Abstract
This mixed methods study examined holistic, analytic, and part marking models (MMs) in terms of their measurement properties and impact on candidate CEFR classifications in a semi-direct online speaking test. Speaking performances of 240 candidates were first marked holistically and by part (phase 1). On the basis of phase 1 findings—which suggested stronger measurement properties for the part MM—phase 2 focused on a comparison of part and analytic MMs. Speaking performances of 400 candidates were rated analytically and by part during that phase. Raters provided open comments on their marking experiences.
Results suggested a significant impact of MM; approximately 30% and 50% of candidates in phases 1 and 2 respectively were awarded different (adjacent) CEFR levels depending on the choice of MM used to assign scores. There was a trend of higher CEFR levels with the holistic MM and lower CEFR levels with the part MM. Although strong correlations were found between all pairings of MMs, further analyses revealed important differences. The part MM was shown to display superior measurement qualities particularly in allowing raters to make finer distinctions between different speaking ability levels. These findings have implications for the scoring validity of speaking tests.
Keywords
Background and rationale
This study examines the impact of different marking models (MMs)—the approaches used for assigning ratings to performances—on candidate scores in the context of a semi-direct online speaking test. The test was originally developed to be part of a Cambridge Assessment English placement test battery as a quick measure of candidates’ ability to speak in a variety of general everyday contexts. It elicits a range of language features and functions through four task types: interview questions about the candidates and their background, a description and comparison of two photographs, questions related to a scenario, and a one-minute monologue on an abstract topic (see Appendix A for further information). The test is designed to progressively increase in difficulty along some of the features identified in Robinson’s (2001) framework of task complexity; for example, more familiar/here-and-now topics in the initial parts of the test and more abstract, open-ended topics in the final parts. While there is progression in task difficulty, each task provides scope for performances at a range of CEFR levels. The test is rated holistically; that is, a general overall evaluation of performance is given. The approach involves a rater assigning a single score to the candidate’s performance on the whole test based on a balanced consideration of four criteria: coherence and discourse management, language resource, pronunciation, and hesitation and extent. This score is based on a six-level rating scale covering levels A1 to C2 on the Common European Framework of Reference (CEFR) (Council of Europe, 2001)—see Appendix B.
The advantages of holistic scoring are considered to be its efficiency, ease of reporting (Davies et al., 1999), and lower cognitive demand on raters (Xi, 2007). In our assessment context, the use of holistic marking, which is relatively quick, reflects the operational need for a placement test with a short results turnaround. However, in light of potential future uses of the test including the provision of diagnostic information to candidates, there was a need to investigate empirically alternative MMs that would allow the generation of more fine-grained information on speaking performance, maintain the test’s scoring validity (Weir, 2005), and meet the practical demand for quick marking.
Choice of MM has been shown to affect rater marks (Barkaoui, 2011; Schoonen, 2005) and is thus an important aspect of a test’s validity argument. There is, however, little empirical research on the impact of different MMs on scores, particularly in speaking assessment. While choice of MM might be context-dependent and related to test purpose, research into how models compare can be valuable in informing decisions on scoring approaches.
The rationale for our study is, therefore, twofold: to inform the operational validation of a speaking test, and to contribute to building theory around the relative strengths and limitations of different MMs based on empirical evidence.
Marking models
In this paper, we use the term marking model to refer to “methods [used] to form judgments” on a performance (Harsch & Martin, 2013, p. 281). Although the term can extend to automated approaches to marking, we will be narrowing our focus to human-mediated MMs. Our definition does not include methods used for ensuring or evaluating judgment quality, such as double marking or different measurement techniques, such as G-theory.
A review of the literature suggests holistic and analytic scoring as the most widely used human-mediated MMs in writing and speaking assessment, with definitions and discussions of their strengths and limitations extensively documented (see, e.g., Davis, 2018; Hamp-Lyons, 1995; Lee, Gentile, & Kantor, 2009; Weigle, 2002). Part scoring is another MM that is operationally used by several international speaking tests such as TOEFL® and BULATS, as well as local tests such as the Oral English Proficiency Test for prospective international teaching assistants at Purdue University (Ginther, Dimova, & Yang, 2010; Yan, 2014). In contrast to the research on holistic and analytic MMs, empirical discussions of part scoring are less widely available in the assessment literature.
In the following sections, we discuss these three MMs in more detail.
Holistic
The practicality of holistic scoring is often seen as its main strength; theoretical arguments have also drawn attention to its suitability for the assessment of overall communicativeness (Weir, 1990), for representing “integrated higher-order skills” (Hunter, Jones, & Randhawa, 1996, p. 64), and as an antidote to “analytic reductionism” (White, 1984, p. 406). There are, however, limitations to the holistic approach, including limited transparency in the relative weighting of features which may be differentially applied by raters (Brown, 1995; Xi, 2007), the potential for raters to focus predominantly on what candidates can do well rather than on areas of weakness (Bacha, 2001), and an underlying assumption in holistic scoring that different features of performance develop at the same rate (Kroll, 1990), which is questionable from a second language acquisition point of view.
Analytic
Analytic scoring involves assigning separate scores to explicitly defined criteria (or dimensions) related to different aspects of performance (Davis, 2018; Xi, 2007). A strength of analytic MMs is that a multi-dimensional scale allows for a more systematic evaluation process where the criteria/dimensions and their relative weightings can be made explicit (Xi, 2007), which in turn provides raters with more clarity about the focus of ratings, thus potentially increasing reliability (Goulden, 1994). The collection of a number of observations through analytic scoring is a further positive feature, since greater reliability can be achieved through multiple observations (Lee, 2006; Barkaoui, 2011).
A multi-dimensional scale, moreover, reflects the complex nature of language and is therefore more in line with theoretical models of communicative language ability (Bachman & Savignon, 1986). Analytic scoring can reveal differences in strengths and weaknesses of performances as learners go through developmental stages (Hamp-Lyons, 1995) and, as such, better suits candidates with uneven profiles (Weigle, 2002). It can also offer diagnostic information to inform individual learning paths (Bacha, 2001).
The analytic MM is not without its limitations. Lee et al. (2009), for example, drew on high correlations between different analytic dimensions and holistic scores, to suggest that analytic scores may be psychometrically redundant. Other limitations include increased cognitive demands on raters (Underhill, 1987), difficulties in precisely defining scoring criteria (Douglas & Smith, 1997), and the need for rigorous rater training in reliably distinguishing between criteria (Xi, 2007).
Part
Part scoring involves assigning separate scores to different test parts on the premise that a single score covering a candidate’s overall performance on a number of tasks may not be an accurate representation of language ability and “might be overly influenced by either good or poor performance on a particular task” (Nakatsuhara, 2011, p. 36). Similar to the analytic MM, the collection of multiple marks has the potential to enhance reliability. Lee (2006), for example, reported a “large impact” of increasing the number of tasks on score dependability in the TOEFL® speaking test. Another advantage is that ratings can be awarded by a single rater or by multiple raters each scoring different parts of a candidate’s performance. A limitation is the shorter language samples elicited in each test part, which may not provide enough language for a valid and reliable evaluation. Although this can be countered with longer tests with multiple tasks of adequate length for rating, this may not be always be possible because of practical constraints.
Part marking can be used in holistic and analytic models, although it may be too practically cumbersome and cognitively demanding for raters to assign a full range of analytic marks to each test part (Taylor & Galaczi, 2011). To the best of our knowledge, the speaking tests mentioned earlier that use part marking involve raters assigning holistic scores to each part. Nevertheless, this MM lends itself to different configurations, such as different criteria for different test parts and/or some parts marked holistically and others analytically.
Relationships between MMs
In this section, we look at empirical research on the relationships between MMs, drawing from studies on both writing and speaking assessment. It is worth emphasizing that although most of this research focuses on measurement properties of different MMs, there are broader conceptual issues at play. MMs are representations of what is considered important in performance. For example, a part MM attributes more importance to variation in performance across tasks and aligns more closely to task-centred approaches to construct definition (Brown et al., 2002; Norris et al., 2002). An analytic MM, on the other hand, places the primary focus on performance against different linguistic criteria; an approach that has closer affinity with trait-based approaches to construct definition (Chapelle, 1998). A full discussion of these different paradigms is beyond the scope of this paper, but it is important that they are taken into account when considering different MMs.
In the context of investigating the relationship between holistic and analytic scales, Bacha (2001) used a stratified sample of essays by L1 Arabic students of English and found high inter-rater agreement and strong correlations between the different analytic criteria, as well as the holistic and analytic scores. Further analyses, however, showed that students’ performance on the different analytic criteria were statistically distinct. The diagnostic element of the analytic MM was seen as more “informative” for learning (Bacha, 2001, p. 381).
The effects of holistic and analytic MMs on writing were further investigated by Barkaoui (2011) and Wiseman (2012). Barkaoui’s (2011) data consisted of essays on two prompts written by adult learners of English from three proficiency levels. Findings from the Many-Facet Rasch Measurement (MFRM) analysis showed comparable ability estimates for the candidates across the MMs, suggesting that the two MMs “may be regarded as measuring the same underlying construct from a measurement point of view” (Barkaoui, 2011, p. 285). Nevertheless, observed differences between the two approaches indicated that the analytic MM resulted in lower standard errors for candidate ability estimates, separation of candidates into more statistically distinct levels, a higher proportion of candidates with acceptable infit 1 values, and increased rater leniency. It is interesting to note that inter-rater agreement was lower with the analytic approach, which the author attributed to this model better capturing the raters’ “diversity of opinions and values” (Barkaoui, 2011, p. 288). In contrast, the analytic scale in Wiseman (2012) was associated with increased severity for raters. In both studies, the analytic approach was shown to display “greater measurement precision” (Barkaoui, 2011, p. 287) and to be more “sensitive” to differences in candidates’ writing abilities (Wiseman, 2012, p. 169). As highlighted by Barkaoui (2011), in an analytic MM, there are multiple observations for each candidate—as opposed to a single observation with a holistic MM—with this additional information likely to contribute to increased measurement precision.
An additional perspective on the relationship between MMs was offered by Harsch and Martin (2013). In this study, raters applied the same rating scale to a sample of scripts in three increasingly fine-grained methods: first, scores were assigned holistically to an overall performance; second, scores were awarded for each of the four criteria in the scale; and lastly, scores were assigned with reference to the descriptors within each of the criteria. Findings showed that raters’ levels of agreement decreased as the scale granularity increased, suggesting that the two more holistic approaches “masked deviances in how the raters applied the descriptors defining a criterion” (Harsch & Martin, 2013, p. 296).
In one of the few studies focusing on MMs in speaking, Xi (2007) explored the viability of analytic scoring in a large-scale holistically marked speaking test. Descriptors were extracted from the holistic scale to create three separate analytic scales, and performances covering a range of proficiency levels and L1s were subsequently marked by raters. Scores were awarded holistically to each task and then averaged to calculate an overall holistic score. The same performances were also marked analytically for each task. Findings showed high correlations between scores on different analytic criteria, which were taken to suggest that, from a psychometric perspective, the different dimensions were not sufficiently distinct from one another. Results also indicated varied profiles at the individual task level in some cases, but these profiles were flattened once scores were averaged across tasks, leading the author to conclude that analytic scores “would not provide additional information beyond what the holistic scores could offer for most examinees” (Xi, 2007, p. 281).
Also in the context of speaking assessment, Nakatsuhara (2011) focused on part marking in the IELTS Speaking test, and found differences in candidate scores on two of the test parts. Although these differences did not reach statistical significance, Nakatsuhara (2011) argued that results provided empirical support for a part MM. Similar to Xi (2007), the part scores were calculated by awarding analytic scores on the four IELTS criteria for each test part and subsequently aggregating the scores; an approach which, as Nakatsuhara cautions, may prove impractical in operational settings and result in an “increased burden on the examiners” (2011, p. 36).
Key issues from the literature
The following points emerge from our review of the literature:
First, although research suggested a trend of awarding comparable scores across holistic and analytic MMs, differences also emerged. For example, in Barkaoui (2011), candidates tended to be awarded higher scores with the analytic MM, whereas the opposite was observed in Wiseman (2012). Findings also diverged on the extent to which performances on different criteria were sufficiently distinct to warrant an analytic MM. A possible explanation is the differences in methodology and scales. For example, although Harsch and Martin (2013) and Xi (2007) used the same scales/descriptors and applied them in different ways, other studies such as Wiseman (2012) applied different rating scales/descriptors in their comparisons. We also found terminological inconsistencies which may have contributed to these contradictory results. For example, what Harsch and Martin (2013) refer to as “holistic-criterion” scoring is seen as analytic scoring by others (Barkaoui, 2011; Wiseman, 2012). Similarly, Xi’s (2007) holistic and analytic scoring is what O’Sullivan and Nakatsuhara (2011) would list as part-holistic and part-analytic scoring respectively.
Second, in reporting on the empirical relationships between different MMs, most studies have relied on correlations; however, this may disguise the effect of MMs on individual candidate marks. It is therefore essential to consider alternative ways of examining MMs, with an explicit focus on their impact on candidate final scores and classifications; in other words, their “practical significance” (Fulcher, 2003, p. 65). What constitutes practical significance is context-dependent. Given the use of the CEFR for score reporting and decision-making, practical significance in often defined in terms of CEFR levels, that is, cases in which candidates receive higher/lower CEFR levels as a result of the MM applied.
Third, the majority of empirical research on MMs is in writing assessment. There is, we believe, a need to gain a better understanding of the impact of MMs in speaking assessment.
Fourth, although the part MM has been used operationally in several speaking tests, there is little reported research on the comparison between the part MM and other marking approaches. Given that speaking and writing tests are typically designed with a range of task types aimed to tap into different aspects of the construct of interest, a systematic examination of the part MM is warranted.
Research aims
Our study aims to empirically evaluate the theoretical assumption that differences between MMs can influence performance scores. We attempt to address some of the issues raised in our literature review as follows: (a) using the same scale/set of descriptors and applying them in three different ways in order to control for the potentially confounding effects of variations in these; (b) focusing on the practical impact of MMs on candidate CEFR classifications; (c) selecting speaking assessment for the context of our study; and (d) including the part MMs in addition to holistic and analytic approaches.
Our study is guided by the following research question:
How do the MMs under examination—holistic whole test (henceforth holistic), holistic by part (henceforth part), and analytic whole test (henceforth analytic)—compare in terms of (a) impact on candidate scores and CEFR classifications and (b) measurement properties?
Method
Design
This was a two-phase study comprising quantitative score data and qualitative rater comments, integrated in a concurrent mixed methods design (Creswell, 2013). We opted for a “competition” design where, in the first phase, our speaking test’s operational MM (holistic) would be compared to an alternative MM. Subject to empirical evidence, the stronger model in terms of measurement properties would be compared to our second alternative MM in the subsequent phase. To allow for a direct comparison of all three MMs simultaneously, the design included a linking of the data sets from the two phases through common raters and performances. Such an approach served as a practical and cost-effective solution to addressing the study’s research question.
In phase 1, we compared the holistic and part MMs. We limited our investigation to a holistic by part MM (as opposed to an analytic by part MM), owing to its likelihood for adoption in operational settings. On the basis of phase 1 results (see findings), phase 2 focused on a comparison of the part and analytic MMs. Figure 1 provides a snapshot of the study’s design, data collection, and analysis procedures.

Study design, data collection, and data analysis.
Participants
Four raters participated in phase 1. Given the larger data set in phase 2, an additional six raters participated in this phase. All raters (six females, four males) were L1 speakers of English, had over five years’ experience teaching ESL/EFL, and were trained/certified for a number of different Cambridge Assessment English speaking tests (see Table 1). All had worked as examiners for the BULATS online speaking test, which shared important similarities with the speaking test in our study in terms of format, task types, and assessment scale descriptors.
Rater examining/teaching experience.
All raters worked independently in order to limit the potential for collusion between them.
Speech data
Data for the study were selected from a pool of available speaking tests (approximately 2500 candidates at the time of data collection), and comprised 240 performances in phase 1 and 400 in phase 2. Each performance had an associated holistic mark, assigned by a single examiner using standard operating procedures. A stratified sampling approach was used to select performances that covered a range of ability levels and L1s. Table 2 shows the breakdown of CEFR levels by data set according to operational holistic ratings. The distribution of the sample data set was designed to closely approximate that of the test population at the time of research. Thirty different L1s were represented, with Chinese (29.5%), Arabic (15.3%), and Portuguese (11.8%) the most frequent.
CEFR level breakdown for phases 1 and 2 data sets.
Three candidates were later dropped from the analysis owing to audio quality.
Rating scales
The assessment scale is a six-level holistic scale covering four criteria: coherence and discourse management, language resource, pronunciation, and hesitation and extent (see Appendix B). For the purpose of our study and similar to Xi (2007), we created an analytic scale by extracting the descriptors from each of the four criteria in the holistic scale and displaying them separately. No changes were made to the content of the descriptors.
Data collection
Data collection took place in two phases with a three-month interval in between. Within each phase data collection was completed in two rounds with a one-week interval in between in a counter-balanced design to minimize any order or halo effects.
Phase 1 focused on a comparison of holistic and part MMs; a rating matrix was designed to ensure the following: (a) that each speaking test was marked holistically and by part; (b) that there was a link between MMs with each rater marking the same candidates’ performances using part and holistic MMs; and (c) that there was a link between raters through a common batch of performances. This design feature created a link between candidates and raters in order to meet the requirements of MFRM. The performances not in the common batch were single scored, that is, once scored holistically and once by part. On the basis of the rating matrix, speech files were allocated to raters along with the assessment scale and detailed instructions.
Phase 2 was informed by the results of phase 1 and focused on a comparison of part and analytic MMs. Data collection for phase 2 closely followed phase 1 procedures, with MM as the main difference.
Upon completion of each phase of marking, raters were invited to provide open comments to a short questionnaire on their experiences and views of applying each pair of MMs. The questionnaire focused on raters’ preferences regarding MMs, ease of marking in each model, and the feasibility of part or analytic marking in operational conditions.
Data analysis
Scores awarded by raters were analysed using MFRM with FACETS (Linacre, 2018a). MFRM provides a technical solution to the well-documented rater effect in performance assessment (McNamara, 1996) by allowing different facets of the testing situation to be measured independently and then mapped onto a common linear scale measured in “logits”. Importantly, candidates’ ability measures from the analysis are estimated independently of the particular rater or task assigned to them, with their raw scores adjusted for the effects of the facets of performance. The resulting candidate fair-average mark is a more objective estimate of the candidate’s ability. MFRM is also robust against missing data as long as there is enough linking between different facet elements (Linacre, 2018b). This is particularly important from a practical perspective, as a fully crossed design may not be possible. We opted for a connected design where “a network of links exists through which every element that is involved in producing an observation is directly or indirectly connected to every other element of the same assessment context” (Eckes, 2009, p. 39). Our rating matrix was designed to ensure a linking of MMs through the same candidate performances, the linking of raters through a common batch of candidates, and finally a linking of the two phases with a common set of raters.
To address the study’s overall research question, several MFRM models were examined. First, separate analyses were run for each MM: a two-facet (candidate, rater) model for analysing the holistic scores, a three-facet (candidate, rater, test part) model for analysing the part scores, and a three-facet (candidate, rater, criteria) model for analysing the analytic scores. In each phase, candidates’ fair-average marks were correlated and their rankings compared using a Wilcoxon Signed-Rank test in SPSS. Additionally, the fair-average marks from the MFRM analyses were converted into CEFR levels; marks were rounded down to the nearest integer given that operationally a candidate needs to meet all the descriptors in a level to be awarded that level. The percentage of candidates receiving the same CEFR classifications across MMs was then calculated. A range of statistics (as explained below) for the two pairings of MMs were compared. A three-facet (candidate, rater, MM) model was also run in each phase to allow for a direct comparison of each pair of MMs on the same logit scale. Lastly, a two-facet (candidate, MM) analysis was run, where fair-average marks from the MFRM analyses of different MMs were combined and analysed simultaneously for a direct comparison of all three MMs. In all analyses, the candidate facet was allowed to float with all other facets centred at zero.
Data analysis drew on a range of statistics that are generated in MFRM for each facet. These included parameter estimates for each facet and corresponding reliability indices, that is, the standard error index, the separation statistics, which are useful for summarizing observations and drawing inferences about group trends, and the separation indices and strata, 2 which estimate the number of statistically distinguishable performance levels and their associated reliability (Linacre, 2018b). We interpreted these indices mindful of their facet-dependency; for example, a high separation index with associated high reliability is desirable for candidates and can show that the test has successfully distinguished between different ability levels, whereas a low separation index and reliability is desirable for raters, who should be similar in measures.
Apart from group-level statistics, we considered fit statistics which “enable the diagnosis of aberrant observations and idiosyncratic elements” (Linacre, 2018b, p. 14) within each facet. Specifically relevant are the infit and outfit mean statistics, which can indicate misfit. They have an expected value of 1 and a range from zero to infinity where “the higher the . . . mean-square index, the more variability we can expect” in the rating patterns (Myford & Wolfe, 2000, p. 15). Here we only report infit, as it less sensitive to outliers compared to outfit and because it is broadly viewed as more important when evaluating the fitness of the data to the model (Eckes, 2009; Myford & Wolfe, 2004). Values below 1 are considered to be “overfitting” the model and too predictable, while values above 1 are considered to be “underfitting” and too unpredictable (Linacre, 2018b) with the latter generally raising more cause for concern (Eckes, 2009; Linacre, 2018b). In line with Linacre (2018b), the current study adopted lower and upper control limits of 0.5 to 1.5 for the infit mean square index. A summary of results for all MFRM models can be found in Appendix E (phase 1) and Appendix F (phase 2).
Raters’ open comments were analysed for common themes and insights that could further inform the study’s quantitative findings. Given the small number of raters involved and short questionnaire, the qualitative part of the project was comparatively limited. It involved the collation of rater comments in Excel and analysis for common themes which emerged from the data. The authors completed this stage collaboratively in order to ensure appropriate interpretation of data.
Findings
A comparison of holistic and part MMs
To compare the holistic and part MMs and explore their potential impact on candidate scores/CEFR classifications, we first estimated candidate ability levels from independent MFRM analyses of scores as awarded by the two MMs. This ensured that candidate measures were adjusted for the effects of raters. The holistic MM was associated with a slightly higher mean (M = 3.70; SD = 0.90) than the part MM (M = 3.64; SD = 0.83), although the results of a paired-sample t-test showed no statistically significant differences and a Wilcoxon Signed-Rank test indicated no significant difference in candidate rankings. We also correlated these fair-average measures and a strong statistically significant correlation emerged (r = .88, p < .01).
These fair-average scores were then converted to CEFR levels; results in Table 3 show a similar distribution of CEFR levels, with a slightly higher percentage of candidates at the A1/A2 levels for the part MM and generally small differences at the group level.
Distribution of CEFR levels by holistic and part MM (%).
N = 240.
We then focused on the percentage of candidates that received the same CEFR classification across the two MMs, including the size and direction of the differences. Results showed that 68.6% of candidates were awarded the same CEFR classification regardless of the MM whereas 30.9% fell within an upper or lower adjacent level. Fewer than 1% received more than a level difference (see Appendix C for a contingency table with details of specific occasions where the two MMs converged/diverged). Candidates were likely to receive a higher CEFR level when the holistic MM was used. Possible explanations for this trend are drawn from rater feedback: one rater commented that when marking holistically he was “less likely to pay attention to a below par performance” for any particular part and another rater “tended to use what appeared to be the predominant level of language over the whole test”.
Raters also referred to “jagged profiles” in candidate performances, that is, differential performance on different test parts. “Candidates rarely fit a single band [level]”, one rater noted, adding that an advantage of a part MM was in “capturing candidate performances that sometimes varied widely in different parts of the test”.
These findings confirmed that the choice of MM has a practical impact on the final CEFR classifications of more than 30% of candidates. We therefore proceeded to compare the two MMs in an attempt to identify the model exhibiting superior measurement qualities.
We combined the two data sets, defined MM as an additional facet, and reran the MFRM analysis. The results indicated that the two MMs were not statistically distinct, with the separation indices confirming that the two MMs could not be reliably divided into different strata (H = 0.33; R = 0.00). A closer look at other indices, however, revealed differences: although the examinee statistics showed comparable distribution of speaking abilities—albeit slightly wider for the holistic MM (1.23 to 6.0) compared to the part MM (1.55 to 5.99)—the part MM separated candidates into more statistically distinct levels (HPart = 6.54; RPart = 0.90) than the holistic MM (HHolistic = 3.46; RHolistic = 0.85). In other words, the use of the part MM resulted in more reliable distinctions between candidates. Second, although the rater statistics showed similar severity rankings and acceptable levels of consistency based on individual and average infit mean square statistics for the four raters, the holistic MM showed overfit for the two most lenient raters, with infit mean square values close to zero. Although overfit may be “less productive for measurement”, it is not considered “degrading” (Linacre, 2018b, p. 279). Overfit is typically an indication of central tendency or restriction of range, and in this case, is in line with previous research on holistic MMs (Barkaoui, 2011; McNamara, 1996). In order to ensure that these two raters were not unduly affecting the results, we re-ran the MFRM by removing the problematic raters from the analyses and examined the impact on candidate separation indices. Although results showed a slight improvement in the separation indices for candidates (H = 3.55; R = 0.87), this difference was small and did not affect the interpretations. Moreover, the same two raters did not show any underfit/overfit in the remaining analyses and we therefore retained them in the analyses.
Additionally, we considered the percentage of unexpected responses flagged by FACETS; 3 this figure was higher for the part MM (1.27%) compared to the holistic MM (0.77%). A closer look revealed that it was the differential performance of the same three candidates on different test parts that resulted in an increased number of unexpected responses. This suggests that candidate performances on different test parts are varied enough to be distinguished by raters. The test part results substantiate this finding: different parts exhibited a difficulty range of 0.4 logits from −0.23 logits (Part 1 – easiest) to 0.17 logits (Part 4 – most difficult). The separation indices suggested that the different parts can be divided into a minimum of 2.64 statistically distinct difficulty strata (R = 0.85). Raters’ open comments indicated that they believed the part MM to be a more “fair” and “reliable” approach to assessment. Moreover, all four raters agreed that marking by part is feasible in operational conditions.
In summary, phase 1 findings suggested that choice of MM has a practical impact on the CEFR classifications of at least 30% of candidates, with a pattern of higher CEFR levels with the holistic MM. Amongst the two MMs under examination, we argue that there is a case to be made for adopting the part MM given its enhanced measurement properties in reliably distinguishing between candidates from different ability levels and separating them into almost twice as many ability strata compared to the holistic MM. Qualitative support for the part MM derives from raters’ expressed preferences for this model in allowing more reliable and fair assessment of candidates. We therefore selected the part MM from this phase to be compared against the analytic MM in phase 2.
A comparison of part and analytic MMs
In Phase 2, we compared the part and analytic MMs and explored their impact on candidate scores/CEFR classifications. Similar to phase 1, we first estimated candidate ability levels on the basis of the MFRM analyses as measured by the two MMs. The analytic MM resulted in a higher mean (M = 3.54; SD = 0.74) than the part MM (M = 3.31; SD = 0.84) and a paired-sample t-test revealed that these differences were statistically significant, t (395) = 9.23, p < .05, r = .42, with a moderate effect size. These were confirmed in the results of a Wilcoxon Signed-Rank test that showed statistically significant differences in the ranking of candidates across the two MMs, with candidates receiving different (higher) rankings with the analytic MM (z = −8.17, p < .01, r = −.42). We also correlated these fair-average measures; results showed a strong statistically significant correlation (r = .80, p < .01).
These fair-average measures were then converted to CEFR levels. Table 4 shows a broadly similar distribution of CEFR levels; aligning with the above findings, the analytic MM resulted in a slightly higher percentage of candidates at the upper B and C levels, whereas the part MM resulted in a higher percentage of candidates at the A levels.
Distribution of CEFR levels by part and analytic MM (%).
N = 397.
We then focused on the percentage of candidates receiving the same CEFR classification across the two MMs, taking into account the size and direction of any differences. Results showed approximately 52% of candidates receiving the same CEFR classification, with the remaining 48% falling within an adjacent upper/lower level. There was once again a systematic trend of lower CEFR levels with the part MM (see Appendix D for a contingency table with details of specific occasions where the two MMs converged/diverged). Two comments by raters help explain this trend: “there was no criterion relating to task achievement. I found myself applying this anyway, almost instinctively” and “even though there is no task completion, when marking by part, it’s easier to mark down answers that are not relevant to the task”.
These findings confirm that choice of MM could impact candidates, resulting in a drop or increase in their overall CEFR classifications. We then proceeded to further compare the two MMs to identify the model exhibiting superior measurement qualities.
Similar to phase 1, we combined the two data sets and defined MM as an additional facet. The results indicated that the two MMs were statistically distinct (X2 = 105.9, p < .01) with the separation indices (H = 9.94; R = 0.98) showing that the two MMs could be reliably divided into approximately nine difficulty strata. These findings suggest that regardless of the strong correlations between candidate ability measures across the two marking conditions, the part and analytic MMs distinguish between candidates in different ways and potentially tap into distinct aspects of the construct.
We subsequently considered other indices to help evaluate and compare the two MMs, and examined the part and criteria statistics to see whether candidates’ speaking abilities on the different test parts and criteria are sufficiently distinct to merit part or analytic marking. Similar to phase 1, the results of the part statistics showed that the four test parts in the speaking test exhibit a range of difficulty levels, from −0.20 (Part 1 – easiest) to 0.19 (Part 4 – most difficult), with the separation and reliability indices (H = 4.32; R = 0.90) suggesting that the different test parts can be reliably separated into a minimum of four statistically distinct difficulty strata. The null hypothesis that all parts exhibit similar difficulty measures was rejected (X2 = 39.3, p = .00). This serves as empirical evidence that candidates may be displaying speaking abilities on the different test parts that are sufficiently distinct, thus justifying a part MM. Similarly, the criterion measurement report of the analytic score data showed that the four criteria in the scale exhibited a range of difficulty measures, with coherence and discourse management as the easiest category (−0.31 logits) and pronunciation as the most difficult (0.13 logits). The separation strata and reliability indices (H = 5.02; R = 0.95) indicated that candidate performances on the different criteria are sufficiently varied to be reliably distinguished by raters thus justifying the use of an analytic MM.
Candidate group-level statistics showed that the part MM separated candidates into slightly more statistically distinct levels than the analytic MM (HPart = 6.62, RPart = 0.96; HAnalytic = 5.58; RAnalytic = 0.94). The rater statistics generally showed acceptable levels of consistency for the ten raters in the study across the two MMs. However, two of the raters exhibited slight underfit with the analytic MM (infit mean square values >1.5), whereas none of the raters in the part MM exhibited misfit. Lastly, the percentage of unexpected responses (those with residual values > |2|) was 1.30% and 1.10% for the analytic and part MMs respectively.
Raters’ views regarding preference for the two MMs were mixed; six out of 10 raters preferred the part MM, two raters were equally happy with either MM, and two raters preferred the analytic MM. Preference for the latter was based on the ability to capture uneven profiles: “I found it hard to balance the different aspects of assessment to give an overall mark in some cases, e.g. very good pronunciation but with limited vocabulary and grammar”. Another rater noted that “each part has its limits and there may not always be enough evidence to mark each part”.
Raters believed that marking the test by part or analytically was feasible in operational settings. One rater distinguished between face-to-face tests (where the examiner serves the dual role of interlocutor and rater) and computer-delivered tests (where the examiner is only focused on rating). In the latter, the time pressure is removed and rater cognitive load is lower, which supports the feasibility of awarding multiple scores particularly when “there is more control over the audio files and you can pause and move on as you like”.
To summarize, our findings from phase 2 suggest that the choice of MM has an impact on candidates’ CEFR classifications. Both part and analytic MMs exhibited precision in measurement, as expected given the number of observations per candidate (four in each case). Raters’ perceptions also provided support for both models in terms of feasibility of use in operational settings. There was, however, a slight advantage for the part MM on three grounds: (a) it allowed for finer distinctions between candidates; (b) raters showed higher consistency; and (c) the percentage of unexpected responses were lower. Additionally, more raters preferred the part MM, although given their limited number, this finding should be treated with caution.
A comparison of all MMs
Each of the two phases focused on a detailed comparison of pairings of MMs. Given the study’s linked design we were able to compare all three MMs simultaneously; candidate ability estimates from the independent FACETS analyses were analysed with a two-facet model consisting of candidates (n = 637) and MM (n = 3).
The MM map is presented in Figure 2 and the MM measurement report is summarized in Table 5 with MMs arranged in ascending order of difficulty.

Map of all MMs.
MM measurement report.
Results show a logit range of 1.7, with the analytic MM as the easiest (logit value = −0.95) and the part MM as the most difficult (logit value = 0.75). All infit mean square values fall within an acceptable range. The separation and reliability indices (H = 9.65; R = 0.98) suggest that the different MMs can be reliably separated into a minimum of nine statistically distinct strata. The null hypothesis that all MMs exhibit similar difficulty measures is rejected (X2 = 199.4; df = 2; p = .00). This can be interpreted as follows: the probability of the same candidate receiving a different ability estimate as a function of choice of MM is statistically significant.
We also explored the extent to which these statistically significant results translate into practical significance in terms of CEFR classifications. In doing so, we considered the fair-average results associated with each MM and calculated the maximum difference between the easiest (analytic) and most difficult (part) MM (3.78–3.26 = 0.52). The value of 0.52 is approximately half a CEFR level, which can have a practically significant impact, particularly for borderline candidates. This effect is attenuated when comparing the two easiest or the two most difficult MMs. Nevertheless, these results demonstrate the potential impact of choice of MM on candidates and in this particular validation investigation have lent support to a change to an alternative MM.
We also ran a candidate × MM bias analysis; this analysis allows for an examination of whether each MM maintains a consistent level of difficulty across candidates. From a total of 1264 interaction terms, there were 389 bias terms with Z values > |2|. However, no single interaction displayed statistical significance (p > .05) with the summary statistics (X2 = 1037.5; df = 1264; p = 1.00), suggesting that the different MMs do not disadvantage different candidates in a statistically distinct way. These somewhat contradictory results can be explained by the small number of observations for each candidate which is critical for statistical significance testing (Eckes, 2009; M. Linacre, pers. com., October 2019). We therefore considered alternative approaches for identifying large and meaningful interaction terms. The first approach, following Eckes (2009), was to calculate the percentage of t values larger than |2| as an indication of large bias. In our data set, this was 3.1% (t-value range: −2.96 to 2.46). The second approach was to consider substantial differences between observed and expected averages for each interaction term. Given that half a CEFR level can be of practical significance in our context, we identified any cases where Observed – Expected Average> |0.5|. In our data set, this was 4.7%; the largest absolute value was |0.88|, which is less than one CEFR level. When both approaches are combined, this percentage is 4.8%. We can therefore conclude that in general MMs display a uniform level of difficulty across candidates; however, for a small percentage of the candidature (<5%), there is evidence of bias.
Discussion
In this study, we compared different MMs in terms of their impact on candidate CEFR classifications and measurement qualities in a specific speaking assessment context. This was done with the view to potentially switch to a model that would allow for the generation of more fine-grained information on performance while maintaining and/or enhancing the test’s scoring validity and preserving the practical demands for a quick results turnaround.
There was strong evidence from both phases to suggest a significant impact of choice of MM; approximately 30% and 50% of the candidates in phases 1 and 2 respectively were awarded different (adjacent) CEFR levels depending on the MM. When all MMs were considered together in a single analysis, results showed half a CEFR level difference between the easiest and most difficult MM. Taken together, findings suggest that the probability of the same candidates receiving different scores/CEFR levels as a function of choice of MM is statistically and practically significant and should therefore be an important test validity consideration and carefully aligned with the purpose of that assessment.
Findings also indicated trends in the direction of these differences. In contrast to Barkaoui (2011) and in line with Wiseman (2012), our study found higher overall scores and CEFR levels with the holistic MM. Supported by raters’ open comments, it appears that the holistic MM lends itself to a benefit-of-the-doubt policy, with scores awarded on the basis of candidates’ best performance, which was also stipulated by Bacha (2001). A halo effect across performance on different tasks is possibly in operation with the holistic MM. The trend of lower scores/CEFR classifications with the part MM was also observed in the comparison with the analytic MM. We drew on raters’ comments to suggest that, in the absence of a task achievement criterion, the part MM seems to serve as an implicit alternative that allows raters to award lower scores for task-irrelevant responses or unsuccessful attempts at handling more complex tasks.
Focusing on a comparison of the MMs from a measurement perspective, our findings showed strong correlations between all pairings of MMs; however, further analyses revealed important differences. The two MMs that showed the strongest correlation with no statistically significant difference in means and candidate rankings were the holistic and part MMs, suggesting that the two are tapping into a similar construct. This is not surprising given that the same scale and criteria were used in both models, with the holistic MM applied to a whole performance and the part MM applied to individual test parts. Nevertheless, the part MM was shown to display superior measurement qualities particularly in separating candidates into more ability strata. A possible explanation is that the part MM includes multiple observations for each candidate and this additional information may result in increased measurement precision (Barkaoui, 2011). To verify this, we used the average of the four test parts instead of the four independent scores in an additional MFRM analysis (not reported here because of space limitations) and observed a drop in measurement precision. This implies that candidates are indeed displaying differential performance across the four test parts and the part MM allows raters to make these finer-level distinctions. These results align with Nakatsuhara’s (2011) argument that a single score is not necessarily a good representation of a candidate’s ability on different test parts; they also substantiate Barkaoui (2011), who found that the holistic approach may not be sufficiently sensitive to differences in performances. This was echoed in rater comments who noted that the part MM allowed them to take a more objective and critical stance and award scores that better represented candidate performances on each test part. For this reason, the part MM was selected as a stronger contender to be compared to the analytic MM in phase 2.
Results of the second phase also showed strong correlations between the analytic and part MMs; further statistical analyses, however, pointed to significant differences in means and ranking of candidates, suggesting that the two models may distinguish between candidates in distinct ways and may be tapping into different aspects of the speaking construct. A possible reason is that the analytic approach focuses raters’ attention more explicitly on individual features of performance such as fluency or grammar, whereas the part MM invites raters to consider language use holistically but in dynamic interaction with individual tasks.
Unlike the holistic MM, the analytic and part MMs both included multiple observations, that is, four score points per candidate performance. As expected, they displayed increased measurement precision. The part MM, nevertheless, exhibited marginally higher measurement precision, evidenced in the separation of candidates into more statistically distinct ability levels and higher levels of rater consistency. Raters were mixed in their preference for either approach, with the majority favouring the part MM. Both MMs were considered to be practically feasible in operational settings.
Taking all results together, we believe that a strong case can be made for adopting the part MM for the assessment context investigated in this study on the following grounds: first, it allows for more measurement precision compared to the current operational holistic model and the analytic MM; second, it can provide more diagnostic information to candidates in terms of their performance on different test parts; third, there is evidence that candidate performances on the different test parts are varied enough to merit part scoring; fourth, there is support from raters regarding its feasibility of application in operational settings; and lastly, it has the practical benefit of requiring minimal changes to the current rating scale and the possibility of implementation within a short timeframe.
Implications
The findings shed light on the influence of MMs in a semi-direct online speaking test, with evidence of significant impact on candidate scores/CEFR classifications. It is therefore essential for test development and validation activities to take MM into account and clearly justify the choice of MM.
All MMs under examination were successful in distinguishing between candidates from different ability levels; however, raters showed the highest sensitivity to differences in performance when using the part MM. As discussed in the literature review, the part MM is not commonly investigated or reported on. In speaking tests that consist of a variety of task types, there is an underlying assumption that candidate performances on different tasks may vary and as such, a MM that attunes to these variations is appropriate. We would like to argue that the part MM is a feasible alternative to the more commonly used methods of analytic and holistic scoring. Adopting a part MM can also facilitate the inclusion of a task achievement criterion, which is evaluated at the part level. The downside is that the part MM cannot necessarily reflect “jagged profiles” in terms of linguistic features of performance (e.g. fluency vs. grammar use). Possible solutions to address these limitations are discussed in the next section. Nevertheless, in light of these findings, a recommendation was made for the operational test in question to adopt a holistic by part MM, which included a task achievement descriptor per part. 4
An additional implication relates to the choice of MM when developing, training and evaluating automated marking systems. Automated assessment technologies largely rely on human-awarded marks as the gold standard for the training and evaluation of their systems (Chen et al., 2018). Our study has shown that choice of MM has an impact on those human-awarded marks, and as such it directly influences the source data for machine learning purposes and system evaluations. When reporting the results of human–machine agreement levels, it is important to be transparent about how those human marks are derived.
Future directions
The scope of our study was limited to three specific MMs and did not allow for an examination of other MMs. Hybrid MMs, for example, can be considered in addressing the limitations of the part MM; to illustrate, a task achievement criterion can be applied to each test part, alongside analytic criteria applied to the whole test. Alternatively, marking criteria can be tailored to the task focus; for example, an extended monologue task could be rated for discourse organization, grammar, and vocabulary; a question-and-answer task in the same test could focus on grammar and vocabulary, and a read-aloud task on pronunciation only. A full picture of ability can thus be provided through the complementary use of specific assessment criteria for specific tasks. This suggestion aligns with Taylor and Galaczi (2011) who advocated such an approach, although they cautioned that the choice of assessment criteria per test part would need to be empirically justified.
The wide use of automated scoring technologies also offers the potential for different hybrid models of machine and human scoring (Isaacs, 2018; de Jong, 2018). For example, an auto-marker can focus on features of speech that it can most reliably assess, such as fluency and pronunciation, while raters can focus on more complex elements such as coherence or task achievement. Alternatively, machines could be used to assess more predictable routine tasks with human raters focusing on extended spontaneous speech on less predictable topics. Such complementary approaches can increase scoring reliability and minimize the limitations of both human-mediated and automated MMs. An investigation into different hybrid MMs can therefore be an exciting avenue for future research.
Limitations
There are three main limitations in our study: first, the number of raters was small, which can affect the robustness of the statistical analyses and restrict generalizations of the qualitative data; second, the analytic scale used here was not independently piloted; and third, there were differences between the CEFR classifications in the operational data and those observed across the three MMs, which is likely to be attributable to noise in the operational single-scored data. These limitations notwithstanding, we believe we have provided evidence-based insights which can inform future theoretical discussions and operational considerations on MMs.
Footnotes
Appendix
Summary results for phase 2.
| Summary results |
||||||||
|---|---|---|---|---|---|---|---|---|
| Fair-M Average | SD | Logit (M) | SE | Infit Mean Square (Average) | Strata | Separation R | n | |
| Candidate | 3.54 | 0.74 | 1.00 | 0.50 | 0.76 | 5.58 | 0.94 | 397 |
| Rater | 3.57 | 0.52 | N/A | 0.07 | 1.11 | 20.78* | 1.00 | 10 |
| Criterion | 3.65 | 0.06 | N/A | 0.04 | 1.00 | 5.02 | 0.95 | 4 |
| Summary results | ||||||||
| Fair-M Average | SD | Logit (M) | SE | Infit Mean Square (Average) | Strata | Separation R | n | |
| Candidate | 3.31 | 0.84 | −0.28 | 0.60 | 0.78 | 6.62 | 0.96 | 397 |
| Rater | 3.45 | 0.66 | N/A | 0.08 | 1.07 | 18.27* | 1.00 | 10 |
| Part | 3.55 | 0.03 | N/A | 0.05 | 1.05 | 4.32 | 0.90 | 4 |
| Summary results | ||||||||
| Fair-M Average | SD | Logit (M) | SE | Infit Mean Square (Average) | Strata | Separation R | n | |
| Candidate | 3.35 | 0.78 | −0.20 | 0.68 | 0.79 | 5.97 | 0.97 | 397 |
| Rater | 3.50 | 0.35 | N/A | 0.09 | 0.98 | 17.41* | 0.97 | 10 |
| MM | 3.51 | 0.09 | N/A | 0.04 | 1.02 | 9.94 | 0.98 | 2 |
| Summary results | ||||||||
| Fair-M Average | SD | Logit (M) | SE | Infit Mean Square (Average) | Strata | Separation R | n | |
| Candidate | 3.48 | 0.80 | 0.21 | 1.40 | 0.95 | 5.72 | 0.94 | 637 |
| MM | 3.55 | 0.10 | N/A | 0.10 | 0.97 | 9.65 | 0.98 | 3 |
See note in Appendix E.
Acknowledgements
We would to thank Professor Anthony Green, Professor Mike Linacre, the three anonymous reviewers, and the editor for their insightful comments. Thanks also to Dr Trevor Benjamin for his help in the early stages of this research.
Declaration of conflicting interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
