Abstract
In second language writing assessment, rating scales and scores from human-mediated assessment have been criticized for a number of shortcomings including problems with adequacy, relevance, and reliability (Hamp-Lyons, 1990; McNamara, 1996; Weigle, 2002). In its testing practice, Euroexam International also detected that the rating scales for writing at B2 had limited discriminating power and did not adequately reflect finer shades of candidate ability. This study sought to investigate whether a level-specific checklist of binary choice items could be designed to yield results that accurately reflect differential degrees of ability in EFL essay writing at level B2. The participants were four language teachers working as independent raters. The study involved the task materials, operational rating scales, reported scores, and candidate scripts from the May 2017 test administration. In a mixed-methods strategy of inquiry, qualitative data from stimulated recall, think-aloud protocols, and semi-structured interviews informed statistical test and item analyses. The results indicated that the checklist items were more transparent, led to increased variance, and contributed to a more coherent candidate language profile than scores from the rating scales. The implications support the recommendation that checklists should be used for level-specific language proficiency testing (Council of Europe, 2001, p. 189).
Keywords
Introduction
Extended written production in high-stakes contexts is traditionally assessed with rating scales. Writing test performance scores need to be reliable and lead to fair decisions even though they are a composite of candidate ability combined with the assessment criteria, rater behaviour, rating processes, and rating conditions (Hamp-Lyons, 1990; Weigle, 2002). Rating on a scale has limited discriminating power in relatively homogeneous populations. In this research, Euroexam International aimed to design a checklist that could adequately replace the operational rating scales for English as a Foreign Language (EFL) writing at level B2 in a high-stakes context. Over the years, we at Euroexam International repeatedly redesigned the rating scales but failed to overcome score inconsistencies. Besides scoring validity, my colleagues and I placed a special emphasis on increasing transparency and accountability. Checklists proved helpful in resolving these issues in previous research (Kim, 2010, 2011; Struthers, Lapadat, & MacMillan, 2013), and were recommended over rating scales when testing a particular level (Council of Europe, 2001).
Rating on a scale
When discussing the methods of scoring written products, the bulk of the literature exclusively focuses on rating scales. Alderson, Clapham, and Wall (1995) argued that writing examiners need a holistic or analytic rating scale. Bachman and Palmer (1996, pp. 204–208) associated scoring as right or wrong with selected responses or limited production, and they recommended rating scales when scoring extended production. The scale is the only instrument that features among the characteristics of performance-based assessment in McNamara’s representation (McNamara, 1996, p. 9). Weigle (2002) explicitly wrote that the first decision to be made with regard to scoring is the type of rating scale to be used.
The literature is equally undivided in acknowledging inconsistencies that characterize ratings. Some of the recurring issues that research has unveiled are as follows. Even with clear and explicit criteria, rater training, and controlled conditions, the judgement ultimately depends on the rater (Pollitt & Murray, 1996). The persistent unwanted variation in judgements led McNamara (2000, p. 37) to title his related chapter “The problem with raters.” Similarly, Hamp-Lyons (1990) pointed out that much of the noise stemmed from the rater. The literature abounds with significant score discrepancies and advice on post-exam adjustment (Bonk & Oakey, 2003; Eckes, 2005; Lumley & McNamara, 1995; Trace, Meier, & Janssen, 2016; Weigle, 1998; Wigglesworth, 1993). Because rater training cannot eliminate differences in rater severity, unmodelled variability is viewed as a fact of life that needs to be better understood (Harsch & Rupp, 2011; Harsch & Martin, 2012; Kane, 2006; McNamara, 1996; Weigle, 2002).
Marking criteria are often defined in vague, abstract, and relative terms (Alderson, 1991, 2007; Alderson, Clapham, & Wall, 1995; Brindley, 1998; Mickan, 2003; Upshur & Turner, 1995; Weigle, 2002). A scale or rubric’s failure to communicate clear criteria places a heavy burden on raters (Fulcher, 1996; Fulcher, Davidson, & Kemp, 2011) and makes it difficult to report results to examinees (East & Weigle, 2016). Lack of empirical underpinning can also lead to inconsistencies with the findings of second language acquisition research (Chalhoub-Deville, 1997; Fulcher, 1996; Harsch & Martin, 2012; Knoch, 2011; North, 1997). In particular, scales that assume monotonicity across descriptor levels can lead to unjustified and unwarranted claims (Fulcher, 1996; Turner & Upshur, 2002).
Two aspects of localization emerge as central issues in assessment research. First, rater involvement in scale creation or revision is increasingly seen as a useful source of empirically derived insights (Harsch & Martin, 2012) and thus supports validity at development stages (Alderson, 1991; Weigle, 2002). Second, criteria need to be developed to serve a specific purpose (Broad, 2003; Harsch & Martin, 2012). Although the majority of testing is situated in multilevel contexts (Eckes, 2005; Hamp-Lyons & Kroll, 1996; Knoch, 2009; Lumley, 2005; Weigle, 1998), the local requirements can demand a level-specific method (Harsch & Martin, 2012; Harsch & Rupp, 2011). The CEFR (Council of Europe, 2001, p. 189) recommended rating on a scale when the language user is to be placed at a level on a series of bands, whereas rating on a checklist is preferred if ability is to be judged in relation to a list of points relevant for a particular level. In the next section I look briefly at how the issues with rating on a scale have been addressed in response to local demands.
Alternatives to rating on a scale
A number of alternate routes have been suggested in order to increase scoring validity. The Performance Decision Tree (Fulcher, Davidson, & Kemp, 2011) was devised in response to the inadequacies of measurement-driven scales with ordered descriptors. Designed for specific communicative contexts, this scoring instrument offers a rich description of performance data through binary decisions. Working towards numerical and objective scoring that would provide detailed analytic description, Struthers, Lapadat, and MacMillan (2013) developed a checklist to assess cohesion in children’s writing. Kim (2010, 2011) presented a 35-item checklist for academic purposes writing assessment with an overt diagnostic orientation for instructional practice to support student learning. Two common features that unite the Performance Decision Tree and checklists are (a) the concrete, finely grained, level-specific items and (b) the view of the assessor as a judge of binary decisions, building on evidence of the required quality (Brindley, 2000). Besides simplifying the reader’s work, objective binary judgements rely on precise observation that leads to shared construct interpretation (Dietrich, French, & Carlton, 1961).
The rationale behind the present research is that some of the issues with human-mediated writing assessment are inherent in rating on a scale in level-specific contexts as a reflection of the tension between the scoring method and the measurement purpose. Adopting a horizontal approach with a list of criterion statements designed to measure proficiency at a particular level will likely provide a richer and more appropriate description of writing ability at the level in question.
Test specifications
Established in 2000, Euroexam International provides language tests in general, business, and academic English and German at levels A1 through C1. The level-specific exams comprise listening, reading, speaking, and writing papers. The productive activities are assessed by human raters using rating scales. The general English writing paper at level B2 is designed to cover a wide range of non-specialized knowledge areas in the personal and public domains (Euroexam International, 2019, pp. 31–33). The successful candidate is expected to manipulate an extensive vocabulary, display a high degree of grammatical control, use a limited range of complex structures, apply cohesive devices competently to make a smooth flowing text, as well as express themselves appropriately and effectively in a logical manner.
The writing paper comprises two tasks in 60 minutes. The first task is a semi-formal transactional email, and the second task is a neutral or formal discursive article, review, or essay of 150 words written for the general public. Rhetorical functions include describing or reporting on events, narrating, giving opinions, comparing and contrasting, exemplifying, synthesizing, analysing, evaluating, and expressing probability. The instructions specify the context and communicative roles in a maximum of 35 words. The assessment criteria are grammatical accuracy, cohesion and coherence, lexical control, content, orthography, development of ideas, and appropriacy (Euroexam International, 2019, p. 32).
The test has seen a number of changes to its structure, format, time allocation, and a partial redefinition of the construct. The assessment criteria for writing originally translated into a four-component rating scale: (a) task achievement [0,10], (b) range and accuracy [0,5], (c) coherence and cohesion [0,5], and (d) appropriacy [0,5] (Alloway, Bowing, Osváth, & Östör, 2005, pp. 187–188). In 2012, the test as a whole was redesigned, and the assessment criteria of grammatical accuracy and lexical control were now measured by two separate scale components: grammatical range and accuracy, and lexical range and accuracy.
In a previous study I conducted locally (Lukácsi, 2016, pp. 211–212), results revealed that raters did not perceive rating scale components to carry equal weight. Task achievement was seen as the governing principle, followed by grammatical range and accuracy, both in terms of time spent when assigning a score and as expressed in raters’ self-reflection.
The rating scales are in Appendix A. The structure of the rating system is demonstrated using the criteria for grammar from the rating scales for writing in Table 1 (Euroexam International, 2019, pp. 61–62).
Rating scale for grammatical range and accuracy at B2.
Table 1 displays a rating scale of six steps, wherein bands 2 and 4 are completely undefined. Where explicit, the subskills are consistent; however, the descriptor for errors is missing from band 1 even though the entire band describes some kind of error. Band 3 is directly comparable to the general linguistic range (Council of Europe, 2001, p. 110) and grammatical accuracy (p. 114) descriptors for B2. The bands are endowed with all the often-criticized shortcomings of the CEFR descriptors (Alderson, 1991, 2007). The highly abstract nature of the descriptions is most apparent when comparing the grammatical structures aspect for levels 3 through 1. The rater is expected to differentiate between simple but mostly correct structures with some mistakes, those that are very simple with frequent and serious mistakes, and a mixture of these two in the intermediary band. These inherent weaknesses ultimately lead to unwanted score properties, such as regression to the mean and unobserved scale categories. Reported scores from the May 2017 live administration in Figure 1 demonstrated the limited spread of writing scores.

Reported score relative frequency distribution for writing in May 2017.
Figure 1 is a visual representation of the relative frequency of the reported scores on the writing paper at B2 in May 2017. The reporting scale ranges from 0 to 100 given that the reported scores are simple percentage-correct scores (Bachman, 2004, p. 300). Unobserved values in Figure 1 are not a result of the score transformation because raw scores range between 0 and 120. In May 2017, 67.97% of the population scored between 50 and 70 on the reporting scale. By contrast, only about 5% of the population achieved reported scores of 40 or less, or 80 or more. The central tendency of the writing scores was not reflected in the other parts of the test. Apart from limited variance and the leptokurtic distribution, some values appeared with a pronounced frequency, such as 55 (6.02%), 60 (7.09%), or 65 (5.10%). These score categories were the result of patterns in raw score allocation.
Euroexam International applies a combination of the conjunctive approach and compensation when reporting results. A reported average score of 60 indicates an overall pass in the complex exam of four test papers (i.e. listening, reading, speaking, and writing). By design, the level standard of each test paper is set to a reported score of 60. However, the minimum requirement for a pass on each test paper is a score of 40, so papers with better sub-scores can compensate for weaker ones. Technically, candidates with abilities below standard on some test papers can pass the complex exam as long as they display abilities above standard on others.
Developing the checklist
The ultimate aim of this research was to design a set of genre- and level-specific checklists that could adequately replace the operational rating scales for EFL writing at level B2 for high-stakes proficiency testing. The study was measurement-driven in its aim to develop a tool that reflected the construct as described in the “Test specifications” section without modifying the tasks or having to reset the standard. A special emphasis was placed on increasing transparency and accountability, alongside the need for improving scoring validity (Weir, 2005, p. 22). In this paper, the focus is on the checklist for the essay writing task.
The primary research question that my research team and I set out to answer with this paper was as follows: Could a level-specific checklist of binary choice items be designed to yield results that accurately reflect differential degrees of ability in EFL essay writing at level B2?
In answering the research question, the development team examined the following:
how the locally developed checklist items performed in classical item analyses;
the fit of the checklist items in applications of modern test theory;
the merits of using a checklist relative to the current rating scales; and
the extent to which the checklist could differentiate levels of success in written products, and thus provide empirical support for scoring validity.
Methods
For this study, I and the research team at Euroexam International used a mixed methods strategy of inquiry. In the initial qualitative phase, teachers’ beliefs were collected through verbal protocol analysis, and salient features of successful essays were listed by means of stimulated recall. The qualitative inquiry was followed by a quantitative analysis of scores from a live test administration and iterative rounds of item analysis. While refining the instrument, the research team members were repeatedly interviewed about their impressions of the checklist and its use.
Participants
At the time of the study, a total of 14 writing examiners comprised the population of raters. The four research team members were a subgroup of the raters working as independent contractors. They were all EFL teachers with at least ten years of experience (M = 14.27, SD = 3.41) in language testing. The research team showed substantial variability in teaching experience (M = 18.00, SD = 4.96).
Materials
The study utilized the operational task and essays from the May 2017 live exam administration. The essay task invited the candidates to discuss to what extent they agreed with the following statement: “The teaching in your country’s schools needs improving.” The candidates were to give reasons for and against and provide a conclusion at the end.
Altogether 184 essays were used in three waves of data collection. Table 2 shows the allocation of essays in each phase of the research.
Allocation of essays in research phases.
Procedures
The total of 184 essays in Table 2 are listed and numbered in a consecutive order. During the initial item-pool construction phase, all four research team members worked with the same four essays. Similarly, in the pilot project, the 30 essays numbered 5–34 were rated by every member of the research team. In the large-scale checklist evaluation phase, each individual rated the 30 common essays numbered 35–64 along with one unique batch of 30 essays numbered 65–94, 95–124, 125–154, or 155–184.
Checklist development
The research comprised three major phases (Figure 2). First, the research team compiled a checklist-questions pool. They transformed the questions into statements and then into items. Next, they piloted the items and reformulated them based on the pilot results and their experience making amendments where necessary. Then, they trialled the 36-item checklist on a large sample in order to unveil how effective and suitable it was.

A flowchart of the major phases of checklist development.
Item development
The aim of the initial item development phase was to compile a pool of statements based on a thorough review of the relevant literature with special emphasis on the CEFR, an analysis of the test specifications, and two waves of preliminary data collection. Research into second language writing in English has revealed operationalizable features that characterize successful texts. Effective compositions are more cohesive (Liu & Braine, 2005) and coherent (Maxwell & Falick, 1992), show lexical variation (Engber, 1995), and contain fewer lexical errors (Engber, 1995). High-quality texts are characterized by collocations and synonyms (Zhang, 2000), the use of longer and less frequent words (Ferris, 1994; Kyle & Crossley, 2016), phrasal elaboration (Biber, Gray, & Poonpon, 2011), longer sentences (Ortega, 2003), and rely on references (Khalil, 1989). Text length is a reliable indicator of proficiency level (Intaraprawat & Steffensen, 1995; Sasaki, 2000). Incoherence results from poor elaboration or a lack of detail (Khalil, 1989). Grant and Ginther (2000) reported that more-able L2 writers show an increased and more skilled use of modal verbs, passives, articles, and prepositions. The checklist set out to incorporate these research findings.
In the first wave of data collection, the research team completed the test task and voiced their beliefs about salient features of successful essays in a stimulated recall. In the second wave, they evaluated four benchmarked scripts highlighting strengths and weaknesses using think-aloud protocols. In this way, the theoretical grounding of the literature was combined with and complemented by the emergent features from the verbal protocol analysis. The research team members formulated a list of questions based on these salient features. When collated, the questions formed a pool.
The pool contained 35 “yes” or “no” questions relating to 15 CEFR scales and a task-specific element. The “yes” or “no” questions were later reformulated as statements because feedback suggested that these might be easier to work with. Appendix B provides a comparative overview of how the items are linked to the CEFR scales and the operational rating scales. By reformulating the B2 band descriptors where possible, the research team sought to achieve level-specificity. Owing to the differential weighting of the assessment criteria both in the test specifications and in rating practice (Lukácsi, 2016, pp. 211–213), the illustrative scales were not represented by an equal number of items. The driving force of task achievement and grammatical range and accuracy was maintained through a marked representation. Assessment of lexis was limited because candidates can use a dictionary. Coherence and cohesion, and appropriacy, were represented by four items each.
At the end of the initial development phase, the research team compiled a pool of items with direct links to the illustrative scale descriptors in the CEFR. The first version of the checklist for essays was trialled in the pilot project phase.
Pilot project
The major aims of the pilot project were (a) to test the feasibility of the checklist as a measurement tool and (b) to collect evidence regarding the relevance and appropriacy of the items. In the pilot study, the research team assessed 30 essays selected at random in a complete rating design. In Table 2, these essays are numbered 5–34.
Some potentially vague concepts invited disagreement and were addressed in the subsequent discussions. In particular, situational authenticity required clear instructions on how much contextualization and how much realistic information the candidate was expected to provide. There was also tension between the task instruction specifying text length and the local scoring tradition of disregarding word count. The complete checklist for essays at B2 is available in Appendix B.
The checklist was designed to contain two basic item types. Items 14, 27, 28, 29, 30, and 34 were formulated with the word “consistent(ly)” aiming at accuracy, where errors were not allowed. Items 16, 17, 18, 21, and 33 expected demonstration of the ability aiming at range, so a single exemplification merited the credit. The research team purposefully avoided the use of inherently subjective terms, such as “many,” “some,” or “few,” and developed the items to fulfil the requirements for good descriptors as listed in Schneider and Lenz (2001, p. 47).
At the end of the pilot study, the checklist for essays at B2 was complete with 36 items, a set of concept-check questions, and four annotated scripts of how the checklist was meant to be used.
Large-scale checklist evaluation
The purpose of testing the checklist on a large sample was to collect data for test and item analysis. For the statistical analyses, a sample of 150 essays were selected randomly. Each research team member received a batch of 60 essays in a linked incomplete design. The link was 30 common essays that every team member rated. In Table 2, the common essays are numbered 35–64, and the essays allocated to individual members of the research team are numbered 65–94, 95–124, 125–154, or 155–184.
Besides scoring with the checklist, the research team members also passed an overall impressionistic judgement on the essays from fail (1) through pass (2) to pass with distinction (3). The three-level scale was chosen so that grades of success could be distinguished with confidence (Linacre, 2014). The overall judgements served two purposes. First, in keeping with the primary research objective to develop a new measurement instrument without altering the construct or modifying the pass rate of the candidature, the overall judgements were expected to agree with the pass or fail classification based on the reported scores. Second, better performances were to score higher on the checklist. Therefore, the overall judgements were expected to show strong positive correlation with checklist sum scores where the items carry equal weight. These points will be discussed in more detail in the “Rating consistency” section.
The research team leader evaluated the checklist items with the help of the overall judgements and the reported scores. However, given the problems with the reported scores from rating scales, i.e. unobserved score categories, limited range, differences in rater severity, and lack of a shared interpretation, the correlation between checklist sum scores and reported scores is to be interpreted with caution. Consistent rating was a prerequisite for acceptable item responses that could be analysed applying classical and modern test theory. Exact agreement indices were calculated on repeated essay scores, and fit statistics were used to estimate person and item characteristics.
Results
Rating consistency
In order to assess rating consistency, the overall judgements were first compared with a converted classification score derived from reported scores. The conversion was directly related to the score calculations as explained in the regulations (Euroexam International, 2015, p. 4). Candidates must reach at least 40 score points on each test paper. Successful candidates achieving 70 score points or more on each test paper are awarded a pass with distinction. Table 3 shows the reported score to converted classification score transformations.
Reported score to converted classification score transformation.
According to Table 3, an immediate fail in reported scores was coded as 1, mid-range reported scores were coded as 2, and pass with distinction was coded as 3. A comparative analysis between the overall impressionistic judgements by the research team and the converted classification scores from reported scores revealed a strong positive correlation r = .741 (p = .000). Exact agreement was 73.30%.
Once evidence for consistent classification was identified, the overall impressionistic judgements were compared with reported scores and checklist sum scores, that is, the sums of unweighted checklist item total scores (Table 4).
Spearman correlations between overall judgements, reported scores, and checklist sum scores.
Correlation was significant at p < .001 (2-tailed).
As Table 4 shows, the overall judgements were in a strong positive correlation with both the reported scores and the checklist sum scores. Checklist sums were always more strongly related to the overall judgements than reported scores. Correlation between overall judgements and checklist total scores ranged from 0.688 to 0.881 (p = .000).
Item analysis
When evaluating how the checklist worked, the research team leader analysed total scores and item level data following classical test theory, and modern test theory in Facets (Linacre, 2017). As for descriptive statistics, the checklist was viewed with regard to the total test score distribution and average values, whereas the individual items were assessed by item difficulty as expressed in the percentage correct and by quality, as indicated by point-biserial correlation (Crocker & Algina, 1986) and Ebel’s D (Ebel & Frisbie, 1986). Since these population dependent statistics combine information about the checklist with test taker characteristics, Facets was used to calculate item difficulty estimates in logits, which were then converted to fair-measure average scores. To judge the degree of fit to the model, infit mean-square statistics were used (Wright & Linacre, 1994).
The literature is divided regarding rules of thumb of acceptable item statistics (Bachman, 2004; Crocker & Algina, 1986; Ebel & Frisbie, 1986). However, there is general agreement in terms of the preference for domain representation (Crocker & Algina, 1986, p. 336; Popham, 1978, p. 91). Therefore, items with proven unsatisfactory properties, that is, extreme item facility (p > .95 or p < .05, Mehrens & Lehmann, 1991, p. 259), or poor discrimination (r < .20, Mehrens & Lehmann, 1991, p. 163), were excluded.
Total scores on the checklist ranged from 3 to 31 with a median score of 19 and a mean of 18.34 (SD = 6.48). With the exception of item_24, all the items showed satisfactory item–test correlation (r > .20), whereas Ebel’s D did not reach a high enough level on eight items (D < .30; Ebel & Frisbie, 1986, p. 232). For this population of test takers, three items were seemingly too easy (p > .90).
Item fit indices were used to assess whether each statement was productive for measurement. Wright and Linacre (1994) recommend retaining items with fit indices between 0.5 and 1.5. With the exception of item_03, all the items were judged to be productive for measurement. As a result, item_03 (“This text is the required length as defined by the task”) had to be eliminated so that acceptable model fit could be achieved. In the end, 35 checklist items were retained. Table 5 contains the item level statistics.
Item level statistics of the 35 retained checklist items.
proportion correct.
point-biserial correlation.
Ebel’s D.
adjusted raw score.
item difficulty estimate.
information weighted fit.
Table 5 shows that item facility had a range of 0.81 with extreme values of 0.98 and 0.17 (M = 0.52, SD = 0.24). Item fair-measure average range was 0.87 spreading from 0.99 to 0.12 (M = 0.50, SD = 0.28). The correlation between item facility and fair-measure average was nearly perfect (r = .998, p = .000). Item difficulty parameters were estimated as centered (i.e., their arithmetic average was 0).
In order to check whether severity influenced the scores, checklist sum scores were compared. The data were collected randomly, the sample sizes were equal (n = 30 each, with the common essays excluded), and Levene’s test showed no heterogeneity of variance (F(3,116) = 1.346, p = 0.264). The results from a one-way ANOVA showed no significant rater effect among research team members (F(3,116) = 1.519, p = .214). Table 6 contains severity and fit statistics from a Facets analysis.
Severity and fit statistics.
As Table 6 shows, R_01 was somewhat more lenient than the others, but all the research team members were well within an acceptable fit.
An important result of checklist use was the rise in the spread of the estimated abilities. When both score sets were placed on a scale ranging from 0 to 100, the reported score distribution was more limited (M = 55.89, SD = 18.58) than the checklist sums (M = 54.28, SD = 27.76). The internal consistency for the checklist of 35 items was α = .98, with a separation index G = 7.85. Therefore, the item set as a whole was judged to be productive for measurement.
Feedback
Given the importance of rater involvement (Harsch & Martin, 2012; Harsch & Rupp, 2011; Lumley, 2005), the members of the research team were repeatedly interviewed for feedback about checklist structure and use. During the pilot project, they typically commented on concepts and wording:
After collecting the comments, a set of concept check questions was developed together with a document clarifying potentially ambiguous concepts. Items that did not reflect the criterion clearly and transparently were reformulated.
During the large-scale checklist evaluation, the comments became increasingly self-reflexive:
Apart from qualities of professionalism, fairness and reliability in scoring, the research team also expressed a wish to be valued. As the volume of the workload increased, the comments related more to time efficiency:
Time efficiency is directly related to the amount of work raters can undertake and, consequently, to their income. The results are also to be officially announced within 30 days of the exam date.
Further, some general comments related possibly to the positive washback of using a transparent list:
Two driving principles behind the checklist development research, transparency, and accountability were mentioned as key values of a fair scoring system.
Validity
Since the major research question of the development project was to see if a checklist could be effective in reflecting test takers’ EFL writing ability with precision, but without altering operational requirements, comparative analyses were conducted between pass or fail classification based on reported scores and checklist sum scores. As strong support for the validity of the checklist, correlational analyses demonstrated that when summed, checklist scores reflected the overall judgements better than reported scores did. In terms of pass or fail decisions, 89.70% of the candidates were classified identically, with the checklist yielding 2.60% false positives and 7.70% false negatives, when compared with live-score groupings. The two pass or fail categorizations were found to be strongly related, χ2(1) = 49.610, p = .000.
Similarity at the classification level was viewed as a necessary but insufficient requirement to support a valid interpretation. Beyond that, the level-specific items that operationalized certain illustrative descriptors assured that the checklist was aligned to the CEFR. Since the operational rating scales were also based on the CEFR, the descriptors provided the common ground that verified the invariance of the underlying trait. Feedback from the research team provided further support for the checklist itself and scoring on a checklist in general. They unanimously felt that the scores were more reliable and fairer than they had experienced previously. They also stated that although there was a large number of items, they could reach an agreeable speed when working with an already familiarized list.
Discussion
This project of developing a level-specific checklist for the assessment of writing at level B2 aimed to replace the operational rating scales for three major reasons. First, the scores from the scales were difficult to model beyond the level of basic arithmetic. In applications of modern test theory, model fit was only achieved if observed score categories of contrasting meaning were collapsed, thereby making it difficult to differentiate between a pass and a fail. Second, the writing paper showed very limited score variance, which was not reflected in candidate performance in other skill areas. Third, previous research conducted locally (Lukácsi, 2016) pointed to the lack of a shared construct interpretation among principal examiners and necessitated curbing idiosyncratic rater behaviour.
The checklist was found to be useful in differentiating among weak performance levels where the rating scales would identically label candidates at the minimum requirement level. When interrater reliability was measured based on checklist use, consistent ratings were detected and unexpected rater behaviour was limited. It was not controlled altogether, pointing to the fact that the term expert judge was indeed hard to define (Brindley, 1991).
As a final step in establishing validity, the relationship between writing scores from the checklist and other measures of L2 attainment was examined. The Pearson correlation between checklist-fair-measure scores on the one hand, and test-paper-reported scores on the other, was always strong and positive, ranging from 0.605 to 0.679 (p = .000). This meant no significant change from the reported scores for writing, where values ranged between 0.600 and 0.659 (p < .000). Fair-measure-average scores from the checklist showed an increased strong-positive correlation with both the final composite result (r = .871, p = .000) and the pass rate (r = .808, p = .000). Therefore, the checklist proved to be a useful tool in re-establishing writing as a strong predictor of both the result and the pass or fail classification.
Previous research into the development and use of checklists in language testing had an overt diagnostic purpose (Kim, 2010, 2011; Struthers, Lapadat, & MacMillan, 2013), whereas the CEFR (Council of Europe, 2001, p. 189) recommends this instrument for level testing irrespective of test purpose (Hughes, 1989). Weigle (2002, p. 120) viewed the provision of diagnostic information inherent in analytic scoring as an advantage over a holistic score. Although, as Hughes (1989, p. 94) noted, “the whole is often greater than the sum of its parts,” this research found that checklists can help focus raters’ attention to construct relevant details.
Research into writing assessment (Alderson, 1991; Bachman & Palmer, 1996; Hamp-Lyons, 1990; Weigle, 2002) emphasized the importance of clearly defined and explicit criteria. The general requirement of independence for good descriptors was that “they should allow for clear yes / no decisions” (Schneider & Lenz, 2001, p. 47). Feedback from raters (Eberharter, 2018) underlines the relative ease of rating on a checklist as opposed to rating on a scale. This study found further evidence that strictly focused criteria can help establish scoring validity, and increase transparency and accountability.
The checklist development study clearly demonstrated that by limiting idiosyncratic rater-scoring behaviour and subjective construct interpretation, rating can be redirected to observable phenomena. Further, such a highly specific analytic approach can provide diagnostic feedback (Alderson, 2005) and thus guide writing instruction. It can assist independent learning and increase learner autonomy (Nation, 2009) over and above its merit in supporting valid score interpretations.
Limitations
The checklist development study was impacted by some limitations. Further research, preferably with a larger number of experts, will be necessary to collect evidence about how adequately the checklist or the rating scales operationalize level B2 in the CEFR.
For the purpose of this study, the overall impressionistic judgements served as a basis for comparison. However, for reporting operational results, the standard will have to be set with precision. In the context of level-testing, the measurement instrument was targeting essays at level B2. A separate checklist will need to be developed for all the targeted genres and levels if it is adopted as the preferred scoring method. All negative items will need to be reworded to reflect the can-do approach adopted in the CEFR.
Conclusion
With this research, Euroexam International aimed to develop a level-specific checklist for assessing EFL writing as part of a high-stakes proficiency examination. The research team and I found that the checklist items performed satisfactorily in classical item analyses, and that the ultimately retained 35 binary statements showed adequate fit in applications of modern test theory. Further, complete with a set of concept-check questions and sample material, the instrument minimized unmodelled rater behaviour, led to a common construct interpretation, and was sensitive enough to differentiate among candidates on the tails of the score frequency distribution.
Although further development is necessary and future research will need to address differences in genre and attainment levels of foreign language ability, the study demonstrated that rating on a checklist as recommended by the CEFR (Council of Europe, 2001, p. 189) is a viable alternative when scoring written products in level-specific language proficiency testing.
Supplemental Material
Appendix_A – Supplemental material for Developing a level-specific checklist for assessing EFL writing
Supplemental material, Appendix_A for Developing a level-specific checklist for assessing EFL writing by Zoltán Lukácsi in Language Testing
Supplemental Material
Appendix_B – Supplemental material for Developing a level-specific checklist for assessing EFL writing
Supplemental material, Appendix_B for Developing a level-specific checklist for assessing EFL writing by Zoltán Lukácsi in Language Testing
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
