Abstract
This article provides a single, common-case study of a test retrofit project at one Colombian university. It reports on how the test retrofit project was carried out and describes the different areas of language assessment literacy the project afforded local teacher stakeholders. This project was successful in that it modified the test constructs and item types, while drawing stronger connections between the curriculum and the placement instrument. It also established a conceptual framework for the test and produced a more robust test form, psychometrically. The project intersected with different social forces, which impacted the project’s outcome in various ways. The project also illustrates how test retrofit provided local teachers with opportunities for language assessment literacy and with evidence-based knowledge about their students’ language proficiency. The study concludes that local assessment projects have the capacity to benefit local teachers, especially in terms of increased language assessment literacy. Intrinsic to a project’s sustainability are long-term financial commitment and institutionally established dedicated time, assigned to teacher participants. The study also concludes that project leadership requires both assessment and political skill sets, to conduct defensible research while compelling institutions to see the potential benefits of an ongoing test development or retrofit project.
Over the past 20 years in Colombia, English language teaching and learning has been at the center of increased attention. Responding to political calls for globalization, national initiatives such as Colombia Bilingüe 2014–2018 and Programa Nacional de Inglés 2015–2025) have set target English proficiency levels for high school and university graduates and have funded extensive teacher training programs. At the Colombian university considered in this study, substantial, diverse efforts supporting its students’ English language study and use are also visible. These include English-medium instruction undergraduate content courses (Tejada-Sánchez & Molina-Naar, 2020); field-specific English coursework for MA students (Nausa et al., 2021); an English program for PhD students, focused on writing for English publication (Janssen, 2016); English for university administrative staff; an English tutoring program for undergraduates; and a tutoring program for graduate students and professors, focused on English for publication and other professional purposes (Janssen & Restrepo, 2019).
In this university’s English Department (pseudonym), there has simultaneously been an increased focus on language assessment. While language assessment has two principal interests—“research and theoretical work on the one hand, and test design, development and programme management on the other” (Buck, 2009, p. 166)—the focus in the English Department has been on the latter, because of the important connection assessment has with the teaching and learning of languages. As Bachman and Palmer wrote, language assessments can provide evidence of the results of learning and instruction . . . the effectiveness of the teaching program itself . . . what specific kinds of learning materials and activities should be provided to students . . . whether individual students or an entire class are ready to move on to another unit of instruction . . . for clarifying instructional objectives and, in some cases, for evaluating the relevance of these objectives and the instructional materials and activities based on them . . . (Bachman & Palmer, 1996, p. 8)
The department director and this author were in agreement about the crucial relevance that an active local language assessment program could have for teachers, connecting the pedagogies used in the department, the curriculum, and different assessment moments. Situated within this larger philosophical perspective was the retrofit of the English Department’s Classification Exam, or EDCE (pseudonym). This project’s goal was to update the constructs, item types, and assessment practices used within the classification exam, while also providing local teacher stakeholders with increased knowledge concerning language assessment and its relationship to the local curriculum and pedagogical practices. This article describes this project, as a single, common-case study of test retrofit, and it has two focuses: the progress of the retrofit project and the opportunities for language assessment literacy the project afforded. This article specifically addresses the following questions:
RQ 1. What was the result of the EDCE retrofit project in terms of changes made to the test? What local factors affected this project’s progress?
RQ 2. As part of the EDCE retrofit project, what different opportunities to increase and exercise language assessment literacy were afforded to local teacher stakeholders? What beliefs did stakeholders have about these opportunities?
By presenting this case, an additional account is added to the literature about locally developed test projects, the constraints they may experience, and the affordances these projects have for teacher stakeholders to increase and exercise their language assessment literacy.
Review of the relevant literature
Test retrofit
Fulcher and Davidson (2009) described how in-tact assessment instruments may require retrofit. Drawing on an architectural metaphor, they described how an exam’s logic structure may require an upgrade or change. Upgrades refer to making small shifts, for instance utilizing new technology within established item types, while changes refer to larger redesignations of exam uses, test-taker groups, and other structural elements (p. 124). While upgrades are frequent, Fulcher and Davidson (2009) described how change retrofits are “rare[ly] documented” (p. 135); arguably one of the most historic of these is the reconceptualization of the TOEFL test as the TOEFL iBT®. The project reported here is a change retrofit, because it includes new proficiencies (productive items, listening) in its classification structure.
To build retrofit projects, Fulcher and Davidson prioritized structuring the exam with an evidence-centered design framework (Mislevy et al., 2003). Evidence-centered design is notable as it begins the assessment development process with the consideration of exam validity (see Kane, 2016), ensuring that the purposes, uses, and interpretations of an exam are aligned with its design (Riconscente et al., 2016). To scaffold this alignment, evidence-centered design has test developers describe three conceptual layers of an exam: domain analysis, domain modeling, and conceptual assessment framework. As described by Lane et al. (2016), domain analyses capture the theoretical characteristics of the construct to be assessed. In domain modeling, these characteristics are then preferentially structured into statements that describe a summary of the domain, rationales for assessing this domain, the main skills to be assessed, what observations or potential work products look like, and the variables that affect the construct (see Riconscente et al., 2016, pp. 49–51). The theoretical work finishes with the conceptual assessment framework, “the machinery for generating assessment blueprints . . . provid[ing] the technical detail required for implementation, such as specifications, operational requirements, statistical models and details of rubrics” (Riconscente et al., 2016, p. 52).
Fulcher and Davidson (2009) completed their description of test retrofit positing a list of 11 steps (pp. 137–138) and providing a discussion of retrofit evaluation. This project followed the 12 steps of exam development described in the volume by Lane et al. (2016), as they provided the level of description necessary for this author. These 12 steps are explained in more detail below in the Methods section.
Factors related to project development
There are many factors that can impact a project’s successful completion. One of these is thorough planning. Indeed, the first step of Lane et al.’s (2016) 12 steps for test development is the creation of an overall plan. Elsewhere, seven of the 11 steps of test retrofit that Fulcher and Davidson (2009) provided describe aspects of planning, and the authors emphasize its “extensive” nature. This includes developing an understanding of the context and the test instrument, but also developing a shared understanding with other stakeholders. Robust planning can help protect a project from “conflicting priorities” and “constraints” (Buck, 2009), which result in situations of compromise (Davies, 2009). Hunter (2009) stated that projects are impacted by the degree to which they are sustainable: “clarity of objectives, detailed planning and clear leadership, necessary ingredients for successful implementation of any new development” (p. 207).
Different social factors and motivations can also impact a project’s successful completion. Alderson’s (2009) edited volume presented various case studies of different large-scale projects related to language teaching and assessment. Therein, Davies (2009) reported on how projects conducted in local environments may be impacted by the instrumental or sentimental rationales that the relevant institutions or individuals may have. The “sentimental role concerns values that are traditional and cultural, while the instrumental role has to do with economics and job-improved possibilities” (Davies, 2009, p. 45). In a case study of the development of a Grade 9 English exam in Slovenia, Pižorn and Nagy (2009) concluded that “politicians, especially when feeling vulnerable to public opinion at election time . . . ignore professional and informed advice and insist on particular courses of action for reasons of expediency [i.e., sentimentality]” (p. 193). In that case, political expediency resulted in project cancelation.
The quality of the interpersonal relationships between the people involved in projects—micropolitics according to Alderson (2009), or between institutions—macropolitics—can also impact a project’s success. Little and Lazenby Simpson (2009) presented a case of successful, recognized language program development in Ireland; however, this program was later closed, likely due to a lack of a consolidated relationship between the project and political leaders. Pižorn and Nagy (2009) presented two cases from Central Europe, which illustrate the problems that can arise when powerful individuals “exercised their arbitrary power in different ways” (p. 185). They went on to write that “ambitions, personal agendas, openness to change and attitudes to professionalism” (Pižorn & Nagy, 2009, p. 185) all can impact a project’s outcomes.
Davies (2009) shed light on how different rationales can intersect with the different qualities of interpersonal relationships. In the case study he reported on from Nepal, his analysis showed that intensive English instruction starting from Grade 8 would best utilize local resources (instrumental rationale). However, the Nepalese Ministry of Education (macropolitics) mandated the start of English instruction at Grade 4, maintaining the former policy; the institution thought that a reform that resulted in fewer years of study—from Grade 8—was unacceptable for sentimental reasons. These reflections are not meant to be read as an exhaustive summary of the factors that can affect project development, but rather as an indication of how a project is situated within and impacted by its local context, far beyond the logic structure defining the project.
Washback and language assessment literacy
Bachman and Palmer (1996) have written that “we should be able to bring about improvement in instructional practice through the use of assessments that incorporate or are compatible with what we believe to be principles of effective teaching and learning” (p. 109). This test retrofit project followed this line of thinking and was part of a larger initiative to provide scaffolding for positive washback to the English Department, drawing a stronger cohesion between the program’s theories, pedagogies, curriculum, and assessment practices (see Cheng & Curtis, 2004). However, Shohamy (2006), among others, importantly warned about the potentially negative political and social impacts of washback from assessment. Elsewhere, Davies (2009) concluded that “introducing an examination [washback] for a nonexistent syllabus [is] a lever not for change but for disaster” (p. 60). Xerri and Vella Briffa (2018) described the negative washback high-stakes assessment projects can have on teachers and students: decreased motivation or disempowerment.
To mitigate these concerns, Xerri and Vella Briffa, along with Shohamy (2006), Wall (2000), and Hunter (2009) advocated for the inclusion of teachers in test development projects. While these texts collectively described how this stance fosters political inclusion, personal empowerment, as well as professional development, today the discussion has evolved to state that teacher participation in assessment projects can promote language assessment literacy, the “theoretical, practical, and experiential knowledge base required for fulfilling assessment and testing functions in language-related situations . . . required by all stakeholders regardless of their assessment roles and the functions they need to fulfill” (Inbar-Lourie, 2013, pp. 301–302). Taylor (2013, p. 410) provided an eight-point list of what language assessment literacy includes:
• Knowledge of theory • Sociocultural values
• Technical skills • Local practices
• Principles and concepts • Personal beliefs/ attitudes
• Language pedagogy • Scores and decision making
Different case studies have documented test development projects that promoted language assessment literacy with collaborating teachers. To wit, Brunfaut and Harding (2018) described a collaborative teacher-led assessment development initiative in Luxembourg. Aspects of language assessment literacy gained from project participation included highly-trained test developers who were knowledgeable in the principles and concepts of assessment, the development of an assessment instrument, in addition to the creation of educational policy. Kremmel et al. (2018) reported on a test development project in Austria, in which teachers increased their assessment knowledge of item writing and rating scale development. Participation by local teachers in smaller segments of assessment development projects also seems to increase and exercise teacher language assessment literacy. Indeed, Holzknecht et al. (2018) described one project in which teachers were involved in the design of rating scales. Increases in language assessment literacy from this project included a better understanding of reliability, validity, and task design, in addition to a more precise understanding of common European framework of reference (CEFR) descriptors. Together, these findings underscored for this author the importance of including teachers in the EDCE retrofit project structure.
Methodology
This article presents a single, common-case study. Single-case studies follow the same logic that justifies single experiments or exploratory research paradigms; these are especially useful when doing research on less-discussed topics. This study is considered common-case as it “capture[s] the circumstances and conditions of an everyday situation—again because of the lessons it might provide about the social processes related to some theoretical interest” (Yin, 2018, p. 56). In this study, the common case addresses the local development of a test retrofit project.
Context
This study was conducted at a private, ranked university with approximately 14,000 undergraduate and 4000 graduate students. This non-denominational, center-oriented university, located in a major urban area, is highly esteemed locally for its research output, reputation with employers, number of professors with PhDs, and numerous regional and international collaborations. Its academic strengths include the areas of administration, biological sciences, economics, engineering, and law. The English Department, though existing for decades, is academically young. Of its 60-some English instructors, currently only seven have completed or are in the process of finishing their PhDs. Nonetheless, professional standards are very important in the English Department: all instructors have MA degrees in relevant fields of study, such as Applied Linguistics or TESOL; all are thought to have a C1 + level of English proficiency. Approximately 6000 students take an English course each semester.
Undergraduates at this university must complete a bilingualism requirement—a B2 level of proficiency—which may be done at any moment during their study. This can be accomplished in three ways: graduation from an international baccalaureate high school, a B2 score on a major international test, or the completion of the last course in the English Department’s undergraduate curriculum. This curriculum develops two course sequences. First is a basic segment, which at one time focused extensively on the English reading comprehension skills needed for undergraduate study at the university. In recent years, this heavy focus on reading has been reconceptualized, and productive language skills have been added to course programming. The basic segment is followed by a productive skills segment, which develops task-based coursework related to speaking and writing; for the students who fulfill the bilingual requirement through coursework, all courses in the productive segment are mandatory. Using an EDCE score, students are classified into a course level from the basic segment or for entrance into the productive skills segment. Approximately one third of undergraduates are exempt from the English Department’s programming because of their English proficiency upon university enrollment.
Participants
As outlined in Table 1, different local teacher stakeholders—both full-time teaching professors and adjunct professors—were approached for participation across the various stages of this test retrofit; they were compensated either monetarily or with project work hours, credited toward their administrative work quota. The teacher participants in this project had at least 5 years of full-time work experience and were an even mix of English and Spanish L1s. The project’s lead investigator (this author) had a prolonged engagement with the exam in various capacities: paper exam proctor, assessment project assistant, exam analyst, exam developer, and exam administrator. This diverse and extensive contact with the exam, its context, and the English Department’s language program over 16 years helps to build the project’s credibility (Yin, 2016, p. 86).
EDCE retrofit and planned language assessment literacy.
Note: Exam development steps from Lane et al. (2016, p. 4). Exam retrofits written in
Steps 10–11, while conceptualized in this project, were not achieved.
Materials
To inform exam retrofit, anonymized, pre-existing test records from approximately 4400 BA, MA, and PhD students were considered for analysis; about 2000 test records included pilot item responses. This data set is thought to represent the entire database of test scores from 2016–2019. In this study, these data were analyzed to provide both a psychometric baseline of the test’s function and a basis of comparison for the piloted items developed by local teachers.
To establish the test design specifications to be developed in this project, teacher stakeholders developed different documentation during the exam retrofit process. These documents were modeled on test framework documents provided in Riconscente et al. (2016). As part of training exercises or feedback, the teacher stakeholders also received additional information—data analyses produced by this author—concerning the stage of the test retrofit project they were participating in (e.g., course level exam analyses, performance reports of piloted items). Selections of the materials the teachers received or created are included in this article either as tables or supplementary materials (see Table 1). A simple questionnaire based on Welch (2006) gathered the opinions of the teacher stakeholders about their experiences.
Retrofit procedures
Since the project was a change retrofit, the 12 steps of exam development described in Lane et al. (2016, p. 4) were followed (see Table 1). An overall plan was stipulated (Step 1), and then the first three layers of evidence-centered design were elaborated—domain analysis, domain modeling, conceptual assessment framework—(Steps 2 and 3). To ensure a strong connection to the curriculum, this step was conducted by six of the English Department’s level coordinators. With the test framework defined, nine item writers were trained according to the procedures Welch (2006) described (Step 4). New items were reviewed, edited, and then consolidated into various exam forms that integrated pilot sections into the original EDCE (Steps 5 and 6). Items were piloted across several administrations, usually including about 300 test-takers each (Step 7). Problem items and distracters were reworked by two teachers and this author. When a complete test form was proposed, jMetrik (Meyer, 2014) was used to equate pilot form scores to the former exam’s score scale (Step 8). Although not described here, a standard setting study confirmed cut scores using the new exam version (Step 9). The development of score reports and test security elements (Steps 10 and 11) were not considered in this project, and test documentation (Step 12) was provided in the form of four internal reports.
Analyses
To compare the performance of the original EDCE items and the items written by local teachers, descriptive statistics and several metrics from classical test theory are presented. Mean scores provide an indication of the EDCE’s fit with the test-takers’ abilities; measures of standard deviation show the degree to which the test was separating test-taker performances. p-values provide measures of each item’s difficulty, and item discrimination values indicate the degree to which each item separates more- and less-proficient test-takers. Mean item discrimination values have been calculated to gauge the general discrimination of different test sections.
It should be noted that this article provides several descriptions of different social factors that impacted the project’s progress. Observations about these social factors are this author’s reflections about highly contrastive moments during the project’s development.
Findings
EDCE origin and description
Informal conversations with two long-time staff members indicated that the EDCE was developed in the late 1970s or early 1980s as part of an inter-institutional initiative between a British university and the English Department. This resulted in a four-volume instructional series titled Reading and Thinking in English (Moore & Munévar, 1979/1980), whose content was reflected in the early version of the curriculum segment that focused extensively on reading skills. The EDCE seems linked to this initiative as it classified students into this course series, and it contains similar types of reading comprehension questions (main ideas, details, inferences) based within academic passages that the instructional texts develop. However, no documentation specifically connects the architecture of the EDCE to the language program (target language use domains, uses, how cut scores were determined), one issue that motivated an exam retrofit.
The original EDCE was a timed, 70-minute, multiple-choice exam (78 items; Scantron) that assessed vocabulary, grammar, and reading comprehension using one test form. As with most exams from this era, no relationships were established to any proficiency scale, nor did the exam enjoy recognition from any certifying agency. Since approximately 2008, the EDCE has been administered online using the university’s learning management system. During online administrations, answer options are presented in a randomized order, and exams are automatically scored. Score reports for both exam versions provide only the course level in which the student should enroll. Today, students can take the EDCE for free upon enrolment into the university; should students wish to take it later—or to retake the test—they must pay approximately 110.000 COP (~30 USD). It is unclear whether the English Department has access to these funds. Because there has been only one form in use for several decades, there were concerns about exam security; thus, item renewal became another reason for exam retrofit.
In 2018, large structural changes were made to the English Department undergraduate program, with the initial, reading oriented segment of the program shifting to include important focuses on other language skills. Accordingly, a gap emerged between the skills developed in the language program and those assessed on the EDCE. Using a SWOT analysis (strengths, weaknesses, opportunities, threats), three exam options were considered for classification into the revised program: a purchased, external exam, the original EDCE, or a retrofitted EDCE (see Table 2 below). Because of its visible cost outlay, the option of a purchased, external exam was discarded despite its convenience, provision of CEFR-scaled scores, and face-validity. The original EDCE, despite its convenience and financial savings, was omitted from consideration because of its lack of fit with test-taker abilities (described in detail below), unspecified test architecture, uncertain relationship to the curriculum, and security concerns. The last option, a retrofitted EDCE, was selected by the department director and this author, as they firmly believed that an EDCE retrofit would resolve the issues signaled above (instrumental) and create opportunities to increase and exercise local teachers’ language assessment literacy (instrumental, sentimental).
SWOT analysis spotlighted benefits for an EDCE retrofit.
EDCE retrofit beginnings
The development of the EDCE retrofit was done within an environment that was highly protective of the original exam. Previous analyses of this exam had been sporadic and restricted by politically powerful individuals. In 2007, efforts to study and retrofit the EDCE by a Colombian graduate of a US doctoral program in language assessment were rejected. In 2012, Meier and this author were contracted by the English Department director to study the department’s PhD English program placement exam, which includes the EDCE as an exam subsection (Janssen & Meier, 2013); access was granted to approximately 120 PhD student test records, but to no BA or MA test records, which were part of the original request. It can be imagined that this was part of efforts to protect and conserve a test instrument that had long-served the English Department. In 2018, a different department director—another politically powerful individual—proposed an EDCE retrofit project as part of a larger initiative to draw stronger cohesion between the pedagogies, curriculum, and assessment practices in the English Department. Access was granted to approximately 5000 BA and MA test records for study by this author.
In baseline analyses using classical test theory statistics from both 2012 and 2019, the EDCE performed consistently across different test-taker groups by education level (see Table 3). In all analyses, the test was highly reliable (α > 0.90), and the mean scores for all groups were similar, though not within the same 95% confidence intervals. For all groups, the means of all items’ p values (Mean p ) were above 0.70, indicating that the placement test was too easy for all groups (Brown, 2006). As shown in the column Too Easy, between 30 and 40 of the 78 test items had p values above 0.80, depending on the test-taker group. Accordingly, increasing item difficulty became an important focus during test retrofit.
Similar exam performance across groups.
Note: Mean p : mean of all items’ p values; Too Easy: number of items with p ⩾ 0.80; MeanID: mean of all items’ discrimination values; Low Discrim.: number of items with item discrimination values < 0.20; /78: a total of 78 items.
From Janssen and Meier (2013).
From this study.
Table 3 also shows how patterns of dispersion (SD) were similar. In terms of the mean of the items’ discrimination values (MeanID), items typically separated less- and more-proficient test-takers adequately (see Ebel, 1979), with MeanID ranging between 0.31 and 0.35. This said, depending on the test-taker group, between 11 and 24 of the items had item discrimination values below 0.20. Addressing the lack of discrimination across sections of the exam was also a focus during EDCE retrofit.
The 12 steps of the EDCE retrofit process
After a global plan was posited (Step 1), the first layer of evidence-centered design was carried out: a domain analysis (Step 2). As the first part of the domain analysis, the assessment practices of the English Department exams were analyzed in terms of their constructs, relative weights, and item types used on the exams with the intention of using a similar system of weighting on the retrofitted exam. Figure 1 illustrates several findings from this analysis. Importantly, the percentages being assigned per skill area were consistent across the program segment for which the EDCE was being used to make classification decisions. This consistency was important as it afforded a single recommendation about how the EDCE retrofit could be weighted to reflect the local curricular design. Other findings that could be applied to the retrofitted exam included benchmarks (i.e., lexile levels) for different course levels and a pool of different possible item types used in the language program (e.g., multiple choice, multiple choice with multiple possible answers, ticking true/false, ticking correct/incorrect; other ticking “opposites,” matching, matching in tables, charts, graphic organizers, ordering information).

Exam analyses revealed shared percentages across the courses of one program segment, but differences in how these constructs were developed. ID questions refer to the identification of information, using a multiple-choice format. Examples of semi-productive questions include those that prompt students to conjugate a specified verb to a specified verb tense. Productive questions include open answers.
The analysis also showed, however, how very different categories of items types were being used to develop each construct in different course levels. As shown in Figure 1, some courses included semi-productive item types (e.g., conjugate the verb “write” in the correct verb tense for the context), while other courses utilized a preponderance of multiple-choice questions; wholly productive language skills (e.g., short answers) were not used except in the writing sections. Reflecting upon the analyses, one-on-one conversations were held with each level coordinator concerning topics about language assessment literacy: the language pedagogies and assessment principles being used (i.e., the local practices), while also brainstorming possible assessment practices in the course levels. While it was ultimately at the coordinators’ discretion to implement any changes into their course level, the EDCE project opted toward using more productive item formats.
The domain analysis (Step 2) was also informed by a literature review, during which six level coordinators gathered “substantive information about the domain of interest that will have direct implications for assessment, including how that information is learned and communicated” (Riconscente et al., 2016, p. 45). Accordingly, these coordinators considered the areas of reading and listening comprehension, the two constructs of focus in the EDCE retrofit, to develop a pedagogical overview for thinking about these constructs. This overview could be applied when making decisions about the courses or the EDCE retrofit project, ostensibly drawing the curriculum and exam more closely together. In the case of listening, three teaching professors and this author reviewed 19 texts over 5 weeks (Supplementary Material A lists the bibliography for listening). During each meeting, the professors presented and discussed with their group both a written and spoken summary of their texts. Their principal findings were consolidated in a Domain Analysis document (Supplementary Material B). This document consolidated the group’s understanding of how they view listening comprehension, comparing the key points from the two authors they found most salient (Mihai & Purmensky, 2016; Rost, 2011). For this language domain, the important points for level coordinators when developing their listening coursework included the different types of knowledge utilized when listening (e.g., phonological, syntactic, semantic) and the different functions a listener can have when responding to assessment tasks (e.g., identification, replication, comprehension). The domain analysis document also presented other high-level factors in the local program to be kept in mind during teaching or assessing the listening construct: a description of advanced listening proficiency from So et al. (2015), the two relevant university language requirements, and their own intuitive understanding of listening.
After domain analysis, the level coordinators consolidated this material into a domain model (Riconscente et al., 2016, pp. 49–53). This provided a “clear statement of the claims to be made about examinee [KSAs]” (Lane et al., 2016, p. 4). As can be seen in Supplementary Material C, the domain modeling provided an overview summary and rationale, and specified the focal KSAs to be developed in the listening component. Also listed were additional KSAs, the observations to be made about test-takers, and the characteristics of the tasks, including variable features.
Based on domain analysis and domain modeling—and informed by general information from the course level exams—a conceptual assessment framework was next defined (Step 3). In this document, the level coordinators and this author stipulated the specific tasks, variables, specifications, and observations that the exam would include (see Supplementary Material D). This drew close relationships between the constructs developed in the local curriculum and those that they recommended for development in the retrofitted EDCE. In terms of listening comprehension, weight was given to main ideas, supporting details, and the relationship between different ideas, as developed by cloze-answer responses. Less important for this construct included the recognition of varieties of English uncommon in the Colombian context, complex analytical tasks based on the listening passage, or very fast-paced speech.
With the assessment framework defined, item development and the subsequent analysis of the original and pilot EDCE items provided teacher stakeholders with moments to exercise the principles and concepts of language assessment (Step 4). During training sessions based on Welch (2006), this author presented an overview of assessment theory (validation; score uses; evidence-centered design). Item writers next became acquainted with the materials that would guide item writing (e.g., outlines of the specific item types desired; characteristics of the CEFR at each course level; lexical measures). Item writers then considered the technical concepts of item facility (also item difficulty; p-values) and item discrimination, and then learned how these metrics would be used to provide them with feedback about their piloted items. The session concluded reviewing Haladyna and Rodriguez (2013) and Rodriguez’s (2016) suggestions for writing multiple-choice items, and then applying these principles and concepts to poorly written sample items.
During the allotted item writing period, item writers had two personalized feedback sessions. The first feedback session considered the potential source texts the teacher stakeholders had identified, especially in terms of content, language appropriateness, and copyright; the second feedback session provided ideas about improving the items, according to suggestions found in Haladyna and Rodriguez (2013). At the end of item writing, item writers turned in three texts with at least 25 items and the related meta-data concerning the items developed.
All item writers and level coordinators received feedback from an item analysis of pilot items (see Table 4) about different assessment practices. This feedback included p values, discrimination values, p values for the top, middle, and lower thirds of the test-taker group (to better sense for whom the item was easy or difficult), in addition to comments and the actions that would be taken in the future with the item. These teacher stakeholders also received a shortlist about the different assessment features that worked best during piloting. These features, largely following the recommendations by Haladyna and Rodriguez (2013), included how three distracters seemed to be enough (i.e., fourth distracters were rarely chosen); distracters worked best when written in parallel form; distracters should include either all or no Spanish cognates; or that multiple-choice verb-tense choice was always too easy for test-takers.
“Text Q” item performance feedback for item writer.
Note: For reasons of space, this table presents feedback for only five of nine items. Colors were used to signal problematic areas. ID: item discrimination.
Item analysis of both the original and piloted EDCE items also provided important information to level coordinators about the local students’ different proficiencies in English, which could be used to inform choices about language pedagogy. As shown in Table 5, the undergraduate test-taker scores are moderately negatively skewed (skew = −0.78), which is sensible since the test is too easy for the test-taker group on the whole (p = .75). The result of this skew is that the means of the top and middle thirds of test-takers are above the mean of the group as a whole, while the mean for the bottom third is more than one standard deviation below the group mean, with skew for the lowest third = −0.94. Put plainly, the project provided important information about local students: that the low proficiency group is quite low, with the implication that local teacher stakeholders consider whether this group requires a qualitatively different pedagogical intervention than the higher performing groups.
Differences within undergraduate test-takers.
Note: SD: standard deviation; SE: standard error; CI: confidence interval.
Another tendency uncovered through analysis provided teachers with another important characteristic of the local context: the impact of Spanish cognates on English language use. Indeed, one academic text (194 words) rated by the Free-Lexile analyzer as being 1200 L–1300 L, or B2–C1, contained 77 cognates (39.76%), in addition to 77 structure words, representing a total of 79.52% of the text (see Figure 2). When analyzing item data for the top, middle, and low thirds of test-takers (Table 6), it becomes clear that a strong majority of the top two thirds of test-takers understood the relationship between Spanish and English in terms of this linguistic feature, while an important proportion of the lower third understood this feature much less. This has direct implications on the pedagogies of the basic segment courses dedicated to this student group. Additional pedagogical information about local students generated by item analysis included how phrasal verbs and idiomatic expressions challenged students, as did less-common meanings of vocabulary items. Items that considered tone or an assertion’s degree of strength were also difficult.
Cognate passage p-values.

Cognates and structure words. English-Spanish cognates dominate academic texts when compared with structure words.
A last example of local pedagogical knowledge afforded by item analysis focused on the verb structures that are shared between English and Spanish. These structures were recognized by nearly all students in multiple-choice verb tense items. As Table 7 shows, almost no test-takers selected a gerund as a main verb in a sentence, something that is also not possible in Spanish. Furthermore, distracters with negative verb formations that are not possible in Spanish (e.g., no main verb; negation at the end of the verb phrase) were also wholly ignored. These findings made an argument for level coordinators to prioritize productive test-tasks instead of multiple-choice when assessing constructs related to verb tense, and to disprefer multiple-choice distracters that are grammatically impossible in both languages. Importantly, cloze response verb tense production challenged students with a difficulty appropriate for their proficiency as did cloze items that looked at verbs and their collocations.
Test-takers did not select fabricated verb structures.
The teacher stakeholders were in near consensus about how item writing was a positive experience for them. In terms of the prompt Describe how these concepts impact your teaching practice, comments touched on how the project informed their technical skills, their application of assessment principles and concepts, and their own beliefs/ attitudes:
Working in the design of this item has made me reflect upon my practices when creating exams or quizzes for my classes.
It has certainly helped me to recognize errors in design in previous tests and . . . to write better and more focused questions and distracters.
I am applying new criteria when designing exams. I am learning about Item Facility and Item Discrimination so this is giving my new perspectives and tools when designing and teaching. I consider it is really important to pay more attention to the way we design and think about exams. These results are not just passing or failing scales. Assessment goes far beyond and we as teachers need to explore and learn more about these.
Despite feeling positive about the item writing experience, there were also calls for additional training. On these two concerns, item writers made comments such as:
I would love to get really good at understanding IF [item facility] and ID better so it becomes easier for me as a designer to understand them better and therefore design better items.
Again, I am better able to recognize weak item design, but I have not yet had the opportunity to apply these skills. I have made moves toward improving items in the courses I teach but have not been able to organize this effort with any coworkers as of yet. I can now see that many of the quizzes in our courses need fixing.
More similar training for teachers who are in charge of test designs, [especially] all teachers in the [first program segment].
While extensive planning had been invested into the conceptual tasks of defining the test framework and item writing in a collaborative environment (Steps 1 through 4), a large difference was noticed during the project exam assembly and production (Steps 5 and 6), uploading and editing the pilot items on the learning management system. Stated simply, the required time investment and the complex, technical nature of these practical tasks had been overlooked and underbudgeted. As project planning and funding did not contemplate the employment of a systems engineer, two assistants and this author spent extensive tracts of time copy-editing each question’s HTML code to ensure format consistency across the entire test. Technical issues with audio files were especially complex and affected the listening pilot. Work-arounds were developed with the university’s tech team, but these were limited by the learning management system’s functionality. These hindrances eroded the sentimental value of the project, and it was effectively put on hold for 6 months, despite having created new test items and a new test form that had been equated to the original EDCE.
After pilot administration (Step 7), analyses (Step 8) showed how the new test form made improvements over the original EDCE, while implementing fewer test items (k; see Table 8). As the mean item difficulty values decreased (Mean p ), the new form better matched the test-taker abilities. The standard deviation values (SD) increased, indicating that test-takers were being more widely distributed by the pilot placement instrument. Also positively, the mean discrimination values (MeanID) for the pilot tests showed increased differentiation between more- and less-proficient test-taker groups; the alpha coefficient also stayed above 0.90 or increased compared with the original test.
Pilot testing showed improvements over the original EDCE.
Note: Pmean: mean of all items’ p values; IDmean: mean of all items’ item discrimination values; SD: standard deviation; SE: standard error.
Today and the future
Faced with government restrictions concerning public gatherings in 2020–2021 related to COVID-19, the English Department canceled in-person exam administrations. Not knowing whether remote proctoring software would provide the secure environment desired, the department was reluctant to administer either the original or retrofitted EDCE during online administrations; as such, the EDCE retrofit was paused and an external exam was purchased for use throughout 2020. Because of funding changes, in 2021 the English Department returned to using the original EDCE, and simultaneously this author left for a post in North America. This said, the English Department is currently conducting an external accreditation; the process asks the department to designate different personnel to exam development and management. In addition, a full-time professor has been hired by the English Department and has worked with this author in 2021 to plan this project’s next steps. Ideally, these steps will help bolster a strong local community of practice—and an institutionalized structure within the department and university—to reenergize the EDCE retrofit project and provide teacher stakeholders with additional opportunities to develop and exercise their language assessment literacy.
Discussion
In terms of the assessment instrument itself, this case study shows how the EDCE retrofit project made two principal contributions. A conceptual design framework was described for the exam, connecting the assessment more rigorously to the local curriculum, and a psychometrically more robust test version was developed and was ready for implementation. A variety of additional test items, midway through the piloting sequence, have also been designed.
This case study illustrates how test development projects can become stalled where there are gaps in planning. While this project was invested importantly in the conceptual structure of the assessment, framed within Lane et al.’s (2016) outline of test development, less attention was given to practical task load concerning the learning management system. The project did not have the long-term project financing or permanent staff to complete these additional tasks, which became more complex since participation was often based on availability. Test development in large-scale language programs cannot rely solely on dedicated project hours that are likely to shift across semesters but should be institutional in nature. By providing space within the institution’s infrastructure for a test project of this scale, a community of practice will have the chance to form around the test, something Brunfaut and Harding (2018) emphasized for project sustainability. With a robust community of practice in place, project developers will be better prepared for the “exogenous shocks” that Swales (2019) described as inevitably affecting all programs.
This project also illustrates how local test development projects are situated within specific social environments, whose constraints and affordances must be considered and directly addressed when planning. The steps of exam retrofit were impacted by politically powerful individuals (micropolitics), who could block or facilitate large projects. Indeed, Brunfaut and Harding (2018) concluded that “sustainability remains difficult to ensure, and many of the factors which may contribute or detract from the longevity of [a] project remain outside of the control of teams of teachers (or consultants)” (p. 170). They noted how persons may “view reform projects either favourably or sceptically depending on a range of issues unrelated to the quality of project outcomes” (p. 170). This social concern, addressed by Alderson (2009), merits further thought from the field; certainly, it should be addressed specifically during project planning in terms of developing a shared understanding with decision makers, as recommended by Fulcher and Davidson (2009).
Positively, this case study shows how the EDCE retrofit project provided local stakeholders with many different types of experiences adding to and applying their language assessment literacy. During the domain definition and test specification statements, stakeholders explicitly reviewed their knowledge of language pedagogy and created connections to principles and concepts related to assessment. Teacher stakeholders were provided feedback about their local practices (course level assessments), and additional assessment principles that might provide a more meaningful test score. They learned about the principles and concepts guiding item writing and applied these concepts in practice. By receiving data-driven information about many aspects of their students’ language performance, teacher stakeholders could adjust the local language pedagogies and curriculum. All together, this project provided professors with a “multidimensional language assessment literacy” (Brunfaut & Harding, 2018, p. 170), something that the purchase of external exams cannot provide. This increase in locally held assessment knowledge is especially important in non-center contexts such as Colombia, where information derived from local test-takers can be helpful in creating responsive and meaningful program development and delivery.
Conclusions
This common, single-case study had as its objective the documentation of one test retrofit project across its different phases. Specifically, this study adds evidence about the fundamental importance of exhaustive planning and how local forces can impact test development projects. This study shows how test development or retrofit projects afford multiple opportunities for local stakeholders to gain and exercise language assessment literacy. It can be argued that this project was successful, in that a conceptual framework, based within evidence-centered design was established and a test form was created that was psychometrically more robust than the original form. By including stakeholders in the project’s steps, they gained experience with assessment theory (evidence-centered design), principles and concepts of assessment (item writing skills, item measures), connecting these to a locally developed language program. By studying data provided by local test-takers, the needs of local student groups were underscored for teacher stakeholders and could be more closely addressed in their pedagogical responses.
However, this project struggled in its allocation of resources, especially concerning planning for the technical management of the learning platform. Indeed, detailed knowledge of assessment seems to be a different area of expertise than sustainable project development. What seemed missing here is a special type of “clear leadership.” This likely includes being an “accomplished process manager” (Hunter, 2009, p. 67) and a knowledge of “sound financial management and good business practice on the other” (Buck, 2009, p. 177). Project leaders must also be politically adept: “although reforms may be introduced by determined individuals, they are implemented by people who have their own opinions and biases, and so micropolitics [interpersonal relationships] may frustrate innovation and change” (Pižorn & Nagy, 2009, p. 188). Brunfaut and Harding made five recommendations for projects, including a phased scaffolding over a sufficient period of time, building toward both greater independence by local stakeholders and project sustainability. Of import, they recommended including language assessment literacy work as part of the project. They also paid heed to different personal factors: “teachers from different spheres of influence . . . who are dedicated and professional from the outset” (Brunfaut & Harding, 2018, p. 170).
The positive benefits of local test development described above require heavy institutional investment. This can be frustrated by the nature of institutions “complex organisational structure[s] . . . [that] often have a completely erroneous or simplistic idea of what is involved in making good tests” (Buck, 2009, pp. 166–167). One first step toward ensuring this positive washback in the long term is for the field of language assessment to continue to reach out to program administrators, documenting for them how test development projects, carried out within a large language program, require a dedicated space within institutional structures in order to function. These projects are enormous in scale and need to be understood as such; as Buck wrote, test developers do much more than “write a few items, give them to test-takers, and report the number of items correct” (Buck, 2009, p. 168). This is the exact point that administrators must understand: the quality of a test is much more than the items; it is in the uses and inferences for which the test can be used to make.
To argue for and carve out this institutional space, project managers also need macropolitical skills. Figueras (2009) described this as sustainable leadership, an “activist engagement with the forces that affect it, [that] builds an educational environment of organisational diversity that promotes cross-fertilisation of good ideas and successful practices in communities of shared learning and development” (p. 208). Indeed, project managers must be able to help institutions move past immediate concerns of cost, to see the importance of investing sustainably in projects like local test development and the development of their teacher stakeholders’ language assessment literacy. Superficially, this stance is innovative and gains the institution important distinction. More importantly, and in the long term, an operationalization of test retrofit as part of an ongoing curricular evaluation and development reflects a truly high-quality institution, one that cares about responding to student needs in addition to developing professional, locally relevant knowledge by its teacher stakeholders.
Supplemental Material
sj-docx-1-ltj-10.1177_02655322221076153 – Supplemental material for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia
Supplemental material, sj-docx-1-ltj-10.1177_02655322221076153 for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia by Gerriet Janssen in Language Testing
Supplemental Material
sj-docx-2-ltj-10.1177_02655322221076153 – Supplemental material for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia
Supplemental material, sj-docx-2-ltj-10.1177_02655322221076153 for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia by Gerriet Janssen in Language Testing
Supplemental Material
sj-docx-3-ltj-10.1177_02655322221076153 – Supplemental material for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia
Supplemental material, sj-docx-3-ltj-10.1177_02655322221076153 for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia by Gerriet Janssen in Language Testing
Supplemental Material
sj-docx-4-ltj-10.1177_02655322221076153 – Supplemental material for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia
Supplemental material, sj-docx-4-ltj-10.1177_02655322221076153 for Local placement test retrofit and building language assessment literacy with teacher stakeholders: A case study from Colombia by Gerriet Janssen in Language Testing
Footnotes
Acknowledgements
I thank the English Department’s directors and the different teacher stakeholders who participated in this project for their invaluable contributions, both to the department and also to my own learning process. I also thank the peer reviewers, whose feedback made important contributions to this article. All remaining errors or imprecisions are, of course, mine.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
