Abstract
The studies documented in the four articles in this special issue uniquely exemplify principles of design-based research as follows: by taking innovative approaches to significant problems in the contexts of real educational practices; by addressing fundamental pedagogical and policy issues related to language, learning, and teaching; and, in the process, by refining their claims and assessment systems. I analyze and compare the four studies in view of Anderson and Shattuck’s (2012) guiding principles of design-based research: real educational contexts, design and testing of a significant intervention, mixed research methods, multiple iterations, collaborative partnerships, and practical impact on educational practices. The four studies differ in numerous respects but are mutually informative about conducting systematic inquiry into diagnostic language assessments. The focus of their analyses on distinct aspects of language and communication relevant to particular educational programs and populations suggest that diagnostic language assessments tend more toward specific purposes assessment rather than general language proficiency testing.
The four research-oriented articles in this special issue provide new insights, research innovations, as well as theoretical perspectives on and practical applications for language assessment. The focal topic that distinguishes and unifies their contributions is assessing learners’ abilities in second or foreign languages for diagnostic purposes. Each article offers unique, significant advances to understanding how assessments can inform and usefully guide the learning and teaching of languages. In addition, the four studies provide exemplary models for conducting design-based research, a new direction for language testing inquiry and the focus of my present analysis. The dual focus on diagnostic assessments and design-based research appears to have arisen, independently and perhaps inevitably, for each research team as they conducted systematic, exploratory studies on significant issues in educational practices guided by theoretical principles about learning, assessing, teaching, and languages. The researchers have considered these elements together – concurrently and in conjunction – while trying out, analyzing, and refining specific claims as well as assessment systems with learners and educators participating in real educational contexts. Their integration of research, theory, assessment, curricula, instruction, and learning stands in marked contrast to most prior studies of language assessment, or indeed of education, which have conventionally aimed to separate for analytic purposes these otherwise integrally unified elements.
Design characteristics
Design-based research has been around for a few decades in studies of education and of information and communication technologies, arising most visibly from studies of the introduction of innovative computer programs into schools and workplaces (e.g., Scardamalia & Bereiter, 1991) and even featuring in major handbooks on research methods (e.g., Schoenfeld, 2006). Nonetheless, surprisingly few published examples have appeared in the fields of language education or applied linguistics (Scheppegrell, 2013, is a rare, notable example). The Design-Based Research Collective (2003, p. 5) proposed that: good design-based research exhibits the following five characteristics: First, the central goals of designing learning environments and developing theories or ‘prototheories’ of learning are intertwined. Second, development and research take place through continuous cycles of design, enactment, analysis, and redesign (Cobb, 2001; Collins, 1992). Third, research on designs must lead to sharable theories that help communicate relevant implications to practitioners and other educational designers (cf. Brophy, 2002). Fourth, research must account for how designs function in authentic settings. It must not only document success or failure but also focus on interactions that refine our understanding of the learning issues involved. Fifth, the development of such accounts relies on methods that can document and connect processes of enactment to outcomes of interest.
Brown (1992) is generally credited with initially arguing that design-based educational research needs to be situated in, and account systematically and theoretically for, real contexts of teaching and learning so as to ensure that “an effective intervention should be able to migrate from our experimental classroom to average classrooms operated by and for average students and teachers, supported by realistic technological and personal support” (Brown, 1992, p. 143). Reviews of the characteristics of design research studies such as Edelson (2002), Collins, Joseph, and Bielaczyc (2004), and, most recently, Anderson and Shattuck (2012) have demonstrated how published studies using this approach have focused on the following:
real educational contexts,
the design and testing of a significant intervention,
mixed research methods,
multiple iterations,
collaborative partnerships, and
practical impact on practice.
These six characteristics, highlighted by Anderson and Shattuck (2012), provide a convenient basis for me to compare how the four present studies of diagnostic language assessment exemplify principles of design-based research. Table 1 outlines the six characteristics as applied, in the remainder of this article, to the four studies. There are, however, two caveats. First, I do not imagine that any of the present researchers set out initially or explicitly to do design-based research, but it seems to me that is what they have done and are doing, and their innovative approaches are worth analyzing and amplifying collectively from this perspective. Second, the article by Harding, Alderson, and Brunfaut stands apart from the other three articles in this special issue because of its situation at a conceptualization stage of a design cycle – following principles derived from analyses presented already in Alderson, Brunfaut, and Harding (2014) among other sources – whereas the other three articles present empirical data from the implementation of design-based studies in specific educational settings.
Six characteristics of design-based research in four diagnostic language assessments.
Real educational contexts and issues
All four of the studies are concerned with diagnostic assessments within and for real, naturally occurring situations of language teaching and learning, albeit each study has analyzed data from and about highly different populations, types of educational programs, and language abilities. Chapelle, Cotos, and Lee synthesize and evaluate results from three studies conducted within programs of English for Academic Purposes for international students who have recently entered diverse academic programs at a university in the United States. They analyze students’ and their instructors’ uses and perceptions of two different automated evaluation programs designed to assist students to improve particular aspects of their writing in English. Poehner, Zhang, and Lu situate their research within third-semester, university classes of Chinese for American students with intermediate proficiency levels in the language. They focus on innovative assessments of reading and listening comprehension, which in addition to assessing the skills or knowledge already acquired also uniquely assess, as indicators of the students’ second-language development and potential trajectories for learning and teaching, the kinds and extent of computer-mediated support that individuals use to complete assessment items successfully. Jang, Dunlop, Park, and van der Boom analyze the uses and perceptions of culturally diverse, pre-adolescent students in an Ontario elementary school, and of their teacher and parents, of diagnostic reading skill profiles derived from a standardized test of English reading and surveys of goal orientations and self-assessments. Harding et al. speculate on the application of principles for devising diagnostic assessments of reading and listening comprehension in second and foreign languages, drawing on examples from various prior studies in diverse educational settings.
Significant interventions
The four studies have each addressed, and illuminated, fundamental issues in pedagogical and assessment practices and policies. The three studies focusing on new data have done so while systematically field testing, refining, and providing validation evidence about specific technology-based systems for language assessments. Chapelle et al. offer new insights into the qualities and presentation of feedback that make a difference in students’ improvement of their written drafts and of their language and discourse abilities. Poehner et al. show that the extent and qualities of mediation required for students to make accurate responses to reading comprehension items can indicate their potential needs for certain kinds of instructional support, potentially informing judgments about suitable learning activities or placement into particular learner groups or courses. Jang et al. demonstrate the multifaceted, variable dimensions of young students’ reading abilities as well as key factors, such as learning orientations and parental influences, related to uses of individually tailored diagnostic skills profiles. Harding et al. articulate design elements to guide the principled creation of diagnostic language assessments that: emphasize pedagogical relevance and usefulness, incorporate the views of teachers and learners alike, identify stages of diagnosis related to the development of language abilities, and lead directly to future treatment.
Notably, as well, all four studies ground, justify, and evaluate their research on theoretical and pragmatic principles. Chapelle et al. draw on principles of validity argumentation to specify and exemplify a framework of inferences for which evidence is required in the validation of automated writing assessments: accuracy, generalization, extrapolation, explanation, decision making, relevant uses, and ramifications. Poehner et al. approach their inquiry firmly on the basis of Vygotskian sociocultural theories of learning, reviewing diverse prior approaches to implementing dynamic assessment, and justifying each step in their design, analyses, and interpretations by assuming that providing mediated support is necessary to assess learners’ language development and potential. Jang et al. draw directly on principles and methods of cognitive diagnostic assessment as well as theories and prior research about reading skills, goal orientations, and self-perception of abilities. Harding et al. aspire themselves to formulate a theory, based on practical principles from their analyses of relevant research and theories, to guide the future development of diagnostic assessments of reading and listening comprehension.
Mixed methods
A further characteristic of design-based research prominent in the four studies is their uses of multiple, complementary methods for collecting and analyzing data. The studies reported by Chapelle et al. analyzed (a) errors and revisions in students’ written drafts of four different compositions and the feedback on them provided by teachers and by Criterion; then, in their second study, the researchers combined and interpreted data from (b) survey questionnaires, think-aloud protocols, Camtasia screen recordings, real-time observations, and interviews. Poehner et al. followed stages of one-on-one pilot observations and interviews to identify prompts and mediating moves, then tried out their application to texts and items in the computer program while refining scoring procedures, embedding transfer items, and conducting statistical analyses on multiple measures of groups of students’ performances, all in order to profile and analyze students’ developmental trajectories. Jang et al. initially analyzed results from standardized reading tests as well as student and teacher questionnaires to construct diagnostic reading profiles for individual students; they then investigated uses of the profiles through student surveys on reading, writing, and home languages, a goals questionnaire, teacher interviews, parent surveys, test performances, and think-aloud protocols with a subsample of students. Harding et al. exemplify their argument through evaluation of extracts from a variety of prior studies that used diverse research methods, including task and item analyses, tests of component subskills, piloting of innovative assessment designs and interfaces, training studies, hypothetical modeling, and interviews with expert diagnosticians, teachers, and learners.
Multiple iterations
Design-based research is characteristically conceived through multiple iterations of trying out, analyzing, and then refining educational innovations and theoretical understandings. Research is considered to be an ongoing process in interactive cycles of theorizing, action, interpretation, and refinement rather than one-shot data collection to reach definitive conclusions. In turn, diagnostic assessment here is considered to involve cycles of performance, diagnosis, mediation, and further diagnosis rather than producing a single recommendation or test score as in conventional language proficiency tests. Chapelle et al.’s first study involved two iterations over five classes, and their second study involved six classes over two semesters. Poehner et al. report results here from two implementations of a test of Chinese administered first in a non-dynamic mode with a large group of students and then later in a mode of dynamic assessment with a sub-group. These analyses link directly to parallel tests developed and analyzed for French and Russian, all informed by previous phases of theoretical reflection, item design, piloting, and exploratory data analysis (Poehner & Lantolf, 2013). Jang et al. report on research that was prefaced by analyses of normative trends in a large-scale reading test in order to develop reading skill mastery profiles, which were then applied to investigate, over a six-month intensive case study in two classes in one school, the uses of differing reading skill profiles in relation to the goal orientations and perceptions of reading abilities of students and their teachers and parents. Harding et al. provide a theoretical rather than empirical analysis here, reflecting on principles derived from a previous interview study with expert diagnosticians (Alderson, Brunfaut, & Harding, 2014), along with consideration of diverse other sources, to propose a systematic basis for designing, conducting, and evaluating future diagnostic assessments of reading and listening comprehension.
Collaborative partnerships
Research collaboration is evident in the multiple authorship of each of the four articles here. Moreover, the views and actions of a variety of relevant stakeholders feature integrally in and inform each study, including teachers, students, and even parents (thereby realizing Harding et al.’s second and third principles for the design of diagnostic assessments). In addition to the researchers themselves, Chapelle et al.’s first study involved three instructors, five students, and then 17 students; and participants in their second study were 88 graduate students in 34 academic disciplines. Poehner et al. worked with four additional co-investigators, a computer programmer, and 28 students in pilot tests, and then 68 students in their listening test and 82 in their reading test. The team of Jang et al. include one professor, two graduate students (and two additional graduate assistants), and the participating teacher, all of whom collectively analyzed and interpreted data from 44 students and 17 of their parents.
Practical impact on practice
The final characteristic of design-based research highlighted by Anderson and Shattuck (2012), and guiding the four studies here, is the purpose of improving educational and assessment practices. Chapelle et al. seek to facilitate and guide students’ development of English academic writing abilities, evaluating how certain types of computer-mediated feedback are able to promote such development. It is interesting to note that their framework for evaluating the validity of automated writing assessments proposes a criterion of “Ramification is critical for classroom diagnostic assessment, where we need to be able to claim that learning results from assessment use” (p. 387). Poehner et al. apply the principle of mediated support – considered theoretically integral to the diagnosis of second-language development (but lacking in conventional language tests that evaluate fully formed abilities) – to innovative computer-mediated assessments. They seek “to develop assessment instruments and scoring mechanisms that allow us to better capture and represent learners’ ZPD” (p. 342) as a basis for informing subsequent language teaching and learning. Jang et al. prepared profiles that combine cognitive reading skills, goal orientations, and self-assessments of ability, seeking to illuminate – in the interests of knowing how to tailor assessment and instruction more directly to students’ abilities and goal orientations – how “how students with different profiles respond to HDF based on the application of cognitive diagnostic modeling to population data from a provincial literacy assessment” (p. 360). Harding et al. articulate and evaluate five principles they consider to be integral to the design of effective diagnostic assessments of reading and listening comprehension.
Concluding remarks
These four studies are uniquely innovative and insightful as well as mutually, but differently, informative about diagnostic language assessments. The orientation of this work toward specific educational contexts for language learning, and the guiding notion of continuous development of assessment instruments and procedures, differs fundamentally from that guiding language proficiency tests for high-stakes decisions, which must by definition be distinct from, and not biased toward, particular educational programs, populations, content, or contexts (Cumming, 2014). Unlike language proficiency tests, which aim to provide scores that represent people’s abilities in a language comprehensively and in general, normative terms, the present four studies of diagnostic assessments conceive of language not as a single or uniform ability, but rather as multifaceted, variable systems of interrelated skills, knowledge, genres, and communication purposes – of which only certain elements might be assessed for diagnostic purposes in relation to curriculum or pedagogical aims and contexts. In Mislevy and Yin’s (2009, p. 264) terms, the present diagnostic assessments involve “rich context” and “opportunistic targets” that aim “to support students’ learning with individualized feedback.”
For example, Chapelle et al. analyzed students’ production of written texts for grammatical and stylistic errors as well as conventional rhetorical moves in research reports, acknowledging these elements to be only certain, partial indicators of English writing ability overall and to be variable across genres of writing as well as different academic disciplines. Poehner et al. designed test items that evaluate a selective, rather than comprehensive, sampling of students’ knowledge for reading and listening to Chinese lexis, grammar, discourse, culture, and phonology. Jang et al. followed theories of reading as a strategically self-regulated, multi-componential set of skills, but their initial reading profiles derived from the components of reading operationalized in a standardized test and educational curriculum, which assessed certain (but not all) reading skills such as vocabulary, grammar, explicit and implicit comprehension, inferencing, and summarizing. Moreover, their results show students’ learning of these abilities to be mediated variably by each student’s individual state of development, goal orientations, and self-perceptions as well as those of their teacher and parents. Harding et al. evaluated numerous models of the complex cognitive and linguistic skills inherent in skilled comprehension, puzzling over how to focus on diagnostic actions within their evident multifaceted complexity and interactions.
In sum, these varied and variable foci on specific aspects of language for educational purposes prompt me to wonder if diagnostic assessments necessarily have to focus on certain language skills relevant to a particular educational program, student population, and intended learning outcomes, rather than assuming comprehensive, holistic views either of language, learners, or pedagogical contexts. Might a defining characteristic of diagnostic assessments be that they are situationally embedded in specific educational contexts in order for diagnoses to be purposeful, relevant, and effective – in respect to the content of a curriculum taught and studied and to the intentions of participating learners and instructors? At the same time, the iterative, design-based nature of the present four research projects suggest that diagnostic language assessments are not fixed or standardized, but rather need to develop and evolve as assessors refine their procedures and instruments, again in relation to improved understanding of the affordances and constraints of particular educational programs and settings.
Consider, for example, the information deriving from language assessments in one institutional context, such as a university, in relation to the roles and responsibilities of people who use the results of language assessments. Staff in a university registrar’s office may only want, and only have time in their workload to cope with, a single score from an internationally acknowledged, standardized language proficiency test such as TOEFL or IELTS in order to make a recommendation as to whether a student applying to the university has a sufficient command of English to cope with the demands of academic studies in that language, while being assured that the score on that language test compares validly and fairly with all other past and present applicants to the university from around the world. The coordinator or faculty advisor for an academic program at that same university, however, should expect more detailed information than a single score from a language test, relevant to the expectations of that academic program. Can a student applicant read, write, and interact orally in a certain language with sufficient proficiency to complete course assignments effectively, conduct required research tasks successfully, and perhaps work as a teaching or research assistant? To make such decisions requires diagnostic information from a language assessment, related directly to (a) the conditions and expectations for performance within the program of studies and work as well as (b) the capacity of each student applicant to perform, or have the capacity to learn to perform, them. To address these matters, a program coordinator or faculty advisor should also ask – and rightly expect relevant diagnostic information from a language assessment – might a student applicant benefit from preparatory or supplementary courses to improve their language proficiency, and if so which aspects of the language and under what conditions? In turn, instructors and administrators charged with preparatory or supplementary courses or other activities are the people who truly need detailed, relevant, and valid information from diagnostic language assessments to design, implement, evaluate, and refine their courses or other activities. Likewise, students in these courses and more generally should expect systematic, useful information from diagnostic language assessments to focus, guide, and evaluate their own language learning goals and experiences. Parallels to such university settings could be easily made with the institutional contexts of language assessments for professional or work-related certification, identification of needs and pedagogy for students in schools, or initiatives to promote heritage or less commonly taught languages.
It is for these educational uses of diagnostic language assessments that the present articles by Chapelle et al., Poehner et al., Jang et al., and Harding et al. provide exemplary models that demonstrate theoretical principles, systematic research, and practice-oriented innovations to guide future developments. But the limitations of these novel initiatives are also evident. Only a few, limited aspects of people’s full range of language abilities are addressed, as may necessarily be so in diagnostic assessments. The assessment instruments and procedures documented here are all preliminary, oriented to specific courses and populations, and in need of further or even ongoing development. New computer-mediated technologies for diagnostic assessments offer appealing, useful prospects but are difficult to imagine being implemented efficiently on a broad, sustainable scale. The four present studies have involved some of the best researchers internationally applying their efforts to develop diagnostic language assessments, but could others with less knowledge, experience, and funding produce comparable successes and insights? A perplexing question overall is whether many institutions such as universities, colleges, schools, and professional and vocational certification agencies would be willing to commit the resources, time, and expertise needed to design and implement effective, sustainable language assessments to serve distinctly diagnostic purposes – given the inherently limited, local applications to specific educational contexts that diagnostic assessments fulfill and the long-term, resource-intensive processes of development that design-based research requires.
Footnotes
Funding
I thank the organizers of the Language Testing Research Colloquium, held in Seoul, Korea, in July 2013 for funding my travel to the conference in order to present these ideas as a discussant at the colloquium.
