Diagnosing diagnostic language assessment

Abstract

Diagnostic language assessment (DLA) is gaining a lot of attention from language teachers, testers, and applied linguists. With a recent surge of interest in DLA, there seems to be an urgent need to assess where the field of DLA stands at the moment and develop a general sense of where it should be moving in the future. The current article, as the first article in this special issue, aims to provide a general theoretical background for discussion of DLA and address some fundamental issues surrounding DLA. More specifically, the article (a) examines some of the defining characteristics of DLA and its major components, (b) reviews the current state of DLA in conjunction with these components, and (c) identifies some promising areas of future research and development of DLA where important breakthroughs can be made in the future. Some of the major obstacles and challenges facing DLA are identified and discussed, along with some possible solutions to them.

Keywords

Diagnosis diagnostic language assessment feedback multi-staged diagnosis multi-tiered feedback remedial learning validity framework

Diagnostic language assessment (DLA) is nowadays gaining a great amount of attention from language testers, language teachers, second language acquisition and writing researchers, and many applied linguists (Alderson, 2005; Alderson, Brunfaut, & Harding, 2014; Kunnan & Jang, 2009; Lee & Sawaki, 2009; Read, 2008). A recent surge of interest in DLA in the field of language testing and learning seems to be linked with increasing needs and demands for tailored assessments that pinpoint the source of the students’ problems in language learning (or use) and provide means for the learners and teachers to deal with the root causes of the problems effectively. In conjunction with this, there has also been, in my view, a gradual shift of paradigm over time in language testing from its focus on selection/screening-oriented testing to more learning-oriented assessment. These new learning-focused approaches emphasize the learning-inducing nature of language assessment in the sense that assessment should be geared toward identifying learning potentials and promoting further learning beyond what the test-takers currently know or are able to do.

There have been several important driving forces, or related developments, behind the rise for such learning-oriented assessment approaches, including DLA, in language assessment. Many of these are connected with the efforts either to extract finer-grained diagnostic information about test-takers from the existing assessments or to develop new assessment tools and procedures that can be used for diagnostic purposes. The first of this kind originates from the “formative assessment” movement. Formative assessment refers to an assessment process in which students’ learning is monitored on an ongoing basis to identify the difficulties the learners are experiencing in reaching the targeted learning goals and to provide explicit feedback to the teachers and learners, so that the teachers and learners can adjust their teaching and learning accordingly (Black & William, 2009; Bloom, Hasting, & Madaus, 1971; Rea-Dickinson & Gardener, 2000; Torrance & Pryor, 1998). Formative assessments, which are usually low-stakes and classroom-based, tend to be embedded in a particular program of learning. One important thing to note in its relationship to DLA is that provision of explicit feedback, which is purported to impact subsequent teaching and learning, is an essential component of both formative assessments and DLA.

The second important development that has to do with DLA is the heightened awareness among language testers about the consequential validity of language assessments (Messick, 1996), particularly the washback effect of large-scale, high-stakes tests on language learning and instruction (Alderson & Wall, 1993; Cheng,Watanabe, & Curtis, 2004). The washback effect refers to the degree to which the introduction and use of particular tests influence the teachers and learners to do things that they would not otherwise do in order to promote or inhibit learning (Alderson & Wall, 1993; Messick, 1996). The washback effect can also be linked to the notion of systemic validity that has been proposed to refer to curricular and instructional changes that can be brought about by the introduction of a test (Cohen, 1994; Frederiksen & Collins, 1989). Messick (1996) cautions us that, in order to secure consequential validity evidence for a test, we need to clearly establish causal connections regarding what particular qualities of the test are responsible for what particular washback effects that could have been introduced by the test (Messick, 1996). This point is particularly relevant for DLA, since it is specifically intended to bring about positive impacts on students’ subsequent learning.

The third important development is the rising demand for enhanced score reporting systems for assessments and the refinement of advanced psychometric techniques to support them. The heightened awareness of washback has provided motivations for assessment developers and testing organizations not only to introduce new assessment tasks and scoring rubrics but also to upgrade their score reporting systems so that they can provide a score profile and finer-grained information about test-takers’ strengths and weaknesses. A number of related approaches have been used to extract fine-grained information about the linguistic knowledge and subskill mastery states of test takers from their performance on language tests (Buck & Tatsuoka, 1998; Buck, Tatsuoka & Kostin, 1997; Jang, 2005; Lee & Sawaki, 2009; von Davier, 2005). DLA was a central focus when these cognitively diagnostic psychometric models were applied to large-scale language test data of existing language tests to extract fine-grained information about test-takers’ strengths and weaknesses. Application of cognitively diagnostic models (mostly based on retrofitting to existing non-diagnostic tests) has, however, been focused on receptive skills sections, such as listening and reading, except for a few studies (Kim, 2011).

Lastly, the most recent development in relation to DLA is remarkable progress in automated scoring and feedback technology for constructed response tasks, such as writing and speaking tasks. Building upon computational linguistics and natural language processing technologies (NLP), many of these automated essay and speech scoring systems attempt to emulate what human raters do in rating test-takers’ speaking and writing samples and they have demonstrated high rates of score agreement with the human raters (Lee, Gentile, & Kantor, 2010; Enright & Quinlan, 2010; Xi, 2010). In the contexts of large-scale, speaking and writing assessments, the automated scoring system can serve as a second scorer to complement a human rater’s score. When these scoring technologies reach maturity, the automated evaluation system is also expected to serve as the main evaluator to provide test takers with useful performance and diagnostic feedback in addition to composite/multi-trait scores about the test-takers’ performance. Another exciting possibility is that such automated scoring and evaluation engines can be merged with computer- or internet-based, individualized (or collaborative) learning programs, which will make it possible to provide the learners and prospective test-takers with diagnostic feedback and remedial learning opportunities in an integrated way.

Despite such remarkable developments and their promises for the future, however, several DLA scholars have criticized a lack of clear definitions of the major components and essential characteristics of DLA, guiding frameworks for DLA development and validation, and DLA instruments and procedures that are appropriate for DLA purposes (Alderson, 2005; Lee & Sawaki, 2009). One of the urgent needs is to develop a general consensus of definitions, constitutive components and processes, and essential characteristics of DLA, evaluate its current state of development, and identify future avenues for research and development. In addition, in order to achieve continuous progress in DLA, it would be necessary for DLA researchers not only to watch closely new developments in existing DLA methodology (e.g., the cognitive diagnosis approach) and its application in language assessment but also to look closely into other related fields of inquiry and practice, such as reading disability diagnosis, dynamic assessment, automated essay evaluation and feedback. (Some of these challenging fronts of exploration for DLA are addressed in other articles in this volume.)

With these issues as a backdrop, this article aims to provide an overall background for DLA and discuss some fundamental issues surrounding DLA. More specifically, the article (a) examines some of the defining characteristics of DLA that distinguish it from other forms (or purposes) of assessment, (b) reviews the current state of DLA by discussing some challenges in relation to the three major components of DLA, and (c) identifies some promising areas for future research and development of DLA where important breakthroughs can be made in the future.

Definition and essential characteristics of DLA

Definition and historical background

The term, diagnosis, originates from a Greek word “diagignoskein,” which means “discerning” or “distinguishing” (Harper, 2010). The Greek word is formed by combining two lexical morphemes, “dia” (apart) and “gignoskein” (to learn, know, or perceive). The term first began to be widely used in the field of medicine, and its general meaning has been closely related with the identification of the nature and cause of a disease or disorder in medicine (American Heritage® Medical Dictionary, 2007). One important feature of medical diagnosis is the distinction, and the connections, made between symptoms (or signs) and a disease as their root cause. In medical diagnosis, a variety of mental, psychological, and bodily symptoms are examined to identify and determine the nature of the disease causing the symptoms. Such a diagnostic process involves measurement, assessment, data gathering, and interpretation of collected data and information, which results in a doctor’s diagnosis, feedback to the patient, prescription of drugs, and eventually treatment (Langlois, 2002; Treasure, 2011). One thing to note here is that medical diagnosis is ultimately aimed at finding the true cause of the symptoms and properly treating it rather than (or as well as) simply alleviating the symptoms.¹

In the field of language testing, the terms, diagnosis and diagnostic testing, have long been used in connection with the major purposes (or types) of testing (Davies, 1968; Henning, 1987). Recently, Kunnan and Jang (2011) revisit these classification schemes for purposes of language testing and suggest that the major purposes of testing can be analyzed and differentiated according to the time direction of assessment and impact. According to their scheme of classification, proficiency testing is concerned primarily with assessing acquisition/learning from the past and predicting performance for the future; achievement (or attainment) testing, primarily with assessment of past learning; and finally diagnosis, with assessment of past learning/acquisition and provision of information for future learning. Such a time-orientation perspective sheds an important light on the nature of diagnostic testing.

It seems, however, that a couple of additional elaborations could further advance our thinking on DLA and improve the conceptual clarity of classification. First, although achievement testing, particularly of a summative nature, is concerned mainly with learning achieved in the past, formatively oriented achievement testing is concerned with both the past and future, as is diagnostic testing. This is reinforced by the fact that feedback is an essential component of both formative assessment and DLA. Second, we probably need to differentiate further between proficiency and diagnostic assessment, since both are concerned with the past and present. The first such criterion has to do with whether testing focuses more on what one is able to do or on what one is not able to do yet. Proficiency testing has more to do with identifying one’s overall strengths and competencies, while diagnostic testing is intended to assess one’s weaknesses and their underlying causes in addition to the overall strengths. The second such criterion has to do with increased specificity (or finer grain size) of assessment for DLA, especially in terms of assessing the learners’ weaknesses and their underlying specific causes.

Diagnostic language assessment (DLA) is defined, in this article, to be the processes of identifying test-takers’ (or learners’) weaknesses, as well as their strengths, in a targeted domain of linguistic and communicative competence and providing specific diagnostic feedback and (guidance for) remedial learning. One interesting feature of the definition is that the strengths and weaknesses are identified together in this conceptual framework, despite DLA’s original focus on the weaknesses. Although the initial goal of DLA is to identify the learners’ overall strengths and weaknesses and pinpoint the root causes of the weaknesses, its ultimate goal is to promote further learning on the part of the learners and increase their overall growth potential through diagnostic feedback and remedial activities. Not only the learners/test-takers but also many other stakeholder groups (e.g., teachers, parents, school administrators, DLA developers, and professionals in testing organizations) can take part in, and contribute to, the process of DLA.

Some essential characteristics of DLA

Several scholars have attempted to identify important characteristics of DLA which can help to distinguish DLA from the existing formats of language assessment (Davies, 1968; Bachman, 1990; Kunnan & Jang, 2011). More recently, Alderson et al. (2014), based on their review of diagnostic practices in the fields of medicine, mechanics, computer programming, and literacy education, have created a list of general principles of DLA. In this section, some of the distinguishing characteristics of DLA are identified, selected, and elaborated, with some additional points raised for clarification.

Conceptualizing normality and goals

When we refer to the language learners’ weaknesses (or deficiencies) in addition to their strengths in DLA, we implicitly assume that there is some sort of a normally, healthily, or fully functioning state for the learner’s targeted competence. The learners’ targeted competence can also be further divided into its sub-competencies or constituent components, and such norms can possibly be specified at various levels of granularity, for instance, at the level of an overall language proficiency, a particular language skill (e.g., reading, writing), domain (e.g., lexis, grammar), or task (e.g., general understanding, writing an email). These normally functioning states can serve as reference points (or target goals) against which the learners’ current states of knowledge, skills, and abilities are measured to identify the learning gap.

When we conceive of a fully functioning state of a learner’s communicative system (or competence), for example, an underlying assumption is that all of the constituent components and processes are coordinated effectively to achieve required communication goals. Such ways of defining normality (or targeted goals) seem to be very much dependent on the metaphor of language use and performance for a fully developed system. Given that most of the second language learners’ competencies are still in development, one could argue that the development (e.g., undeveloped, partially developed, fully developed) and the functioning states (i.e., malfunctioning, partially functioning, highly functioning) should be treated and evaluated as two different dimensions of normality in DLA. This may also suggest that the functioning states can be established separately for each of the development levels or stages if these two dimensions are combined into a unified framework.

By the same token, can we also define such normality at more specific levels of tasks, items, and attributes (including knowledge, subskills, or processes)? Can we decompose a task into parts and subparts in terms of required knowledge, skills, and processes? Can we also identify some attributes required jointly by a series of items or tasks? If any of the knowledge, skills, and processes required for completion of a certain task or a set of tasks is not acquired, activated, and coordinated effectively, how is such a deficiency (or functioning failure) manifested in the learners’ responses, performance, and behaviors? Can we establish standards for normality not only for the overall performance on the tasks/items but also for each of the attributes required for the completion of tasks and items? Can we use such standards to assess the gap between the learners’ current performance levels and the targeted goals and identify the weaknesses and deficiencies that cause such a gap? All of these related issues need to be carefully taken into consideration when we conceptualize and define the normal state of functioning and development in the context of DLA.

Identifying potential areas for growth

The primary goals of DLA are to identify the learners’ weaknesses and the underlying causes of the weaknesses and help the learners to overcome or improve on those weaknesses that prevent them from moving beyond the current stage of learning/performance to the next level. In DLA, the learner’s weaknesses are often associated with deficiencies, misconceptions, faulty understanding, incompletely learned/unlearned knowledge, or unacquired/incompletely acquired skills. However, many researchers prefer to use more positively nuanced words to refer to the weaknesses, such as “areas for improvement” (Lee & Sawaki, 2009), “areas in which a student needs further help” (Alderson, Clapham, & Wall, 1995), and “learning potential” (Poehner, Zhang, & Lu’s article in this volume). Hirch (2014), in particular, proposes using the term nondum abilities (which means “abilities yet to be developed”) to minimize the negative nuances of terminology and emphasize the teachability and learnability of the attributes of interest.

As mentioned previously, however, the process of diagnosis involves not only identifying the weaknesses but also distinguishing them from the strengths. If possible, the identified weaknesses can be further decomposed into weak and strong components, so that the root causes of the initially identified weakness are pinpointed at more specific levels. This means that these decomposing and disentangling processes can be repeated at different levels of specificity, until a correct diagnosis is achieved. So in an ideal form of DLA, the weaknesses should be identified, represented, and described in a detailed and specific manner. When the information about the weaknesses is provided to the learners, it is desirable to present it in parallel with that for the strengths. Even in the stage of remedial learning and intervention, when we attempt to improve on, or augment, the components identified as weak in the targeted domain, we would want to add or build new elements into the already existing structures of knowledge, skills, and processes that are already functioning at the expected level. This means that while we need to focus on identifying and treating the root causes of the weaknesses in DLA, we should not lose sight of their relationship to the strengths and their potential impact on the whole system of functioning and performance.

Multi-componential representation of constructs and problems

In DLA, the learners’ weakness are identified along with their strengths at a general level and further decomposed into their constituent parts/subparts, components/subcomponents, and processes/subprocesses to pinpoint the root causes of the learning and performance problem. Closely connected with such decomposing processes is the multi-componential representation of the assessment construct and the learning/performance problems in DLA. Distinguishing between the weaknesses and strengths requires us to have a multi-componential view of the competence. In a sense, it is such multi-componential views of the construct and the problem adopted by DLA that make it possible to generate from DLA more specific and detailed information about the test-takers’ weaknesses and their underlying causes than from the existing general-purpose proficiency and achievement tests.

Within such a multi-componential framework of representation, it is considered desirable to report a profile of component (or attribute) scores showing weakness–strength patterns in addition to a single summative test score. One important point is that DLA developers should make a special effort to make clear what is represented by each of the component/attribute scores. For instance, do they refer to particular attributes of a particular language skill (e.g., gist understanding on a reading comprehension test)? Do they refer to specific types of linguistic knowledge, such as vocabulary or grammar? Do they focus on both the product and the process of language processing or one of the two? When a particular weakness is analyzed, decomposed, and reduced to its root cause(s), this task is most likely done in relation to other constituent components and subcomponents of an assessment construct either at a general (e.g., a whole test, a whole section of the test) or more specific levels (tasks/items/sub-items). It is important to understand and define the relationships among these related components and subcomponents of the assessment construct, for instance, compensatory and non-compensatory relationships among the components. In the compensatory relationships, mastery of all the components involved in a language task is not necessarily required for a test taker to perform satisfactorily on the task because strength in one component may compensate for weakness in another. In the non-compensatory relationships, however, weakness in one component cannot be completely compensated for by strengths in other components in terms of task performance (Lee & Sawaki, 2009).

Increased specificity for assessment and feedback

Although increased specificity (or reduced granularity) for diagnosis and feedback is an important defining characteristic of DLA, there seems to be a fundamental dilemma in defining it, as has been noted earlier by several researchers in language testing (Alderson et al., 1995; Bachman, 1990). The core of the problem is that specificity (or granularity) is not an absolute concept, but a relative concept (e.g., more or less specific, compared to something) in DLA. It is not easy to determine how specific is specific enough for diagnosis and feedback in DLA. For instance, we can say that the four skill section scores (listening, reading, speaking, and writing) of a test provide more specific information than the total test score (or weighted sum of the section scores) only. However, the subskill/attribute scores (e.g., general understanding, specific understanding, inferencing, and synthesizing in reading) estimated through cognitive diagnostic analysis from the reading section of a language test provide even more specific information than the reading section score alone. Even each of these attributes can be further subdivided into smaller or more micro-level attributes or skills.

Then, a central question is as follows: How can we determine the optimal level of specificity or granularity for diagnosis and feedback in different DLA contexts? One possible starting point is to conceptualize specificity as a property assessed on a continuum rather than a simple dichotomy (specific vs. non-specific). Next, we can ask a more concrete, reformulated question: What specificity levels of feedback are most effective and useful for remedial learning activities? This means that such questions need to be rephrased into a series of more learning- and instruction-centered ones. For instance, what grain size of diagnostic information is most likely to be acted upon by learners and teachers? What is the manageable and treatable level of specificity in DLA? How useful do learners and teachers find different types of feedback information (e.g., numeric, verbal, visual, generic, and task-specific)? In addition, we probably should consider various other factors (e.g., learner characteristics, test purposes, assessment construct, and test design) in determining the optimal level of specificity for diagnosis and feedback.

In relation to this, one worthy avenue for future research is to carefully examine what an accumulated body of feedback research studies tells us about the effectiveness of various feedback types and methods in terms of facilitating self-regulated learning on the part of the learner. A large body of knowledge and expertise has accumulated over time in feedback research in education (Butler & Winne, 1995; Hattie & Timperley, 2007) and second language acquisition (Lyster & Ranta, 1997) as well as second language writing (Biber, Nekrasova, & Horn, 2011; Bitchener & Ferris, 2011; Hyland & Hyland, 2006). For instance, what are the major findings about the potential impact of various learner characteristics (e.g., language proficiency, learning styles, L1 backgrounds, gender, etc.) on the effectiveness of feedback at particular levels of specificity? In addition to the structural aspect of feedback, the language, wordings, and lengths of segmentation of the feedback can be important factors impacting the effectiveness of the feedback.

Impacting future learning and instruction

DLA is both backward-looking and forward-looking (Davies, 1968; Kunnan & Jang, 2011). It is backward-looking in the sense that it is intended to assess what the learners have not yet learned as well as what they have already learned. It is forward-looking since it has a very explicit goal of impacting and facilitating subsequent remedial learning and instruction by providing diagnostic feedback and (guidance for) remedial activities designed to strengthen the identified weaknesses and increase the overall growth potential. Of course, even high-stakes language tests developed for non-diagnostic purposes can have a huge washback effect on learning and instruction, which may be rather general, systemic, complex, and difficult to trace. In contrast, the kind of impact intended by DLA is designed to be relatively more direct, specific, substantive, individualized, and customized to the needs of the learners and teachers. Since the intended impact of DLA is mostly realizable through diagnostic feedback and remedial learning, it is critically important to ensure the accuracy of diagnosis, the effectiveness of diagnostic feedback, and the close alignment between diagnostic feedback and subsequent remedial learning activities.

Three core components of DLA

In order to translate the working definition and defining characteristics of DLA into the design and development of DLA instruments and procedures, it would be helpful to conceptualize it in terms of its major component parts, define each of them, and establish the relationships among these parts in a systematic, principled way. In this article, the DLA procedure is divided into three main parts: (a) diagnosis, (b) (diagnostic) feedback, and (c) remedial learning. Figure 1 describes three main components of DLA, which contain various subcomponents, and describes the relations among these three main components.

Figure 1.

Three major components of diagnostic language assessment (DLA).

The first component is diagnosis, which is the core component of DLA. The central goal of diagnosis in DLA is to identify not only one’s strengths but also weaknesses that prevent the learner from moving beyond the current level of learning (or knowledge, proficiency, ability, or competence) to the next level. Diagnosis requires developing and using various diagnostic instruments (including tests, questionnaires, and tools) and procedures for collecting, analyzing, and scoring the test-takers’ data, estimating the learners’ pattern of strengths and weaknesses in a targeted domain, and, if necessary, classifying the learners into groups of learners with the same (or a similar) pattern of strengths and weaknesses. In this stage, it is important not only to separate the weaknesses from the strengths at a general level but also to pinpoint the root causes (or underlying causes) of the weaknesses at more specific levels, which may ideally require multiple rounds of diagnostic testing.

The second component is (diagnostic) feedback, which is designed to describe, summarize, and present the results of diagnosis to the learners, teachers, and other stakeholders of assessment in various formats. DLA has the relatively strong mission of facilitating the learners’ future learning, which can be realized through carefully designed feedback and remedial activities provided to, and acted upon, by learners, teachers, and educational institutions. So, once the strengths and weaknesses (along with the root causes of the weaknesses) are identified, such information should be communicated effectively to the learners and teachers, so that they can take necessary or recommended actions to strengthen the identified weaknesses and increase the overall learning potential.

The current practice of diagnostic feedback in DLA is to provide it in the form of an enhanced score report, which includes a succinct summary of the learners’ patterns of comparative weaknesses and strengths in addition to their overall proficiency (Kunnan & Jang, 2011). The enhanced score report can include both quantitative and qualitative information about the test-takers’ knowledge/skill states and performance. The quantitative information can comprise a profile of skill/attribute scores in the case of the multiple-choice (MC) item section of the test or analytic/trait/dimension scores in the case of the constructed response (CR) task section, together with the section scores and their weighted composite (the total score), and a vector of item scores (along with the key and the response option chosen by the test-takers for each of the items). The quantitative information can also be presented in visual/graphic form in the score report and then further translated into verbal descriptions (or descriptors) that are easily understandable to the learners, all of which are expected to increase the interpretability and meaningfulness of the diagnostic information.

The third component has to do with remedial learning and instruction. Remedial learning (or treatment/intervention) is a set of learning activities that are designed to help the learners to improve on the attributes that were identified to be weak and thereby reach the desired goals of learning and proficiency in the targeted domain of language and communication. A rather strong position is taken in this article that responsible DLA should comprise some degree of remedial learning as an essential component of DLA. At the moment, we can think of three major types of remedial learning and instruction programs as follows: (a) remedial programs that DLA developers are able to provide, or are responsible for providing, in connection with the diagnostic feedback from the DLA system; (b) remedial learning/instruction programs the teachers and educational institutions can provide for their students; and (c) self-help remedial activities the learners can provide for themselves, independently of the remedial programs provided by the DLA developers or the teachers.

The first and second types of remedial programs may need further elaboration here. First of all, the remedial programs, which DLA developers are able to provide, can include the following: (1) a session to review the test items/tasks incorrectly, incompletely, or unsatisfactorily answered by the test-takers, with a focus on the learners’ misconceptions, faulty understanding, and cognitive/performance errors; (2) a session to review the relevant sections of various reference and resource materials available online/offline (e.g., an online dictionary, a grammar book, a textbook, etc.) that address the learners’ weaknesses, misconceptions, and cognitive/performance errors associated with the items/tasks; and (3) some practice and re-assessment sessions with various equivalent forms of the tests that were used in diagnosis. Second, when DLA is designed, developed, and implemented in close connection with a school curriculum, syllabus, or a course of instruction, the teachers and schools can play a major role in developing and implementing a long- or short-term remedial program for the students, which can go in parallel with, or be embedded in, ongoing, regular instruction. Under such contexts, school teachers and administrators probably need to work in close collaboration (or consultation) with DLA developers.

Major challenges and issues facing DLA

In diagnosing the current state of DLA, a good starting point is to review some important issues surrounding DLA in relation to the three main components. For example, do we have a good understanding of what we are supposed to diagnose in DLA? How capable are we of diagnosing what we would want to at the desired levels of granularity? Do we have a solid understanding of what the most effective ways to describe and communicate diagnostic results to the learners and teachers are? How much should the DLA developers be involved in developing and providing remedial learning activities to the learners? Some of these issues are further elaborated below.

Conceptual frameworks for assessment design, analysis, and diagnosis

A fundamental goal of various learning-oriented assessment approaches, including DLA, is to identify the gap between the learners’ current states of knowledge/skills and their targeted learning goals (or norms) and encourage them to work harder toward the desired goals. DLA, in particular, attempts to help the learners and teachers achieve such goals by assessing the learners’ weaknesses and strengths and providing feedback and guidance for subsequent learning. The first big challenge, as discussed previously, is that there does not seem to be an absolute criterion for determining the optimal level of specificity (granularity) for DLA. In relation to this, what seems to be lagging behind most is our understanding of the following: (a) how to define the constituent parts and subparts of the assessment construct to be decomposed in different contexts of DLA; and (b) how to develop tests, instruments, tools, and procedures to assess the learners’ strengths and weaknesses at different levels of specificity and link the assessment results to remedial learning activities.

In cognitive diagnosis approaches (CDA), for instance, researchers make special efforts to identify and define a set of meaningful attributes (including knowledge, skills, and processes) that are considered underlying factors impacting the learners’ performance on items and tasks through a careful process of constructing an “item-attribute matrix,” which is often called a Q-matrix, for a given test (Lee & Sawaki, 2009). The weakness–strength patterns are defined and identified based on the attributes included in the Q-matrix. The same may be true when an analytic (multi-trait) scoring scheme is developed and applied to score constructed response tasks in speaking and writing assessments. However, these attributes/rating dimensions may not lend themselves well to identifying weaknesses at sufficiently specific levels and pinpointing the underlying causes of the weaknesses. Even here, some of these attributes/dimensions are super-ordinate attributes that are formed by collapsing some of the more specific subattributes or subdimensions. Moreover, the number and specificity of attributes/dimensions can also be constrained by various logistical considerations (e.g., the total number of items/tasks, item/task types, analytical tools, etc.).

In addition, it is important to understand how specific assessment results can be linked to overall assessment. From a strategic point of view, it seems necessary to tackle these issues on two major fronts. As some researchers note, DLA can be either theory-based or syllabus-based (Bachman, 1990), which implies that DLA can be developed from the frameworks of both proficiency and achievement testing. Kunnan and Jang (2011) also propose that it is feasible to incorporate the capability of diagnostic feedback into existing achievement and proficiency testing systems. Given current trends, future development in DLA is very much likely to be linked not only to the innovative efforts to develop “purely diagnostic tests” consisting primarily of tasks and items of more specific or discrete-point nature but also to continuous efforts to strengthen or augment the diagnosticity (or diagnostic power) of existing assessment practices.

Ensuring the effectiveness of diagnostic feedback

Although feedback, more precisely diagnostic feedback, is a critical component of DLA, the whole field of language testing seems to have only a rudimentary understanding of what it should consist of, how it should be structured and linked to diagnosis and remedial learning, and what types of feedback are most effective in facilitating and promoting subsequent remedial learning. In educational contexts, feedback is defined as “information provided by an agent (e.g., teacher, peer, book, parent, self, experience) regarding aspects of one’s performance or understanding” (Hattie & Timperley, 2007, p. 81). Feedback in general has been claimed to be most effective when it is targeted at pinpointing and treating misconceptions and faulty understanding revealed through the learner’s performance on items/tasks, rather than complete lack of understanding (Hattie & Timperley, 2007). Most such types of feedback are designed to impact subsequent learning activities positively and ultimately help the test-takers become self-regulating learners who can self-monitor, seek appropriate feedback, and self-adjust their learning processes. Some researchers, such as Butler and Winne (1995), see feedback as an “inherent catalyst” for all self-regulated learning activities.

DLA researchers have been grappling with a fundamental dilemma in meeting two conflicting requirements: (a) satisfying psychometric requirements for obtained profile scores; and (b) making DLA-based feedback as meaningful and effective as possible for subsequent learning activities. First, the psychometric requirements demand that the reported scores (profile, section, and total) meet certain standards of psychometric quality (e.g., accuracy, generalizability, and validity) and be described at a certain level of abstraction. Such score profiles can give a general picture of the weakness–strength patterns, but are somewhat detached from the learners’ actual performance on each individual item/task. In contrast, feedback can be most effective in facilitating subsequent learning when it is addressed to a specific learning context or a specific item/task. In DLA, such contextualized feedback can most likely take the form of informing learners of the misconceptions, faulty understanding, and cognitive/performance errors identified in test-takers’ responses to the incorrectly answered items or unsatisfactorily responded tasks. In the ideal feedback system, the two conflicting demands for “abstraction” and “usefulness” should be accommodated in a complementary way. This requires that general, specific, and contextualized types of diagnostic feedback in DLA should not only be combined in an integrated way but also linked to one another in a systematic, meaningful, and defensible way.

In this regard, a core task in designing such a feedback system would be to establish valid and meaningful links between various pieces of information produced across different components of DLA horizontally, across different levels of granularity within each DLA component, and between the quantitative and qualitative information produced by DLA. The first type of links should be established between feedback and two other components of DLA (i.e., diagnosis, remedial learning), which suggests that diagnostic feedback should not only reflect the results of diagnosis accurately but also align itself closely with remedial learning activities to ensure its positive impact and effectiveness. The second type of link is between diagnostic information generated at different levels of specificity and granularity (from section scores to attribute scores to item/task specific misconceptions/errors) within the feedback component. The third type of link should be established between the quantitative and qualitative information reported in the score and feedback report, which requires an elaborate system and process of translation, transformation, score anchoring, or standard setting.

Remedial learning/instruction

Remedial learning or instruction has not yet been studied much in connection with DLA. When we think of remedial learning in connection with DLA, the first key issue is to what degree DLA developers should engage in designing remedial learning activities for the learners. It has not been very clear whether remedial learning should be included as an essential component of a DLA system or whether it should be completely left up to the learners and teachers to determine what they are going to do with the diagnostic results and feedback. Some DLA developers may want to take a minimalist approach to remedial learning by deciding not to go much further than simply providing feedback about the learners’ identified weaknesses and strengths. However, the minimalist approach may run the risk of negating or weakening the DLA’s original mission of promoting further learning and impacting subsequent learning positively.

There are a number of possible ways to increase the probability that the learners heed and act upon the feedback and take necessary remedial actions to address the identified problems. (See Jang, Dunlop, Park, & Vander Boom’s article in this volume for more information about the actionability of feedback.) These methods, in principle, require strengthening the links (and integration) between the diagnostic feedback and remedial activities by including more specific guidance for remedial learning and instruction in the score and feedback report. To be more specific, it would also be helpful to have an elaborate feedback system into which a routing and referral mechanism is built to guide the learner to relevant remedial activities and make it compulsory (or required) for the learners to go through a course of recommended remedial activities and be re-assessed about what they learned and how much they improved. This also assumes that feedback and remedial learning components can be designed, developed, and implemented in an integrated way.

Conclusions and some areas for future exploration

DLA, along with other learning-oriented language assessment approaches, represents language testers’ collective attempts to make language assessments more accountable and responsible for its consequences on learning and instruction. In this article, many more questions have been asked than answers provided about DLA, partly because DLA is still in an early period of development. New developments in psychometrics and measurement have served as important inspirations and driving forces of exploration for DLA. New possibilities are also being created for DLA by recent progress in automated essay and speech evaluation technology, online corpus linguistic analysis, and information technology-enhanced learning programs. At this particular point of evolution of DLA, what we can do as a community is to continue to develop, evaluate, and improve the DLA system, with the optimistic belief that such efforts will eventually enhance the diagnostic power of existing diagnostic and non-diagnostic tests and also deepen our understanding of the ways to maximize the positive impact of assessment on, and its contribution to, language learning.

In this regard, the following are some of the promising areas identified for future exploration that warrant DLA researchers’ and developers’ future investigation.

Multi-staged diagnosis, multi-layered feedback systems

In this DLA framework, a multi-staged approach to diagnosis is combined with a multi-layered system of feedback. First, a multi-staged DLA system refers to a series of hierarchically and horizontally linked tests that are administered together, preferably starting from tests assessing weaknesses and strengths at general levels and then moving on to those assessing the underlying causes of the weaknesses. Diagnostic information obtained from different stages of DLA should be linked with one another. Second, a multi-layered feedback system is an interactive feedback system that can adjust its level of granularity in response to the expressed needs of, or requested specificity levels of feedback from, individual learners and teachers, which is analogous to an “online interactive map” with zooming-in and zooming-out functions. This requires that a wide granularity range of data and information should be collected, analyzed, and linked across different stages and levels of DLA. If such a multi-tiered DLA system is successfully established, it will give us not only a bird’s-eye view of the weaknesses–strengths patterns of the language learners but also a close-up view of the underlying causes of the identified weaknesses.

Utilization of learners’ misconceptions and errors in refining diagnosis and feedback

The underlying causes of the language learners’ weaknesses can be manifested or revealed through various indicators observed in their behaviors, performance, and responses to the assessment tasks. Such indicators can probably be found in the contents of incorrect options selected by the learner for the MC items, incorrect answers supplied for short-answer items, and extended responses constructed for speaking or writing tasks. In the case of constructed response tasks, the weaknesses can be indicated by various features of the constructed responses, which may include under-developed idea units, lack of coherence/cohesion and lexical/structural varieties, inappropriate choice of words/structures, collocational, morpho-syntactic, and pronunciation/intonation errors, and presence of disfluency markers. These indicators can also be closely related with the learners’ misconceptions, faulty understanding, insufficient width/depth of lexical/grammatical knowledge, incompletely acquired comprehension subskills, underdeveloped automaticity, or various other linguistic/communicative deficiencies. In the future, there should be more systematic investigations into the relationships between various types of learner errors/text features on the one hand and the deficiencies and weaknesses revealed by them on the other hand. Insights from such investigations can inform future efforts to refine the existing DLA methods, particularly the ways to diagnose the root causes of the learners’ weaknesses. If such refined technologies of diagnosis and feedback are successfully developed and implemented in the future DLA tools and systems, it can greatly improve the accuracy of diagnosis and the effectiveness of feedback for the DLA system.

Refining validity frameworks for DLA

One interesting feature of a diagnostic score and feedback report is that it can include not only score information that can be psychometrically evaluated but also qualitative information that describes the nature of attributes (knowledge, traits, skills, or processes), learners’ proficiency levels on the attributes, weakness–strength patterns, and referral information for recommended remedial activities. Meaningful and defensible linkages should be created, through the process of translation, score anchoring, and standard setting, between the qualitative and the quantitative information provided in the report, which should also be carefully evaluated and validated. (See also Chapelle, Cotos, & Lee’s article in this volume for a validity framework for DLA.) Moreover, given that the ultimate test for DLA has to do with the extent to which the diagnostic feedback and associated remedial activities result in learning gain and improvement on the part of the learners, the consequential validity evidence should be viewed as a critical component of the DLA validity framework, which requires identifying the causal connections between various features of DLA systems and the learners’ improvement on the treated weaknesses and overall progress toward the targeted learning goals. All of these suggest that the validity framework for DLA should be expanded to include not only the arguments for the diagnostic score profiles and their utility but also those for the corresponding qualitative information and the linkage established between these two types of information.

Footnotes

Funding

This research project was supported by the Overseas Study Support Fund for Humanities and Social Sciences Faculty of Seoul National University (SNU). I would like to thank Charles Alderson, Rosalie Hirch, and three anonymous reviewers for their helpful comments about the earlier manuscript.

Notes

References

Alderson

J. C.

(2005). Diagnosing foreign language proficiency: The interface between learning and assessment. New York: Continuum.

Alderson

J. C.

Brunfaut

Harding

(2014). Toward a theory of diagnosis in second and foreign language assessment: Insights from professional practice across diverse fields. Applied Linguistics (advanced access). Retrieved, February 21, 2014, http://proxy-net.snu.ac.kr/a4e6947/_Lib_Proxy_Url/applij.oxfordjournals.org/content/early/recent

Alderson

J. C.

Clapham

C. M.

Wall

(1995). Language test construction and evaluation. Cambridge, UK: Cambridge University Press.

Alderson

J. C.

Wall

(1993). Does washback exist? Applied Linguistics, 14, 115–120.

American Heritage^®Medical Dictionary. (2007). New York:Houghton Mifflin. Retrieved, July 1, 2014, http://medicaldictionary.thefreedictionary.com/diagnosis.

Bachman

L. F.

(1990). Fundamental considerations in language testing. New York: Oxford University Press.

Biber

Nekrasova

Horn

(2011). The effectiveness of feedback for L1-English and L2-writing development: A meta-analysis. (TOEFL iBT Research Report No. TOEFLiBT-14). Princeton, NJ: ETS.

Bitchener

Ferris

(2011). Written corrective feedback in second language acquisition and writing. New York: Routledge.

Black

P. J.

William

(2009). Developing the theory of formative assessment. Educational Assessment, Evaluation, and Accountability, 21(1), 5–31.

10.

Bloom

B. S.

Hasting

J. T.

Madaus

G. F.

(1971). Handbook of formative and summative evaluation of student learning. New York: McGraw-Hill.

11.

Buck

Tatsuoka

K. K.

(1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15, 119–157.

12.

Buck

Tatsuoka

Kostin

(1997). The subskills of reading: Rule-space analysis of a multiple-choice test of second language reading comprehension. Language Learning, 47, 423–466.

13.

Butler

D. L.

Winne

P. H.

(1995). Feedback and self-regulated learning: A theoretical synthesis. Review of Educational Research, 65, 245–281.

14.

Chapelle

Cotos

Lee

(2015). Diagnostic assessment with automated writing evaluation: A look at validity arguments for new classroom assessments. Language Testing, 32(3), 000–000.

15.

Cheng

Watanabe

Curtis

(2004). Washback in language testing: Research contexts and methods. New York: Routledge.

16.

Cohen

(1994). Assessing language ability in the classroom (2nd ed.). Boston, MA: Newbury House/Heinle & Heinle.

17.

Davies

(1968). Language testing symposium: A psycholinguistic perspective. London: Oxford University.

18.

Enright

M. K.

Quinlan

(2010). Complementing human judgment of essays written by English language learners with e-rater® scoring. Language Testing, 27(3), 317–334.

19.

Frederiksen

J. R

Collins

(1989). A systems approach to educational testing. Educational Researcher, 18(9), 27–32.

20.

Harper

(2010). Online etymology dictionary. New York: Author. Retrieved, July 1, 2014. www.etymonline.com/index.php?allowed_in_frame=0&search=diagnosis&searchmode=none.

21.

Hattie

Timperley

(2007). The power of feedback. Review of Educational Research, 77(1), 81–112.

22.

Henning

(1987). A guide to language testing. Rowley, MA: Newbury House.

23.

Hirch

(2014). Developing theory-based diagnostic tests of grammar: Application of processability theory. Unpublished M.A. thesis, Seoul National University, Seoul, South Korea.

24.

Hyland

(2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83–101.

25.

Jang

E. E.

(2005). A validity narrative: Effects of reading skills diagnosis on teaching and learning in the context of NG TOEFL. Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign.

26.

Jang

E. E.

Dunlop

Park

Vander Boom

(2015). Mediation of goal orientations and perceived ability on junior students’ responses to diagnostic feedback, Language Testing, 32(3), 000–000.

27.

Kim

Y.-H.

(2011). Diagnosing EAP writing ability using the Reduced Reparameterized Unified Model. Language Testing, 28(4), 509–541.

28.

Kunnan

A. J.

Jang

E. E.

(2011). Diagnosing feedback in language assessment. In Long

M. H.

Doughty

Catherine J.

(Eds.), Handbook of language teaching (pp. 610–627). New York: Wiley-Blackwell.

29.

Langlois

J. P.

(2002). Making a diagnosis. In Mengel

M. B.

Holleman

W. L.

Fields

S. A.

(Eds.), Fundamentals of clinical practice (2nd ed., pp. 198–218). New York: Kluwer Academic.

30.

Lee

Y.-W

Gentile

Kantor

(2010). Toward automated multi-trait scoring of essays: Investigating relationships among holistic, analytic, and text feature scores. Applied Linguistics, 31(3), 391–417.

31.

Lee

Y.-W.

Sawaki

(2009). Cognitive diagnosis approaches to language assessment: An overview. Language Assessment Quarterly, 6(3), 172–189.

32.

Lyster

Ranta

(1997). Corrective feedback and learner uptake: Negotiation of form communicative classrooms. Studies in Second Language Acquisition, 19(1), 37–66.

33.

Messick

(1996).Validity and washback in language testing. Language Testing, 13(3), 241–156.

34.

Poehner

Zhang

(2015). Building computerized dynamic assessment (C-DA): Diagnosing L2 development according to learner responsiveness to mediation. Language Testing, 32(3), 337–357.

35.

Rea-Dickinson

Gardener

. (2000). Snares and silver bullets: Disentangling the construct of formative assessment. Language Testing, 17(2), 215–243.

36.

Read

(2008). Identifying academic language needs through diagnostic assessment. Journal of English for Academic Purposes, 7, 180–190.

37.

Torrance

Pryor

(1998). Investigating formative assessment: teaching, learning and assessment in the classroom. Buckingham: Open University Press.

38.

Treasure

(2011). Diagnosis and risk management in primary care: Words that count, numbers that speak. New York: Radcliffe.

39.

von Davier

(2005). A general diagnostic model applied to language testing data (ETS Research Report No. RR-05–16). Princeton, NJ: ETS.

40.

(2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300.