Abstract
Previous research in cognitive diagnostic assessment (CDA) of L2 reading ability has been frequently conducted using large-scale English proficiency exams (e.g., TOEFL, MELAB). Using CDA, it is possible to analyze individual learners’ strengths and weaknesses in multiple attributes (i.e., knowledge, skill, strategy) measured at the item level. This study explored how a placement test score could be used for diagnosing the L2 reading ability of incoming students to an adult ESL program. Five content experts completed a reading placement test and identified the attributes required for successfully completing each item on the test, while referring to the list of L2 reading attributes. Then, content experts’ codings were analyzed and developed into an item-by-attribute Q-matrix. The Fusion model, a type of cognitive diagnostic model (CDM), was used for refining the Q-matrix and diagnosing 1982 learners’ strengths and weaknesses in L2 reading. Results suggest that 10 major L2 reading attributes were involved in the reading test. In addition, examinees’ strengths and weaknesses were identified for the overall group, three reading proficiency groups (i.e., beginner, intermediate, and advanced), and individual learners. Such information could be provided to ESL program administrator and teachers for enhancing the reading curriculum and developing instructional materials.
Keywords
Cognitive diagnostic assessment (CDA) has recently received much interest in the field of second language assessment since it can be used to provide diagnostic feedback to stakeholders and ultimately promote language learning. A diagnostic assessment could be defined as “a test which is used for the purpose of discovering a learner’s specific strengths or weaknesses” (ALTE, 1998, p. 142). To be effective, diagnostic feedback should contain detailed information on individual examinees’ strengths and weaknesses of a cognitive domain (e.g., reading, mathematics). Moreover, in order to provide high-quality feedback, it is beneficial to identify the various components of a cognitive domain based on both theoretical and empirical evidence. For example, in conducting cognitive diagnostic assessment of second language (L2) reading ability, L2 reading ability and its components could be defined using theories of language ability in general and L2 reading.
Providing stakeholders with detailed feedback on students’ strengths and weaknesses could lead to enhanced learning of English. For example, program administrators and teachers can determine the type of future instruction that is necessary for the students (Hughes, 1989; Lee & Sawaki, 2009a). Specifically, a classroom teacher could use diagnostic information to design remedial instructional materials for an individual learner. At a more global level, program administrators could develop a curriculum that meets a group of learners’ needs.
In CDA, the various components of a cognitive domain are referred to as attributes. Attributes are “[cognitive] procedures, skills, or knowledge a student must possess in order to successfully complete the target task” (Birenbaum, Kelly, & Tatsuoka, 1993, p. 443). That is, attributes are domain-specific skills and knowledge that are needed to demonstrate mastery in a cognitive domain (Chipman, Nichols, & Brennan, 1995; Leighton & Gierl, 2007). CDA is powerful since it allows examining individual learners’ strengths and weaknesses of multiple attributes measured at an item level. These attributes should not be a mere list of skills. Rather, they should be systematically categorized according to a specific cognitive theory. For instance, attributes of L2 reading ability – or L2 reading attributes – consist of knowledge, skills, and strategies, which are involved in comprehending texts (Templin, 2004; Birenbaum et al., 1993). L2 reading attributes include various reading skills and strategies, which will be used interchangeably in this paper. However, in the literature, some claim that skill is a predisposition to perform in a certain way, which is automatized and unconscious in nature (Urquhart & Weir, 1998), whereas strategy is the conscious process involved in reading (Alderson, 2000; Cohen, 2007). Others note the differences between skills and strategies (e.g., Faerch & Kasper, 1983; Purpura, 1999), but state that a strategy could become more automatized and skill-like as readers become more proficient. In the present study, reading attributes will encompass the notion of language knowledge, skills, and strategies involved in reading.
Previous studies in CDA of L2 reading (e.g., Buck, Tatsuoka, & Kostin, 1997; Jang, 2005; Kasai, 1997; Li, 2011; Sawaki, Kim, & Gentile, 2009; Scott, 1998) have used large-scale English proficiency tests (e.g., TOEIC, TOEFL, MELAB) for identifying examinees’ mastery of L2 reading attributes. For instance, in a ground-breaking study of diagnosing L2 reading, Jang (2005) first conducted a verbal protocol analysis to identify the L2 reading attributes used by ESL learners in completing the reading section of the Test of English as a Foreign Language (TOEFL). Nine attributes were identified: (1) deducing word meaning from the context; (2) determining word meaning out of the context; (3) comprehending text through syntactic and semantic links; (4) comprehending text-explicit information; (5) comprehending text-implicit information at the global level; (6) inferring major arguments or a writer’s purpose; (7) comprehending negatively stated information; (8) summarizing major ideas from minor details; and (9) determining contrasting ideas through diagrammatic display. The Fusion model (Hartz, Roussos, & Stout, 2002), a type of cognitive diagnostic model (CDM), was used to evaluate the attributes and to diagnose examinee’s L2 reading ability. Individual diagnostic score reports were provided to ESL learners and teachers, which were found to be useful for learning.
More recently, Sawaki et al. (2009) used expert judgment to identify the attributes involved in the reading and listening sections of the Test of English as a Foreign Language Internet-based Test (TOEFL iBT) as part of an attempt to provide diagnostic score reports to stakeholders. For the reading section, six content experts first identified the L2 reading attributes involved in each test. These attributes were identified based on the three reading constructs of the test (i.e., basic comprehension, inferencing, and reading to learn) and task analysis. Afterwards, the Fusion model analysis was implemented to refine L2 reading attributes in to the following: (1) understanding word meaning; (2) understanding specific information; (3) connecting information; (4) synthesizing and organizing information. However, this study did not further examine the pedagogical impact of CDA.
In fact, few studies of CDA in L2 reading, except for Jang (2005), have explored the type of diagnostic feedback that could be provided to stakeholders or the pedagogical implications it could have. It may have been difficult to do so considering that these studies used large-scale proficiency tests, which were often administered outside of instructional contexts. Rather, prior research has focused mostly on the feasibility of using CDMs in language assessment or identifying L2 reading attributes involved in L2 reading tests. To explore the full capacity of CDA, for example within an ESL program, it is necessary to explore the type of diagnostic feedback that could be provided to program administrators and teachers, and how such information could be potentially used for instructional purposes.
Diagnostic information could be highly useful in the context of the current study, an adult ESL program within a US college. In the program, placement test scores have been used for placing incoming students into classes at different proficiency levels. In the beginning of each semester, ESL teachers are given information on the learners’ total placement scores as a reference. Yet it is often insufficient to identify their students’ strengths and weaknesses in reading. Therefore, some teachers design their own diagnostic tests (e.g., a short essay or a conversation test) or assignments to better gauge their students’ needs, indicating a need for more diagnostic information regarding the students. Using CDA, diagnostic information could be extracted not only at the individual learner level, but also at different language proficiency levels (i.e., beginners, intermediates, advanced). Such information could be provided to ESL program administrators and teachers, and lead to enhanced reading curriculum at both the classroom-level and program-level.
In addition, the identified L2 reading attributes in previous studies vary widely, partly because the attributes were not drawn from a unified reading theory or language ability model. The current study defines reading as the process of constructing meaning while interacting with the text (RAND Reading Study Group, 2002). It is an instance of L2 language use in which language ability is used for understanding the written text (Bachman & Palmer, 1996). Therefore, reading ability could be discussed within the larger framework of a language ability model, which defines language ability as consisting of language knowledge and strategic competence (e.g., Bachman & Palmer, 1996; Purpura, 2004). In other words, reading ability can be examined in terms of both language knowledge and strategic competence, which are necessary when interacting with the text. Using a language ability model, complemented by various reading theories, as a framework for L2 reading research will allow one to define the relationship among attributes very clearly in a systematic fashion.
Therefore, the main purpose of the current study was to diagnose ESL learners’ L2 reading ability based on their placement test scores with the ultimate goal of providing diagnostic feedback to ESL program administrators and teachers. The Fusion model, a type of CDM, was used for cognitive diagnosis of L2 reading ability. CDMs are psychometric models that can be used to evaluate students’ strengths and weaknesses in a cognitive domain (de la Torre, 2009). Specifically, the Fusion model is an attribute-level item-response model (Roussos, DiBello, Stout, Hartz, Henson, & Templin, 2007), used for determining examinee’s mastery (i.e., strengths) and non-mastery (i.e., weakness) of each attribute at the item-level. To fulfill the study purpose, first, the L2 reading attributes were drawn from a general model of language ability (e.g., Bachman & Palmer, 1996; Purpura, 2004) and reading theories (e.g., Cohen & Upton, 2006; Grabe, 2009; Phakiti, 2007; Pressley & Afflerbach, 1995; Weir, Hawkey, Green, & Devi, 2009). Then, information on the identified L2 reading attributes was used to diagnose L2 reading ability of adult ESL learners, enrolled in an ESL program in a US college. Diagnostic information was extracted from their placement test scores not only at the individual learner level, but also at different language proficiency levels (i.e., beginner, intermediate, advanced).
The present study investigated the following research questions:
What are the major L2 reading attributes involved in successfully completing a reading test?
What are the strengths and weaknesses of the examinees’ L2 reading ability for the overall group, three different proficiency groups (i.e., beginners, intermediates, advanced), and individual learners?
Literature review
Q-matrix development
The preliminary step to conducting a cognitive diagnosis of learners’ test performance is to construct a Q-matrix. A Q-matrix is a “cognitive design matrix that explicitly identifies the cognitive specification for each item” (de la Torre, 2009, p. 2). It represents the cognitive components required in answering test items, by having rows represent items, and columns represent attributes or skills (Tatsuoka, 1990). The Q-matrix is important in conducting studies involving cognitive diagnostic models since the diagnostic power of the models are often determined by the theoretical and empirical soundness of the Q-matrix (Lee & Sawaki, 2009a).
The Q-matrix was first proposed by Tatsuoka (1990), who integrated the conceptual framework by Falmagne and Doignon (1988), and Haertel and Wiley (1993). The matrix indicates unobservable knowledge states into observable item-response patterns (Tatsuoka, 1995). The Q-matrix consists of [i × k] matrix of binary information in ones and zeros, where i is the number of items on the test and k represents the number of attributes. For a given element in the Q-matrix in the ith row and the kth column (
A sample Q-matrix.
The Q-matrix focuses on the conjunctive interaction of attributes on a task. Conjunctive interaction means that the correct response to an item requires the knowledge of all attributes represented in the item. By contrast, compensatory interaction of attributes on a task refers to when a high level of competence in one attribute compensates for a low level competence of another attribute in completing an item (Dibello, Roussos, & Stout, 2007). Owing to the conjunctive nature of the Fusion model, test-takers are assumed to possess the knowledge of all attributes tested by an item to correctly respond to the item (refer to the ‘Fusion model analysis’ section for more details on the model).
Furthermore, the specificity of the attributes defined in the Q-matrix is crucial because the granularity of attributes affect psychometric analysis. The more fine-grained the attributes are, the richer the diagnosis of learners’ strengths and weaknesses will be. However, there is a trade-off between the richness and the stability of diagnosis (Haberman & von Davier, 2006). According to Lee and Sawaki (2009b), [a]s the level of specification for the Q-matrix becomes more detailed, a much larger number of items are required to represent a universe of the attributes in the test. In addition, it is likely that the more fine grained the attributes are, the more difficult it can become to maintain the consistency of diagnosis across occasions or test forms, potentially contributing to instability and unreliability of examinee classification. (p. 184)
Thus, often in practice, similar attributes are combined to reduce the number of attributes. For instance, Sawaki et al. (2009) combined the two skills of understanding and connecting information within a paragraph and understanding and connecting information across paragraphs as these were considered to involve similar cognitive procedures. Therefore, they were combined them into one attribute called connecting information (p. 198).
Fusion model analysis
The Fusion model (Hartz et al., 2002), a type of cognitive diagnostic model, is used to make inferences about each examinee’s mastery level of each attribute, based on the examinee’s item responses (DiBello & Stout, 2008). Specifically, the Fusion model is a skill-level item-response model (Roussos et al., 2007), which “expresses the probability of an examinee [j] answering an item i correctly as a function of both the examinee characteristics (skills mastery level profile vector) and item characteristics (item parameters that are linked to an items required skills)” (DiBello & Stout, 2008, p. 9). The item-response function is as follows:
That is,
In this study, for model simplification, the Reduced Fusion model was used by repressing ability parameter
(
A high value above .6 indicates that examinees will have a good chance of correctly responding to the item if they have mastered all the necessary attributes.
That is, it is the ratio of (1) the probability of invoking all attributes on item i given an examinee is a non-master of attribute k but a master of all other attributes for the item and (2) the probability of invoking all attributes on item i given an examinee is a master of all attributes for item i. It compares correct item-response probability between the mastery of attribute k and the non-mastery of attribute k. Thus, it is a reverse indicator of an item’s discriminating power for attribute k. For example, a low
The Fusion model also computes the examinee population’s probability of mastery of attribute, or
Moreover, the Fusion model produces information regarding each examinee’s probability of mastery for each attribute, or
In the function,
Methods
Participants
Coders
Five content experts in ESL identified the L2 reading attributes involved in the reading test. Content experts were recruited as coders since the purpose of the study was to identify the L2 reading attributes required for successfully completing the reading test items. Numerous attributes could be involved in reading, but the focus was on examining the major attributes that were essential for correctly responding to the test items. The coders were graduate students in TESOL or Applied Linguistics at a US college and had considerable knowledge and experience in language testing and L2 reading. They comprised three females who were near-native speakers of English and two males who were native speakers of English. The coders all had experience of teaching ESL, including L2 reading. Two of the coders had experience of teaching in the ESL program concerned.
Examinees of the reading test
Examinees were 1982 new incoming students to the adult ESL program within a US college. Participants were 18 years or older and had diverse linguistic and cultural backgrounds. Some of the students were immigrants residing in the area, whereas others were international students and international corporate executives. Therefore, students were expected to have a wide range of reading proficiency.
Instruments
The list of L2 reading attributes
The list of L2 reading attributes (as presented in Table 2) was developed by drawing upon previous literature on language ability models (e.g., Bachman, 1990; Bachman & Palmer, 1996; Purpura, 2004) and reading theories (e.g., Cohen & Upton, 2006; Phakiti, 2007; Pressley & Afflerbach, 1995; Weir et al., 2009). The list consisted of L2 reading attributes that were presumed to be involved in reading.
Attributes of L2 reading ability.
Considering that reading ability is an instance of language use, L2 reading ability was defined within a broader communicative language ability (CLA) model (e.g., Bachman & Palmer, 1996; Purpura, 2004), in which language ability consists of language knowledge and strategic competence. Thus, L2 reading ability largely consists of language knowledge and strategic competence, necessary when a learners’ language ability interacts with the written text. Language knowledge is further categorized into grammatical knowledge, consisting of grammatical form and semantic meaning, and pragmatic knowledge. Grammatical form refers to “linguistic forms on the subsentential, sentential and suprasentential levels” (Purpura, 2004, p. 61) whereas semantic meaning concerns the “literal meaning expressed by sounds, words, phrases and sentences” (Purpura, 2004, p. 61). On the other hand, pragmatic knowledge is defined in terms of various pragmatic meanings. Pragmatic meaning draws on grammatical form and semantic meaning to convey the implied meaning of utterances, and can simultaneously embody a range of contextual, sociolinguistic (meaning associated with social norms, preferences, and expectations), sociocultural (meaning related to cultural norms, preferences, and expectations), psychological (meaning embodying attitude, affect), and rhetorical (meaning derived on coherence and genre) meanings (Grabowski, 2009; Purpura, 2004).
L2 reading ability also involves strategic competence, comprising both metacognitive strategies and cognitive strategies. Metacognitive strategies are conscious or unconscious reading activities that regulate cognitive strategies, such as assessing the situation, monitoring, and evaluating (Bachman & Palmer, 1996; O’Malley & Chamot, 1990; Phakiti, 2007; Purpura, 1999; Wenden, 1991). Assessing the situation refers to the higher-order executive function of assessing one’s own knowledge, resources, and constraints of the situation before engaging in a task. Monitoring strategies involve determining the effectiveness of one’s task performance during the task, whereas evaluating strategies refer to determining the effectiveness of task after its completion (Purpura, 1999). The specific constituents of these strategies were drawn from various reading literature sources (Phakiti, 2007; Pressley & Afflerbach, 1995; Weir et al., 2009). In contrast to metacognitive strategies, cognitive strategies of L2 reading ability are processes that directly interact with the target language and involve those required for comprehension (understanding the text), memory (transforming information to store in memory), and retrieval (recalling information from memory) (O’Malley & Chamot, 1990; Phakiti, 2007; Purpura, 1999; Wenden, 1991). These were further categorized into sub-strategies by drawing upon previous reading literature (e.g., Cohen & Upton, 2006; Phakiti, 2007; Pressley & Afflerbach, 1995; Weir et al., 2009).
Reading test
The reading test data was from the reading section of a placement test used in the adult ESL program, developed to place incoming students into classes at different proficiency levels. The reading test consisted of four reading passages and 30 multiple-choice items, which were scored dichotomously. As seen in Table 3, the reading passages differed in a number of ways: topic (i.e., complaint letter, Argentine ants, mysteries of smell, demise of Neanderthals), text structure (i.e., causation, compare/contrast, description, problem/solution), length (i.e., 185 to 533 words), and the number of items measured (i.e., six to nine items).
Structure of the reading test.
According to test specifications, the 30 reading test items (Items 44–73 in the placement test) measured two different types of meaning as seen in Table 4, either semantic meaning or pragmatic meaning of the text. Semantic meaning refers to the literal meaning expressed by utterances, whereas pragmatic meaning refers to comprehending highly contextualized implied meaning of the text based on the context or the readers’ prior knowledge (Purpura, 2004).
Item types.
Procedures
To identify the L2 reading attributes required for successfully completing each item on the reading test, individual items needed to be coded into attribute specifications. First, content experts participated in a group training session and familiarized themselves with the list of L2 reading attributes. Rather than recognizing all the language knowledge and strategies possibly involved in answering an item, they were asked to indicate the major attributes they utilized to respond to each item. It should be noted that not all attributes may be measured using one single test, nor suit the purpose of an adult ESL placement test. For example, if the reading passage is available only in the beginning and becomes unavailable while responding to the items, examinees may need to extensively tap into their memory strategy, which was not relevant for the current reading test. Coders practiced coding sample items by first solving each item and then immediately marking the major attributes they used on the blank Q-matrix. The list of L2 reading attributes was used as a reference whenever needed. After the training session, each coder coded the 30 items on the reading test. Once the coders finished coding the items, they individually met with the researcher to report the identified L2 reading attributes. They explained why they chose certain attributes at the item level. These sessions were recorded using a digital recorder.
Coded data were compiled and analyzed to develop an initial Q-matrix. Attributes which three or more coders agreed upon (with 60% or higher coder agreement) were selected as essential for the item and included in the initial Q-matrix. Owing to the large number of coders (i.e., five coders), Fleiss Kappa was calculated to measure the agreement among the coders (Landis & Koch, 1977). To comply with the criterion proposed by Hartz et al. (2002), attributes that were measured by fewer than three items, thereby not providing statistically meaningful information, were either merged with similar attributes or deleted from the Q-matrix (refer to initial Q-matrix in Appendix A).
The initial Q-matrix was further refined through Fusion model analysis in an iterative process. Reading test data were analyzed in conjunction with the Q-matrix using the Arpeggio Suite program, which had implemented the Fusion model. The first step in the Fusion model analysis involved an analysis of Markov Chain Monte Carlo (MCMC) convergence to guarantee that model parameters reached a stable value (Roussos, Templin, & Hensen, 2007). Three parameters were evaluated: (1) examinees’ probability of mastery for each attribute (
After refining the Q-matrix, two types of fit statistics were measured to evaluate the fit of the model to the data: (1) FUSIONStats and (2) item mastery statistics (IMStats). FUSIONStats compare the difference between the observed item p-values (the proportion of observed correct items) and the estimated item p-values (the proportion of estimated correct items). A low difference between the two p-values suggests a good fit of the data. Item mastery statistics (IMStats) compare the observed performance of masters versus non-masters at the item level. In addition, the reliability of the Fusion model was examined by analyzing the Correct Classification Rate (CCR), the consistency of classification of examinees into masters versus non-masters of attributes. Finally, examinees’ probability of mastery for each attribute (
Results
Q-matrix development
Content experts’ coding was compiled into a Q-matrix. In the Q-matrix as seen in Appendix A, the rows represent the 30 items from the CEP reading test (Item 44 to Item 73), while the columns indicate the L2 reading attributes. The overall coder agreement obtained using Fleiss Kappa was .38, indicating fair agreement among the coders (Landis & Koch, 1977). Attributes that were measured by fewer than three items were either merged with similar attributes or deleted from the Q-matrix following Hartz et al. (2002). Specifically, the five different types of pragmatic meanings (i.e., contextual, sociolinguistic, sociocultural, psychological, and rhetorical meanings) were measured by only eight items, so they were merged into one pragmatic meaning attribute. Also, the strategy of integrating ideas was measured by only two items, so it was merged into the summarizing strategy, which was conceptually similar. Likewise, predicting strategy was merged with inferencing strategy owing to their similarities.
Therefore, the initial Q-matrix consisted of 10 attributes, including five knowledge-related attributes (i.e., lexical; cohesive; sentence; paragraph/text; and pragmatic meaning) and five reading strategies (i.e., identifying word meaning; finding information; skimming; summarizing; inferencing). Each item appeared to measure a minimum of two to a maximum of four L2 reading attributes. Numerous reading attributes were involved in completing an item due to the complex nature of reading (Alderson, 2000; Urquhart & Weir, 1998). In addition, all items minimally measured one knowledge-related attribute and one reading strategy, indicating that readers used sequences of knowledge-related attributes and strategies when completing reading tasks.
To examine the statistical meaningfulness of the identified attributes in the initial Q-matrix, Fusion model analysis was conducted using the Arpeggio software. First, Markov Chain Monte Carlo (MCMC) method was used with a long Markov chain length of 200,000 with a 10,000 burn-in period to guarantee convergence of parameters. Among the various parameters, the convergence of overall examinees’ probability of mastery for each attribute (
The fit of the Fusion model to the data was checked using FUSIONStats and item mastery statistics (IMStats). FUSIONStats analysis produced the observed item p-values and the estimated item p-values. The absolute difference between each observed item p-value and the estimated item p-value for each item was minimal, and was below the suggested value of .05 for all items (L. Roussos, pers. comm., March 8, 2010). Moreover, the mean absolute difference between the p-values was low at .005. These results, summarized in Table 5, suggested that the Fusion model fit well to the data.
Observed and estimated item p-values.
In addition, item mastery statistics (IMStats) was used to compare the observed performance of masters versus non-masters at the item level. Three different values were examined to evaluate IMStats: (1) the phat (m), which refers to the probability of correctly responding to an item given mastery of the attributes required for that item (over all items); (2) the phat (nm), indicating the probability of correctly responding to an item given non-mastery of the attributes required for that item (over all items); and (3) the pdiff, indicating the average difference between phat (m) and phat (nm) across items. In the current data, as seen in Table 6, the average phat (m) across all items was .869, indicating that the average probability of getting a correct response to an item by masters of attributes was high at 86.9%. Contrarily, the average phat (nm) was .474, so the average probability of having a correct response to an item by non-masters of attributes was much lower at about 47.4%. Thus, the pdiff was .395, indicating that the masters of attributes on an item outperformed non-masters of attributes on average by 39.5% across all items. This high value indicated a good fit between the estimated model and the observed data, suggesting a strong diagnostic power of the model.
Probability of correctly responding to an item.
Figure 1 visually depicts the probability of getting the item correct for both the masters and the non-masters of attributes on each item. Generally, there was a clear pattern of separation between the masters and the non-masters of attributes for each item.

Item mastery statistics.
Furthermore, the reliability of the Fusion model was examined by evaluating the Correct Classification Rate (CCR) index. CCR refers to the consistency of classification of examinees into masters versus non-masters of attributes if the same test were administered to the same examinee group multiple times (Roussos et al., 2007). The CCR values range between zero and one. In the current data, the CCR was high at .876, suggesting a high reliability of the Fusion model.
Cognitive diagnosis of learners’ L2 reading ability
Examinees’ L2 reading performance was analyzed in terms of their mastery probability of the L2 reading attributes. The results were analyzed for the following: (1) the overall group; (2) three reading proficiency groups (i.e., beginner, intermediate, and advanced); and (3) individual examinees.
Overall group
The attribute mastery of the overall group was investigated by examining the population’s probability of mastery for each attribute (
Overall group’s attribute mastery probability.

Overall group’s mastery of L2 reading attributes (N = 1982).
Three reading proficiency groups – beginner, intermediate, and advanced groups
To examine the attribute mastery probability of three reading proficiency groups (i.e., beginner, intermediate, and advanced), the entire examinee population was stratified into five different reading proficiency levels according to their total scores. Using comprehensive sampling (Wiersma, 2000), three sub-groups were selected from the examinee population: beginner (bottom 20%), intermediate (middle 20%), and advanced (top 20%) levels. Then, the L2 reading attribute mastery probabilities of the three reading proficiency levels were obtained by averaging each group’s individual examinees’ attribute mastery probability,
As presented in Table 8 and Figure 3, the beginners demonstrated poor performance on the reading test with very low attribute mastery probability across all 10 attributes. Their mastery probabilities showed very little variability, with attributes ranging from approximately .05 to .2. The intermediates demonstrated average performance across the attributes. Their mastery probability for knowledge-related attribute ranged from approximately .53 (cohesive meaning) to .70 (pragmatic meaning), while the probability for reading strategies ranged from .56 (summarizing) to .95 (skimming), indicating more variability than beginners. Notably, the intermediates’ mastery probabilities on three strategies – identifying word meaning, finding information, and skimming – were very high at above .9. This suggests that the three strategies were comparatively easier to master than other attributes. Unsurprisingly, the advanced group had very high mastery probabilities across all 10 L2 reading attributes above .95.
L2 reading attribute mastery of reading proficiency groups.

L2 reading attribute mastery statistics of reading proficiency groups.
Individual examinees
To diagnose individual examinees’ L2 reading ability, the individual examinees’ attribute mastery probabilities, or
Individual examinees’ mastery probability of L2 reading attributes.
Among the 10 examinees, the performance of two – Examinees A and B – was examined in more detail. As seen in Figure 4, Examinee A had a mastery probability over .7 on Attributes 4, 6, 8, and 9. However, his performance on the remaining attributes was less successful, suggesting that these were his weaknesses. On the other hand, as seen in Figure 5, Examinee B had obtained a high mastery probability of over .7 on most attributes except for Attributes 2, 3, 4, and 9. These cases illustrate how two examinees may receive the same total scores, but vary in their strengths and weaknesses in L2 reading ability.

L2 reading attribute mastery of Participant A.

L2 reading attribute mastery of Participant B.
Discussion and conclusion
Summary of the findings
The purpose of this study was to diagnose learners’ L2 performance on an L2 reading placement test with the ultimate goal of developing diagnostic feedback for ESL program administrators and teachers. This first required constructing the Q-matrix. Analysis of the Q-matrix identified 10 major L2 reading attributes necessary for completing the reading test, including five knowledge-related attributes (i.e., lexical; cohesive; sentence; paragraph/text; and pragmatic meaning) and five reading strategies (i.e., identifying word meaning; finding information; skimming; summarizing; inferencing). These findings suggest that the list of L2 reading attributes could be used as a framework for CDA research of L2 reading.
Examinees demonstrated different attribute mastery probability among the 10 L2 reading attributes. This not only indicated their strengths and weaknesses, but also suggested that some attributes were more difficult to master compared to others. Among the knowledge-related attributes, cohesive meaning was the most difficult attribute to master, whereas pragmatic meaning was the easiest. Cohesive meanings are mapped onto various cohesive forms through lexical cohesion, reference, and substitution (Halliday & Hasan, 1976), and replace previously mentioned parts in the text (de Beaugrande, 1980; Halliday & Hasan, 1976). Although research suggests that knowledge of cohesive meaning assists with comprehension (Cohen, Glasman, Rosenbaum-Cohan, Ferrara, & Fine, 1979), few studies have examined the relative difficulty of cohesive meaning compared to other types of meanings on L2 reading tests. However, studies on L2 grammar (e.g., Ameriks, 2009; Dakin, 2010; Liao, 2009) have found that cohesive meaning is more difficult than lexical meaning, which concur with the results of the current study. Cohesive devices require an accurate understanding of the relationship between two parts of the text, making it challenging to master (Dakin, 2010).
In contrast to cohesive meanings, pragmatic meaning was the easiest to master. This finding may appear to counter the claims made in prior research, which has generally viewed pragmatic meaning as highly contextualized implied meaning, making it more difficult to understand than literal or intended meanings (e.g., Gray, 1960; Herber, 1978; Purpura, 2004). However, in accordance with the findings of the current study, Liao (2009) also found that pragmatic meaning items had a higher mean in comparison to semantic meaning items on a reading test. She explained that other factors, such as textual features, may have affected the difficulty of these items. In addition, test-takers may have lacked decoding skills, but had sufficient prior knowledge, which could have compensated for their deficiencies in language proficiency (Stanovich, 1980), resulting in a comparatively higher mean for pragmatic meaning items.
Among reading strategies, summarizing and inferencing were found to be more difficult compared to the other three strategies (i.e., identifying word meaning, finding information, skimming). These findings regarding the hierarchy of L2 reading attributes concur with previous research (Grabe & Stoller, 2002; Lumley, 1993). Summarizing may have been comparatively challenging for L2 learners since it requires readers first of all to comprehend the overall text and then extract the gist from it. Understanding the gist involves numerous elements, such as knowledge of grammar, vocabulary, discourse structure, and various cognitive processes (Pressley, 2002). Similarly, inferencing may have been difficult to master owing to its complexity (e.g., Hammadou, 1991; Hosenfield, 1977; Long, Seely, Oppy, & Golding, 1996). It requires readers to understand not only the literal meaning of the text, but also its implied meaning. Overall, both summarizing and inferencing strategies were identified to be difficult attributes because they required higher-level processing of information from the text (Grabe, 2009).
Moreover, the three different reading proficiency groups (i.e., beginner, intermediate, advanced) showed distinct attribute mastery probability. Not surprisingly, beginners had low mastery probability across all 10 attributes with little variability. The intermediates had attribute mastery probability similar to the overall group, with high variability in attribute mastery. Notably, their mastery probability on three strategies – identifying word meaning, finding information, and skimming – was very high. This suggests that the three strategies were comparatively easier to master than other attributes. The advanced group had very high mastery probabilities across all 10 L2 reading attributes. Although examinees as a group showed a distinct attribute mastery pattern, individuals within the group still demonstrated high variability.
Study implications and suggestions for the future
Pedagogical implications could be drawn from this study regarding the type of diagnostic feedback that could be provided to various stakeholders. The detailed information regarding the attribute mastery probability of the overall group and the three reading proficiency groups, presented in Appendix C, could be provided as part of a score report to ESL program administrators and teachers. They could refer to the information to further refine the L2 reading curricula relevant for each reading proficiency level. For beginners and intermediates, it is important to design a reading curriculum that reflects their disproportionate mastery pattern of L2 reading attributes. Typically, L2 reading attributes with lower mastery probabilities (e.g., cohesive meaning, summarizing, inferencing) would require more instruction in comparison to attributes with higher mastery probabilities (e.g., finding information, skimming). Advanced learners had high mastery probabilities across the 10 attributes, which could be attributed to learners performing exceptionally well on the reading test, making it difficult to identify their weaknesses. To prevent such a ceiling effect, future studies need to include more difficult items to the test.
Similarly, ESL instructors could be provided with diagnostic information regarding individual learners’ strengths and weaknesses for instructional purposes. They could develop lesson plans tailored to individual learners’ specific language weaknesses or needs. For instance, if an individual learner lacked knowledge of cohesive meaning and knowledge of sentence meaning (refer to Figure 4), the teacher should focus on instructing these attributes. This could potentially lead to improvements in the learners’ L2 reading ability.
Although this study provides pedagogical implications for enhancing ESL instruction, some limitations exist. It used an ex post facto design (Wiersma, 2000), in which an existing reading test was retrofitted for diagnostic purposes, whereas ideally it should have been designed specifically as a diagnostic test (Alderson, 2005). In a truly diagnostic test, the item-by-attribute relationship would have been pre-determined and, as a result, ease the process of developing a Q-matrix. Retrofitting a CDM to an existing test could be limited by design in extracting rich diagnostic information. Thus, in this study, attributes that were measured by less than three items, and thereby not adding meaningful information to the matrix, were either merged with other attributes or deleted from the Q-matrix following the suggestions by Hartz et al. (2002). However, using a diagnostic test with a pre-determined Q-matrix may prevent such loss of attribute data, therefore ultimately lead to better CDA of leaners’ language performance.
In the future, CDA research should focus on enhancing methods for developing the Q-matrix. There is as yet no standardized method of Q-matrix development. In this study, the Q-matrix was developed by having content experts code the reading test while referring to the list of L2 attributes. This was based on the assumption that content experts would know the most efficient way to reach the key to the item, whereas ESL learners may not. For instance, a beginning learner may refer to the wrong cues within a text to respond to an item, but may still correctly respond to the item. However, combining learners’ verbal protocol with expert judgment may have provided more elaborate account of the major reading attributes involved in the test. Also, in the current study, coders individually coded the items and did not have an opportunity to convene to come to an agreement, which has led to a relatively low coder agreement. Having additional group meetings may have led to a higher coder agreement and better overall Q-matrix development.
In addition, considering the practical implications of diagnostic assessment, there is a dire need for more CDA research in L2 reading. Few studies have examined the effect of providing diagnostic feedback to stakeholders. Jang (2005) found that diagnostic feedback to learners and teachers contributed to enhancing learners’ L2 reading ability. However, in the study, students’ improvements could not be directly connected to the diagnostic information because they did not notice their weak attributes. She also found that teachers did not necessarily utilize diagnostic information for instructional purposes. However, only three teachers were interviewed, making it difficult to generalize the findings. Therefore, future studies need to address the washback effect of providing diagnostic feedback to different stakeholders. In the current research context of an ESL program, subsequent studies need to investigate program administrators’ and teachers’ perception of the score reports and actual application of diagnostic information for instructional purposes. Also, it would be useful to identify the kinds of diagnostic information that is most informative and beneficial for pedagogical purposes.
Footnotes
Appendix A.
Appendix B.
Appendix C:
Acknowledgements
I would like to thank James Purpura and Michael Kieffer for their comments on earlier drafts, as well as the anonymous reviewers of Language Testing for their valuable feedback and suggestions.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
