Application of a Cognitive Diagnostic Model to a High-Stakes Reading Comprehension Test

Abstract

General cognitive diagnostic models (CDM) such as the generalized deterministic input, noisy, “and” gate (G-DINA) model are flexible in that they allow for both compensatory and noncompensatory relationships among the subskills within the same test. Most of the previous CDM applications in the literature have been add-ons to simulation studies. Although there are some applications of CDMs such as the Fusion Model and the Rule Space Model to educational assessment data in general and second-language data in particular, there are few studies applying general models such as the G-DINA. The purpose of the present study was to demonstrate the application of the G-DINA to the reading comprehension data of a high-stakes test. To this end, an initial Q-matrix was developed, validated, and cross-validated. The skill profiles of the test takers were estimated using the “CDM” package in R. Throughout, the process of constructing and validating a Q-matrix was elaborated on, the benefits of general models were emphasized, and implications for research investigating inter-skill relationships were discussed. Finally, suggestions for further research, to better take advantage of the flexibilities of general diagnostic models, were presented.

Keywords

attribute CDM G-DINA Q-matrix subskill

Cognitive diagnostic models (CDMs) can maximize opportunities to learn by “pinpointing why students perform as they do” (Leighton & Gierl, 2007, p. 5). They decompose test tasks into strategies, processes, and knowledge required to perform successfully in each task, thereby help teachers to replace student’s faulty strategies (Embretson, 1983). Unlike conventional educational psychometric models such as item response theory (IRT), which are based on an investigator’s expectations of what cognitive processes test takers follow to solve problems in test-taking situations, CDMs are based on empirical evidence of the actual processes and strategies they follow in these situations. CDMs diagnose test takers’ competency along a set of multiple discrete/dichotomous skills. They predict probability of an observable categorical response from unobservable (i.e., latent) categorical variables. These discrete latent variables have been variously termed as skill, subskill, attribute, knowledge, ability, processes, and strategies.¹

Based on a comprehensive review of the literature, Rupp and Templin (2008) put forth a definition of CDMs which included the following defining characteristics: their multidimensionality, confirmatory nature, complex loading structure, and probabilistic nature. Just like factor analysis (FA) and IRT models, CDMs include multiple latent predictor variables. However, unlike conventional IRT and FA models, which assign to respondents a single score on a continuous scale representing a broadly defined ability, CDMs assign respondents to multidimensional skill profiles by classifying them as masters versus nonmasters of each skill involved in any given test. Moreover, CDMs are different from multidimensional IRT and FA models in that latent variables in CDMs are discrete or categorical (i.e., they indicate mastery/nonmastery), whereas ability estimates in multidimensional IRT models and factor scores in FA models are continuous. Because FA and IRT models typically operationalize broadly defined dimensions, they usually have a simple loading structure in the sense that each item loads on just one dimension. In contrast, in CDMs, where narrowly defined constructs are operationalized, each item typically requires multiple subskills leading to what is known as within-item multidimensionality (Adams, Wilson, & Wang, 1997; Baghaei, 2012). CDMs are also confirmatory in that the processes, strategies, or subskills required to perform successfully on items of any given test are specified in a Q-matrix (Tatsuoka, 1983) according to a substantive theory of a construct. Using an analogy from confirmatory factor analysis, a Q-matrix is the loading structure of a CDM wherein item-by-skill relationships are hypothesized. Then the theory-driven Q-matrix is tested against real data. CDMs are confirmatory in another sense which is rarely discussed. According to Rupp and Templin (2008), CDMs are also confirmatory in that how attributes interact in the response process should be specified a priori, that is, whether attributes combine in a compensatory or conjunctive relationship (see below for more explanation of the terms) to produce the correct answer should be specified in advance. The process of selecting the right model (i.e., either compensatory or conjunctive) should be informed by domain theories or extant literature. Finally, CDMs are probabilistic in two ways: They express (a) a given respondent’s performance level in terms of the probability of mastery of each one of the postulated attributes separately and (b) the probability of his or her having a specific skill mastery profile or belonging to each latent class. Take the present study, for example, where five subskills (k = 5) were postulated to underlie performance on the test under study. With five subskills, there are 2⁵ = 32 skill mastery profiles² representing a latent class each. As the second row of Table 8 shows (see below), for Respondent 2, the chances are 43% and 16% that he belonged to the Latent classes 32 and 31, respectively The probabilities of belonging to the Latent Classes 11, 12, 15, 16, 25, 27, and 28 are about .03, .01, .08, .06, .08, .07, and .07, respectively, and 0 for the rest of the latent classes. Therefore, he is assigned to the latent class with the highest probability, that is, Latent Class 2. As Table 9 shows (see below), the probabilities that Respondent 14238, for example, with the skill profile of [01011] has mastered Attributes 1 to 5 are 0.58, 0.9, 0.74, 1.00, and .82, respectively.

Depending on whether CDMs specify inter-skill relationships a priori or not, they are classified into two groups: general and specific, as shown in Table 1. General CDMs allow for both compensatory and noncompensatory relationships within the same test. The strength of the general models is that they allow each item to pick the model that best fits it rather than to force-assign a single model to all the items. Specific CDMs, on the other hand, allow for either compensatory or non compensatory relationships in the same test. Most of the specific models are subsumed under the general models.

Table 1.

CDM Types.

	CDM type	Examples	Author(s)
Specific	Compensatory	1. Deterministic-input, noisy-or-gate model (DINO)	Templin and Henson (2006)
		2. Compensatory reparameterized unified model (C-RUM)	S. M. Hartz (2002)
		3. Additive CDM (ACDM)	de la Torre (2011)
	Noncompensatory	1. Deterministic-input, noisy-and-gate model (DINA)	Junker and Sijtsma (2001)
	Noncompensatory	2. Noncompensatory reparameterized unified model (NC-RUM)	DiBello, Stout, and Roussos (1995); S. M. Hartz (2002)
General	Both compensatory and noncompensatory	1. General diagnostic model (GDM)	von Davier (2005)
		2. Log-linear CDM (LCDM)	Henson, Templin, and Willse (2009)
		3. Generalized deterministic-input, noisy-and-gate model (G-DINA)	de la Torre (2011)

In compensatory models, mastery of one or some of the attributes required to get an item right can compensate for nonmastery of the other attributes. On the contrary, in noncompensatory or conjunctive models, lack of mastery of one attribute cannot be completely compensated by other attributes in terms of item performance.

Review of the Literature

CDMs have been used in two ways: (a) retrofitting (post hoc analysis) of existing non-diagnostic tests to extract richer information and (b) designing a set of items or tasks from the beginning for diagnostic purposes. Most of the applications of CDMs in educational measurement in general and language testing in particular are cases of retrospective specification (post hoc analysis) of the knowledge and skills evaluated by existing non-diagnostic tests. CDM studies in the literature have focused mostly on psychometric modeling rather than the actual application of CDMs. Most of the few applications have involved math data and been add-ons to simulation studies (e.g., J. Chen & de la Torre, 2013; J. Chen, de la Torre, & Zhang, 2013; Cui, Gierl, & Chang, 2012; de la Torre, 2009; de la Torre & Douglas, 2004; Henson, Templin, & Willse, 2009; Hou, de la Torre, & Nandakumar, 2014; Templin & Bradshaw, 2013; von Davier, 2014). Henson (2009) argues,

Although this direction was necessary as a first step toward establishing a very basic set of statistical principles, the growing emphasis of the methodology for diagnostic classification models (DCMs) is now on providing evidence that these models, in application, can provide the information that has been promised. (p. 34)

Most of the applications of CDMs in educational assessment in general and language assessment in particular have involved the Rule Space Model (e.g., Buck & Tatsuoka, 1998; Buck, Tatsuoka, & Kostin, 1997; Buck, VanEssen, Tatsuoka, Kostin, Lutz, & Phelps, 1998; Kasai, 1997; Kasai & Saito, 1996; Scott, 1998) and the Fusion Model (e.g., Jang, 2009; A. Y. A. Kim, 2015; Y. H. Kim, 2011; Li, 2011; Li & Suen, 2013; Sawaki, Kim, & Gentile, 2009). However, other models have also been applied. von Davier (2005) applied the general diagnostic model (GDM) to the reading and listening sections of Test of English as a Foreign Language (TOEFL). Ravand, Barati, and Widhiarso (2012) applied the DINA model to second language (L2) reading data.

However, as DiBello, Roussos, and Stout (2007) note, both compensatory and noncompensatory models assumed by specific CDMs make simplifying assumptions about the relationships between attribute mastery and response probability. They further argue,

These types of [simplifying] assumptions reduce the number of item parameters to be estimated, thus reducing standard errors of estimation . . . this can be especially useful when the total number of items measuring a given attribute is small or if the number of examinees in the sample is small. But these kinds of parameter reduction may also introduce unwanted bias if the assumptions are not warranted. (DiBello et al., 2007, p. 985)

According to de la Torre and Lee (2013), employing general models is helpful in that “(a) CDMs need not be specified a priori, and (b) multiple, statistically determined CDMs can be used within a single assessment” (p.370).

There are few studies investigating the application of the general CDMs. von Davier (2005) applied the GDM to L2 data. Templin and Hoffman (2013) demonstrated the application of the log-linear CDM (LCDM) to the grammar section of the TOEFL. To the best knowledge of the author, few, if any, studies have demonstrated the application of the G-DINA. An advantage of the G-DINA is that, unlike GDM and LCDM, which are conducted with restricted research license and commercial software programs, respectively, G-DINA can be conducted through the free software program R and the Ox (Doornik, 2007) code prepared by de la Torre, available by contacting Jimmy de la Torre at j.delatorre@rutgers.edu.

Therefore, the present study intends to demonstrate the application of the G-DINA to a high-stakes reading comprehension test. Throughout, the flexibilities of the model are demonstrated and focused upon, and the outputs are interpreted and issues related to the application of the model are discussed. Specifically, how the G-DINA can be used to inform inter-skill relationships in reading comprehension is discussed.

G-DINA Model

The G-DINA was proposed by de la Torre (2011), as a generalization of the DINA model. The DINA model is a noncompensatory model, which classifies test takers into two groups for each item: Those who have mastered all the subskills required by the item j (ξ_ij = 1) and those who have not mastered at least one of the required attributes (ξ_ij = 0). According to the DINA model, “lacking one required attribute for an item is the same as lacking all the required attributes for the item” (de la Torre, 2011, p. 179). de la Torre (2011) argues that this assumption might not hold for the group ξ_ij = 0. Unlike the DINA model, the G-DINA does not assume equal probability of success for all those who have not mastered any, some, or all of the required attributes for an item.

The modeling approach adopted by the G-DINA is the same as ANOVA. In this model a set of main and interaction effects are used. Specific CDMs are derived from the G-DINA by removing the main and/or interaction effects. The probability in a G-DINA model that student i gets item j correct which requires two attributes $α_{1}$ and $α_{2}$ is defined as follows:

P (X_{i j} = 1 | α_{1}, \dots, α_{K}) = δ_{j 0} + δ_{j 1} α_{1} + δ_{j 2} α_{2} + δ_{j 12} α_{1} α_{2}

The parameter $δ_{j 0}$ is denoted as the item intercept, which is the probability of a correct answer to an item when none of the required attributes for the item has been mastered. For two attributes, there are two main effects $δ_{j 1}$ and $δ_{j 2}$ and one interaction effect $δ_{j 12}$ .

Method

Data

The test analyzed in this study is the reading comprehension section of the Iranian National University Entrance Examination (INUEE), a four-option multiple-choice high-stakes test held annually to admit candidates to master’s programs in English studies. The test is an advanced assessment designed for candidates holding a bachelor’s degree who seek to pursue their studies for a master’s degree in state universities. The test is composed of two sections of content knowledge and general English (GE). The GE section is of four sections of grammar (10 items), vocabulary (20 items), cloze (10 items), and reading comprehension (20 items). The candidates are supposed to answer the test in 60 min. The 20 reading comprehension items and a sample of 10,000 candidates (69 % females and 31 % males) who took the test in 2012 were selected for this study. The participants mostly aged between 22 and 25 years.

For the purpose of the present study, the item response data were randomly divided into two groups: calibration group and validation group (terms borrowed form multigroup FA). First, the adequacy of the initially specified subskills was explored using the calibration sample; then, the final set of subskills was validated with the validation sample.

Q-Matrix Construction

Quality of a diagnostic assessment is affected by how correctly the subskills underlying performance on the item of any given test have been specified. To define attributes involved in a test, various sources such as test specifications, content domain theories, analysis of item content, think-aloud protocol analysis of examinees’ test-taking process, and the results obtained by the relevant research in the literature can be sought (Embretson, 1991; Leighton & Gierl, 2007; Leighton, Gierl, & Hunka, 2004). According to Lee and Sawaki (2009a), in CDMs retrofitted to existing non-diagnostic tests, where a detailed cognitive model of task performance is not available, “brainstorming about possible attributes that elaborate on an existing test specification might serve as a good point of departure” (p. 176). Because the test employed in the present study had not been developed for diagnostic purposes, the author took the following steps to ensure, as much as possible, that the subskills identified were valid: (a) The author invited two university instructors to brainstorm on the possible attributes measured by the test, (b) three other university instructors and three master students were invited to independently specify the attributes measured by each item, (c) the Q-matrix was empirically validated and revised, and (d) the final Q-matrix was cross-validated with the other half of the sample (i.e., validation group). Each one of the steps is explained in detail below.

The author invited three university instructors, who had been teaching reading comprehension at BA level to English majors for at least 10 years, to identify the possible attributes measured by the test. They specified a set of five attributes: reading for details, reading for inference, reading for main idea (henceforth referred to as Detail, Inference, and Main idea), Syntax, and Vocabulary. Three other university instructors holding PhDs with more than 5 years of teaching reading comprehension experience and three master’s students studying English Language Teaching, who had taken the same test a year before to enter the master’s program, were invited to independently specify the attributes measured by each of the 20 reading comprehension items. They were trained for a session on how to code the attributes measured by each item. Then they read the passages and coded the items for the major attributes they utilized to respond to each item, independently. An initial Q-matrix (Tatsuoka, 1985) was developed. Attributes on which at least two thirds (i.e., four) of the coders agreed were included into the Q-matrix. A Fleiss Kappa agreement rate of .59 indicated a moderate agreement among the coders (Landis & Koch, 1977).

The disagreements mostly concerned whether, in addition to attributes such as Detail, Main idea, and Inference, Vocabulary and Syntax were also required for correctly answering some of the items. The reading passages were relatively long with a lot of difficult words and in some cases, complicated sentences, and the student judges perceived them very difficult. The author decided to resolve the disagreements in favor of the students’ codings for three reasons: (a) Student judges had the experience of taking the same test in a high-stakes context; it was thought their coding were more indicative of the real processes involved in reading comprehension hence more reliable; (b) as noted by Leighton and Gierl (2007), expert judges’ ability is usually well above that of the students and the students do not necessarily follow the same processes as specified by expert judges; and (c) the follow-up empirical validation of the Q-matrix would indicate whether the skill was consequential for correctly answering the item or not.

The “initial” Q-matrix is presented in Table 2. In this table, 1s indicate that the item requires the attribute whereas 0s indicate that the item does not require the attribute.

Table 2.

Initial Q-Matrix.

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	1	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	0
10	1	0	0	0	0
11	0	1	0	0	1
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	1
17	0	1	0	0	1
18	1	0	0	0	1
19	0	0	0	1	1
20	0	0	0	1	0

In the next step, the Q-matrix was empirically validated with half of the data (i.e., calibration sample). The adequacy of the Q-matrix was explored through the procedure suggested by de la Torre and Chiu (2010) using a code written in Ox (Doornik, 2007).

Q-Matrix Revision

In the first run of the Ox, the following suggestions for the Q-matrix revision were provided: For Item 2, it was suggested that Detail, specified by judges as one of the requirements of the item, be removed from the Q-matrix. Had the attribute been removed, the only attribute remaining in the Q-matrix for Item 2 was Vocabulary. Believing that statistical analysis cannot be the only driving force for Q-matrix revision, both the author and the judges further inspected the content of the item. Therefore, the student and instructor judges and the author unanimously agreed that Detail was required for the item, hence kept in the Q-matrix. For Items 9 and 16, addition of Vocabulary was suggested. Inversely, for Items 11, 16, 17, and 18 for which initially both Detail and Vocabulary were specified, the suggestion was to remove Vocabulary from the Q-matrix. I did accordingly and rerun the Ox.

After several rounds of revisions and rerunning the Ox, I came up with the Q-matrix presented in Table 3. Four items were affiliated with Attribute 1 (Detail), eight items with Attribute 2 (Inference), three items with Attribute 3 (Main idea), six items with Attribute 4 (Syntax), and finally, 12 items were associated with Attribute 5 (Vocabulary).

Table 3.

Final Q-Matrix.

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	0	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	1
10	1	0	0	0	0
11	0	1	0	0	0
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	0
17	0	1	0	0	0
18	1	0	0	0	0
19	0	0	0	1	1
20	0	0	0	1	0

Purely relying on the items and the data to come up with the final Q-matrix is like an exploratory study that might capitalize on chances: It is almost always possible to come up with a set of subskills that fit the data. To ensure that the subskills are still meaningful in other contexts, I used the second half of the sample (i.e., validation sample) to explore meaningfulness of the Q-matrix. In the first run with the “initial” Q-matrix, the suggestions for revision were exactly the same as the ones suggested in the first run with the calibration sample. Therefore, in the next run, the “final” Q-matrix, obtained in the last run of the Ox with the calibration sample, was used. The suggestions were exactly the same as the ones suggested by Ox in the last run with the calibration sample. The conclusion was that the Q-matrix also held for the second half of the sample.

Model Fit

As with any statistical model, the results of a CDM are meaningless if the model fit the data poorly. Fit of a model can be ascertained in two ways: checking fit of the model to the data (i.e., absolute fit) and comparing the model with other rival models (i.e., relative fit). For the purpose of the present study, the absolute fit of the G-DINA was evaluated by comparing the observed and model-predicted statistics and inspecting the classification consistency and accuracy of the model.

The following fit indices were inspected to check the fit of the model:

Mx2 (W. H. Chen & Thissen, 1997) which is the test of global model fit, which uses test statistics of all item pairs. It is the mean difference between the model-predicted and observed response frequencies. Large differences are taken as evidence that there are dependencies between the items. Because respondents draw upon the same cognitive processes to respond to the items, dependencies are expected. But if CDM fits the data well, “the x² test statistic is expected to be 0 within each latent class as the attribute profile of the respondents would perfectly predict the observed response patterns” (Rupp, Templin, & Henson, 2010, p. 269).

The mean absolute difference for the item-pair correlations (MADcor) statistic (DiBello et al., 2007). It is the difference between the observed and the model-predicted item correlations.

Mean residual covariance (MADRESIDCOV). MADRESIDCOV (McDonald & Mok, 1995) is the mean difference between matrices of observed and reproduced item correlations.

Q3 (MADQ3) statistic (Yen, 1984). MADQ3 is calculated by subtracting the model-predicted from the observed responses of the respondents and computing the average of the pairwise correlation of these residuals.

The average root mean square error (RMSEA) for the item parameters.

Classification consistency (P_c) and accuracy (P_a) (Cui et al., 2012). P_c and P_a refer to the reliability and validity of the examinees’ classification into the latent classes or master/nonmaster of each separate skill. P_c is an indicator of the degree to which an examinee is consistently classified into the same latent class or will be indicated as master/nonmaster of the same attribute on re-administration of the same or a parallel form of the test while P_a refers to the degree to which an examinee’s classification matches his true latent class or he is truly identified as master/nonmaster of any given attribute.

As Table 4 shows, the “CDM” package provides a test of significance for Mx2. A nonsignificant value (p > 0) indicates good fit. The value of Mx2 in the present study was 7.73, which is not significant (p = .12). There are no hard-and-fast rules as to most of the other model fit indices, which are based on observed and model-predicted statistics. For all these indices, the closer the value to zero, the better the model fits. The MADcor in the present study was 0.006. DiBello et al. (2007) considered a MADcor of 0.049 in Jang (2005) and Roussos et al. (Roussos, DiBello, Henson, Jang, & Templin, 2006; Roussos, DiBello, & Stout, 2006) as suggesting a good fit of the CDM to the data. For MADRESIDCOV, MADQ3, and RMSEA, values of below .05 show good fit. Except for the MADRESIDCOV value, which was .12, the other indices were well below .05, indicating good fit of the G-DINA to the data.

Table 4.

G-DINA Fit Statistics.

Fit index	Estimate	Significance
Mx2	7.73	.12
MADcor	0.006	—
MADRESIDCOV	0.12	—
MADQ3	0.025	—
RMSEA	0.01	—

As the first row of Table 5 shows, the P_a and P_c values for the whole latent class pattern in the present study were .81 and .73, respectively. The other rows of Table 5 display the degree to which the test takers were consistently and accurately classified as masters and nonmasters of each separate skill. The values for all the skills were relatively high. There are no clear-cut criteria for P_a and P_c values. C. Ying (personal communication, November 12, 2013) suggested a value of .7 or .8 for the P_a and P_c as acceptable classification rates. In the light of the results obtained by Cui et al. (2012), .68 and .52 for P_a and P_c, respectively, for the fraction subtraction data (Tatsuoka, 2002), and also considering Ying’s suggestion, reliability and validity of the classifications in the present study are acceptable.

Table 5.

Classification Consistency (P_c) and Accuracies (P_a).

	G-DINA
P_a	.81
P_c	.73
P_a Skill1	.71
P_c Skill1	.94
P_a Skill2	.86
P_c Skill2	.95
P_a Skill3	.69
P_c Skill3	.89
P_a Skill4	.94
P_c Skill4	.88
P_a Skill5	.83
P_c Skill5	.81

Data Analysis

Data were analyzed using R, “CDM” package Version 3.4-4 (Robitzsch, Kiefer, George, & Uenlue, 2014). The “CDM” package employs Marginal maximum likelihood using the Expectation-Maximization algorithm to do the estimations. The results showed that Syntax, mastered by about 73% of the test takers, was the easiest attribute followed by Vocabulary, Detail, Main Idea, and Inference mastered by 64%, 60%, 54% and 50% of the test takers, respectively. As it was explained above, CDMs group test takers into 2^K latent classes. In the present study, as Table 6 shows, test takers were classified into 2⁵ = 32 latent classes. For space considerations, data for only the first and last three latent classes are presented in the table. The second column of the table shows the possible attribute profiles for the latent classes.

Table 6.

Class Probabilities.

Latent class	Skill profile	Class p	Class expected frequency
1	00000	.149	3223.178
2	10000	.001	18.088
3	01000	.004	81.380
—	—	—	—
30	10111	.009	198.654
31	01111	.018	388.697
32	11111	.373	8073.891

As the third column of Table 6 reads, the attribute profile of α₃₂ = [11111] had the highest class probability. Approximately 37% of the respondents (as shown in the last column, about 8073 respondents) in the present study were classified as belonging to this last latent class hence expected to have mastered all of the five attributes. Skill profile of α₁ = [00000] had the second highest class probability of about .15 indicating that approximately 15% (about 3223 respondents) of the test takers were expected to have mastered none of the attributes.

To save space, the G-DINA parameter estimates for only the first two items of the reading comprehension test are displayed in Table 7. The second column represents the attributes required by any item, the third column displays the skill mastery patterns, and the fourth column represents the probability of success on each item due to mastery of the attributes required by any item on the test. The number of parameters estimated for each item is a function of the number of attributes required by that item. Because G-DINA is a saturated CDM, all the main effects for the attributes and their possible interactions are estimated. For example, for Items 1, 2, 3, 6, 7, 8, 9, 13, and 19, which required two attributes, four parameters were estimated each: one intercept, two main effects for the attributes, and one interaction effect. For Items 4 and 12, which required three attributes, eight parameters were estimated each: one intercept, three main effects, and four interaction effects. The intercept parameters show the probability of answering each item correctly when none of the attributes required by the item has been mastered. The main effects show the increase in the probability of correctly answering each item when any of the attributes has been mastered and the interaction effects show the increase in the probability when a combination of the attributes has been mastered.

Table 7.

G-DINA Parameters.

Item number	partype.attr	skillcomb	p
1	V2-V5	A00	.11
1	V2-V5	A10	.18
1	V2-V5	A01	.42
1	V2-V5	A11	.60
2	V1-V5	A00	.10
2	V1-V5	A10	.12
2	V1-V5	A01	.13
2	V1-V5	A11	.53

Note. V1 to V5 are Detail, Inference, Main Idea, Syntax, and Vocabulary, respectively.

partype.attr refers to the subskills required by each item.

Skillcomb refers to the mastery status of the required subskills: 1 = matery, 2 = nonmastery

As Table 7 shows, those who had not mastered any of the attributes required by Item 1, namely Inference and Vocabulary, had about 11% chance of guessing and getting the item right. Chances of success on Item 1 for those who had mastered only Inference were 18% higher compared with those who had not mastered any of the attributes. Therefore, masters of Inference had .11 + .18 = .29 probability of not slipping (success) on the item. Mastery of Vocabulary increased success on the item more than mastery of the Inference, indicating that Vocabulary discriminated more between its masters and nonmasters. Therefore, masters of vocabulary had .11 + .42 = .53 chance of getting the item right. Interaction of (mastery of both) Inference and Vocabulary added 60% to the probability of success on the item: For masters of both attributes, the probability of getting the item right was .11 + .60 = .71. A point worthy of note is that the probabilities for any given item should not add up to one because they are conditional probabilities.

Table 8 displays the probabilities that each person belonged to each one of the 32 latent classes for five respondents. In the table, values for each respondent with the given response pattern represent the posterior probability that he belongs to latent class c with the given skill profile. For example, for Respondent 2, the chances are 43% and 16% that he belonged to Latent classes 32 and 31, respectively. Put another way, there is 43% chance that he has mastered all the five attributes and 16% chance of having mastered Attributes of Inference, Main idea, Syntax, and Vocabulary.

Table 8.

Class Probabilities for Respondents.

Response pattern	Class 1	Class 2	Class 3	Class 4	Class 5	Class 6	Class 7	Class 8	Class 9	Class 10	Class 11	Class 12	Class 13	Class 14	Class 15	Class 16	Class 17	Class 18	Class 19	Class 20	Class 21	Class 22	Class 23	Class 24	Class 25	Class 26	Class 27	Class 28	Class 29	Class 30	Class 31	Class 32
00000000000000000000	.98	.00	.00	.00	.00	.00	.00	.00	.01	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00
11100100001000100000	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.03	.01	.00	.00	.08	.06	.00	.00	.00	.00	.00	.00	.00	.00	.08	.00	.07	.07	.00	.00	.16	.43
01000000010110010000	.00	.00	.00	.00	.00	.00	.00	.00	.02	.02	.02	.06	.00	.00	.07	.47	.00	.00	.00	.00	.00	.00	.00	.00	.03	.02	.00	.04	.00	.00	.00	.23
10000000000000000000	.10	.00	.00	.00	.00	.00	.00	.00	.73	.00	.01	.00	.06	.00	.03	.01	.00	.00	.00	.00	.00	.00	.00	.00	.06	.00	.00	.00	.00	.00	.00	.00
00000100001000000000	.93	.00	.00	.00	.00	.00	.00	.00	.02	.00	.00	.00	.00	.00	.00	.00	.04	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00	.00

“CDM” Package also generates probability that each test taker has mastered any of the subskills involved in answering the items of any given test. To save space, only a slice of the output is presented in Table 9. It shows the probability that each respondent with the given ID, response pattern, and skill profile has mastered Attributes 1 to 5. As the table shows, the probabilities that Respondent 14238 with the skill profile of [01011] has mastered Attributes 1 to 5 are .58, .9, .74, 1.00, and .82, respectively.

Table 9.

Skill Mastery Probabilities.

Column 1	Pattern	Skill profile	p	Attribute1	Attribute2	Attribute3	Attribute4	Attribute5
1	00000000000000000000	00000	.98	.00	.002	.003	.01	.004
14238	11100100001000100000	01011	.43	.58	.91	.74	1.00	.82
6085	01000000010110010000	10010	.47	.84	.90	.78	1.00	.32
282	10000000000000000000	00010	.73	.01	.05	.10	.90	.08
153	00000100001000000000	00000	.93	.00	.00	.01	.02	.04
4345	00001111100110000000	00001	.81	.09	.06	.08	.12	1.00
1875	11001100000011000000	01111	.51	.54	.73	.91	1.00	.98
1099	00001100110111000010	10111	.88	.95	.93	.98	1.00	.95
362	00001110100010000001	00011	.77	.04	.04	.09	.91	.96
1265	00101110000010000011	00011	.87	.03	.02	.07	.99	.97

As it was argued before, general models allow for different CDMs across the items on the same test. Recently, de la Torre and Lee (2013) used the Wald test to objectively choose the best-fitting model for each multi-attribute item. Specifically, the function developed by de la Torre and Lee evaluated the fit of the G-DINA, at item level, against that of the DINA, deterministic-input, noisy-or-gate (DINO), and additive CDM (ACDM). The results of the Wald test, using Ox, showed that from among the 11 items which required more than one subskills, DINO fit Items 6,7, 13, and 19, ACDM fit Items 1, 3, and 9, DINA fit Items 2 and 8 and for the other three multi-attribute items the G-DINA fit.

Discussion

The present study demonstrated the application of the G-DINA to the reading comprehension data of a high-stakes reading comprehension test. The results of the study showed that the two “flat” skill mastery profiles, namely “nonmaster of all skills” α1 = [00000] and “master of all skills” α32 = [11111], were the most prevalent skill profiles. This finding is in line with other CDM studies (e.g., Lee & Sawaki, 2009b; Li, 2011; Ravand et al., 2012). Prevalence of the flat skill profiles can be due to high positive correlations among the attributes (Rupp et al., 2010) or unidimensionality of the measure used, where a master of one skill tends to be a master of another skill, or vice versa (Lee & Sawaki, 2009b). High tetrachoric correlations between the attributes were observed in the present study. Except for the correlations between Vocabulary and the other four subskills, the other correlations ranged from .78 to .95. Correlations between Vocabulary and the other attributes ranged from .38 to 61.

As it was explained above, general CDMs such as G-DINA allow for checking model fit at two levels: macro or test level and micro or item level. At macro level when the G-DINA fits, the problem of model selection is solved. At this level when the G-DINA fits the data, the implication is that at least for some items the relationships between the subskills is compensatory, for some others it is noncompensatory, and still for some others the relationships are not yet known. For these items, the G-DINA fits.

At item level, specific CDMs such DINA, DINO, ACDM, and NC-RUM are more interpretable in terms of the relationships among the attributes, whereas general models are hard to interpret (Rojas, de la Torre, & Olea, 2012). The implication is that at test level, general models are more favored, but at item level specific models are to be preferred. At test level, general models do not see the relationships among the attributes in a test through the limited lens of any specific CDM and allow for multiple CDMs for different items within the same test. In other words, general CDMs allow the researcher to hypothesize varying relationships among the attributes across the items. It is more viable to hypothesize that this relationship might change depending on the difficulty of the attributes, the area of language tapped by the items, the cognitive load of the attributes (e.g., whether they tap higher or lower order thinking), and so on, rather than to assume the same relationship (either compensatory or noncompensatory) across all items of a test.

The results of item-level model selection showed that for some items which required more than one skill, variously the DINO, DINA, ACDM, and G-DINA held: DINO fit Items 6, 7, 13, and 19, ACDM fit Items 1, 3, and 9, DINA fit Items 2 and 8 and for the other three multi-attribute items G-DINA fit. The interpretation of the relationships among the attributes for items such as Item 4, for which the G-DINA fit, is difficult. However, using the G-DINA can open up new avenues of research regarding inter-skill relationships at item level. Items such as the ones for which the G-DINA fit can “direct a researcher’s attention to those problems where cognition may not be well understood” (Henson et al., 2009, p. 208).

As the results of the present study showed, Syntax was the easiest and Inference was the most difficult attribute. The second most difficult attribute was Main Idea, followed by Detail and Vocab. This hierarchy of difficulty of the L2 reading attributes concurs with the previous research (Grabe & Stoller, 2002; Lumley, 1993). The findings are in line with those of Baghaei and Ravand (2015) who apply the linear logistic test model (Fischer, 1973; see also Baghaei & Kubinger, 2015) to these data. Harding, Alderson, and Brunfaut (2015) argued that “it is probably reasonable to accept that both first language and L2 reading involve a number of different ‘levels’ of ability” (p. 4). According to Haring et al., syntax and vocabulary are lower level attributes and understanding main idea, making inferences, and understanding specific details are higher level L2 reading processes. Understanding the main idea of a reading passage involves knowledge of vocabulary, grammar, discourse, and employing different cognitive processes (Pressley, 2002). In a similar vein, inferencing is a complex attribute hence difficult to master (Long, Seely, Oppy, & Golding, 1996). Inferencing involves understanding both the literal and implied meanings of a text. Both Main Idea and Inferencing were identified as the most difficult subskills because they involved higher level processing of the information in the passages (Grabe, 2009).

Finally, the G-DINA can also provide insights as to some misspecifications in the Q-matrix. According to the attribute plot for Item 4 (shown in Figure 1), the main effect for the second attribute required for Item 4 (Syntax), the two- and three-way interactions of this attribute with the other attributes were all close to the intercept value; the increase in the probability of correct answer due to the mastery of Syntax was nearly zero. The two-way interaction of Syntax with Detail and Vocabulary also added relatively nothing to the probability of a correct answer. Although the Q-matrix validation through the general procedure suggested by de la torre and Chiu (2010) did not suggest any misspecification in the q-vector for Item 4, the G-DINA showed that the weight associated with this attribute was very low.

Figure 1.

G-DINA attribute probability plot for Item 4.

According to Lee (2015), there are three components to diagnostic language assessment: diagnosis, feedback, and remedial learning. The first component, which is the core component, has to do with identifying test takers’ weaknesses and their root causes as well as their strengths, which can be carried out through CDM. Providing feedback for remedial learning through CDM can help secure consequential validity of language assessments (Messick, 1996). One interesting area for further research can be how learner characteristics (e.g., proficiency level, gender, learning style, etc.) and granularity of the attributes affect effectiveness of diagnostic feedback. However, there is one important caveat: As Lee (2015) argued, it should be noted that because competencies of most language learners are still in development, a distinction should be made among the undeveloped, partially developed, and fully developed competencies and for any of these stages of development it should be assessed whether competencies and their constituent components are malfunctioning, partially functioning, or highly functioning. In other words, when weaknesses or deficiencies are identified, it should be made clear whether the weakness reflects the stage of development (i.e., undeveloped, partially developed competencies) or it is due to the fact that the fully developed competency is malfunctioning or partially functioning due to factors affecting performance.

One more point should be made before wrapping up this section. CDM has been largely motivated by a call for more formative assessment by No Child Left Behind Act (2001). Formative assessment is carried out repeatedly in a course of instruction to monitor the learning process or change as a result of feedback and instruction. CDMs suit the purpose of formative assessment the best if they are used to identify the strengths and weaknesses of language learners over time rather than being applied to one-shot assessments. The intervention-induced changes over time can be modeled through a proper growth model. However, growth models have been proposed to model changes in continuous latent abilities over repeated measures. CDMs deal with binary mastery/nonmastery statuses. To measure change in binary mastery/nonmastery statuses, conventional growth models cannot be employed for the above mentioned reason. As a way around this problem, Li, Cohen, Bottge, and Templin (2015) integrated the DINA model with the latent transition analysis model to analyze change in binary continuous skills over time.

Limitations and Suggestions for Further Research

Regarding the process of Q-matrix development, there is as yet no standardized method of Q-matrix development. In this study, the Q-matrix was developed by having a group of students and content experts code the reading test while referring to the list of attributes previously specified by another group of experts. A think-aloud procedure would have resulted in more authentic determination of the attributes required to perform successfully on the test. Future studies can investigate relationships among attributes of varying difficulty. Challenging the findings of Alderson and Lukmani (1989) and corroborating the findings of Brutten, Perkins, and Upshur (1991) and Lumley (1993), the present study found a hierarchy of attribute difficulty. Future research can study stability of this hierarchy of difficulty across test takers of different proficiency levels (e.g., high, mid, and low).

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Adams

R. J.

Wilson

M. R.

Wang

W. C.

(1997). The multidimensional random coefficients multinomial logit. Applied Psychological Measurement, 21, 1-24.

Alderson

J. C.

Lukmani

(1989). Cognition and reading: Cognitive levels as embodied in test questions. Reading in a Foreign Language, 5, 253-270.

Baghaei

. (2012). The application of multidimensional Rasch models in large scale assessment and validation: An empirical example. Electronic Journal of Research in Educational Psychology, 10, 233-252.

Baghaei

Kubinger

K. D.

(2015). Linear logistic test modeling with R. Practical Assessment, Research & Evaluation, 20, 1-11. Retrieved from http://pareonline.net/getvn.asp?v=20&n=1

Baghaei

Ravand

. (2015). A cognitive processing model of reading comprehension in English as a foreign language using a linear logistic test model. Learning and Individual Differences, 43, 100-105.

Brutten

S. R.

Perkins

Upshur

J. A.

(1991, March). Measuring growth in ESL reading. Paper presented at the Thirteenth Annual Language Testing Research Colloquium, Princeton, NJ.

Buck

Tatsuoka

(1998). Application of the rule-space procedure to language testing: Examining attributes of a free response listening test. Language Testing, 15, 119-157.

Buck

Tatsuoka

Kostin

(1997). The subskills of reading: Rule-space analysis of a multiplechoice test of second language reading comprehension. Language Learning, 47, 423-466.

Buck

VanEssen

Tatsuoka

Kostin

Lutz

Phelps

(1998). Development, selection and validation of a set of cognitive and linguistic attributes for the SAT I Verbal: Analogy section (Research Report, RR-98-19). Princeton, NJ: Educational Testing Service.

10.

Chen

& de la Torre

(2013). A general cognitive diagnosis model for expert-defined polytomous attributes. Applied Psychological Measurement, 37, 419-437. doi:10.1177/0146621613479818

11.

Chen

de la Torre

Zhang

(2013). Relative and absolute fit evaluation in cognitive diagnosis modeling. Journal of Educational Measurement, 50, 123-140. doi:10.1111/j.1745-3984.2012.00185.x

12.

Chen

W. H.

Thissen

(1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265-289.

13.

Cui

Gierl

M. J.

Chang

H. H.

(2012). Estimating classification consistency and accuracy for cognitive diagnostic assessment. Journal of Educational Measurement, 49, 19-38.

14.

de la Torre

(2009). DINA model and parameter estimation: A didactic. Journal of Educational and Behavioral Statistics, 34, 115-130.

15.

de la Torre

(2011). The generalized DINA model framework. Psychometrika, 76, 179-199.

16.

de la Torre

Chiu

C. Y.

(2010, April). General empirical method of Q-matrix validation. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.

17.

de la Torre

Douglas

J. A.

(2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333-353.

18.

de la Torre

Lee

Y. S.

(2013). Evaluating the Wald test for item-level comparison of saturated and reduced models in cognitive diagnosis. Journal of Educational Measurement, 50, 355-373.

19.

DiBello

L. V.

Roussos

L. A.

Stout

W. F.

(2007). Review of cognitively diagnostic assessment and a summary of psychometric models. In Rao

C. R.

Sinharay

(Eds.), Handbook of statistics. Volume 26: Psychometrics (pp. 979-1030). Amsterdam, The Netherlands: Elsevier.

20.

DiBello

L. V.

Stout

W. F.

Roussos

(1995). Unified cognitive psychometric assessment likelihood-based classification techniques. In Nichols

P. D.

Chipman

S. F.

Brennan

R. L.

(Eds.), Cognitively diagnostic assessment (pp. 361-390). Hillsdale, NJ: Lawrence Erlbaum.

21.

Doornik

J. A.

(2007). Object-oriented matrix programming using Ox (6th ed.). London, England: Timberlake Consultants Press.

22.

Embretson

S. E.

(1983). Construct validity: Construct representation vs. nomothetic span. Psychological Bulletin, 93, 179-197.

23.

Embretson

S. E.

(1991). A multidimensional latent trait model for measuring learning and change. Psychometrika, 56, 495-515.

24.

Fischer

G. H

. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

25.

Grabe

(2009). Reading in a second language: Moving from theory to practice. Cambridge, England: Cambridge University Press.

26.

Grabe

Stoller

(2002). Teaching and research reading. Harlow, UK: Longman.

27.

Harding

Alderson

J. C.

Brunfaut

(2015). Diagnostic assessment of reading and listening in a second or foreign language: Elaborating on diagnostic principles. Language Testing. Advance online publication. doi:10.1177/0265532214564505

28.

Hartz

S. M.

(2002). A Bayesian framework for the unified model for assessing cognitive abilities: Blending theory with practicality (Doctoral dissertation). University of Illinois at Urbana–Champaign.

29.

Henson

R. A.

(2009). Diagnostic classification models: Thoughts and future directions. Measurement: Interdisciplinary Research and Perspectives, 7, 34-36.

30.

Henson

R. A.

Templin

J. L.

Willse

J. T.

(2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191-210. doi:10.1007/s11336-008-9089-5

31.

Hou

de la Torre

J. D.

Nandakumar

(2014). Differential item functioning assessment in cognitive diagnostic modeling: Application of the Wald test to investigate DIF in the DINA model. Journal of Educational Measurement, 51, 98-125.

32.

Jang

E. E.

(2005). A validity narrative: Effects of reading skills diagnosis on teaching and learning in the context of NG TOEFL (Doctoral dissertation). University of Illinois at Urbana–Champaign.

33.

Jang

E. E.

(2009). Cognitive diagnostic assessment of L2 reading comprehension ability: Validity arguments for fusion model application to LanguEdge assessment. Language Testing, 26, 31-73.

34.

Junker

B. W.

Sijtsma

(2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258-272.

35.

Kasai

(1997). Application of the rule space model to the reading comprehension section of the test of English as a foreign language (TOEFL) (Doctoral dissertation). University of Illinois, Urbana-Champaign, IL.

36.

Kasai

Saito

(1996, April). The rule space model applied to the reading comprehension section of the Test of English as a Foreign Language (TOEFL). Paper presented at the annual meeting of the National Council in Measurement in Education, New York, NY.

37.

Kim

A. Y. A.

(2015). Exploring ways to provide diagnostic feedback with an ESL placement test: Cognitive diagnostic assessment of L2 reading ability. Language Testing, 32, 227-258. doi:10.1177/0265532214558457

38.

Kim

Y. H.

(2011). Diagnosing EAP writing ability using the reduced reparameterized unified model. Language Testing, 28, 509-541. doi:10.1177/0265532211400860

39.

Landis

J. R.

Koch

G. G.

(1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.

40.

Lee

Y.-W.

(2015). Diagnosing diagnostic language assessment. Language Testing, 32, 299-316. doi:10.1177/0265532214565387

41.

Lee

Y.-W.

Sawaki

(2009a). Application of three cognitive diagnosis models to ESL reading and listening assessments. Language Assessment Quarterly, 6, 239-263. doi:10.1080/15434300903079562

42.

Lee

Y.-W.

Sawaki

(2009b). Cognitive diagnosis approaches to language assessment: An overview. Language Assessment Quarterly, 6, 172-189. doi:10.1080/15434300902985108

43.

Leighton

J. P.

Gierl

M. J.

(2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees’ thinking processes. Educational Measurement: Issues and Practice, 26, 3-16.

44.

Leighton

J. P.

Gierl

M. J.

Hunka

S. M.

(2004). The attribute hierarchy method for cognitive assessment: A variation on Tatsuoka’s Rule-Space Approach. Journal of Educational Measurement, 41, 205-237.

45.

(2011). Evaluating language group differences in the subskills of reading using a cognitive diagnostic modeling and differential skill functioning approach (Doctoral dissertation). Pennsylvania State University, State College.

46.

Cohen

Bottge

Templin

(2015). A latent transition analysis model for assessing change in cognitive skills. Educational and Psychological Measurement. Advance online publication.doi:10.1177/0013164415588946

47.

Suen

H. K.

(2013). Detecting native language group differences at the subskills level of reading: A differential skill functioning approach. Language Testing, 30, 273-298.

48.

Long

D. L.

Seely

M. R.

Oppy

B. J.

Golding

J. M.

(1996). The role of inferential processing in reading ability. In Britton

B. K.

Graesser

A. C.

(Eds.), Models of understanding text (pp. 189-214). Mahwah, NJ: Lawrence Erlbaum.

49.

Lumley

(1993). The notion of subskills in reading comprehension tests: An EAP example. Language Testing, 10, 211-234.

50.

McDonald

R. P.

Mok

M. M. C.

(1995). Goodness of fit in item response models. Multivariate Behavioral Research, 30, 23-40. doi:10.1207/s15327906mbr32-001

51.

Messick

(1996). Validity and washback in language testing. Language Testing, 13, 241-256.

52.

No Child Left Behind Act of 2001 (NCLB) Public Law 107-110.

53.

Pressley

(2002). Metacognition and self-regulated comprehension. In Farstrup

A. E.

Samuels

S. J.

(Eds.), What research has to say about reading instruction (3rd ed., pp. 184-200). Newark, DE: International Reading Association.

54.

Ravand

Barati

Widhiarso

(2012). Exploring diagnostic capacity of a high stakes reading comprehension test: A pedagogical demonstration. Iranian Journal of Language Testing, 3, 11-37.

55.

Robitzsch

Kiefer

George

A. C.

Uenlue

(2014). CDM: Cognitive diagnosis modeling (R package version 3.0-12). Retrieved from https://cran.r-project.org/web/packages/CDM/index.html

56.

Rojas

de la Torre

Olea

(2012, April). Choosing between general and specific cognitive diagnosis models when the sample size is small. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, British Columbia, Canada.

57.

Roussos

L. A.

DiBello

L. V.

Henson

R. A.

Jang

E. E.

Templin

J. L.

(2006). Skills diagnosis for education and psychology with IRT-based parametric latent class models. In Embretson

Roberts

(Eds.), New directions in psychological measurement with model-based approaches (pp. 35-69). Washington, DC: American Psychological Association.

58.

Roussos

L. A.

DiBello

L. V.

Stout

(2006). Diagnostic skills-based testing using the Fusion-Model-Based Arpeggio system. In Leighton

Gierl

(Eds.), Cognitively diagnostic assessment in education: Theory and practice (pp. 275-318). New York, NY: Cambridge University Press.

59.

Rupp

A. A.

Templin

J. L.

(2008). Unique characteristics of diagnostic classification models: A comprehensive review of the current state-of-the-art. Measurement: Interdisciplinary Research and Perspectives, 6, 219-262. doi:10.1080/15366360802490866

60.

Rupp

A. A.

Templin

J. L.

Henson

R. A.

(2010). Diagnostic measurement: Theory, methods, and applications. New York, NY: Guilford.

61.

Sawaki

Kim

H. J.

Gentile

(2009). Q-matrix construction: Defining the link between constructs and test items in large-scale reading and listening comprehension assessments. Language Assessment Quarterly, 6, 190-209.

62.

Scott

H. S.

(1998). Cognitive diagnostic perspectives of a second language reading test (Unpublished doctoral dissertation). University of Illinois, Urbana-Champaign, Urbana, IL.

63.

Tatsuoka

(2002). Data analytic methods for latent partially ordered classification models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51, 337-350.

64.

Tatsuoka

K. K.

(1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345-354.

65.

Tatsuoka

K. K.

(1985). A probabilistic model for diagnosing misconceptions by the pattern classification approach. Journal of Educational and Behavioral Statistics, 10, 55-73.

66.

Templin

Bradshaw

(2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30, 251-275.

67.

Templin

Hoffman

(2013). Obtaining diagnostic classification model estimates using Mplus. Educational Measurement: Issues and Practice, 32, 37-50.

68.

Templin

J. L.

Henson

R. A.

(2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287-305.

69.

von Davier

. (2005). A general diagnostic model applied to language testing data (RR-05-16). Princeton, NJ: Educational Testing Service.

70.

von Davier

. (2014). The log-linear cognitive diagnostic model (LCDM) as a special case of the general diagnostic model (GDM). Retrieved from http://onlinelibrary.wiley.com/doi/10.1002/ets2.12043/abstract

71.

Yen

W. M.

(1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125-145. doi:10.1177/014662168400800201

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	1	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	0
10	1	0	0	0	0
11	0	1	0	0	1
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	1
17	0	1	0	0	1
18	1	0	0	0	1
19	0	0	0	1	1
20	0	0	0	1	0

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	0	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	1
10	1	0	0	0	0
11	0	1	0	0	0
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	0
17	0	1	0	0	0
18	1	0	0	0	0
19	0	0	0	1	1
20	0	0	0	1	0

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	1	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	0
10	1	0	0	0	0
11	0	1	0	0	1
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	1
17	0	1	0	0	1
18	1	0	0	0	1
19	0	0	0	1	1
20	0	0	0	1	0

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	0	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	1
10	1	0	0	0	0
11	0	1	0	0	0
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	0
17	0	1	0	0	0
18	1	0	0	0	0
19	0	0	0	1	1
20	0	0	0	1	0

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	1	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	0
10	1	0	0	0	0
11	0	1	0	0	1
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	1
17	0	1	0	0	1
18	1	0	0	0	1
19	0	0	0	1	1
20	0	0	0	1	0

Item	Detail	Inference	Main idea	Syntax	Vocabulary
1	0	1	0	0	1
2	1	0	0	0	1
3	0	1	0	0	1
4	0	0	1	1	1
5	0	0	0	0	1
6	0	0	0	1	1
7	0	0	0	1	1
8	0	0	1	0	1
9	0	1	0	0	1
10	1	0	0	0	0
11	0	1	0	0	0
12	1	1	0	0	1
13	0	0	0	1	1
14	0	0	1	0	0
15	0	1	0	0	0
16	0	1	0	0	0
17	0	1	0	0	0
18	1	0	0	0	0
19	0	0	0	1	1
20	0	0	0	1	0