A Longitudinal Higher-Order Diagnostic Classification Model

Abstract

Providing diagnostic feedback about growth is crucial to formative decisions such as targeted remedial instructions or interventions. This article proposed a longitudinal higher-order diagnostic classification modeling approach for measuring growth. The new modeling approach is able to provide quantitative values of overall and individual growth by constructing a multidimensional higher-order latent structure to take into account the correlations among multiple latent attributes that are examined across different occasions. In addition, potential local item dependence among anchor (or repeated) items can be taken into account. Model parameter estimation is explored in a simulation study. An empirical example is analyzed to illustrate the applications and advantages of the proposed modeling approach.

Keywords

cognitive diagnosis diagnostic classification model longitudinal data anchor-item local item dependence DINA

The central topic in educational research and assessment is to measure change in student learning on different occasions (Fischer, 1995). Measuring individual growth or change relies on longitudinal data collected over multiple measures of achievement construct along the growth trajectory (Wang, Jiao, & Zhang, 2013; Wang, Kohli, & Henn, 2016). Up to now, much research concerning individual or overall changes has been conducted in fields such as developmental, educational, and applied psychology.

In recent years, cognitive diagnosis has received great attention, particularly in the areas of educational and psychological measurement (Rupp, Templin, & Henson, 2010). One of the main objectives of cognitive diagnosis is to evaluate respondents’ status of mastery or nonmastery of skills (also called “attributes”) and then it provides diagnostic feedback for teachers or clinicians to help them make decisions regarding remedial teachings or targeted interventions (Zhan, Jiao, & Liao, 2018). Several diagnostic classification models (DCMs), also known as cognitive diagnosis models, have been developed, such as the deterministic inputs, noisy “and” gate (DINA) model (Haertel, 1989; Junker & Sijtsma, 2001; Macready & Dayton, 1977) and the deterministic inputs, noisy “or” gate (DINO) model (Templin & Henson, 2006). Some general DCMs are also available (de la Torre, 2011; Henson, Templin, & Willse, 2009; von Davier, 2008). However, most DCMs are not concerned about measuring growth in terms of several possibly related attributes over multiple occasions, which could be potentially very important for remedial teaching or targeted intervention.

Unlike continuous latent variables in the item response theory (IRT) models, the attributes in DCMs are categorical (typically, binary). Therefore, the methods for modeling growth in the IRT framework may not be directly extended to capture growth in the mastery of attributes. For example, the change in the mastery of attributes may not be directly modeled by the variance–covariance-based methods (Collins & Wugalter, 1992) when assuming multiple continuous latent variables follow a multivariate normal distribution (e.g., Andrade & Tavares, 2005; von Davier, Xu, & Carstensen, 2011).

In DCMs, to account for change in attributes, Li, Cohen, Bottge, and Templin (2016) proposed a latent transition analysis (LTA; Collins & Wugalter, 1992), also known as mixed hidden (or latent) Markov model (Van de Pol & Langeheine, 1990), in combination with the DINA model in repeated measures. Likewise, Kaya and Leita (2017) combined the LTA with the DINA model and the DINO model, respectively. Such LTA-based methods provided an attribute-level transition probability matrix rather than a quantitative value of change, which was used more commonly. Additionally, it was assumed that attributes are independent and their transition probabilities are also independent. However, those independence assumptions may be tenuous as the attributes may be correlated (de la Torre & Douglas, 2004; Rupp et al., 2010). Recently, focusing on modeling learning trajectory, Wang, Yang, Culpepper, and Douglas (2018) proposed a higher-order, hidden Markov model for attribute transitions. Compared with the above two LTA-based methods, Wang et al.’s model used a set of observed and latent covariates, such as intervention indicators and a time-invariant general learning ability, to model the attribute-level transition probabilities. The correlations among attributes on the first occasion and the correlations among different transition probabilities were also accounted for. Additionally, Wang et al.’s model assumed learning trajectories to be nondecreasing. Rather than employing attribute-level hidden Markov models, Chen, Culpepper, Wang, and Douglas (2017) considered an attribute–pattern-level approach for approximating the learning trajectory space. In Chen et al.’s model, the attribute–pattern-level transition probability matrix explicitly provides the probabilities of remaining in the same pattern or changing to other patterns from one occasion to the next one. However, Chen et al.’s model assumed the transition probabilities of different attribute patterns were the same on different occasions, which were also the same for each individual.

Essentially, these transition probability-based methods analyzed the longitudinal data from the latent class modeling perspective, which can all be taken as a special case or an application of the mixture hidden Markov model (Vermunt, Tran, & Magidson, 2008). Moreover, despite the use of the repeated measures design in these studies, local item dependence among a person’s responses to the repeated items on different occasions (Cai, 2010; Paek, Park, Cai, & Chi, 2014) was not taken into account. In the IRT framework, it has been demonstrated that the local item dependence affects model parameter estimation, equating, and estimation of test reliability (e.g., Bradlow, Wainer, & Wang, 1999; Jiao, Kamata, Wang, & Jin, 2012; Jiao & Zhang, 2015; Sireci, Tissen, & Wainer, 1991; Tao & Chao, 2016; Wang & Wilson, 2005; Zhan, Wang, Wang, & Li, 2014). Similarly, in the DCM framework, if local item dependence is ignored, large estimation errors of item parameters could appear, and the correct classification rate of attributes might reduce (Zhan, Li, Wang, Bian, & Wang, 2015; Zhan, Liao, & Bian, 2018).

Additionally, Hansen (2013) proposed a longitudinal unidimensional DCM for the repeated measures when only one attribute is required on each occasion, and multiple attributes are required on different occasions. Further, a higher-order latent structural model (also known as the higher-order latent trait model; de la Torre & Douglas, 2004) is employed to account for associations among the attributes across different time points, where local item dependence among repeated items was accounted for by using additional random-effect latent variables. Although theoretically DCMs have been employed to measure multiple dimensions of latent constructs rather than a unidimensional attribute on a given occasion, this model provides insight of longitudinal analysis in diagnostic assessment from another perspective that is different from transition probability-based methods.

This study proposes a new longitudinal diagnostic classification modeling approach for measuring growth, which can be used in not only the repeated measures design but also the anchor-item design. Among numerous DCMs, the interpretability of the DINA model makes it the most popular one. Thus, in this study, the DINA model is taken as an example to illustrate the conceptualization of the proposed modeling approach. The proposed method can be easily extended to many other DCMs such as the log-linear DCM (LCDM; Henson et al., 2009) and its special cases. The rest of the article starts with a review of the DINA model with a higher-order latent structure. Then the proposed longitudinal DINA (denoted as the Long-DINA) model is presented and illustrated. Item response data from a physical achievement test was analyzed to illustrate the application of the proposed modeling approach.

Longitudinal Diagnostic Classification Modeling

DINA Model With a Higher-Order Latent Structure

Let Y_ni be the observed response of person n to item i. In the DINA model, the relationship among attributes and an observed response can be expressed as (DeCarlo, 2011; Rupp et al., 2010; von Davier, 2014):

logit (P (Y_{n i} = 1 | α_{n})) = λ_{i 0} + λ_{i (K)} \prod_{k = 1}^{K} α_{n k}^{q_{i k}},

where logit(x) = log(x / (1 − x)); P(Y_ni = 1 | α _n ) is the probability of a correct response by person n to item i; λ _i ₀ and λ _i _(K) are the intercept and the K-way interaction effect parameters, respectively, for item i. In such a case, the guessing and slipping probability of item i (g_i and s_i) can be expressed, respectively, as follows:

g_{i} = \frac{exp (λ_{i 0})}{1 + exp (λ_{i 0})} and s_{i} = 1 - \frac{exp (λ_{i 0} + λ_{i (K)})}{1 + exp (λ_{i 0} + λ_{i (K)})};

α _nk is the attribute for person n on attribute k (k = 1,…, K), with α _nk = 1 if person n masters attribute k, and α _nk = 0 otherwise. Q-matrix (Tatsuoka, 1983) is an I × K matrix with element q_i _k indicating whether attribute k is required to answer item i correctly; q_ik = 1 if the attribute is required and 0 otherwise.

In practice, attributes in a test are often correlated. In such cases, it may be assumed that a general continuous latent ability underlies these attributes. Let α _nk be person n’s attribute k and θ _n be the general ability of person n. The probability of α _nk = 1 conditional on θ _n is defined as follows (de la Torre & Douglas, 2004):

logit (P (α_{n k} = 1 | θ_{n})) = δ_{k} θ_{n} + β_{k},

where δ _k and β _k are the slope and intercept parameters of attribute k, respectively. To reduce computational burden, the attribute slope parameter δ _k can be further constrained as δ _k = δ, suggesting all attributes share the same slope parameter (de la Torre, Hong, & Deng, 2010), or δ _k = 1 for all attributes (Ma & de la Torre, 2016), similar to the Rasch modeling.

Modeling Growth in DCMs

Basic modeling

In DCMs, attributes are typically modeled as categorical, especially binary variables. Thus, the longitudinal modeling approaches within the IRT framework such as the multivariate normal distribution strategy (e.g., von Davier et al., 2011) and the latent growth (curve) model-based strategy (e.g., Wang et al., 2016) cannot be employed directly. However, the general continuous latent trait θ in the higher-order latent structural model (Equation 2) can be an alternative.

The proposed model for two time points can be graphically presented in Figure 1; it can also be extended to more time points. The DINA model (or other DCMs) is specified as the first-order model to link the attributes of a respondent to the observed response data at each time point. Further, a second-order latent structural model is specified to determine the mastery status for attributes of the respondents. Thus, at a given time point, the first two orders represent the higher-order DINA model (Equation 2). For the proposed model, the relationship between the general latent traits measured at different time points is specified at the third order. In other words, the Long-DINA model is a multidimensional extension of the higher-order DINA model, but the multidimensionality does not refer to different general ability dimensions rather the same general ability measured at different time points. Theoretically, this third-order model utilizes both strategies for measuring individual growth, that is, the multivariate normal distribution strategy and the latent growth model-based strategy. As the repeated measures design is not always feasible in educational measurement, a more common practice of test administration over time involves multiple test forms that share anchor items. This design is called the anchor-item design such as the nonequivalent groups with anchor test design; however, it may induce local item dependence among a respondent’s responses to the same anchor items on multiple occasions. Therefore, additional random-effect latent variables or testlet effects (Bradlow et al., 1999; e.g., γ₁ in Figure 1) can be introduced to account for local item dependence. The number of such random-effect variables is the same as the number of anchor items (Cai, 2010).

Figure 1.

A graphical representation of the Long-DINA model for two time points.

First order

In Figure 1, responses Y₂₍₁₎ and Y₂₍₂₎ are for the same anchor item at two time points. The specific factor γ₁ should be added in the first-order model to capture local item dependence. To account for local item dependence in DCMs, Hansen (2013) and Hansen, Cai, Monroe, and Li (2016) proposed a higher-order, hierarchical DCM, which can be viewed as a combination of the two-tier item factor model (Cai, 2010) and the LCDM. Like the two-tier item factor model, Hansen’s model can only account for local item dependence due to one source. Zhan, Li, Wang, Bian, and Wang (2015) proposed (within item) multidimensional testlet-effect DCMs, which simultaneously account for multiple sources of local item dependence within one item (Rijmen, 2011; Zhan et al., 2014). Multiple within-item local item dependence may be presented in assessment when testlet-based items are repeatedly used or used as anchor items (e.g., Zu & Liu, 2010). However, modeling an additional tier of specific factors could substantially increase model complexity. To simplify the proposed model, only one source of local item dependence is modeled in this study.

Following Hansen’s and Zhan et al.’s models, for a given occasion t (t = 1,…, T), the first order of the Long-DINA model can be expressed as

logit (P (Y_{n i t} = 1 | α_{n t})) = λ_{i 0 t} + λ_{i (K) t} \prod_{k = 1}^{K} α_{n k t}^{q_{i k t}} + s_{i m} γ_{n m},

where Y_nit denotes the response of person n to item i at occasion t; $α_{n t} = (α_{n 1 t}, \dots, α_{n K t})^{'}$ denotes person n’s attribute profile on occasion t; λ _i _0t and λ _i _(K)t are the intercept and K-way interaction effect parameter for item i on occasion t, respectively; q_ikt is the element in an I × K Q-matrix on occasion t; γ _nm ∼ N(0, 1) be the mth (m = 1,…, M) specific dimension parameter for person n; and γ _n s are independent of each other. To simplify the computation, the item slopes on the mth specific dimension are constrained to be equal as s_im = s_m (Cai, 2010; Paek et al., 2014; Wang et al., 2016). Note that Equation 3 was a complete version of the first-order model, after specifying partial or all specific dimensions to be zero, some restricted models as illustrated in the empirical example would result.

Second order

In the IRT framework, the multidimensional IRT models allow for the modeling of individual growth (te Marvelde, Glas, Van Landeghem, & Van Damme, 2006). Andersen (1985) proposed a between-item multidimensional Rasch model to measure individual differences on different occasions. Embretson (1991) proposed a within-item multidimensional Rasch model for learning and change. As the between-item multidimensionality is a special case of the within-item multidimensionality, Embreton’s model can be taken as an extension of Andersen’s model (Adams, Wilson, & Wang, 1997; von Davier et al., 2011). In addition, two- and three-parameter logistic multidimensional IRT models (e.g., von Davier et al., 2011; Paek et al., 2014) can be employed in longitudinal studies.

In this study, a two-parameter logistic multidimensional higher-order latent structural model was used. For a given occasion t, the second order of the Long-DINA model can be expressed as

logit (P (α_{n k t} = 1 | θ_{n})) = δ_{k t} θ_{n t} + β_{k t}, θ_{n} = (θ_{n 1}, \dots, θ_{n T})^{'},

where θ _nt is person n’s general ability on occasion t, δ _kt and β _kt are the slope and intercept parameters of attribute k on occasion t, respectively. θ_ns are constrained to be independent with γ_ns. Equation 4 is a between-attribute multidimensional model which is similar to Andersen’s model. However, the major difference between these two models is that α _nkt in Equation 4 is latent, but the item response in Andersen’s model is observed. As a starting and a reference point for subsequent occasions, θ _n ₁ is constrained to follow a standard normal distribution, θ _n1 ∼ N(0, 1), the mean values and variances of θ _nt (t ≥ 2) are free to estimate. In addition, the same attributes are assumed to be measured on different occasions with the same latent construct on different occasions (Bianconcini, 2012), that is, K_t = K. Correspondingly, the slope and intercept parameters of the kth attribute are constrained to be constants across occasions, δ _kt = δ _k and β _kt = β _k . Each respondent’s general ability and attribute mastery probabilities are allowed to change over occasions.

Third order

The most straightforward and general method assumes multiple general abilities follow a T-way multivariate normal distribution. Thus, the third order of the Long-DINA model assumes that

θ_{n} = (θ_{n 1}, \dots, θ_{n T})^{'} \sim M V N_{T} (μ_{θ}, Σ_{θ}),

with a mean vector μ _θ = (μ₁,…, μ _T )′ and a variance and covariance matrix

Σ_{θ} = [\begin{matrix} σ_{1}^{2} \\ ⋮ & ⋱ \\ σ_{1}_{T} & \dots & σ_{T}^{2} \end{matrix}],

where μ₁ = 0 and $σ_{1}^{2} = 1$ ; σ_1T is the covariance of the first and Tth general abilities. Additionally, the latent growth model-based strategy can be employed in the third order. That is, θ _nt is assumed to be a linear or nonlinear combination of the random coefficients or growth factors (Kohli & Harring, 2013) on occasions. Note that the latent growth model-based strategy is not employed in this study and can be one of the future explorations.

Rebuilt the Longitudinal Data and Longitudinal Q-Matrix

In the Long-DINA model, response data from different occasions were combined and calibrated simultaneously. Then, the longitudinal data is a $N \times \sum_{t = 1}^{T} I_{t}$ matrix and the longitudinal Q-matrix is constructed as a $\sum_{t = 1}^{T} I_{t} \times T K$ matrix.

Longitudinal Q = [\begin{matrix} Q_{1} \\ ⋮ & ⋱ \\ 0 & \dots & Q_{t} \\ ⋮ & ⋱ & ⋮ & ⋱ \\ 0 & \dots & 0 & \dots & Q_{T} \end{matrix}],

where Q _t is the sub-Q-matrix for the test on the tth occasion. In such a case, the length of the estimated attribute pattern of each person was TK, which represented the attribute mastery status of K attributes at T occasions rather than TK attributes for each person. Correspondingly, the posterior mixing proportions were computed at each occasion separately. Further, items from different occasions should be sequentially recoded for simultaneous estimation: Items from tth occasion are recoded as $\sum_{t = 0}^{t - 1} I_{t} + 1$ to $\sum_{t = 0}^{t} I_{t}$ , where I₀ = 0.

Overall and Individual Growth

Equations 3 through 6 together are the Long-DINA model. Using the Long-DINA model, both the overall and individual growth can be computed. The overall mean growth at the population level is ${\hat{μ}}_{t + 1} - {\hat{μ}}_{t}$ , and the overall scale change at the population level is ${\hat{σ}}_{t + 1} / {\hat{σ}}_{t}$ (Paek et al., 2014). In the meantime, this model can also estimate the change in the mixing proportion of possible attribute patterns, the change of mean mastery probability of each attribute across all students, and the change of the number of students who master each attribute. For individual growth, the growth in the general ability can be computed as ${\hat{θ}}_{n (t + 1)} - {\hat{θ}}_{n t}$ , and changes in each attribute mastery status also can be reported.

In the Long-DINA model, the number of estimated parameters is $2 \sum_{t = 1}^{T} (I_{t} - d_{t}) + 2 K + T (T - 1) / 2 + 2 (T - 1) + 3 M$ . More specifically, there are (1) $3 M + 2 \sum_{t = 1}^{T} (I_{t} - d_{t})$ item parameters including 2M parameters for anchor items, $2 \sum_{t = 1}^{T} (I_{t} - d_{t})$ parameters for nonanchor items, and M item slopes of special latent variables for anchor items, where M is the total number of anchor items, d_t is the number of anchor items on occasion t; (2) 2K latent structural parameters including K attribute slopes and K attribute intercepts; and (3) T(T − 1)/2 + 2(T − 1) parameters of general abilities including (T − 1) averages, (T − 1) variances, and T(T − 1)/2 covariances. Obviously, the number of model parameters is mainly influenced by the number of occasions, T. In addition, the complexity of model structure might increase with the increase of T. Furthermore, the number of possible attribute patterns is 2 ^KT , and it increases exponentially with K and T. Therefore, the computational burden could be heavy when the number of occasions is large or even medium, which should be considered when applying the proposed model to real data.

Overall, as the proposed modeling approach is similar to the longitudinal IRT modeling approach, the interpretation of the proposed model is more straightforward than transition probability-based methods. The Long-DINA model uses a multidimensional higher-order latent structural model to approximate the correlations among attributes at each occasion as well as across occasions. More importantly, local item dependence among a respondent’s responses to the same anchor items on multiple occasions can be modeled in the Long-DINA model. Essentially, the Long-DINA model can be seen as a special application of the higher-order, hierarchical DCM (Hansen, Cai, Monroe, & Li, 2016) in longitudinal studies. The relationship between these two models is quite similar to that between the multidimensional IRT models and the longitudinal IRT models (e.g., te Marvelde et al., 2006).

A Simulation Study

Design and Data Generate

A simulation study was conducted to evaluate the parameter recovery of the Long-DINA model on different conditions. Three independent variables were manipulated including (a) the sample sizes (N) at two levels of 200 and 500; (b) the qualities of anchor items (QA) at two levels of high (λ _i _0t = −2.197 and λ _i _(K)t = 4.394) and moderate (λ _i _0t = −1.387 and λ _i _(K)t = 2.774). For the high-quality anchor items, the aberrant response (i.e., guessing and slipping) probabilities are approximately equal to 0.1, while for the moderate-quality anchor items, the aberrant response probabilities are approximately equal to 0.2. In practice, it is not common to use low-quality items as anchor items and (c) the number of occasions (T) at two levels of two and three.

Within each occasion, three attributes (K_t = 3) were measured by 20 items (I_t = 20), and first 4 items are used as anchor items. A condition (T = 2) of simulated test structure is presented in Figure 2 as an example. The simulated Q-matrices are presented in Figure 3. Nonanchor-item parameters were fixed at λ _i _0t = −2.197 and λ _i _(K)t = 4.394. For the general abilities, the correlations among them were set as .9. The overall mean growths were set as 0.5, and the overall scale changes were set as 1.25. Three specific dimensions were assumed to follow a standard normal distribution, and the slopes of the specific dimension were set as 0.8. In sum, the true person parameters including T general abilities and four specific dimensions were generated from a (T + 4)-way multivariate normal distribution as $Θ {∼MVN}_{(T + 4)} (μ, Σ)$ . On each occasion, the true attribute pattern for each person is generated according to Equation 4, the true attribute intercept parameters were β = (−1, 0, 1)′, and the true attribute slope parameters were δ = (1, 1.25, 1.5)′.

Figure 2.

A condition (T = 2) of simulated test structure in simulation study. Occasion is in parenthesis.

Figure 3.

K × I_t Q-matrices of three occasions in the simulation study. t = tth occasion; gray = 1; blank = 0. When T = 2, only first two Q-matrices were used.

Estimation and Analysis

Response data from different occasions were combined and calibrated simultaneously. Thus, items on Occasion 2 were recoded as Items 21 through 40, and items on Occasion 3 were recoded as Items 41 through 60, accordingly. Then, for the conditions of T = 2, the longitudinal data was a N × 40 matrix, and the longitudinal Q-matrix was constructed as a 40 × 6 matrix; for the conditions of T = 3, the longitudinal data was a N × 60 matrix, and the longitudinal Q-matrix was constructed as a 60 × 9 matrix.

For the model parameter estimation, flexMIRT, version 3.5 (Cai, 2015), was used. In flexMIRT, the default Bock–Aitkin expectation maximization (EM) algorithm (Bock & Aitkin, 1981) was used for parameter estimation, and the Richardson extrapolation method was used to compute standard error. Specifically, the maximum number of cycles was set as 20,000 and 100 for the E-step and M-step, respectively; and the convergence criteria were 10⁻⁴ and 10⁻⁷ for the E-step and the M-step, respectively. Sample codes with comments are provided in the Appendix.

Thirty replications were implemented in each simulated condition. To evaluate model parameter recovery, bias and root mean square error (RMSE) were computed. The attribute correct classification rate (ACCR) and the pattern correct classification rate (PCCR) were computed to evaluate the classification accuracy of individual attributes and profiles. Additionally, the recovery of the overall mean and scale growths across different occasions were computed.

Results

Figure 4 summarizes the recovery of item parameters. First, the recovery of the intercept parameters was better than that of the interaction parameters. Further, the mean bias and mean RMSE for the study condition with a sample size of 200 were larger than those with a sample size of 500, indicating that a larger sample size led to better recovery of item parameters. In addition, the number of occasions and the quality of anchor item had little effect on the recovery of item parameters.

Figure 4.

Recovery of item parameters. T = the number of occasions; N = sample size; QA = the quality of anchor item.

The recovery of attributes on different occasions is summarized in Table 1. The PCCR focuses on whether K attributes can be correctly recovered on a given occasion; in contrast, the Longitudinal PCCR focuses on whether all TK attributes can be correctly recovered. It can be found that the value of ACCR and PCCR both increased with time. According to the Longitudinal PCCR, it is evident that anchor items with high quality improved the recovery of the attributes, and it is less evident that a larger sample size improved the recovery of the attributes. In addition, the Longitudinal PCCR decreased as the number of occasions increased.

Table 1.

Recovery of Attributes in Different Simulation Condition

T	N	QA	t	ACCR			PCCR	Longitudinal PCCR
T	N	QA	t	α₁ ₍ _t ₎	α₂ ₍ _t ₎	α₃ ₍ _t ₎	PCCR	Longitudinal PCCR
2	200	High	1	.99	.97	.95	.92	.86
		High	2	.99	.97	.96	.93	.86
		Low	1	.97	.96	.95	.89	.83
		Low	2	.98	.97	.96	.92	.83
	500	High	1	.99	.98	.96	.93	.87
		High	2	.99	.98	.97	.94	.87
		Low	1	.97	.97	.95	.89	.83
		Low	2	.98	.97	.96	.92	.83
3	200	High	1	.99	.98	.96	.93	.85
			2	.99	.98	.97	.94
			3	.99	.99	.98	.96
		Low	1	.97	.96	.95	.90	.79
			2	.98	.97	.96	.92
			3	.98	.98	.98	.95
	500	High	1	.99	.98	.96	.93	.85
			2	.99	.97	.99	.95
			3	.99	.99	.98	.96
		Low	1	.97	.97	.95	.90	.80
			2	.98	.97	.96	.92
			3	.98	.98	.97	.95

Note. T = the number of occasions; t = tth occasion; N = sample size; QA = the quality of anchor items; ACCR = attribute correct classification rate; PCCR = pattern correct classification rate.

Table 2 presents the recovery of the general abilities on different occasions. For Occasion 1, virtually all conditions resulted in similar mean absolute bias; for Occasions 2 and 3, the mean absolute bias was a little bit higher. Overall, the effects of the sample size and the quality of anchor items were not evident on the recovery of the general abilities. Further, the RMSE of θ _t ₊₁ is larger than that of θ _t , which means that the accuracy in the recovery of the general abilities diminished with time.

Table 2.

Recovery of the General Abilities

T	N	QA	θ₍ ₁ ₎		θ₍ ₂ ₎		θ₍ ₃ ₎
T	N	QA	MA_Bias	M_RMSE	MA_Bias	M_RMSE	MA_Bias	M_RMSE
2	200	High	.34	.62	.35	.72
	200	Low	.35	.63	.35	.76
	500	High	.35	.63	.37	.71
	500	Low	.36	.64	.37	.72
3	200	High	.31	.57	.33	.66	.36	.73
	200	Low	.32	.58	.34	.67	.41	.77
	500	High	.31	.57	.33	.65	.38	.72
	500	Low	.31	.57	.33	.65	.39	.72

Note. T = the number of occasions; N = sample size; QA = the quality of anchor item; AI = the number of anchor items; MA_Bias = mean absolute bias across all respondents; M_RMSE = mean RMSE across all respondents; RMSE = root mean square error.

Table 3 summarizes the recovery of the overall mean and the overall scale growth. For the overall mean growth, the bias is close to zero across all conditions; by contrast, for the overall scale growth, negative biases can be found, indicating that the Long-DINA model underestimated overall scale changes. Larger sample sizes seem to help the recovery, especially in terms of RMSE. The quality of anchor items did not evidently affect the recovery of these parameters.

Table 3.

Recovery of the Overall Mean Growth and the Scale Growth

T	N	QA	Change of Occasion	Overall Mean Growth		Overall Scale Growth
T	N	QA	Change of Occasion	Bias	RMSE	Bias	RMSE
2	200	High	t₁ → t₂	–.02	.15	–.09	.29
	200	Low	t₁ → t₂	.03	.19	–.03	.21
	500	High	t₁ → t₂	–.02	.08	–.11	.18
	500	Low	t₁ → t₂	.03	.11	–.10	.17
3	200	High	t₁ → t₂	–.01	.16	–.14	.24
		High	t₂ → t₃	.01	.12	–.09	.21
		Low	t₁ → t₂	–.02	.12	–.13	.24
		Low	t₂ → t₃	.01	.18	–.12	.25
	500	High	t₁ → t₂	–.01	.10	–.15	.19
		High	t₂ → t₃	–.03	.08	–.18	.20
		Low	t₁ → t₂	–.01	.08	–.10	.17
		Low	t₂ → t₃	–.02	.11	–.12	.19

Note. T = the number of occasions; N = sample size; QA = the quality of anchor item; AI = the number of anchor items; RMSE = root mean square error.

The recovery of attribute intercept and slope parameters are presented in Figure 5. For the attribute intercepts, the bias is close to zero across all conditions, while the bias for the attribute slopes is slightly larger. Large sample sizes seem to help the recovery, especially in terms of RMSE. On the contrary, the quality of the anchor items has no evident effect on the recovery of the attribute intercept parameters.

Figure 5.

Recovery of the attribute intercept and attribute slope parameters. T = the number of occasions; N = sample size; QA = the quality of anchor item.

An Empirical Example

Data Description

Item response data from a physics achievement test about electric current and voltage were used to illustrate the application of the proposed Long-DINA model. Response data were available for three occasions. On Occasion 1, 264 eighth-grade students from seven classrooms took part in the assessment in a school in Hangzhou, Zhejiang Province, China. After 1 week, 221 students from six classrooms remained on Occasion 2. Another week later, 209 students from the same six classrooms remained on Occasion 3. Among the 209 students, seven students missed data collection on Occasion 2. Thus, 202 respondents took part in all three tests. The same four attributes were assessed by all tests, namely, electric current (α₁), voltage (α₂), circuit analysis (α₃), and Ohm’s law (α₄; resistance).

There were 17 items in the first two tests. Items 1 through 5 were dichotomous fill-in-the-blanks items, Items 6 through 15 were dichotomous multiple-choice items, and the last 2 constructed-response items were polytomously scored. Among the 20 items in the last test, Items 1 through 6 were dichotomous fill-in-the-blanks items, Items 7 through 17 were dichotomous multiple-choice items, and last 3 constructed-response items were polytomously scored. For the current study, only dichotomous items were used. Items 1, 3, 6, 7, and 8 on Occasion 1 were the same as Items 2, 5, 9, 12, and 15 on Occasion 2. Items 1 and 8 on Occasion 1 were the same as Items 5 and 16 on Occasion 3, and Items 7 and 10 on Occasion 2 were the same as Items 13 and 8 on Occasion 3. Three Q-matrices and test structure are presented in Figures 6 and 7, respectively. Students with missing responses to more than 7 items were removed, while other missing data were treated as missing at random. The final cleaned data set contained 197 students, 15 dichotomous items in the first two occasions and 17 dichotomous items on the last occasion.

Figure 6.

Three K × I_t Q-matrices for the empirical example where blank means 0, gray means 1, red or blue square represents anchor items.

Figure 7.

Test structure for the empirical example. Nonanchor items are omitted. Occasion is in parenthesis.

Analysis

Consistent with the simulation study, response data from different occasions were combined and calibrated simultaneously. Likewise, items on Occasion 2 were recoded as Items 16 through 30 and those on Occasion 3 were recoded as Items 31–47, accordingly. Thus, the longitudinal data was a 197 × 47 matrix and the longitudinal Q-matrix was constructed as a 47 × 12 matrix.

Two models were fitted to the data, a complete model (denoted as cLong-DINA), in which seven specific dimensions (γ₁–γ₇) were included for all anchor items, and a simple model (denoted as sLong-DINA) that ignored any specific dimensions. As aforementioned, θ _n ₁ and all γ _m s were constrained to follow a standard normal distribution, and the item slopes on each γ _m were constrained to be equal and need to be estimated. The M₂ statistic (Hansen et al., 2016) was used to evaluate the absolute model data fit, and the Akaike information criterion (AIC; Akaike, 1974) and Bayesian information criterion (BIC; Schwarz, 1978) were computed for each model to evaluate the relative model data fit. The likelihood ratio test (i.e., ▵ −2 log-likelihood [▵ −2LL]) was also employed as the sLong-DINA model is nested within the cLong-DINA model.

Results

Table 4 presents the model data fit indexes of the compared two models. The value of M₂ for the cLong-DINA model was 1,091.88, with 1,030 degrees of freedom (df), and the RMSEA based on M₂ has a value of 0.02. By contrast, the value of M₂ for the sLong-DINA model was 1,097.96, with 1,037 df, and the RMSEA based on M₂ has a value of 0.02. Such results indicating both the cLong-DINA model and sLong-DINA model appear to provide reasonable good fit. Additionally, −2LL of cLong-DINA model is slightly better. This is expected because cLong-DINA model is more general than the sLong-DINA model. However, AIC and BIC both chose the sLong-DINA model as a better fitting model, and the likelihood ratio test also shows that the sLong-DINA model does not fit significantly worse than the cLong-DINA model (▵ −2LL = 1.61, df = 7, p > .05). The estimated s_m for each specific dimension is presented in Table 5. Under the cLong-DINA model, only estimates of s₁ and s₃ are higher than 0.01, which means that local item dependence among the anchor items had limited impact. This may also explain why AIC and BIC tend to choose the sLong-DINA model. Thus, only the results pertain to the sLong-DINA model are discussed next.

Table 4.

Summary of Model Data Fit in the Empirical Example

Model	M ₂	df	p	RMSEA	NP	−2LL	AIC	BIC
cLong	1,091.88	1,030	.088	.02	98	9,867.89	10,063.89	10,385.65
sLong	1,097.96	1,037	.092	.02	91	9,869.50	10,051.50	10,350.27

Note. cLong = complete Long-DINA model; sLong = simple Long-DINA model; NP = number of estimated parameters; −2LL = −2 log-likelihood; AIC = Akaike information criterion; BIC = Bayesian information criterion; df = degree of freedom; RMSEA = root mean square error of approximation. Boldface means smaller number.

Table 5.

Estimated Item Slopes of Specific Dimensions

Fit Model	s ₁	s ₂	s ₃	s ₄	s ₅	s ₆	s ₇
cLong	.74 (0.40)	.00 (1.10)	.47 (0.74)	.00 (1.21)	.00 (0.46)	.01 (0.33)	.00 (0.70)

Note. Standard error is in parentheses. cLong = complete Long-DINA model.

Figure 8 presents the overall mean and scale growth of general ability with time. The overall means are ${\hat{μ}}_{2} - {\hat{μ}}_{1} = 0.34$ and ${\hat{μ}}_{3} - {\hat{μ}}_{2} = 0.42$ ; the overall mean growth from Occasion 2 to 3 is a little larger than that from Occasion 1 to 2. The overall scale growth is ${\hat{σ}}_{2} / {\hat{σ}}_{1} = 1.1 6$ and ${\hat{σ}}_{3} / {\hat{σ}}_{2} = 1. 35$ , which means that the gap between students becomes greater as time went by. To better understand these two concepts, we divide 197 students into (relatively) high- and (relatively) low-ability groups according to the median of the estimated general ability on Occasion 1. Figure 9 presents the overall mean growth of such two groups with time. Obviously, the growth of the high-ability group is higher than that of the low-ability group, and the low-ability group grows a little and almost remains the same across occasions.

Figure 8.

The overall mean and scale growth of the general ability with time.

Figure 9.

The overall mean growth of the high- and low-ability students.

Figure 10 presents the estimated overall growth of mean mastery probability across all students with time. In sum, the mean mastery possibilities of all four attributes increase with time. The mastery probability and the growth tendency of Attribute 1 are close to those of Attribute 2, and similar relationship can be found between Attributes 3 and 4. Figure 11 presents the overall change of the number of students who mastered each attribute with time. Similarly, such numbers increase with time.

Figure 10.

The overall growth of the mean mastery probability of each attribute with time.

Figure 11.

The overall growth of the number of students who mastered each attribute with time.

Table 6 presents the estimated means, variances, and covariances of the general abilities. The correlation between general Abilities 1 and 2 is 0.95, between general Abilities 1 and 3 is 0.91, and between general Abilities 2 and 3 is 0.86. High correlations may be due to the short-time intervals. In addition, the model-implied (tetrachoric) correlations among attributes are presented in Table 7, which were computed after assigning a classification of all respondents. Moderate correlations are found among attributes regardless of the number of attributes within and across occasions. Such results indicate that the LTA-based method with attribute independence assumptions may oversimplify the real-world complexity.

Table 6.

The Estimated Mean Vector and Variance and Covariance Matrix

Parameters	θ₍ ₁ ₎	θ₍ ₂ ₎		θ₍ ₃ ₎
θ₍ ₁ ₎	1.00	0.95		0.91
θ₍ ₂ ₎	1.10 (.21)	1.34 (.31)		0.86
θ₍ ₃ ₎	1.43 (.24)	1.56 (.36)		2.44 (.49)
μ₍ ₁ ₎	μ₍ ₂ ₎		μ₍ ₃ ₎
.00	.34 (.64)		.76 (0.46)

Note. Upper and lower triangular matrix is the covariances and correlations, respectively. Standard error is in parentheses.

Table 7.

The Model-Implied Correlation Among Attributes

Parameters	α₁₍₁₎	α₂₍₁₎	α₃₍₁₎	α₄₍₁₎	α₁₍₂₎	α₂₍₂₎	α₃₍₂₎	α₄₍₂₎	α₁₍₃₎	α₂₍₃₎	α₃₍₃₎	α₄₍₃₎
α₁₍₁₎	1.00
α₂₍₁₎	.42	1.00
α₃₍₁₎	.38	.51	1.00
α₄₍₁₎	.34	.36	.43	1.00
α₁₍₂₎	.42	.58	.56	.27	1.00
α₂₍₂₎	.41	.65	.52	.50	.69	1.00
α₃₍₂₎	.29	.61	.57	.34	.45	.69	1.00
α₄₍₂₎	.21	.65	.39	.22	.61	.53	.42	1.00
α₁₍₃₎	.38	.39	.36	.22	.19	.10	.28	.12	1.00
α₂₍₃₎	.20	.63	.49	.40	.38	.59	.57	.35	.45	1.00
α₃₍₃₎	.25	.59	.56	.31	.52	.58	.51	.51	.33	.66	1.00
α₄₍₃₎	.15	.45	.50	.29	.43	.34	.53	.39	.55	.55	.44	1.00

Figure 12 presents the change of the posterior mixing proportion with time. Take the (0000) and (1111) as two examples. The posterior mixing proportion of (0000) on Occasions 1, 2, and 3 is .18, .15, and .12, respectively. In contrast, the posterior mixing proportion of (1111) at Occasions 1, 2, and 3 is .14, .21, and .29. In sum, the proportion of students who master all attributes increases with time and the proportion of students who master zero attributes decreases with time.

Figure 12.

Overall change of posterior mixing proportion with time.

In addition to the overall growth, the growth of individuals can be analyzed by the Long-DINA model. Three examples are presented in Table 8. For student ID = 2, after twice remedial teaching, the general ability increased significantly, from 0.37 to 1.28. Similarly, the attribute mastery status all change to 1. This indicates that the remedial teaching was effective for this person. By contrast, for student ID = 50, the general ability almost kept constant on three occasions. This means that the remedial teaching was not effective for this person. Similar conclusions can be drawn for this person’s mastery of attributes. In addition, even though the general ability increased from 0.13 to 1.09, the student with an ID of 197 still has not mastered the fourth attribute after twice remedial teaching. Meanwhile, this student may have forgotten the second attribute during the second occasion.

Table 8.

Four Examples of Individual Growth of General Ability and Attributes With Time

Student ID	Growth	Parameter	t = 1	t = 2	t = 3
2	General ability	θ	0.37	0.74	1.28
	Attributes	α₁	1	1	1
		α₂	1	1	1
		α₃	0	1	1
		α₄	0	0	1
50	General ability	θ	−1.04	−0.94	−0.90
	Attributes	α₁	1	1	1
		α₂	0	0	0
		α₃	0	0	0
		α₄	0	0	0
197	General ability	θ	0.13	0.63	1.09
	Attributes	α₁	0	1	1
		α₂	1	0	1
		α₃	0	1	1
		α₄	0	0	0

Overall, the results from fitting the data to the Long-DINA model indicate that the remedial teaching was more effective for high-performing students than low-performing students. This result is consistent with the Matthew effect in education (Walberg & Tsai, 1983), which means students starting out at a higher level gain more on average than students starting at a lower level of proficiency (von Davier et al., 2011).

Conclusions and Discussions

This study proposed a longitudinal diagnostic classification modeling approach for measuring individual growth, especially for the anchor-item design (also can be used in repeated measures design). Unlike the LTA-based method, the new modeling approach estimates the overall and individual growth and simultaneously retains the advantages of the higher-order latent structure (e.g., reduction in the number of model parameters) by constructing a multidimensional higher-order latent structure to take into account the correlations among multiple attributes. Additionally, potential local item dependence among anchor items can be taken into account. An empirical example was analyzed to illustrate the application and advantages of the proposed modeling approach.

The proposed modeling approach is the first attempt to measuring individual growth in cognitive diagnostic assessments by incorporating the multidimensional higher-order latent structure. Despite the promising findings, further study is still needed. For example, (a) only a DINA-based model was employed for illustrating the modeling approach, though the proposed modeling approach can be easily extended to the LCDM and its special cases. However, the performance of the proposed modeling approach based on other DCMs still needs further investigation. (b) Currently, only the single-group situation was considered, multiple group modeling (e.g., von Davier et al., 2011) can be extended in the future. (c) Additionally, in practice, students are nested in classrooms, and classrooms are further nested in schools. Thus, multilevel modeling (e.g., Fox & Glas, 2001; Huang, 2015; Jiao & Zhang, 2015) also can be incorporated into the third order of the proposed modeling approach. (d) Furthermore, theoretically polytomous attributes (Karelitz, 2004) provide more information than dichotomous attributes in describing the growth in longitudinal studies, as the former is more refined than the latter. Although the proposed modeling approach currently focuses on binary attributes, there is no conceptual challenge in extending the idea to model polytomous attributes by using the polytomous higher-order latent structural model (Zhan, Wang, & Li, in press). (e) Detailed comparisons between other longitudinal diagnosis methods, for example, transition probability-based methods, within the same conditions could be an interesting topic in the future. (f) In our empirical example, most respondents are allocated into the patterns that master the first or the second attribute; meanwhile, less respondents are allocated into the patterns that do not master the first two attributes (see Figure 12), which means these four attributes may follow a hierarchical structure (Leighton, Gierl, & Hunka, 2004). It is meaningful and practical to explore how to apply the Long-DINA model to hierarchical attribute situations. (g) Recently, some studies focus on utilizing response time information in cognitive diagnosis (e.g., Minchen, de la Torre, & Liu, 2017; Zhan, Jiao, et al., 2018). How to incorporate response time information into the proposed longitudinal modeling approach is also an interesting topic (e.g., Wang, Zhang, Douglas, & Culpepper, 2018). (h) Parameters in the Long-DINA model also can be estimated by using the Bayesian Markov chain Monte Carlo algorithms, which can be found in a tutorial by Zhan, Jiao, and Man (2017). Finally, it should be noted that the current Long-DINA model is complicated enough, which has already lead to heavy computing burdens, especially for complex test situations (e.g., more occasions, more attributes, and more anchor items). Thus, the computing capability of computers should also be considered in further extension.

It is worthy of note that a necessary condition should be satisfied when using the proposed modeling approach, that is, the latent attributes measured by multiple tests must be invariant over time, that is, the achievement construct does not shift across occasions. Occasionally, such assumption may be violated in practice. For instance, for cognitive areas (e.g., mathematics and reading), those target dimensions may change as students’ grade levels increase (Wang & Jiao, 2009; Wang et al., 2013). In such cases, different attributes due to construct shift may be examined in multiple measures on different occasions. Therefore, the general abilities on different occasions may have different meanings (i.e., contain different target attributes). The complexity in computation and interpretation in this extension needs further exploration.

Footnotes

Appendix

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Grant No. 31600908) and the Key Program of Educational Science Planning of Zhejiang Province, China (Grant No. 2019SB112), and the MOE (Ministry of Education in China) Project of Humanities and Social Sciences (Grant No. 19YJC190025).

References

Adams

R. J.

Wilson

Wang

(1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23. doi:10.1177/0146621697211001

Akaike

(1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.

Andersen

E. B.

(1985). Estimating latent correlations between repeated testings. Psychometrika, 50, 3–16. doi:10.1007/BF02294143

Andrade

D. F.

Tavares

H. R.

(2005). Item response theory for longitudinal data: Population parameter estimation. Journal of Multivariate Analysis, 95, 1–22. doi:10.1016/j.jmva.2004.07.005

Bianconcini

(2012). A general multivariate latent growth model with applications to student achievement. Journal of Educational and Behavioral Statistics, 37, 339–364. doi:10.3102/1076998610396886

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: An application of an EM algorithm. Psychometrika, 46, 443–459. doi:10.1007/BF02293801

Bradlow

E. T.

Wainer

Wang

(1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168. doi:10.1007/BF02294533

Cai

(2010). High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika, 75, 33–57. doi:10.1007/s11336-009-9136-x

Cai

(2015). flexMIRT^® version 3.00: Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.

10.

Chen

Culpepper

S. A.

Wang

Douglas

(2017). A hidden Markov model for learning trajectories in cognitive diagnosis with application to spatial rotation skills. Applied Psychological Measurement, 42, 5–23. doi:10.1177/0146621617721250

11.

Collins

L. M.

Wugalter

S. E

. (1992). Latent class models for stage-sequential dynamic latent variables. Multivariate Behavioral Research, 27, 131–157. doi:10.1207/s15327906mbr2701_8

12.

DeCarlo

L. T.

(2011). On the analysis of fraction subtraction data: The DINA model, classification, latent class sizes, and the Q-matrix. Applied Psychological Measurement, 35, 8–26. doi:10.1177/0146621610377081

13.

de la Torre

(2011). The generalized DINA model framework. Psychometrika, 76, 179–199. doi:10.1007/s11336-011-9207-7

14.

de la Torre

Douglas

(2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333–353. doi:10.1007/BF02295640

15.

de la Torre

Hong

Deng

(2010). Factors affecting the item parameter estimation and classification accuracy of the DINA model. Journal of Educational Measurement, 47, 227–249. doi:10.1111/j.1745-3984.2010.00110.x

16.

Embretson

S. E.

(1991). Implications of a multidimensional latent trait model for measuring change. In Collins

L. M.

Horn

J. L.

(Eds.), Best methods for the analysis of change: Recent advances, unanswered questions, future directions (pp. 184–197). Washington, DC: American Psychological Association.

17.

Fischer

G. H.

(1995). Some neglected problems in IRT. Psychometrika, 60, 459–487. doi:10.1007/BF02294324

18.

Fox

J. P.

Glas

C. A.

(2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, 271–288. doi:10.1007/BF02294839

19.

Haertel

E. H.

(1989). Using restricted latent class models to map the skill structure of achievement items. Journal of Educational Measurement, 26, 301–321. doi:10.1111/j.1745-3984.1989.tb00336.x

20.

Hansen

(2013). Hierarchical item response models for cognitive diagnosis. Unpublished doctoral dissertation, University of California, Los Angeles, CA.

21.

Hansen

Cai

Monroe

(2016). Limited-information goodness-of-fit testing of diagnostic classification item response models. British Journal of Mathematical and Statistical Psychology, 69, 225–252. doi:10.1111/bmsp.12074

22.

Henson

Templin

Willse

(2009). Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika, 74, 191–210. doi:10.1007/s11336-008-9089-5

23.

Huang

H.-Y.

(2015). A multilevel higher-order item response theory model for measuring latent growth in longitudinal data. Applied Psychological Measurement, 39, 362–372. doi:10.1177/0146621614568112

24.

Jiao

Kamata

Wang

Jin

(2012). A multilevel testlet model for dual local dependence. Journal of Educational Measurement, 49, 82–100. doi:10.1111/j.1745-3984.2011.00161.x

25.

Jiao

Zhang

(2015). Polytomous multilevel testlet models for testlet-based assessments with complex sampling designs. British Journal of Mathematical and Statistical Psychology, 68, 65–83. doi:10.1111/bmsp.12035

26.

Junker

B. W.

Sijtsma

(2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272. doi:10.1177/01466210122032064

27.

Karelitz

T. M.

(2004). Ordered category attribute coding framework for cognitive assessments. (Unpublished doctoral dissertation). University of Illinois at Urbana-Champaign, Champaign, IL, USA.

28.

Kaya

Leite

W. L.

(2017). Assessing change in latent skills across time with longitudinal cognitive diagnosis modeling: An evaluation of model performance. Educational and Psychological Measurement, 77, 369–388. doi:10.1177/0013164416659314

29.

Kohli

Harring

J. R.

(2013). Modeling growth in latent variables using a piecewise function. Multivariate Behavioral Research, 48, 370–397. doi:10.1080/00273171.2013.778191

30.

Leighton

J. P.

Gierl

M. J.

Hunka

S. M.

(2004). The attribute hierarchy method for cognitive assessment: A variation on Tatsuoka’s rule-space approach. Journal of Educational Measurement, 41, 205–237. doi:10.1111/j.1745-3984.2004.tb01163.x

31.

Cohen

Bottge

Templin

(2016). A latent transition analysis model for assessing change in cognitive skills. Educational and Psychological Measurement, 76, 181–204. doi:10.1177/0013164415588946

32.

de la Torre

(2016). GDINA: The generalized DINA model framework (R package version 1.2.1). Retrieved from http://CRAN.R-project.org/package=GDINA

33.

Macready

G. B.

Dayton

C. M.

(1977). The use of probabilistic models in the assessment of mastery. Journal of Educational and Behavioral Statistics, 2, 99–120. doi:10.3102/10769986002002099

34.

Minchen

de la Torre

Liu

(2017). A cognitive diagnosis model for continuous response. Journal of Educational and Behavioral Statistics, 34, 115–130. doi:10.3102/1076998617703060

35.

Paek

Park

H.-J.

Cai

Chi

(2014). A comparison of three IRT approaches to examinee ability change modeling in a single-group anchor test design. Educational and Psychological Measurement, 74, 659–676. doi:10.1177/0013164413507062

36.

Rijmen

(2011). Hierarchical factor item response theory models for PIRLS: Capturing clustering effects at multiple levels. IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 4, 59–74.

37.

Rupp

Templin

Henson

(2010). Diagnostic measurement: Theory, methods, and applications. New York, NY: Guilford Press.

38.

Schwarz

(1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.

39.

Sireci

S. G.

Thissen

Wainer

(1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237–247. doi:10.1111/j.1745-3984.1991.tb00356.x

40.

Tao

Cao

(2016). An extension of IRT-based equating to the dichotomous testlet response theory model. Applied Measurement in Education, 29, 108–121. doi:10.1080/08957347.2016.1138956

41.

Tatsuoka

K. K.

(1983). Rule Space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345–354. doi:10.1111/j.1745-3984.1983.tb00212.x

42.

te Marvelde

J. M.

Glas

C. A.

Van Landeghem

Van Damme

(2006). Application of multidimensional item response theory models to longitudinal data. Educational and Psychological Measurement, 66, 5–34. doi:10.1177/0013164405282490

43.

Templin

Henson

(2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11, 287–305. doi:10.1037/1082-989X.11.3.287

44.

Van de Pol

Langeheine

(1990). Mixed Markov latent class models. Sociological Methodology, 20, 213–247. doi:10.2307/271087

45.

Vermunt

J. K.

Tran

Magidson

(2008). Latent class models in longitudinal research. In Menard

(Ed.), Handbook of longitudinal research: Design, measurement, and analysis (pp. 373–385). Burlington, MA: Elsevier.

46.

von Davier

(2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61, 287–307. doi:10.1002/j.2333-8504.2005.tb01993.x

47.

von Davier

(2014). The DINA model as a constrained general diagnostic model: Two variants of a model equivalency. British Journal of Mathematical and Statistical Psychology, 67, 49–71. doi:10.1111/bmsp.12003

48.

von Davier

Carstensen

C. H.

(2011). Measuring growth in a longitudinal large-scale assessment with a general latent variable model. Psychometrika, 76, 318–336. doi:10.1007/s11336-011-9202-z

49.

Walberg

H. J.

Tsai

S.-L.

(1983). Matthew effects in education. American Educational Research Journal, 20, 359–373. doi:10.3102/00028312020003359

50.

Wang

Kohli

Henn

(2016). A second-order longitudinal model for binary outcomes: Item response theory versus structural equation modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23, 455–465. doi:10.1080/10705511.2015.1096744

51.

Wang

Jiao

(2009). Construct equivalence across grades in a vertical scale for a K-12 large-scale reading assessment. Educational and Psychological Measurement, 69, 760–777. doi:10.1177/0013164409332230

52.

Wang

Jiao

Zhang

(2013) Validation of longitudinal achievement constructs of vertically scaled computerized adaptive tests: A multiple-indicator, latent-growth modelling approach. International Journal of Quantitative Research in Education, 1, 383–407. doi:10.1504/IJQRE.2013.058307

53.

Wang

Yang

Culpepper

S. A.

Douglas

J. A

. (2018). Tracking skill acquisition with cognitive diagnosis models: A higher-order, hidden markov model with covariates. Journal of Educational and Behavioral Statistics, 43, 57–87.

54.

Wang

Zhang

Douglas

Culpepper

(2018). Using response times to assess learning progress: A joint model for responses and response times. Measurement: Interdisciplinary Research and Perspectives, 16, 45–58. doi:10.1080/15366367.2018.1435105

55.

Wang

W.-C.

Wilson

(2005). The Rasch testlet model. Applied Psychological Measurement, 29, 126–149. doi:10.1177/0146621604271053

56.

Zhan

Jiao

Liao

(2018). Cognitive diagnosis modelling incorporating item response times. British Journal of Mathematical and Statistical Psychology, 71, 262–286. doi:10.1111/bmsp.12114

57.

Zhan

Jiao

Man

(2017). Using JAGS for Bayesian cognitive diagnosis modeling: A tutorial. arXiv preprint:1708.02632. Retrieved from https://arxiv.org/abs/1708.02632

58.

Zhan

Wang

W.-C.

Bian

Wang

(2015). The multidimensional testlet-effect cognitive diagnostic models. Acta Psychologica Sinica, 47, 689–701. doi:10.3724/SP.J.1041.2015.00689

59.

Zhan

Liao

Bian

(2018). Joint testlet cognitive diagnosis modeling for paired local item dependence in response times and response accuracy. Frontiers in Psychology, 9, 607. doi:10.3389/fpsyg.2018.00607

60.

Zhan

Wang

W.-C.

Wang

(2014). The multidimensional testlet-effect Rasch model. Acta Psychologica Sinica, 46, 1208–1222. doi:10.3724/SP.J.1041.2014.01218

61.

Zhan

Wang

W.-C.

. (in press). A partial mastery, higher-order latent structural model for polytomous attributes in cognitive diagnostic assessments. Journal of Classification.

62.

Liu

(2010). Observed score equating using discrete and passage-based anchor items. Journal of Educational Measurement, 47, 395–412. doi:10.1111/j.1745-3984.2010.00120.x