Abstract
Interim assessment occurs throughout instruction to provide feedback about what students know and have achieved. Different from the current available cognitive diagnostic computerized adaptive testing (CD-CAT) design that focuses on assessment at a single time point, the authors discuss several designs of interim CD-CAT that are suitable in the learning context. The interim CD-CAT differs from the current available CD-CAT designs primarily because students’ mastery profile (i.e., skills mastery) changes due to learning, and new attributes are added periodically. Moreover, hierarchies exist among attributes taught sequentially and such information could be used during item selection. Two specific designs are considered: The first one is when new attributes are taught in Stage II, but the student mastery status of the previously taught attributes stays the same. The second design is when both new attributes are taught, and previously taught attributes can be further learned or forgotten in Stage II. For both designs, the authors propose an individual prior, which considers a person’s learning history and population learning model, to start an interim CD-CAT. Simulation results show that the Stage II CD-CAT using individual prior outperforms the methods using population priors. The GDINA (generalized deterministic inputs, noisy, “and” gate) diagnostic index (GDI) is extended to accommodate item hierarchies, and analytic results are provided to further illustrate the types of items that are most popular during item selection. As the first study that focuses on the application of CD-CAT in a learning context, the methods and results present herein showed the great promise of using CD-CAT to monitor learning.
To fully embody the potential of computerized adaptive testing (CAT) for facilitating individualized learning on a mass scale (Quellmalz & Pellegrino, 2009), a CAT should have built-in diagnostic features. Kingsbury (2009) has dubbed adaptive tests geared toward cognitive diagnosis as “Idiosyncratic CAT” and has found promising applications in providing teachers with information for targeted instruction. CAT based on cognitive diagnostic models (CDMs)—namely, CD-CAT (cognitive diagnostic computerized adaptive testing)—has become a psychometrically sound option. CDMs provide a level of control in scaling, linking, and item banking that is unavailable with other simpler subscore methods.
A large body of simulation work has been done to explore different item selection methods. For example, Xu et al. (2016) provided a non-asymptotic theory-based approach to guide initial item selection in CD-CAT, which considerably reduces the test length. In addition, well-established information-based criteria such as the mutual information index (C. Wang, 2013), the modified posterior-weighted Kullback–Leibler index (MPWKL; Kaplan et al., 2015), or the GDINA (generalized deterministic inputs, noisy, “and” gate) diagnostic index (GDI; Kaplan et al., 2015) have shown to be highly efficient. However, these available CD-CAT methods mostly focus on assessment at a single time point (e.g., beginning- or end-of-semester). As a result, the test may start randomly (if no a priori information about the test takers are available) and the items usually cover a broad range of attributes with the goal of maximizing the precision of the entire cognitive profile. In contrast, item selections for interim CD-CAT should take into account the general learning model as well as an individual student’s learning history. Developing CD-CAT for such a purpose is essential to close the learning-assessment feedback loop. The interim CD-CAT differs from the current available CD-CAT designs primarily because students’ mastery profile (denoted as α) changes due to learning, and new attributes are added periodically. Therefore, at the
In this article, the authors explore viable interim CD-CAT designs that reflect above-mentioned features. The unique contribution of the study is twofold: (a) although using individual collateral information to start CAT is not new (e.g., van der Linden, 1999), this is the first study that ever positions CD-CAT in a learning context with changing sets of attributes, so the application context is unique. Previous studies either considered cross-sectional CD-CAT or assumed the collection of attributes stay the same over time. (b) The study illustrates the performance of GDI when an attribute hierarchy exists and provides interesting analytical results on the types of items that are preferred during item selection. The findings will shed light on future item bank design. In what follows, the authors first briefly introduce CDMs and GDI, followed by the interim CD-CAT designs as well as relevant analytic results. Then two simulation studies are presented to demonstrate the performance of interim CD-CAT under different scenarios.
CDMs
To support these next-generation assessments aimed at providing fine-grained feedback for students and teachers (Leighton & Gierl, 2007; Templin & Bradshaw, 2014), CDMs have arisen as advanced psychometric models in the past few decades. In essence, CDMs are restrictive latent class models that uncover the skills/attributes a student possesses at the time of assessment. Each latent profile constitutes one latent class. Denote the mastery profile of a person by
To model the item responses as a function of item and person characteristics, de la Torre (2011) proposed a GDINA model, which is one of the most flexible CDMs. The GDINA model specifies the probability that person i answers item j correctly as follows:
where
GDI
The GDI is considered as the basis item selection method in this article because it is computationally fast and it performs as well or even sometimes better than the other computationally more intensive method, such as the mutual information method (C. Wang, 2013) or the modified PWKL (Kaplan et al., 2015; Xu et al., 2016). The GDI measures the weighted variance of the conditional success probabilities of an item. The definition is as follows. Let
where
Interim CD-CAT Design
In this section, several design aspects for interim CD-CAT will be discussed in sequel, which include (a) efficiently synthesizing information from prior stages to update α when the size of α keeps growing over time, (b) selection of items in the presence of attribute hierarchies, and (c) updating prior information to take into account learning.
Specifically, at the (t+ 1)th learning stage, if a set of skills denoted by
Update of α
First of all, during a CD-CAT when the size of α is fixed, then to save computation time for sequential update of α using the maximum a posteriori (MAP; Huebner & Wang, 2011), one can use the posterior density from the previous step as the current prior. That is, suppose the cardinality of
where
Compared with Equation 3, Equation 4 circumvents the need of cycling through all n items to compute the likelihood, hence reducing the computation time, in particular when either n or
Second, as alluded to earlier, during an instructional unit, new skills are taught and evaluated periodically. Hence, the posterior density of
where
It should be noted that, even though the conditional correct response probability of an item depends only on the reduced vector (i.e., Equation 1), and so is GDI, the posterior density of the full vector will be updated. In other words, if an item measures
Item Selection
Given the many desirable features of GDI, it could be used for item selection in interim CD-CAT. At stage t, the eligible items would be those that require one or multiple skills in
Besides this simple modification, it would be interesting to delve deeper into the types of items that are preferable because this knowledge would guide future item bank design. When all attributes are independent, Xu et al. (2016) proposed a non-asymptotic theory-based approach to guide initial item selection in CD-CAT. In particular, they derived the minimum number of items required to identify the attribute pattern of a student as well as the specific types of initial items that are required to reach the optimal classification results, under both ideal and practical scenarios. Their proposal was later used by Chang et al. (2018) in their nonparametric CD-CAT design. While Xu et al.’s (2016) results are suitable for independent attribute structures, the authors extend the results to scenarios with attribute hierarchies. The two lemmas presented below are for the ideal case of conjunctive CDM when students answer correctly all questions they are capable of and incorrectly otherwise. Their proofs are presented in the online appendix. The conclusions will be empirically verified in non-ideal cases via simulation studies. On a side note, the Q-matrix does not necessarily have to conform to the attribute hierarchies (Templin & Bradshaw, 2014; Templin & Hoffman, 2013).
Lemma 1: Suppose a test intends to measure K attributes with the attribute hierarchies summarized in a K-by-K reachability matrix
For instance, with K = 4 and attributes exhibit a divergent structure shown in Figure 1, the reachability matrix takes the form of

An illustration of the divergent structure.
Lemma 2 (Extension of Theorem 1 in Xu et al., 2016): Consider a person with attribute profile α = 0. In the ideal case, to identify
Moreover, consider a person with attribute profile
C1: There are
where
C2: There is a set of item(s) that requires none of the last
Again, use K = 4 and the divergent structure in Figure 1 as an example. As all attributes form a single cluster, when α = 0, only one item with
Similarly, if α = (1,1,0,0), then we only need two items with the following Q-matrix:
For sequential item selection, Theorem 2 in Xu et al. (2016) cannot be easily extended due to the complication of the attribute hierarchies. That is, the first item does not have to be a single-attribute item anymore. However, the general conclusion in Xu et al. (2016) still holds: During item selection, when the answer to an item is wrong, implying that the corresponding attributes may not be mastered, then the non-mastered attributes should not be required by the next item. On the contrary, if the answer is right, then the mastered attribute could be “unspecified” for the next item.
Using Figure 1 as an example, one interesting takeaway message from Lemma 2 is when
When Learning Happens
Learning could happen between two interim CD-CATs, which means a student’s
To be specific, their model could produce either attribute-level or pattern-level first-order transition probabilities. That is, between any two time points, the attribute-level transition probability spells out as
Of course,
Now with pattern-level transition probabilities known from a learning model, the authors propose to update priors of
Simulation Studies
Two simulation studies were conducted to mimic the typical scenarios in learning context. In both cases, two time points were considered for illustration purpose, but the methods could be used for additional number of time points. In the first scenario, new attributes are added at Time II, and these new attributes may require the previously taught attributes as prerequisites. In the second and more interesting scenario, not only new attributes are added at Time II, students’ true mastery profile on the attributes measured at Time I also changes due to learning and forgetting. The simulation designs and results for these two simulation studies are presented in detail below.
Study I Design
Manipulated factors
In Study I, the three manipulated factors 2 are (a) the total number of attributes measured across two time points, K = 6 or 10; (b) the relationship between the attributes tested at two time points, that is, independent or hierarchical; and (c) the underlying CDM model, additive CDM (ACDM) or GDINA, both with identity link. Even though the attribute hierarchies could take on different types, the authors believe one type is representative to show the general trend between independent and hierarchical comparisons. The attribute hierarchies are shown in Figure 2 for both two levels of K. When the attributes are independent, it implies that there are no direct linkages among any attributes. Fully crossing the two manipulated factors results in four simulated conditions.

Attribute hierarchies in the simulation study. (A) K = 6 and (B) K = 10.
Item bank construction
The item bank size was created in proportion to K. When K = 6, there are 480 items in the item bank, with the Q-matrix designed as follows: The first 150 items only measure
As to the item parameters for ACDM with identity link, the intercepts were simulated from a uniform distribution U(0.01, 0.2), and the main effects were simulated as
As to the item parameters for the GDINA model (de la Torre, 2011), a generic item parameter,
When K = 10, there are 600 items in the item bank, with the Q-matrix designed similarly to the previous condition, as follows. The first 180 items only measure
Sample construction
The sample size in the four conditions was simulated to be proportional to the number of permissible latent classes, that is, N = 100 × Number of permissible conditions. Although the total sample size was not held equal between K = 6 and K = 10 conditions due to the intrinsic difference between the dimensionality of the two latent spaces, the number of people per permissible latent class was held equal. The total sample size therefore is 2,400 (K = 6, hierarchical), 6,400 (K = 6, independent), 16,800 (K = 10, hierarchical), and 102,400 (K = 10, independent). These are the typical sample size used in other CD-CAT researches (e.g., Kaplan et al., 2015; Xu et al., 2016).
Two methods were considered for Time II CD-CAT design: The baseline method that only exploits population-level information of attribute hierarchies, that is,
Study I Results
Table 1 presents the pattern and attribute recovery rates under different manipulated conditions. For attribute recovery, we present the mean recovery rate for the first three (or five) and last three (or five) separately as they show different patterns under restricted and non-restricted conditions. The raw attribute recovery rates per condition are presented in the online appendix (Tables A2–A5). The results consistently show that using individual priors produces higher pattern and attribute recovery rates than using population priors. The difference in pattern recovery rate ranges from 5% (e.g., GDINA, K = 10, hierarchical, non-restricted) to almost 40% (e.g., ACDM, K = 10, independent, restricted). When we restrict item selections in Time II CD-CAT to items that measure at least one of the last three attributes (i.e.,
Study I: Pattern and Average Attribute Recovery Rates Under Different Manipulated Conditions.
Note.“P” denotes population prior, whereas “I” denotes individual prior. “Restricted” condition refers to the scenario where item selection in Stage II CD-CAT is from the 330 items that measure at least one of the three attributes, that is,
To further explore the types of items (i.e., items with certain q-vectors) that are more likely to be selected, we draw the heat map of item exposure in Time II conditioning on the different true
Study II Design
The same eight conditions were considered in Study II. The item bank was also the same as in Study I. In addition, to model learning transitions between Time I and Time II, we assumed the attribute-level learning rate,
To form the
Assuming the attribute hierarchies follow the pattern in Figure 2, we have
Three methods will be considered for Time II CD-CAT. (a) Baseline method exploits attribute hierarchies (if exist). That is, the population prior used at Time II is a uniform prior, excluding impermissible latent classes that violate the hierarchical relationship. This method takes neither individual response history from Time I nor learning transition patterns into consideration. (b) Population prior method uses both attribute hierarchies and learning transitions. Because the transition matrix constructed using Equation 6 already accounts for attribute hierarchy, for this method, the prior for
Study II Results
Table 2 presents the pattern and mean attribute recovery rates for Time II CD-CAT from different conditions. Several patterns can be summarized from the table. First, using individual prior consistently produces highest pattern and mean attribute recovery rates in all conditions, followed by using population prior, whereas the baseline condition yields the lowest recovery rates. Second, when the learning rate is high and forgetting rate is low, the pattern recovery rates are on average 8% to 16% higher than the low learning and high forgetting rate conditions when there is attribute hierarchy, and the difference is around 15% to 27% when the attributes are independent. The improvement in mean attribute recovery rate is also quite visible. This is because when the learning rate is high and forget rate is low, the individual prior distribution in Time II CD-CAT is more informative and concentrated, as reflected by its smaller entropy (see Figure A4 in the online appendix, for instance). Although the baseline method does not take advantage of the individual priors, having a higher learning rate leads to higher correlations among the attributes, which in turn leads to more accurate results (see C. Wang, 2013). Third, and consistent with the findings in Study I, when attributes have a hierarchical relationship instead of an independent structure, the recovery rates are uniformly higher. Fourth, and again consistent with the findings in Study I, restricting item selection in Time II CD-CAT to a subset of the item pool lowers the pattern recovery rate mainly due to the poor recovery of the first three (or five) attributes. Fifth, using the GDINA model leads to a lot higher α recovery rate across all conditions unsurprisingly. Finally, the pattern recovery rate difference between using individual prior and population prior is only about 0.1% to 0.8% in most conditions, implying that the population learning model and attribute hierarchy provide sufficient information such that using individual response history from Time I is only marginally beneficial. Figure A5 provides further evidence to show the difference between population and individual priors, using the condition of “K = 6, hierarchical, non-restricted, high-learning, and ACDM” as an example. As shown, the population prior for each permissible pattern is roughly the center of the individual priors, and both priors are 0 for impermissible patterns. This finding implies that in online learning context with live streaming, when there are hundreds of thousands of users, storing individual user’s data on-the-fly may pose challenge to the server. In this case, using the population prior is preferable as it yields almost similar precision as individual prior.
Study II: Pattern and Average Attribute Recovery Rates Under Different Manipulated Conditions.
Note. B = baseline, P = population prior, I = individual prior; ACDM = additive cognitive diagnostic model; GDINA = generalized deterministic inputs, noisy, “and” gate.
Discussion
As emphasized in Knowing What Students Know (Pellegrino et al., 2001), there is a need to move from assessment at a single point in time to assessment practices that guide “additional teaching, supports, or interventions that will help students master challenging material” (Fact Sheet: Testing Acting Plan). This step is consistent with the spirit of the learning-assessment cycle, calling for the need of longitudinal, dynamic assessments that assist teachers in understanding how knowledge and skill grow in sophistication (National Education Technology Plan, 2017). In particular, research has shown that providing timely, informative feedback can greatly improve learning (Hanna, 1976; Kluger & DeNisi, 1996). Interim assessments that are given frequently throughout the instructional period need to be short and highly efficient, which makes CAT promising (Kingsbury et al., 2014).
To embed CD-CAT in weekly instructions, an interim CD-CAT differs from the current available CD-CAT designs primarily because students’
Throughout the study, the authors use the GDI (Kaplan et al., 2015) as the item selection index because it is easy to compute and works well. One particular advantage of GDI is that its computation complexity only increases with
This study provides a proof-of-concept illustration of embedding CD-CAT in an interim assessment setting. There are several new directions that are worth exploring in the future to further solidify this application. First, we compare the hierarchical versus independent structures in the simulation studies and find that the former scenario yields more precise α recovery. This improved precision is due to the known attribute hierarchies that are reflected in the structural 0s in the prior. Therefore, knowing the attribute relationship is the premise to the success, and future studies should be devoted to exploring and validating attribute hierarchies from data, such as the exploratory approaches proposed in C. Wang and Lu (2019), and Lu and Wang (2019), or the confirmatory approaches proposed in Templin and Bradshaw (2014). Second, throughout the study, we assume
There are several limitations of the current study that worth mentioning. First, the maximum K we considered is 10, whereas in practice, K could be ultra large such as 20 or more. The ultra-large K scenario can still be handled well using the current design because of the following reasons. (a) Regardless of how large K is in a test, as long as the number of attributes measured by a single item is relatively small, selecting items using GDI is always efficient. (b) Large K imposes a computation challenge on interim update of α. But if one uses Equation 4 and uses matrix programming, the calculation can be done very efficiently. (c) Large K requires long test length to reach enough measurement precision. In this regard, ancillary information such that those from learning models may be particularly useful. A recent study proposes a promising four-step latent regression approach to improve attribute classification accuracy in ultra-high dimensional data (Sun & de la Torre, 2020). The second limitation is that we assume the attribute hierarchies and learning models are specified correctly. It is certainly expected that any misspecification in these two critical components would deteriorate the performance of CD-CAT. A future robustness study should be conducted to evaluate to what extent the misspecification may generate more harm than benefit that ancillary information brings. Finally, although one type of attribute hierarchical relationship was considered in the simulation study due to space limit, it is expected that other hierarchical structures would also outperform independent structure as long as the hierarchies are specified correctly. In fact, if there are more linkages among attributes (than specified in Figure 2), yielding a further reduced number of permissible patterns, the
Supplemental Material
sj-pdf-1-apm-10.1177_0146621621990755 – Supplemental material for On Interim Cognitive Diagnostic Computerized Adaptive Testing in Learning Context
Supplemental material, sj-pdf-1-apm-10.1177_0146621621990755 for On Interim Cognitive Diagnostic Computerized Adaptive Testing in Learning Context by Chun Wang in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project is partly supported by the University of Washington Royalty Research Fund A143697.
Supplemental Material
Supplemental material is available for this article online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
