Attribute hierarchy, the underlying prerequisite relationship among attributes, plays an important role in applying cognitive diagnosis models (CDM) for designing efficient cognitive diagnostic assessments. However, there are limited statistical tools to directly estimate attribute hierarchy from response data. In this study, we proposed a Bayesian formulation for attribute hierarchy within CDM framework and developed an efficient Metropolis within Gibbs algorithm to estimate the underlying hierarchy along with the specified CDM parameters. Our proposed estimation method is flexible and can be adapted to a general class of CDMs. We demonstrated our proposed method via a simulation study, and the results from which show that the proposed method can fully recover or estimate at least a subgraph of the underlying structure across various conditions under a specified CDM model. The real data application indicates the potential of learning attribute structure from data using our algorithm and validating the existing attribute hierarchy specified by content experts.
Cognitive diagnosis models (CDMs) have been widely applied to the field of educational assessment, psychiatric diagnosis, and other social sciences (Chen et al., 2015; de la Torre & Douglas, 2004, 2008; Sorrel et al., 2016; K. K. Tatsuoka, 1984; C. Tatsuoka, 2002; von Davier, 2008). In conjunction with diagnostic assessments, this family of structured latent class models uses subjects’ observed responses to specifically designed diagnostic items to determine fine-grained classification of the underlying latent attribute patterns. However, an unsolved problem for applying CDM in a learning context is that there is still a lack of efficient statistical tools to directly learn attribute hierarchy from the observed data. An attribute hierarchy in a given domain summarizes the relationships among skills, indicating the mastery of an attribute is a prerequisite to the mastery of another one. Different attribute hierarchies have been observed in many learning contexts, such as mathematics learning (Leighton & Gierl, 2007), critical reading (Wang & Gierl, 2011), and syllogistic reasoning (Leighton et al., 2004). Research has indicated that an attribute hierarchy has a potential to play an important role in designing an effective diagnostic assessment and an adaptive learning system (Chen et al., 2018).
Most studies of attribute hierarchy within CDM assume attribute hierarchy is known and develop a series of models that incorporates the attribute hierarchy (Leighton et al., 2004; K. K. Tatsuoka, 2009; Templin & Bradshaw, 2014) or investigate the impact of attribute hierarchies on model fit and item parameter estimation (de la Torre et al., 2010) and attribute pattern classification accuracy (Tu et al., 2019). Results from these studies demonstrate that if the attribute hierarchy is correctly specified, then with the restriction on the permissible attribute patterns given a specified structure, the number of latent attribute patterns can be reduced from the full space and this can improve the model fit and item parameter estimation. Furthermore, the sample size needed to obtain stable parameter estimates from CDM calibrations will decrease as well, and this can result in faster computing time especially when there are a large number of attributes and items (Tu et al., 2019).
However, identification and validation of attribute hierarchy are usually difficult. The attribute hierarchy of a given learning context is typically determined by the domain experts based on cognitive theories. This might cause a structure misspecification problem due to subjective understandings and opinions. Few attempts in CDM literature explored the possibility of identifying attribute hierarchy from subjects’ response data. For example, several studies have developed hypothesis testing methods when some attribute structures have already been identified (e.g., Ma & Xu, 2021; Templin & Bradshaw, 2014). However, these approaches cannot be applied when no prior information of attribute hierarchy is available. Wang and Lu (2021) developed regularization-based approaches to infer attribute hierarchy directly from data. Similarly, Ma et al. (2022) proposed a penalized likelihood approach to estimate the Q matrix and restricted attribute profiles. The structure is then inferred by partial orders. These penalized likelihood approaches infer hierarchy structures through restricted attribute profile spaces, but the estimation of the actual hierarchy structure requires stepwise reconstruction of the binary representations of the partial orders.
In viewing of the aforementioned limitations of current approaches for estimating attribute hierarchy, there is considerable scope of future research on developing new methods in learning attribute hierarchy from data. Borrowing the idea of structure learning problem in the field of machine learning, we view the underlying attribute hierarchy as a structure of directed acyclic graph (DAG) and then aim to develop new Bayesian estimation approaches for learning attribute hierarchy from data within cognitive diagnosis. The rationale of framing the problem in this way is because the DAG specifies the paths connecting nodes and the distribution satisfies certain conditional independence implied by the Markov assumption. These properties are similar to the assumptions for attribute hierarchy structures in cognitive diagnosis. However, different from traditional structure learning problems, which consider the structures of observational data without hidden variables, or the hidden variables are not included in the structures, CDMs are restricted latent class models with constraints on model parameters, and the structure of latent variable is the vital information to be learned for CDM with attribute hierarchies. Therefore, the existing algorithms of structure learning (e.g., Colombo & Maathuis, 2014; Colombo et al., 2012) cannot be directly applied to our setup. Thus, the goal of this study is to develop an efficient algorithm to estimate the hierarchy structure along with the estimation of CDM model parameters and subjects’ latent attribute patterns. Our proposed algorithm explores the sample space of DAGs through random walks on transitive reductions and is applicable to a general class of CDMs under the Bayesian framework. In addition, the proposed algorithm gives exploratory discovery of hierarchical structures with posterior probabilities of potential DAGs (or edges in DAG); the posteriors provide more information on the possible prerequisite relationships among attributes than a strict estimated structure does.
The remainder of this article is organized as follows. Section 2 introduces the general CDM structures and proposes the Bayesian framework for attribute hierarchy. Section 3 introduces the developed Metropolis within Gibbs algorithm for structure learning and model estimation. Section 4 reports the simulation results of the proposed sampling algorithm using a simple CDM model as an example. Section 5 applies the algorithm to a real-world data. Finally, Section 6 provides a summary of our method and findings, discusses the implications of the study, and provides potential future research directions.
2. Bayesian Formulation for CDMs With Attribute Hierarchy
In this section, we first provide a general setup for CDM and then introduce the Bayesian estimation framework for CDMs with attribute hierarchy. Consider a scenario that N subjects complete J diagnostic questions measuring K attributes. The binary response vector, indicating the response to each item is right or wrong, from subject i is denoted by . Within the CDM framework, a latent attribute profile with K binary components is assumed for each subject, generically denoted by α. That is, , indicating the nonmastery or mastery of K attributes in an educational learning context. Under the independent structure, the attribute profile patterns are all permissible, and they are represented by the index set . Additionally, CDMs use a binary matrix to indicate whether an attribute is measured by an item (K. K. Tatsuoka, 1985). Here, qj is the jth row of Q, implying the required attributes for item j. CDMs then impose assumptions about how each subject’s latent attribute profile and each item’s characteristics together determine the subject’s correct response probability to that item. Specifically, in the general setup of CDMs (e.g., de la Torre, 2011; von Davier, 2008), s are modeled as independent Bernoulli variables with parameter , that is
Here, f is a link function, and for item j with parameter and subject i with attribute profile . can be formulated as
Note that the parameter vector indexed by is equivalent to . The choice of the link function and the sparsity (zeros and nonzeros) of are specified by the assumption of a given CDM and qj (e.g., Chen et al., 2020; de la Torre, 2011).
The existence of an attribute hierarchy directly impacts the cardinality of the attribute profile space, thus influencing . To start with, we first formulate an attribute hierarchical structure as a DAG. This is because the prerequisite relationship is single-directed, and two attributes cannot be each other’s prerequisite at the same time. Figure 1 shows four types of hierarchy structure proposed by Leighton et al. (2004). Each of these structures is clearly a DAG with attributes as nodes and a directed edge from k to represents that is the prerequisite of . From now on, we use to denote an attribute hierarchy, where V is the set of nodes (latent attributes) and E is the set of directed edges (prerequisite relationships), and we use the notation G to represent the adjacency matrix of structure and thus simplify the hierarchical structure as G for the remainder of this article.
Illustration of four types of attribute hierarchy structure. (a) Linear. (b) Convergent. (c) Divergent. (d) Unstructured.
The attribute profile space is determined by a specific G. For example, for the linear structure presented in Figure 1a, an attribute profile with and is not permissible as this violates the assumption that attribute 1 is the prerequisite for attribute 2. Thus, the cardinality of the attribute profile space given a G will be smaller than , which is the largest value when we do not assume any specific hierarchical structure among attributes. Another component one needs to carefully specify is the Q matrix when a G is assumed. In fact, there is a debate in the literature regarding whether Q matrix needs to reflect the specified attribute hierarchy. On the one hand, several research studies assume items do not necessarily reflect the specified structure, thus have used an unstructured item-by-attribute Q matrix in their studies regarding attribute hierarchy in CDM (e.g., de la Torre et al., 2010; Templin & Bradshaw, 2014). The unstructured Q matrix only labels the attribute(s) directly measured by an item as 1(s). On the other hand, another group of researchers think a structured Q matrix that reflects the specified attribute hierarchy is necessary (e.g., Leighton et al., 2004; Tu et al., 2019). The structured Q matrix not only labels the attributes directly measured by an item but also their prerequisites.
In our study, we assume the Q matrix is known, and the goal is to estimate the latent attribute profiles, item parameters, and the DAG directly from the observed response data. Thus, we prefer to assume the Q matrix is unstructured, as the true attribute hierarchy is unknown. In addition, another benefit of using an unstructured Q is that the sparsity of model parameters is fully determined by Q and a CDM model assumption. In other words, given a specified CDM assumption (i.e., conjunctive, disjunctive, or compensatory assumption), the sparsity of has a one-to-one relationship with the unstructured Q (Chen et al., 2020), and this helps simplify our model estimation procedure. In addition, even though the Q matrix does not reflect the specified attribute hierarchy structure, the estimation method described in the following section already restricts the possible structure through the latent attribute profile space, so that the estimation result from our method will still align with the implied structure if there were any.
In summary, let be the collection of item parameters for J items. With the conditional independence assumption, the likelihood of the response vector of N subjects, , is
where is a matrix representing attribute profiles of all subjects. As noted above, due to the fact that the information about the Q-matrix is implied by the sparsity of B, the likelihood (2) can be simplified as . To establish the Bayesian estimation framework for G, let be the prior for all possible DAGs. The posterior distribution of G is
The derivation of the posterior of G given Y is nontrivial, and we use the Bayesian directed graphical models for factorization of the joint probabilities of p(A, B, G) and propose the following two tasks to sequentially update G, , and .
3. Metropolis Within Gibbs Sampling Algorithm
In this section, we first focus on the estimation of , , and G for a general class of CDMs formulated by Equation 1. We will then illustrate the proposed algorithm using a specific CDM as an example. To start with, the CDMs with attribute hierarchy can be formulated by a graphical model as shown in Figure 2.
Graphical model of the Bayesian framework for cognitive diagnosis model with hierarchy structure.
3.1. Gibbs Sampling for Model Parameters Under Hierarchy Structure
We assume that the hierarchy structure G only affects the attribute profiles proportions and hence the profiles A. Then, the problem of attribute hierarchy estimation under this Bayesian graphical model framework essentially depends on . The full conditional distribution of G is as follows:
According to (4), we propose to sequentially update A, , B, and G based on their posteriors through the Gibbs sampling algorithm. Specifically, we assume follows a Dirichlet distribution , where is a hyperparameter vector, and A follows a multinomial distribution with parameters .
Algorithm 1 shows the Gibbs sampling algorithm for all model parameters and the hierarchy structure illustrated in Figure 2. Algorithm 1 mainly consists of the procedures that update CDM parameters given G (Steps 3–5) and that update G given a CDM model (Step 6). We illustrate these procedures in the following sections.
Algorithm 1 Gibbs Sampling Algorithm for Attribute Hierarchy Estimation.
1: Initialize with an non-hierarchical structure , attribute pattern proportions and individual attribute profiles A0, and model parameters . 2: fort in do 3: Update from . 4: Update from . 5: Update from . 6: Update from . 7: end for
3.2. Efficiently Sampling of G
The main challenge of Algorithm 1 is in developing a sampling scheme to generate from , which may not be in a closed form. With a constrain of a specific structure and a reasonable value of K, the cardinality of G is generally not extremely large, and we proposed to use a Metropolis–Hastings (M–H) algorithm to sample G from the whole space of all possible DAGs through adding or removing one directed edge at each iteration. A key step in this M–H algorithm is to propose a transition step to ensure an efficient search of the candidate sample space. We first introduce the proposed transition step and then present the developed M–H algorithm.
An efficient transition step
To start with, DAGs with different numbers of edges might have the same dependent relationships. This is known as Markov equivalence property of DAGs (Chickering, 2002). For example, Figure 3 shows two DAGs that have the same directed connection (the so-called reachability). The extra edge in Figure 3b shows the connection of , which is already implied by the path of . In terms of conditional independence, under the global Markov condition, both graphs imply that Node 3 is independent of Nodes 2 and 4 given Node 1.Therefore, Figure 3a with a simpler structure can represent both DAGs, and a random walk from Fig. 3(a) to (b) is inefficient. To make efficient transitions, we propose a graph generating mechanism using the notion of transitive reduction to construct a simple structure. A transitive reduction is a subgraph of G with the fewest edges that maintain the same reachability as G, and it can be retrieved from any given DAGs by selecting edges that are the only paths between its endpoints (Aho et al., 1972). In this regard, the subgraph is unique for DAGs in the same equivalence class. Adopting this idea, we propose to only allow transitions between transitive reduction DAGs.
Illustration of directed acyclic graphs belonging to the same Markov equivalence class.
Proposed M–H Sampler
Based on the concept of transitive reductions, we developed an M–H method, documented by Algorithm 2, to sample from (line 6 of Algorithm 1). Specifically, starting with any transitive reduction G, with a prespecified probability p, a new directed edge is added between two nodes that are not connected through any paths. This guarantees the newly added edge is the only path between its endpoints; thus, the newly proposed remains a transitive reduction. Otherwise, an existing edge is removed. Note that if the current is a transitive reduction, then the newly proposed after removing one edge is still a transitive reduction, because the remaining edges are still the only paths between its endpoints.
Let denote the transition probability from G to the proposed new , then
if the proposed step is adding an edge,
Correspondingly the acceptance ratio of a proposed M–H sampler is
where is assumed to be uniform among possible equivalent classes of DAGs, and follows a Dirichlet distribution. When G is a graph with no edges, which implies an independent structure, follows the original prior of Dirichlet distribution with parameters of length . When G imposes a specific hierarchical structure, we use the aggregation property of Dirichlet distribution, such that now follows a Dirichlet distribution with parameters of length , where CG is the index set of permissible α given G, . In particular, we have
where represents the index of attribute profile patterns that are coerced into the permissible pattern c under the structure G. The computation of coerced permissible patterns is discussed in Section 3.3. For computation efficiency, we collapsed A in the M–H samplers, such that
The acceptance ratio then becomes
The proposed transition step in Algorithm 2 is irreducible, because from any transitive reduction DAG G in the Markov equivalence class, one can reach a new transitive reduction DAG by removing the edges in G sequentially and adding the edges in one by one thereafter. This irreducibility of the proposed transition step guarantees Algorithm 2 can search through the whole space of Markov equivalent DAGs. Another remark is on the prespecified probability p to control the step of proposing different structures. We hope the algorithm to explore the sampling spaces of other model parameters conditioned on a given structure and thus propose to change the structure after a few iterations of exploration. The prespecified p can be set to 1, where at each iteration, the algorithm proposes a new structure. We suggest to select p based on the acceptance ratio of the proposed samples in the Metropolis step, where the acceptance ratio between 20% and 30% is preferred.
Algorithm 2 Metropolis–Hasting Algorithm for Sampling Attribute Hierarchy Structure.
1: The current state of the structure is G and attributes profiles , with a pre-specified probability p 2: General a u from Uniform : 3: ifthen 4: Propose a new by randomly adding one directed edge between two nodes that are not connected by any paths. 5: else ifthen 6: Propose a new by randomly removing one existing edge. 7: end if 8: Compute the acceptance ratio in Equation (9) and generate w from Uniform : 9: ifthen 10: Accept the new . 11: else 12: Stay with the current G. 13: end if
3.3. Sampling CDM Model Parameters
The derivation of full conditional distributions for A, , and B is explicit for a general class of CDMs following the Bayesian formulation (e.g., Chen et al., 2020). For example, for DINA model, these posteriors can be derived explicitly using certain specified priors, such as the formulations in Culpepper (2015). Even if the derivation of posteriors is explicit, the computation of and in each iteration needs to be sequentially adapted. This is because different G structures imply different constrained spaces of A and . The A at the current iteration may not comply with the space implied by a new in the next one.
When an attribute pattern under independent structure, indexed by , is not permissible under the proposed structure G, its corresponding permissible pattern should comply with the constraint G. Denote the new pattern under structure G by . By the aggregation property of Dirichlet distribution, the becomes a vector of length following an aggregated Dirichlet distribution, with
where means and are equivalent under structure G. For example, consider a simple case for , and the current G has one edge from to , as shown in Figure 4a. The space of A given G is , , , , , . Two attribute patterns, and , are not permissible with G. Because the random walk approach only modifies one edge at each proposed step, there will be two simple scenarios of adding or removing a prerequisite relationship. If removing an edge for the next step, resulting in a new (Figure 4b); then, is no longer the prerequisite of . This implies all eight attribute patterns are permissible in the space of A given in Figure 4b. Then, is computed as the Dirichlet distribution with variables. Similarly, if the next step is to add an edge, then a new prerequisite relationship is proposed, resulting in a (Figure 4c). In this case, the space of is , , , , . That is, one permissible attribute pattern under G (Figure 4a), , is no longer permissible and the subjects under which will be reclassified into under the new in Figure 4c. This causes a reduced number of permissible patterns under . In this case, based on the aggregation property of Dirichlet distribution, we only need to collapse the impermissible attribute patterns into the permissible ones under the reduced space, and is an aggregated Dirichlet distribution with a collapsed hyperparameter.
Illustration of directed acyclic graphs and permissible patterns. (a) G. (b) after removing an edge. (c) after adding an edge.
In the case that an edge is eliminated, the number of permissible mastery patterns increases and all current patterns are still permissible under the new structure; therefore, they can be directly used in the calculation for acceptance ratio (as conditioned on the current patterns). Then, in the later steps of Gibbs sampling, attribute patterns are updated through Dirichlet-multinomial distribution, where all attribute profiles are updated sequentially relying on the updated profiles only, that is, is proportional to the number of subjects among the existing subjects classified as . The Dirichlet parameter is then updated and conditioned on the updated attributes.
3.4. Summary of the Proposed Algorithms
In summary, incorporating Algorithm 2 into line 6 of Algorithm 1, we have the Metropolis-within-Gibbs (MH-Gibbs) algorithm for sampling the hierarchical structure and model parameters together. The algorithm is applicable to general CDMs, with explicit derivation of the posteriors of CDM model parameters. Given a specific CDM model, essentially, one only needs to modify the sampling procedures for the CDM part (Steps 3–5 in Algorithm 1) and the proposed MH algorithm for sampling G (Algorithm 2) is applicable to any CDMs. As an illustration, we present the Gibbs sampling procedures for DINA model given G here. For DINA model, where s is a vector of slipping parameters and g is a vector of guessing parameters. Assume the prior for is Dirichlet , the priors for s and g are Beta and Beta , respectively. Under the proposed random walk, the prior for G is uniform on all transitive reductions of DAG of size K. Therefore, the posteriors for Gibbs samplers of A, , s, and g are
where ic is the number of subjects among the current subjects that are classified as possessing attribute profile , and are the numbers of correct and incorrect responses to item j from subjects who possess all required skills, and and are the numbers of correct and incorrect responses to item j from subjects who miss at least one required skill.
4. Simulation Study
In this section, we examined the performance of our proposed algorithm assuming response data follow the DINA model. To demonstrate the generalizability of the proposed algorithm, in addition to this simple model, we implemented the algorithm for general restricted latent class models discussed in Section 2. The results can be found in the Appendix. For this simulation study, we use different sample sizes ( and 1,000), test lengths ( and ), numbers of attributes ( and 7), attribute dependence levels ( and ), and attribute hierarchy structures (shown in Figures 5 and 6). To evaluate the performance of the proposed algorithm in the case of balanced and unbalanced attribute profile distribution, we use two different distributions to simulate the true attribute profile. Specifically, when , the profiles are randomly generated from one of its permissible pattern; thus, the true is relatively balanced among permissible patterns. When , the profiles are generated from a higher order probit model that
Four hierarchy structures in simulation study for K = 4. (a) G1. (b) G2. (c) G3. (d) G4.
Four hierarchy structures in simulation study for K = 7. (a) G1. (b) G2. (c) G3. (d) G4.
where follows a multivariate normal distribution with mean zeros and a covariance matrix , and the attribute profiles are converted into permissible patterns of the given hierarchy structure. Using this generating approach, the true is unbalanced and the proportions of all zeros or all ones are higher than others. The true slipping and guessing parameters are fixed at 0.2.
To evaluate the simulation performance, we ran 100 replications for all cases. First, we checked the convergence of the proposed algorithm using the multivariate proportional scale reduction factor (Brooks & Gelman, 1998), and the results indicate the algorithm reaches the convergence criterion after 6,000 iterations. Thus, we ran all chains of 20,000 iterations with a 15,000 burn-in period. The hyperparameter for Dirichlet prior of is set to be , and the hyperparameters for slipping and guessing parameters are set to be . All model parameters are estimated by posterior means. The hierarchy structure is inferred by the inclusion probability on each element of the adjacency matrices from the posterior samples. The classification of attribute profiles is evaluated by overall pattern accuracy rate (PAR), which is defined by and attribute accuracy rate (AAR), defined by . The estimation of model parameters is evaluated by the root mean squared error (RMSE):
4.1. Simulation Results
Recovery of G for K = 4
For a DAG represented by an adjacency matrix, its th element represents a potential edge from attribute i to attribute j, indicating attribute i is the prerequisite of attribute j. The final estimated adjacency matrix is presented by a weighted matrix, where the th element represents the proportions of edge included in the 100 replications of all posterior samples. A hierarchical structure can then be recovered from the weighted matrix with a specified threshold of inclusion probability. Table 1 reports the estimated hierarchy structures using weighted adjacency matrix under four different specified structures for and averaging over 100 replications. First, the estimated inclusion probabilities of all attribute hierarchy structures are comparable when and (columns 2 and 3 of Table 1), indicating the distribution of attribute profile has little impact on the estimation of G. Estimation in N = 1,000 case has a slightly higher inclusion probability for true edges than in case. Second, we note that for all considered attribute hierarchical structures, the collection of estimated edges is a subset of the list of true edges. As discussed in Section 3.2, some edges in DAG can be implied from other paths, and removing those edges will not affect the hierarchical structure implied by the graph. Note that the extra edges estimated from our algorithm in Table 1 for G1, G2, and G4 are all implied edges, which do not violate the true structure. Therefore, based on the true structure, the proposed method correctly estimated the true edges or the implied edges. Lastly, we observe that the inclusion probabilities for true edges are around 0.5, the inclusion probabilities for implied edges are smaller than those for true edges, and the inclusion probabilities for directed edges not implied in the true structure are all 0 in case. This suggests that the estimated hierarchy structures of 100 replications are either the true structure or a subgraph of the true structure. To further illustrate this, Table 2 summarizes the overall structure estimation over 100 replications for and . The threshold for inclusion probability among posterior samples is fixed as 0.5. The proposed method always identifies the true structure or a subset of it. In fact, for those estimated subset structures, the inclusion probabilities for missing edges are around 0.4, and the inclusion probabilities for nonexisting edges are 0, so one may consider a loose threshold in practice to get a better recovery of G.
Estimated Structures in Adjacency Matrices of the Proposed Method for K = 4 and J = 20 Cases
G
N = 1,000
G1
.000
.544
.208
.153
.000
.545
.184
.183
.000
.523
.199
.159
.000
.529
.202
.141
.000
.000
.557
.193
.000
.000
.542
.184
.000
.000
.521
.174
.000
.000
.512
.164
.000
.000
.000
.530
.000
.000
.000
.535
.000
.000
.000
.498
.000
.000
.000
.473
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
G2
.000
.522
.544
.139
.000
.530
.529
.145
.000
.510
.516
.132
.000
.531
.522
.130
.000
.000
.000
.497
.000
.000
.000
.499
.000
.000
.000
.490
.000
.000
.000
.485
.000
.000
.000
.496
.000
.000
.000
.509
.000
.000
.000
.481
.000
.000
.000
.496
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
G3
.000
.497
.495
.500
.000
.538
.541
.531
.000
.485
.497
.505
.000
.534
.511
.522
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
G4
.000
.502
.542
.248
.000
.522
.545
.230
.000
.500
.502
.253
.000
.518
.533
.261
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.507
.000
.000
.000
.523
.000
.000
.000
.505
.000
.000
.000
.518
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
Note. is denoted by the inclusion probabilities of each entry in the sampled adjacency matrices. The proportions are calculated from 100 replications of each case across all posterior samples. True edges are in bold.
Frequency of Corrected Estimated Structures Among 100 Replications for K = 4 and J = 20 Cases
N = 500, J = 20
N = 1,000, J = 20
Structure
True
Subset
False
True
Subset
False
G1
73
27
0
77
23
0
G2
72
28
0
72
28
0
G3
72
28
0
76
24
0
G4
70
30
0
72
28
0
G1
54
46
0
61
39
0
G2
65
35
0
64
36
0
G3
56
44
0
72
28
0
G4
60
40
0
70
30
0
Note. The threshold of inclusion probability is set to be . “True” refers to estimated structures (including implied edges) equivalent to the true structure, “Subset” refers to estimated structures missing some edges, and “False” refers to estimated structures with extra edges or wrong directions.
Furthermore, a better performance can be observed when both the sample size and test length increase. Specifically, N = 1,000 and case gives a better performance than N = 1,000 case. Tables 3 and 4 report the estimated structures and frequency of correctly estimated structures over 100 replications when N = 1,000 and . The final estimated inclusion probabilities of true edges are all greater than 0.5, and the frequency of correctly identified structures is also higher compared with the condition .
Estimated Structures Ĝij in Adjacency Matrices of the Proposed Method for K = 4 and J = 40 Cases
G
N = 1,000, J = 40, ρ = 0
N = 1,000, J = 40, ρ = 0.5
G1
.000
.566
.198
.145
.000
.566
.184
.142
.000
.000
.576
.189
.000
.000
.559
.187
.000
.000
.000
.562
.000
.000
.000
.547
.000
.000
.000
.000
.000
.000
.000
.000
G2
.000
.535
.548
.139
.000
.521
.541
.134
.000
.000
.000
.536
.000
.000
.000
.507
.000
.000
.000
.534
.000
.000
.000
.528
.000
.000
.000
.000
.000
.000
.000
.000
G3
.000
.547
.547
.532
.000
.530
.525
.535
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
G4
.000
.528
.549
.249
.000
.526
.547
.222
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.000
.542
.000
.000
.000
.542
.000
.000
.000
.000
.000
.000
.000
.000
Note. is denoted by the inclusion probabilities of each entry in the sampled adjacency matrices. The proportions are calculated from 100 replications of each case across all posterior samples. True edges are in bold.
Frequency of Corrected Estimated Structures Among 100 Replications for K = 4 and J = 40 Cases
N =1,000
Structure
True
Subset
False
G1
89
11
0
G2
77
23
0
G3
82
18
0
G4
81
19
0
G1
82
18
0
G2
75
25
0
G3
77
23
0
G4
77
23
0
Note. The threshold of inclusion probability is set to be . “True” refers to estimated structures (including implied edges) equivalent to the true structure, “Subset” refers to estimated structures missing some edges, and “False” refers to estimated structures with extra edges or wrong directions.
Recovery of CDM parameters for K = 4
Table 5 reports the overall attribute classification accuracy rate, the estimation accuracy rate for each attribute, and the RMSEs and biases of slipping and guessing parameters across all conditions. Again, the parameter estimation results are similar when and . The proposed method accurately classifies the attribute profiles of subjects using the estimated structure, with PAR greater than 0.957 and AAR greater than 0.982 across all conditions. We also note that for a given attribute hierarchy structure, the classification accuracy for the attribute at a lower level tends to be higher than those at a higher level. For example, for the linear stricture G1 that has as the lowest level and as the highest level, the AAR for . The RMSEs for s and g are smaller than 0.035 across all conditions, indicating an accurate estimation of item parameters for DINA. For case, the attribute classification accuracy is slightly better than case, and the RMSEs are smaller for a larger sample size. The average biases of the slipping and guessing parameters are close to 0.
Estimated Attribute Profile Accuracy and Item Parameter RMSEs and Biases of the Proposed Method for K = 4 and J = 20 Cases
Structure
PAR
AAR1
AAR2
AAR3
AAR4
RMSEs
Biass
RMSEg
Biasg
G1
.978
.988
.992
0.996
1
.017
.003
.022
.006
G2
.978
.985
.992
0.991
0.999
.017
−.008
.022
.001
G3
.957
.986
.988
0.989
0.989
.025
.003
.020
−.001
G4
.967
.986
.990
0.991
0.995
.021
.002
.021
.006
G1
.977
.985
.994
0.997
0.999
.018
.006
.034
.001
G2
.972
.983
.993
0.992
0.999
.018
.003
.033
−.003
G3
.961
.982
.990
0.991
0.990
.023
.002
.035
−.004
G4
.969
.984
.993
0.991
0.996
.018
.004
.033
.005
N = 1,000
G1
.982
.984
.999
0.999
1
.011
.002
.017
.003
G2
.979
.989
.993
0.996
0.998
.011
−.004
.016
−.003
G3
.966
.984
.991
0.995
0.991
.015
.003
.013
.002
G4
.976
.984
.995
0.995
0.999
.014
.005
.018
.005
N = 1,000
G1
.974
.977
.996
1
0.999
.012
.004
.018
.003
G2
.969
.983
.993
0.988
0.998
.014
−.001
.016
−.003
G3
.964
.988
.986
0.992
0.993
.016
.002
.014
.001
G4
.973
.988
.989
0.992
0.999
.016
−.003
.021
.002
Note. “RMSE” and “RMSE” refer to average RMSEs of slipping and guessing parameters. “Bias” and “Bias” refer to average biases of slipping and guessing parameters. RMSE = root mean squared error; AAR = attribute accuracy rate; PAR = pattern accuracy rate.
Similar with the findings for the recovery of G, the increase of both sample size and test length leads to better recovery of CDM model parameters. Table 6 summarizes the attribute profile accuracy and item parameter estimation from the DINA when . The result is consistent with the simulation studies by Sen and Cohen (2021), where the classification accuracy for DINA model may not be monotonic with increased sample sizes. The recovery of attribute profiles may be jointly affected by sample size, test length, and mastery rate. Table 7 reports the RMSEs of estimated under the , and cases. The RMSEs of are smaller for case compared to cases, which is consistent with the PAR estimations. The average biases of estimated are less than 0.002 across different sample sizes and test lengths.
Estimated Attribute Profile Accuracy and Item Parameter RMSEs and Biases of the Proposed Method for K = 4 and J = 40 Cases
Structure
PAR
AAR1
AAR2
AAR3
AAR4
RMSEs
Biass
RMSEg
Biasg
G1
.994
.996
.999
1
1
.011
−.003
.016
.001
G2
.993
.995
.998
0.999
0.999
.012
.001
.016
.004
G3
.991
.995
.998
0.998
0.999
.013
.003
.013
.001
G4
.993
.996
.999
0.998
0.999
.011
.003
.014
−.001
G1
.985
.988
.996
1
1
.013
−.001
.015
.002
G2
.983
.989
.997
0.997
0.999
.013
.002
.013
.004
G3
.981
.989
.997
0.997
0.998
.014
.002
.014
−.004
G4
.984
.990
.997
0.997
0.999
.012
.000
.015
.001
Note. “RMSE” and “RMSE” refer to average RMSEs of slipping and guessing parameters. “Bias” and “Bias” refer to average biases of slipping and guessing parameters. RMSE = root mean squared error; AAR = attribute accuracy rate; PAR = pattern accuracy rate.
RMSEs of Estimated π for K = 4 Cases
Structure
G1
.009
.011
.006
.007
.002
.004
G2
.007
.009
.007
.009
.003
.004
G3
.005
.008
.005
.008
.003
.005
G4
.005
.008
.007
.009
.004
.006
Note. The RMSEs are computed by averages of RMSEs of estimated attribute profile proportions across 100 replications. RMSE = root mean squared error.
Recovery of G for K = 7
To simplify the presentation, instead of presenting the adjacency matrix that documents the inclusion probability for each pair of attributes, we presented Figure 7 with the inclusion probabilities annotated by the side of estimated edges when test length equals 20. First, the estimated inclusion probabilities for true edges are similar for and , except for the linear structure G1. Second, different from what we observe for , the algorithm estimated a few edges that violate the true hierarchy structure for G1, G2, and G3. For case, the inclusion probabilities are slightly worse than case. This is because for DINA model, attribute classification accuracy depends on sample size, test length, and attributes together, and the hierarchy structure estimation is affected accordingly. However, the proposed algorithm estimated the existing true edges (and their implied edges) with inclusion probabilities larger than the estimated nonexist edges. In fact, as K gets larger, the sample space for DAGs becomes larger; thus, it is more difficult to detect the true structure for than for . Finally, we observe that the linear structures (G1 and G3) are more difficult to recover than the divergent structure and convergent structure G4.
Estimated directed acyclic graph edge probabilities from simulated data for K = 7 and J = 20. Note. The thick edges are the edges from the true structures, and the dashed edges are edges not appearing in the true structures. Implied edges are omitted from the graph.
Table 8 summarizes the overall structure estimation over 100 replications for test length equals 20. Because the inclusion probabilities for are relatively smaller than estimations, we set the threshold of inclusion probability for final estimation to be 0.25. As presented in Figure 7, the inclusion probability of true edges is larger than the inclusion probability of nonexisting edges in the simulation results. Therefore, if there are two edges of opposite directions among a pair of attributes, we include the edge with the larger inclusion probability in the final estimation. The proposed method identifies the true structure or a subset of it for for most replications. The algorithm performs better on the divergent G2 and convergent G4, compared to linear structures G1 and G3.
Frequency of Corrected Estimated Structure Among 100 Replications for K = 7 and J = 20 Cases
Structure
True
Subset
False
True
Subset
False
G1
82
4
14
72
25
3
G2
96
4
0
96
4
0
G3
79
12
9
71
29
0
G4
91
8
1
75
25
0
G1
65
18
17
65
30
5
G2
90
8
2
88
12
0
G3
56
18
26
54
33
13
G4
72
23
5
68
24
8
Note. The threshold of inclusion probability is set to be . “True” refers to estimated structures (including implied edges) equivalent to the true structure, “Subset” refers to estimated structures missing some edges, and “False” refers to estimated structures with extra edges or wrong directions.
Similar to cases, the performance of attribute hierarchy recovery improves with the increase of both sample size and test length. Figure 8 and Table 9 report the inclusion probability and corrected estimated structures, respectively, for . Compared to case, there are no incorrect edges; all estimated structures are true or at least a subset of the true structure.
Estimated directed acyclic graph edge probabilities from simulated data for K = 7 and J = 20. Note. The thick edges are the edges from the true structures, and the dashed edges are edges not appearing in the true structures. Implied edges are omitted from the graph.
Frequency of Corrected Estimated Structure Among 100 Replications for K = 7, N = 1,000, and J = 40 Cases
Structure
True
Subset
False
G1
93
7
0
G2
94
6
0
G3
89
11
0
G4
91
9
0
G1
90
10
0
G2
90
10
0
G3
85
15
0
G4
83
17
0
Note. The threshold of inclusion probability is set to be . “True” refers to estimated structures (including implied edges) equivalent to the true structure, “Subset” refers to estimated structures missing some edges, and “False” refers to estimated structures with extra edges or wrong directions.
Recovery of CDM parameters for K = 7
Tables 10 and 11 report the overall attribute classification accuracy rate, the estimation accuracy rate for each attribute, and the RMSEs and biases of slipping and guessing parameters. The overall findings are similar to . However, we note that the overall PAR accuracy rate is smallest on divergent G2, which is different from its hierarchy estimation performance. This is due to the fact that G2 results in the largest number of permissible attribute patterns among the four structures, which naturally produces lower PAR than others. The AAR of each attribute of G2 is comparable with those from all other structures. For case, the attribute classification accuracy is slightly worse than case, and the RMSEs are smaller for a larger sample size. The average biases of the slipping and guessing parameters are close to 0.
Estimated Attribute Profile Accuracy and Item Parameter RMSEs of the Proposed Method for K = 7 and J = 20 Cases
Structure
PAR
AAR1
AAR2
AAR3
AAR4
AAR5
AAR6
AAR7
RMSE
Bias
RMSE
Bias
G1
.927
.983
.996
.996
.990
.995
.969
.988
.018
−.003
.030
.006
G2
.832
.984
.989
.987
.960
.978
.940
.955
.022
−.004
.027
−.008
G3
.892
.984
.990
.993
.974
.991
.967
.973
.018
.005
.029
.006
G4
.896
.975
.993
.990
.991
.992
.948
.989
.019
.006
.027
−.008
G1
.825
.990
.996
.878
.989
.988
.965
.993
.019
.008
.042
−.003
G2
.794
.996
.987
.987
.943
.972
.921
.940
.024
.006
.039
.006
G3
.843
.989
.993
.965
.989
.987
.943
.991
.021
.003
.036
−.006
G4
.878
.996
.993
.990
.954
.976
.957
.986
.022
.008
.037
−.009
G1
.865
.958
.988
.985
.942
.964
.939
.932
.021
.008
.029
−.006
G2
.854
.987
.987
.976
.933
.961
.928
.941
.022
.007
.026
.007
G3
.763
.953
.988
.969
.938
.966
.931
.936
.017
.006
.027
−.008
G4
.848
.969
.988
.980
.971
.976
.912
.930
.019
.009
.026
.006
G1
.788
.967
.988
.991
.977
.989
.910
.928
.019
.006
.032
−.008
G2
.781
.960
.984
.990
.973
.975
.927
.929
.022
−.008
.033
.009
G3
.749
.934
.977
.978
.921
.948
.925
.933
.023
.006
.035
.008
G4
.769
.979
.987
.979
.942
.961
.922
.943
.022
−.005
.033
.008
Note. “RMSE” and “RMSE” refer to RMSEs of slipping and guessing parameters. RMSE = root mean squared error; AAR = attribute accuracy rate; PAR = pattern accuracy rate.
Estimated Attribute Profile Accuracy and Item Parameter RMSEs of the Proposed Method for K = 7 and J = 40 Cases
Structure
PAR
AAR1
AAR2
AAR3
AAR4
AAR5
AAR6
AAR7
RMSE
Bias
RMSE
Bias
N = 1,000
G1
.965
.984
.998
.999
.997
.998
.997
.991
.014
.002
.012
−.004
G2
.909
.992
.996
.998
.988
.994
.963
.969
.013
.003
.015
−.003
G3
.878
.972
.997
.989
.995
.999
.923
.997
.017
−.006
.016
.007
G4
.905
.991
.991
.993
.982
.987
.956
.999
.014
.004
.015
.003
N = 1,000
G1
.941
.970
.994
.999
.996
.997
.985
.999
.015
−.007
.019
.006
G2
.897
.989
.998
.996
.985
.994
.956
.967
.021
−.005
.025
.008
G3
.838
.948
.999
.991
.994
.999
.923
.974
.023
.004
.025
.006
G4
.901
.991
.996
.997
.985
.987
.949
.991
.016
−.006
.021
.005
Note. “RMSE” and “RMSE” refer to RMSEs of slipping and guessing parameters. RMSE = root mean squared error; AAR = attribute accuracy rate; PAR = pattern accuracy rate.
Table 12 reports the average computation time (in seconds) of the proposed algorithm under each condition. Simulation and estimation work was completed using R by a seamless interfacing established with RcppArmadillo and Rcpp packages (Eddelbuettel et al., 2011). On average, for , the proposed algorithm took less than 6 minutes for a chain with length , and for , it takes about 45 minutes to finish a chain with , and chain length . The computation time increases exponentially in K because of possible attribute profile patterns and possible candidate edges.
Average Computation Time of the Proposed Algorithm Under Each Condition in Seconds
N
K
J
G1
G2
G3
G4
500
4
20
0
156.25
162.86
156.39
156.70
500
4
20
0.5
157.38
162.52
156.92
158.38
1,000
4
20
0
214.47
229.23
229.70
229.75
1,000
4
20
0.5
233.83
240.48
237.79
235.53
1,000
4
40
0
287.44
321.94
307.56
309.55
1,000
4
40
0.5
292.33
324.03
319.99
312.64
500
7
20
0
1,119.36
1,137.94
1,124.62
1,135.18
500
7
20
0.5
1,123.78
1,139.26
1,130.20
1,136.79
1,000
7
20
0
1,944.25
2,090.44
1,967.32
2,075.93
1,000
7
20
0.5
2,082.74
2,131.28
2,065.55
2,187.02
1,000
7
40
0
2,579.00
2,618.58
2,614.96
2,616.03
1,000
7
40
0.5
2,646.88
2,131.28
2,633.92
2,658.55
Note. The computation time is averaged across 100 replications under each condition. The computing is done on a single CPU core with 4-GB memory.
5. Data Application
In this section, we applied the proposed MH-Gibbs algorithm in estimating attribute hierarchy on the Examination for the Certificate of Proficiency in English (ECPE) data. This data set contains 2,922 test takers’ responses to 28 ECPE test items, which has been analyzed by several studies regarding attribute hierarchy (e.g., Ma & Xu, 2021; Templin & Bradshaw, 2014; Wang & Lu, 2021). Three attributes were identified, including (1) morphosynatacic rules, (2) cohesive rules, and (3) lexical rules. The response matrix and Q matrix can be loaded in the R package “CDM” (George et al., 2016). A linear hierarchical structure of the three attributes, , was specified or estimated by several previous studies (e.g., Templin & Bradshaw, 2014; Wang & Lu, 2021). In this analysis, we applied the developed algorithm to learn the attribute hierarchy structure from the data using both DINA and GDINA model. For each CDM model, a total of five chains with a random starting point were first ran to check the convergence using the maximum value of Gelman–Ruin proportional scale reduction factor (Gelman & Rubin, 1992). Results indicate the algorithm converged after 3,000 iterations. Thus, we summarize the result based on a single chain of the last 5,000 iterations of the 10,000 iterations. The estimated adjacency matrix from both DINA and GDINA was very consistent, and here, we presented the result based on the DINA model in Table 13. We can conclude that Attribute 3 is the prerequisite of Attribute 2 as the (3, 2) element is nonzero (0.584). Thus, our algorithm estimated a subgraph of the expert specified attribute hierarchy, which is consistent with the finding from the simulation study that our algorithm is able to at least recover a subgraph of the true structure. This demonstrates that our estimated structure can serve as the starting structure for cognitive experts if no prior information regarding the attribute hierarchy exists. The proposed method did not identify the relationship of compared to the previous studies. We found the performance of the proposed algorithm could potentially be affected by the prespecified Q matrix, where the three skills were loaded unbalanced among the test items. The missing edge allows two more permissible patterns (1,0,0) and (1,0,1), and the corresponding ideal response pattern of these two attribute profiles only differs from other permissible patterns by 5 items under the currently prespecified unstructured Q matrix. Further improvement of the proposed algorithm may require joint modeling of attribute structure and corresponding structured Q matrix.
The Estimated Adjacency Matrix for Examination for the Certificate of Proficiency in English Data
1
2
3
1
0
0.010
0
2
0
0
0
3
0
0.584
0
6. Discussion
Attribute hierarchy plays an important role in applying CDMs for designing efficient cognitive diagnostic assessments, yet there are limited statistical tools to directly learn the latent hierarchy structure among attributes. In this study, we formulated the estimation of attribute hierarchy underlying CDMs with a Bayesian framework and developed an efficient MCMC algorithm to estimate the hierarchy structure together with CDM model parameters directly. The exploration of sampling space of hierarchy structure uses a random walk on transitive reduction of acyclic directed graphs to effectively search possible hierarchy states. Using the DINA model as an example, our simulation study demonstrates that the proposed method can fully recover or estimate at least a subgraph of the underlying structure across various conditions. In addition, we applied our proposed algorithm in estimating the hierarchy using the ECPE data. Results indicate that our algorithm is able to recover a subgraph of the prespecified attribute hierarchy. This indicates the potential of using our developed algorithm in learning attribute hierarchy from data, and the results can be used to validate the existing attribute hierarchy specified by content experts.
One finding from the simulation study is about the impact of the true attribute hierarchical structure on our proposed estimation algorithm. We note that the linear structure is more difficult to estimate correctly than nonlinear structures. This may be due to the construction of the M–H step in the proposed algorithm, where we proposed to only add or remove one edge at a time. Other stochastic construction of graphs, such as exchanging edges or reverting direction, can be considered in the algorithm to further expedite the exploration of the sample space of DAGs in the future.
Another finding from the simulation study is that the inclusion probabilities calculated from sampled posterior adjacency matrices are often smaller than , which indicates that the acceptance ratio of adding an edge to the free structure is not very high. This is because currently we used a Dirichlet prior for the proportions of attribute profiles based on the nonhierarchical structure, and the hyperparameter in simulation and data application is . When the current G is the nonhierarchical structure (all zeros) and the newly proposed in Algorithm 2 Line 6 has one edge, , then dominates the acceptance ratio. Therefore, in practice, one can use an unbalanced to deliberately make the acceptance ratio in favor of certain dependent structures. This can further improve the inclusion probabilities for potential edges.
The proposed method gives an exploratory discovery of potential hierarchical structures. The estimated adjacency matrix with inclusion probability of potential edges provides more information on the possible prerequisite relationships among attributes than a strict estimated structure does. However, if the researchers look for further interpretation on the attribute mastery, an explicit structure is needed. The choice of a threshold provides guidance on how to decide on explicit structures. The selection of threshold values to include the edge is subjective in this study. A potential strategy is to consider the overall possible DAGs of size K. The exact enumeration given by Robinson (1977) shows the total number of possible DAGs with K nodes is given by the recurring function :
Because the proposed algorithm considers a uniform prior, then the edges (off-diagonal elements in G) have an equal probability of appearing in the sample space of DAGs, which is upper bounded by . Because is superexponential in K, therefore one can consider an approximate value of as the threshold. Alternatively, a subjective threshold depending on posterior samples such as a certain percentile of edge inclusion probability could also be considered.
Even though the current study mainly used the DINA model in the simulation study, the algorithm can be generally applicable to other CDM specifications as long as the posteriors can be derived explicitly. The Appendix showed another simple simulation of the proposed method on a compensatory main-effect restricted latent class model. In addition, one limitation of the current study is that we assume Q is known and unstructured. In the future, we can estimate Q together with other CDM model parameters and the underlying attribute hierarchy. Our current model framework assumes that the item parameter matrix B fully implies the structure of Q; thus, the estimation of structured Q can be easily incorporated into the existing algorithm through the estimated sparsity of B. Furthermore, the proposed algorithm uses a Dirichlet prior on attribute pattern proportions and utilizes the aggregation property of Dirichlet distribution. An alternative formulation using Polya urn model for attribute profiles can be further investigated to collapse the parameter . Finally, future research can potentially extend the proposed algorithm to longitudinal CDMs, where the transition probabilities among different attribute patterns are also constrained by G. Under specific model assumptions such as the first-order Markov model, the proposed algorithm is also feasible with explicit derivation of the transition matrix analogous to the construction of .
Footnotes
Appendix
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This research was supported by National Science Foundation under award SES-2051198.
ORCID iD
Yinghan Chen
References
1.
AhoA. V.GareyM. R.UllmanJ. D. (1972). The transitive reduction of a directed graph. SIAM Journal on Computing, 1(2), 131–137.
2.
BrooksS. P.GelmanA. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455.
3.
ChenY.CulpepperS.LiangF. (2020). A sparse latent class model for cognitive diagnosis. Psychometrika, 85(1), 121–153.
4.
ChenY.LiX.LiuJ.YingZ. (2018). Recommendation system for adaptive learning. Applied Psychological Measurement, 42(1), 24–41.
5.
ChenY.LiuJ.XuG.YingZ. (2015). Statistical analysis of Q-matrix based diagnostic classification models. Journal of the American Statistical Association, 110(510), 850–866. https://doi.org/10.1080/01621459.2014.934827
6.
ChickeringD. M. (2002). Learning equivalence classes of Bayesian-network structures. Journal of Machine Learning Research, 2(1), 445–498.
7.
ColomboD.MaathuisM. H. (2014). Order-independent constraint-based causal structure learning. The Journal of Machine Learning Research, 15(1), 3741–3782.
8.
ColomboD.MaathuisM. H.KalischM.RichardsonT. S. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, 40(1), 294–321.
9.
CulpepperS. A. (2015). Bayesian estimation of the DINA model with Gibbs sampling. Journal of Educational and Behavioral Statistics, 40(5), 454–476.
de la TorreJ.DouglasJ. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69(3), 333–353. https://doi.org/10.1007/bf02295640
12.
de la TorreJ.DouglasJ. A. (2008). Model evaluation and multiple strategies in cognitive diagnosis: An analysis of fraction subtraction data. Psychometrika, 73(4), 595–624.
13.
de la TorreJ.HongY.DengW. (2010). Factors affecting the item parameter estimation and classification accuracy of the DINA model. Journal of Educational Measurement, 47(2), 227–249.
14.
EddelbuettelD.FrançoisR.AllaireJ.ChambersJ.BatesD.UsheyK. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.
15.
GelmanA.RubinD. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 457–472. https://doi.org/10.1214/ss/1177011136
16.
GeorgeA. C.RobitzschA.KieferT.GroßJ.ÜnlüA. (2016). The R package CDM for cognitive diagnosis models. Journal of Statistical Software, 74, 1–24.
17.
LeightonJ. P.GierlM. J. (2007). Why cognitive diagnostic assessment. In LeightonJ. P.GierlM. J. (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 3–18). Cambridge University Press. https://doi.org/10.1017/cbo9780511611186.001
18.
LeightonJ.P.GierlM.HunkaS. (2004). The attribute hierarchy model: An approach for integrating cognitive theory with assessment practice. Journal of Educational Measurement, 41(3), 205–236.
19.
MaC.OuyangJ.XuG. (2022). Learning latent and hierarchical structures in cognitive diagnosis models. Psychometrika, 88, 175–207.
20.
MaC.XuG. (2021). Hypothesis testing for hierarchical structures in cognitive diagnosis models. Journal of Data Science, 20(3), 279–302.
21.
RobinsonR. W. (1977). Counting unlabeled acyclic digraphs. In Combinatorial mathematics v (pp. 28–43). Springer. https://doi.org/10.1007/BFb0069178
22.
SenS.CohenA. S. (2021). Sample size requirements for applying diagnostic classification models. Frontiers in Psychology, 11, 621251.
23.
SorrelM. A.OleaJ.AbadF. J.de la TorreJ.AguadoD.LievensF. (2016). Validity and reliability of situational judgement test scores: A new approach based on cognitive diagnosis models. Organizational Research Methods, 19(3), 506–532.
24.
TatsuokaC. (2002). Data analytic methods for latent partially ordered classification models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51(3), 337–350.
25.
TatsuokaK. K. (1984). Analysis of errors in fraction addition and subtraction problems. Computer-Based Education Research Laboratory.
26.
TatsuokaK. K. (1985). A probabilistic model for diagnosing misconceptions by the pattern classification approach. Journal of Educational and Behavioral Statistics, 10(1), 55–73.
27.
TatsuokaK. K. (2009). Cognitive assessment: An introduction to the rule space method. Routledge.
28.
TemplinJ. L.BradshawL. (2014). Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika, 79(2), 317–339.
29.
TuD.WangS.CaiY.DouglasJ.ChangH.-H. (2019). Cognitive diagnostic models with attribute hierarchies: Model estimation with a restricted Q-matrix design. Applied Psychological Measurement, 43(4), 255–271.
30.
von DavierM. (2008). A general diagnostic model applied to language testing data. British Journal of Mathematical and Statistical Psychology, 61(2), 287–307.
31.
WangC.GierlM. J. (2011). Using the attribute hierarchy method to make diagnostic inferences about examinees’ cognitive skills in critical reading. Journal of Educational Measurement, 48(2), 165–187.
32.
WangC.LuJ. (2021). Learning attribute hierarchies from data: Two exploratory approaches. Journal of Educational and Behavioral Statistics, 46(1), 58–84.