Abstract
This article provides a process to carefully evaluate the suitability of a content domain for which diagnostic classification models (DCMs) could be applicable and then optimized steps for constructing a test blueprint for applying DCMs and a real-life example illustrating this process. The content domains were carefully evaluated using a set of defined criteria, which are purposely defined to improve the success rate of DCM implementation. Given the domain, the Q-matrix is determined by a simulation-based approach using correct classification rates as criteria. Finally, a physics test on the final Q-matrix was developed, administered, and analyzed by the author and the subject-matter experts (SMEs).
Keywords
Introduction
Researchers have been implementing diagnostic classification models (DCMs) into educational settings to support and measure learning on various topics. Examples include topics, such as proportional reasoning in math (Tjoe and de la Torre, 2014), grade school and undergraduate mathematics (Gierl et al., 2010; Mejía-Ramos et al., 2017), and teacher’s understandings of rational numbers (Bradshaw et al., 2014). In the context of the applications and the literature, several challenges have been identified concerning the identification of attributes. For example, Bradshaw et al. (2014) concluded that identifying distinct but related traits could be a primary challenge when attempting to define attributes and that reliably discriminating between two related attributes could be difficult. Beyond the definition of attributes, it is also important to be able to measure them appropriately. For example, some guidance has been discussed regarding determining the number of attributes (e.g., Nájera et al., 2021) and the number of items that should measure each attribute within a test (e.g., Henson & Douglas, 2005).
Although specific challenges have been identified as potential explanations for poor or moderately fitting models and characteristics resulting in a good fit have been addressed, these general themes in the literature about what would impact successfully applying a DCM have not been discussed jointly with a possible set of criteria that should be used when identifying whether a construct or domain could actually benefit from a diagnostic modeling approach. Even with the current literature on DCMs applications, which includes examples in mathematics and science, these examples do not do a sufficient job at describing the specific steps that were used to initially identify an appropriate domain from several domains. It is likely that DCMs cannot be equally and successfully applied to all domains.
Even when a domain has been identified, there are additional real-world constraints that may impact the successfulness of a DCM. These constraints, such as allowable testing time, test length, and the feasibility of extracting separatable attributes from a specific domain, can impact the nature of the test and whether it is even possible for all attributes to be reliably measured.
A critical point to this article is that, while DCM can be useful, it is believed that it cannot be applied in all domains given specific limitations and constraints, such as those that have been previously discussed. No literature has been found that provides guidelines for specifically emphasizing real-life-related characteristics that make DCM application feasible. In the current study, four factors, such as (1) whether a domain can be separated into different attributes, (2) how many attributes are appropriate, (3) whether the attributes can be measured separately, and (4) whether the final items introduce construct-irrelevant variation, are discussed and examined in evaluating a domain that makes DCM applications realistic.
The first factor is whether or not a given domain can naturally be separated into different attributes. In order to measure attributes using a test, the attributes must first be identified. In the literature, there are several examples of DCM research where specific attributes were identified. Most of these studies were attempting to retrofit a diagnostic model to a test that had already been created, which can be challenging and also possibly create limitations for the result (Jang, 2009; Kim, 2015; Li et al., 2016; Ravand, 2016; von Davier, 2005). In contrast, there are some examples that identified the attributes before creating the test. For example, Tjoe and de la Torre (2014) identified six core attributes in proportional reasoning through a series of meetings with subject matter experts (SMEs) before creating any test. Bradshaw et al. (2014) identified a set of attributes for a test of teachers’ multiplicative reasoning before constructing the test. Identifying distinct but related traits could be a primary challenge while at attribute identification (Bradshaw et al., 2014). Therefore, the first factor to be considered is whether distinct attributes could be extracted from one domain.
The second factor concerns the number of attributes. According to the evaluated sample of DCM applications in Sessoms and Henson (2018), the number of attributes typically used in actual DCM applications ranges from 4 to 23, and the average number of attributes estimated was 8. Depending on the test length, the number of attributes should be limited to an appropriate range, where each attribute could be measured at least three times (Kuo et al., 2016). For example, Henson and Douglas (2005) suggested that a 20-item exam can measure each of the 8 attributes at least three times. Related to the number of attributes is the grain size of the attributes. These two are related because more general attributes will result in few attributes for a given construct, yet assuming that the attribute is dichotomous may not be reasonable. In contrast, if the grain size is “small” and more detailed, assuming the attribute is dichotomous may be more reasonable, yet there will typically need to be more attributes for the same construct.
The third factor is whether the attributes can be measured separately. This factor can manifest in two ways: (1) whether we can write items that measure the attribute distinctly. If some attributes must be together all the time to form an item, there will be confounding issues during estimation. For example, a typical convex lens image question for middle school physics would have to include the attribute of understanding the concept of a convex lens and the attribute of calculating the distance, and these two attributes cannot be measured distinctly in a typical test and (2) another way is that even if experts think the attributes can be measured separately by a test, some attributes are undistinguishable in the population responses. Bradshaw et al. (2014) ran into the same issue, in which the authors could not reliably discriminate between two related attributes and were compelled to combine them.
Lastly, something that has not been emphasized as a reason for the diagnostic model to succeed in the literature, but also crucial in testing in general, is avoiding introducing construct irrelevant variance, which is specified in the Standards (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME], 2014). Construct-irrelevant variation may give an unfair advantage or disadvantage to one or more subgroups of test-takers (Standards, AERA, APA, & NCME, 2014). While in a multidimensional model framework, it is even more relevant that items should still look familiar to the students. Even if items that measure different combinations of attributes could be formed, such items should be familiar to students without introducing construct-irrelevant variation into the problem-solving process. That is, if an item must be written in a highly unusual way to measure a set of attributes, then that item could be problematic.
In addition to these four factors, another issue people have not addressed in the literature often is the construction of the Q-matrix given the real-life constraints. Many had discussed the construction of Q-matrix when response data or item bank was given (Henson & Douglas, 2005; Huang et al., 2022; Kuo et al., 2016; Nájera et al., 2021). However, no literature has been found on building a Q-matrix without any given items or responses under application constraints. For example, how to construct a Q-matrix under a fixed test length condition or a condition where items can only be written to measure some of the attributes. Therefore, it is important to consider not only how the domain is selected but also the nature of the Q-matrix during the item writing process, as constraints are common in real-world applications.
To address this concern, the authors proposed a new simulation approach that is used to identify a Q-matrix under real-life constraints to measure attributes based on inferred correct classification rate (CCR). This new approach was used to create the Q-matrix and guide the writing of items. After constructing a Q-matrix, the rest of the process includes formulating a list of attributes, constructing a Q-matrix using a simulation approach, and developing and administering a diagnostic test.
This article intends to emphasize and demonstrate how a domain should be evaluated previously based on the factors that have been identified in the literature, and then given the domain, discuss how to construct a Q-matrix (test blueprint), giving constraints about how items can be written and how many items can be included in the test. An example of the process is provided in the domain of mechanical efficiency in middle school physics.
Method
This study was partitioned into four phases as illustrated in Figure 1: (1) evaluating the domain where DCMs would be helpful and applicable and defining attributes in this domain, (2) constructing the Q-matrix, (3) test development and administration, and (4) test analysis. Prior to evaluating the domain, SMEs needed to be familiarized with DCM. As will typically be the case, the SMEs are familiar with basic concepts of the content (physics) but are not familiar with thinking of the concepts and the context of DCMs. Thus, brief materials about DCMs were provided to the SMEs prior to the interview; then, during the interview, any questions and critical content about the basic concepts of attributes, Q-matrix, and DCMs were reviewed. At this point, SMEs had the knowledge of basic terminology, such that we were able to work through the phases as a group. Working as a group, they addressed each of the four phases that were previously mentioned. When asked to answer specific questions, the SMEs were allowed to disagree with each other at first. However, they were asked to reach single final conclusions, where a consensus was reached through discussion.

The flowchart for diagnostic classification model application.
Given that the SMEs were familiar with DCMs, for the first phase (identify a domain), SMEs were asked to inspect the four factors discussed in the “Introduction” section for identifying a potential domain for the application of DCM. The first factor is whether the SMEs can extract attributes from the target domain. The SMEs were informed in a conversation before the interview and were reminded again during the interview that attributes are skills that a student can master or not; hence, the attributes are dichotomous. To meet this requirement, the SMEs went through each domain for eighth-grade physics and then identified all domains that could satisfy the first factor in the interview.
Considering the set of domains where SMEs felt they could have a set of dichotomous attributes, the second factor (number of attributes) was addressed by taking into account the expected length of the test and the total number of attributes that could be feasibly measured in that time. This diagnostic test was administered in the form of either an after-school homework assignment or a class quiz, depending on the teachers’ decision. In the interview, SMEs were asked how much time would be available to administer the test and how many items could be included in the test. Both SMEs discussed and agreed that a 30-minute multiple-choice test containing a total of 20 items would be feasible. Based on the literature review, tests that contain 20 items typically measure between four to eight attributes (Cui et al., 2012; de la Torre & Douglas, 2004; Li & Suen, 2013; Ravand, 2016). SMEs were asked to extract attributes from the few domains that satisfied the first factor. Any domain with a number of attributes more than eight and less than four would be eliminated.
To evaluate whether the attributes can be measured separately (the third factor to consider), the SMEs were asked to create items that only measure each attribute by itself. Notice that if there were attributes that must always be measured together, then they would be indistinguishable in the model. Thus, they could not be measured singly and must be either dropped or merged with other attributes. Lastly, the SMEs examined those items on whether students are familiar with questions written in that way. This last consideration is important because there may be situations, where it is conceivably possible to write an item that will measure an attribute, but the item would need to be written in a way or style that students had never seen. Such an item might be missed or even more difficult because the student was not familiar with it, which would increase the potential for bias or construct irrelevant variance.
After considering these factors, one domain was selected that was most likely to result in a useful application of a DCM. This domain is then used as the target domain for the diagnostic test. After deciding on the topic, the SMEs explicitly defined all the desired attributes based on the curriculum requirements. The detailed definitions of all the attributes are described in Appendix A.
After the best domain was selected and the basic form of items had been discussed, it was necessary to determine what diagnostic model would most likely be appropriate. An appropriate DCM should be able to describe the relationship of the attributes and getting the item correct in the test. It should also be well-studied in the literature. Model types are often described based on the relationship between the attributes and the probability of a correct response. Two common types of models are conjunctive and disjunctive models.
In conjunctive models, a lack of mastery in one attribute cannot be compensated by mastery of any other attributes (Henson et al., 2009). One commonly seems conjunctive DCM model is the deterministic input; noisy ‘and’ gate model (DINA; Junker & Sijtsma, 2001). In DINA, if someone has mastered all the required attributes for an item, that person should probably get that item right, and this probability is described by a parameter in the model. If someone lacks one or more required attributes, that person should probably miss that item, and this probability is also described by a parameter in the DINA model. However, in disjunctive models, mastery of a subset of skills, or sometimes one skill, can lead to a high probability of a correct response even when other attributes are measured by the item have not been mastered (Henson et al., 2009). In this study, the SMEs were asked whether students needed to master all the required attributes to obtain a correct item and whether lacking mastery in even one attribute would result in a student having a high chance of missing the item. Based on the SMEs’ responses, it is possible to determine a model that would model this behavior.
In the second phase, the SMEs constructed a list of all possible combinations of attributes that could be measured by a single item, and this process has never been conducted before in the Q-matrix construction-related literature. While it has already been determined in the first phase that items could be written to measure each attribute by itself, however, it is possible that not all possible combinations of attributes could be measured by a single item. For example, it might not be possible to write an item that measures five attributes simultaneously; thus, that combination might be excluded from the list. Note that any given combination of attributes measured by a single item could be thought of as the Q-matrix vector. Thus, this list created by the SMEs contains all the possible Q-matrix vectors. Any test constructed for this domain can only be constructed using items that measure the attribute combinations in this list, which is a constraint for test construction.
From this list, a subset of potential attribute combinations (allowing for replacement; hence, two or more items could theoretically measure the same attribute combination) was determined, such that if items were created, the attributes would be measured as well as possible. Measuring an attribute well, in this case, means that the CCRs of all attributes will be large. This phase was completed taking into account the test length restrictions. A simulation approach was used to predict how well each attribute would be measured given a specific test (i.e., the Q-matrix). Given the multiple possible Q-matrices and the estimated CCRs for each, the Q-matrix with the highest set of CCRs for all attributes was identified. This Q-matrix was then used as the blueprint for item construction. This process is described in detail in the following paragraphs.
Note that ideally, one could optimize this process. However, because item parameters are unknown, the goal is to determine a process that would result in a good combination of items to measure the defined attributes. Besides not knowing the item parameters, computationally, it is impossible to try all the combinations in the form of a real test, and thus, a simulation approach was defined to determine an effective test of a 20-items test to assess those attributes when not all combinations were possible.
To illustrate the process of the simulation approach, x represents the number of items for the test based on real-life testing time limitations, and y represents the number of all the possible attribute combinations that could generate an item. The simulation approach was conducted as follows: A total of x rows were randomly selected from the list of possible Q-matrix vectors (y rows) with replacement. Five hundred students’ profile data (mastery or nonmastery of the attributes) were randomly simulated, assuming an equal probability for all the combinations of mastery patterns. Note that equal probabilities are used as the worst-case scenario. In the event that some attributes are more likely, this information could increase CCRs if a Bayesian approach was used. The response data set was simulated using the generated Q-matrix, and the profile data were produced in #1 and #2 using the reduced Reparameterized Unified Model (RUM; Hartz, 2002). In the simulated data, the parameters were fixed, such that the The simulated response data were then calibrated using the reduced RUM model again. This calibration resulted in item parameters and, more importantly, examinee posterior proportion of mastery for each attribute. These posterior proportion of mastery can then be used to estimate examinee mastery, The CCR was estimated by comparing the estimated attribute mastery to the true originally simulated attributes in Step 2. Note that in this case, the CCR is computed as the proportion of times that estimated mastery matches true mastery across all attributes and simulated examines. Steps 1 through 5 were then repeated a total of 4,000 times. The Q-matrix that resulted in the highest CCR is used as the final Q-matrix and the test design.
This process uses the R-RUM as the diagnostic model because we believe the relationship between the attributes is conjunctive in the mechanical efficiency domain, and we want to have some flexibility in the model. Even if the R-RUM model is not the final correct model, as long as the correct model is a conjunctive model rather than a disjunctive or mixed type model, this method will result in a reasonable Q-matrix.
Based on the final Q-matrix, which specified the blueprint for what items should measure, a multiple-choice item test was developed by two SMEs. This test was then reviewed and examined by a third SME.
Four different forms were constructed based on the given Q-matrix, which was intended to decrease cheating. To create these three different forms, items were rearranged in three item “difficulty blocks.” That is, all the SMEs were required to rate the items on whether they were easy, medium, or hard based on their professional judgment. If there was disagreement on the rating, SMEs were asked to discuss it among themselves until they reached an agreement. Items were arranged from easy to hard in terms of difficulty on the test. Items with the same difficulty rating were rearranged to create four different test forms.
After having the test (and corresponding forms) created, the final step was to collect data and calibrate a DCM. Students were directed to different forms based on their student ID to ensure that the number of students who took the forms was similar. The test was administrated through Qualtrics (Qualtrics, Provo, UT). Students could take the test anytime during the two days the Qualtrics link was open. While there was no time limit specified in the Qualtrics administration of the test, students were asked to finish the test within a certain time limit (30 minutes) and informed that they could not go back and change answers once the test began. Only fully responded cases were recorded. The data were collected and downloaded through Qualtrics. Duplicate responses, as well as responses under 5 minutes and over 30 minutes, were eliminated.
After the students’ response data were collected, a DINA model was fitted, and item parameters and students’ profiles were generated using the R package CDM (Robitzsch et al., 2020). Even though the R-RUM model is used in the simulation approach rather than the DINA, it is our belief that this approach is relatively robust to the selection of a specific conjunctive model. The DINA model was selected for two reasons. First, the SMEs determined that students need to master all the required attributes to get the item correct, and this quality is captured in the conjunctive model, such as the DINA model. Secondly, the DINA model is widely applied in the literature (Cui et al., 2012; de la Torre, 2009; Huang et al., 2022; Li & Suen, 2013; Templin & Henson, 2006). After calibrating the data using the DINA model, item parameters estimate, attribute probability, and profile proportion were generated, reported, and analyzed based on the selected model. As specified in Equations 1 and 2, the item discrimination index (IDI; Lee, de la Torre & Park, 2012) and the root mean square error of approximation (RMSEA) item fit index for each item j were computed and analyzed
The IDI is a measure of how well an item is able to distinguish between masters and nonmasters, and the RMSEA is an absolute fit index. Both indices can be interpreted as indicators of the item quality. Specific outliers where they are highly discriminate or reasonably low discrimination compared to the rest of the group were examined. In addition, when interpreting the item RMSEA values, any value greater than 0.1 is classified as poor fit, values less than 0.1 and greater than 0.05 are moderate fit, and values less than 0.05 indicate good fit (Kunina-Habenicht et al., 2009). The guessing and slipping parameters for each item were computed and summarized. Specific outliers with a guessing or slipping parameter greater than 0.5 were inspected. The proportion of students answering each item correctly (the p value) was computed for each item and summarized.
In addition to the item characteristics, the properties of the attributes and the profiles were examined. Specifically, the estimated proportion of people in the population with each attribute mastery profile was examined. The ten profiles with the highest proportion were evaluated. The SMEs were interviewed on their expectation of how many students should master all the attributes and how many students master none. This information was compared to the observed percentage of all-mastery and nonmastery. If there is any inconsistency, the potential reason that might cause the inconsistency will be analyzed.
Lastly, the proportion of mastery for all the attributes was computed and summarized. The SMEs were asked in the interview about their expectations for the most and least mastered attributes and then compared to the actual results. The association between attributes was also explored and analyzed for all the attributes using the tetrachoric correlations between the attributes.
Results
Domain Evaluation Phase
There are 13 chapters in the eighth-grade physics curriculum. In this class, domains are taught in chapters, so each chapter represents one domain. The SMEs reviewed each chapter to determine whether dichotomous attributes could be extracted from the 13 domains. Whenever there was a disagreement, they discussed it until a consensus was met. If a consensus could not be met, this would be interpreted as a situation where discrete attributes could not be reliably identified for that domain (although this situation did not occur across the 13 domains). After reviewing all the chapters, the SMEs concluded that all domains have attributes that could be extracted.
After having determined whether all domains (all 13 chapters) could be broken down into dichotomous attributes, the next factor was to determine whether the number of attributes (and grain size) was reasonable for the permitted testing window. Because the diagnostic test was administered as after-school homework, the SMEs decided the total testing time should be less than 30 minutes, and as a result, they felt that the test should be composed of no more than 20 multiple-choice items. According to the literature, the average number of attributes measured by a diagnostic test is eight (Sessoms & Henson, 2018). Considering that the test length is short—20 items—only domains with fewer than eight attributes were considered. Four of the 13 domains were determined to have fewer than eight attributes.
Given the four possible domains with eight or fewer attributes identified, it was necessary to determine whether they were all distinct attributes that could also be measured uniquely. Thus, SMEs were asked whether separate items could be written to measure each of the attributes in the four domains. They determined that two of the four domains had attributes typically measured with other attributes together as one question rather than being measured singly as one item, which would create a possible confound for a DCM. For example, a typical question in the convex lens image domain includes knowledge on understanding the concept of the convex lens and calculating the distance. It is uncommon that these two attributes are measured separately in an item; thus, this domain is excluded. Therefore, two domains, buoyancy force and mechanical efficiency, were considered that an assessment designed for DCMs would be most likely to be successful and useful.
Recall that the goal is to evaluate the domain that could most reasonably work with diagnostic models. As such, in thinking about how to measure attributes in the given domain, we did not want to depart from the typical assessment (e.g., item type) students are usually exposed to. As an initial check, instead of evaluating all the possible ways of writing items, the SMEs were asked whether items could be written to measure each of the attributes in the previous step, with follow-up questions exploring whether the items measured only a single attribute would also be familiar to students. Thus, in this step, SMEs evaluate whether students are familiar with items that only measure one attribute in these two domains.
Typical items in the mechanical efficiency domain are mostly computational questions. Students need to complete multiple steps to reach the final correct answer. There are a total of seven attributes in this domain, and each one in the mechanical efficiency domain is equally important and easy to define. Each attribute is defined as being able to identify when it is appropriate to use one formula, memorize the correct formula, and know how to use the formula to obtain a specific value. SMEs can easily create items that measure one or multiple combinations of attributes by providing some information and asking students to compute the answer using the required attribute(s). Most importantly, these items will look very familiar to the items that students are usually given on the test. Although in buoyancy force, some attributes are computational skills; others concern analyzing, experimenting, and recognizing skills. Different types of attributes make the test development process more complex, and the items may need to differ from the typical questions students are usually exposed to. For example, a typical question in buoyancy force is usually a combination of computational, analytical, and recognition skills. Although SMEs could create an item that measures only one attribute, it will look unfamiliar to students and have the risk of introducing irrelevant variance into the test design.
Based on these criteria and in addition to the number of attribute measures, mechanical efficiency was selected as the most suitable domain for constructing the DCM-scored test. Provided that this domain was most appropriate, the SMEs were asked in the interview whether students needed to master all the required attributes to obtain a correct answer (or at least have a good chance of a correct answer) for the items. The SMEs discussed and agreed that students need all the attributes measured by an item to reach the correct answer. Considering this relationship is consistent with what is true when using a conjunctive model and the sample size is small, the DINA model was selected as the DCM for this physics test.
As a next step, SMEs were asked to formally define and describe the attributes. In the mechanical efficiency domain, there are seven required formulas that are used to solve questions in this domain. Each attribute is defined as overall mastery of each formula and its appropriate use. Thus, mastery is defined as knowing each formula, including correctly memorizing and applying the formula. The SMEs also helped describe the relationship of the seven formulas, which is shown in Figure 2 (a detailed description is presented in Appendix A). Although there are seven formulas, certain components in the formula could be computed from others. For example,

The relationship of the seven formulas.
Q-Matrix Development Phase
After defining all of the attributes, the SMEs were asked to list all of the possible attribute combinations that could be realistically measured by a single item. Assuming no restrictions, here are a total of 127 different combinations that could theoretically be measured when there are a total of seven attributes in a domain:
The SMEs considered all possible combinations and evaluated what a possible item would look like to measure that combination of attributes. Any combination of attributes that could not be measured by a single item was eliminated. For example, Attributes 3 and 7 could not be measured simultaneously by a single item because they do not have any shared variable to be connected into one item. The remaining combinations contained all-possible combinations of attributes for which hypothetical items could be written to measure. Note that these combinations also represent all possible 0/1 rows of a Q-matrix for the constructed test. In this study, a total of 37 combinations of the original possible combinations were retained as possible items that could be included in the test. All possible combinations of attributes that could be measured by a single item are included in Appendix B.
Because the test length was set to 20 items by the SMEs based on the allowable testing time, a subset of all possible items must be used to construct a test and its corresponding Q-matrix. Note that multiple items measuring the same attributes could be included in a test. However, when trying to measure seven attributes using only 20 items, not all combinations of items were considered the same. For example, if 20 items that only measured Attribute 1 were included, then Attribute 1 could be measured well, but no information about Attributes 2–7 could be obtained. As a result, a “good” Q-matrix with only 20 items needed to be determined with the goal of measuring all attributes equally well, while also addressing the fact that items cannot measure all possible attribute combinations.
A simulation approach was conducted to determine the Q-matrix structure that would do well at measuring all attributes given the constraints. Table 1 provides a summary of the estimated CCRs for attributes on the 4,000 simulated 20-item tests, including the average, minimum, maximum, and standard deviation of CCR. Recall that the Q-matrix of a simulated test was generated by randomly sampling, with replacement, from the list of 37 possible sets of attributes measured by a single item.
Summary of Correct Classification Rate for Attributes
The Q-matrix with the highest CCR (0.83) for attributes was selected for the diagnostic test, and it is listed in Table 2. The final Q-matrix contains 20 items. Items 1 through 10 are simple structure, such that each item only measures one attribute. Items 11 through 14 measure two attributes, and Items 15 through 20 measure three attributes. The Q-matrix measures Attributes 1 and 6 the most (seven times) and Attribute 4 the least (three times), which is typically listed as the minimum number of times (Kuo et al., 2016).
Final Q-Matrix With the Highest Correct Classification Rate
Test Development and Administration Phase
After obtaining the target Q-matrix for the test, a multiple-choice item test was developed by two of the SMEs, and the test was reviewed and examined by the third SME. Specifically, for each row of the Q-matrix, the SMEs wrote an item that measures those specific attributes. Because of the previous review, it was known that items could be written to measure all combinations of attributes in this Q-matrix. Four test forms were created by rearranging the order of the questions. Students were directed to different test forms based on the first letter of their last name. The diagnostic test was administered to ninth-grade students in five middle schools in Guangdong province through Qualtrics. Five hundred and sixty responses were recorded in Qualtrics. After eliminating duplicate responses, as well as responses under 5 minutes and over 30 minutes, 397 were kept in the analysis.
Test Analysis Phase
After the data were exported from Qualtrics, the response data were analyzed in R using the “CDM” package. The item parameters and item statistics for the DCM-scored test are shown in Table 3. The p value is the proportion of times that an item was answered correctly, and it is the indicator of item difficulty. The average p value is .742, and 12 of the 20 items have a p value greater than .7. Items that measure one attribute tend to have a higher p value than items that measure more than one attribute. The item quality for Items 1, 2, 4, and 8 was low with an IDI below 0.4. Items 1, 2, 4, 8, 9, and 10 have a guessing parameter estimate higher than 0.5. These items only measure one attribute. Items that measure two or three attributes have lower guessing parameters. Most of the items have a relatively small slipping parameter. The highest slip parameter is Item 12 (0.238). The mean of the RMSEA item fit is 0.08. The RMSEA item-fit indices for the DINA model showed three items with good fit (RMSEA < .05), 13 items with moderate fit (RMSEA < .10), and four items with poor fit (RMSEA > .10; Kunina-Habenicht et al., 2009).
Item Parameter and Item Statistics for the 20 Items
Note. RMSEA = root mean square error of approximation; IDI = item discrimination index.
Table 4 provides the tetrachoric correlations between the seven attributes. The correlation between Attributes 3 and 6 is the lowest with a value of 0.35, which could be due to the fact that both attributes are related to calculating
Tetrachoric Correlations Between the Seven Attributes
Table 5 provides the top 10 profiles with the highest estimated proportions in the population, comprising almost 80% of the population. It also shows that 49.2% of the students have mastered all of the attributes and 5.2% of the students have mastered none. One SME was interviewed on her expectation of how many students mastered all the attributes and how many students mastered none. She anticipated that approximately 40% of the students would master all the attributes and approximately 5% would master none. This result is almost consistent with the SMEs’ expectations. No particular hierarchical structure or learning pattern appears while examining the profiles. Table 6 provides the estimated proportion of mastery for all the attributes. The proportion of masters for each attribute is higher than 0.6, and the average proportion of masters is 0.7. Attributes 4 and 7 have the lowest proportion of mastery, which is consistent with the SMEs’ expectations. The SMEs expected Attribute 4 to be the hardest formula to apply and Attribute 7 as the most forgettable by the students. The rest of the attributes were expected to be equally easy.
Partial Profile Proportion
Proportion of Mastery for Each Attribute
In conclusion, the results of this article supported the possibility of applying a DCM to a middle-school physics test with careful selection of the content domain and a simulation approach for a Q-matrix construction. The results of the test were promising despite some items that showed inadequate fit and quality. In addition, the correlation between attributes showed that most could be distinguished, and the proportion of attribute mastery and the proportion of mastery profile are reasonably consistent with SMEs’ expectation.
Discussion
Conclusion
The focus of this article is to address critical challenges when considering the application of the DCMs. Specifically, the literature has demonstrated specific challenges when attempting to apply DCMs. This article places front and center the fact that DCMs cannot always be applied to any domain without first considering the typical features of that domain and demonstrates the factors to consider before application. There are some characteristics of domains that make DCM a more favorable psychometric-scoring model than others. There are instances in which particular features prevented, or at least limited, testing a domain with a DCM-scored test, and thus, an examination needs to be undertaken prior to application. In addition, this article addressed that, even when a domain is carefully evaluated, there may be additional constraints of test length (how much time is allowed) and how items could actually be written. Based on that, this article provides a simulation-based approach to determine a reasonable test blueprint (target Q-matrix). In doing so, the quality of the test is not confounded based on the initial test design.
The results demonstrate that it is possible to apply DCM to a middle-school physics test with careful evaluation of the content domain and a simulation approach for the construction of a target Q-matrix. Using the phases outlined in this article, the DCMs results demonstrated that only four of the 20 items showed poor fit and poor quality. The average correlation between the attributes shows a moderate correlation. There were a few instances when these correlations were high. For example, the correlation between Attributes 4 and 5 has the highest value of 0.91. Some of the attributes showed high correlation, and this might be because the two attributes are not distinct in nature. In the general population, mastery of one attribute would lead to the mastery of another. Further discussion with the SMEs would be necessary to determine whether the attributes that showed high correlation to each other could be merged into a single attribute. Both the majority of the student profiles and the mastery proportion of the attributes are consistent with the SMEs’ expectations. While examining the profiles, no learning or hierarchical structure pattern emerged, which is also consistent with the assumed independent relationship among the attributes.
While the results do have some issues, in a typical application, new items could be constructed and piloted. After analyzing the item parameters based on responses, some would be dropped or refined in the next round if the item quality is insufficient. The current study only contains a single iteration. Even with some poor-quality items and high correlations among the attributes, the test is still considered an acceptable test with usable diagnostic information.
Although current literature on DCM applications includes many real data examples, most applications have used a retrofitting approach and did not design exams specifically as diagnostic models; however, the studies subsequently attempted to fit a diagnostic model. Even among a few DCM application examples that were able to construct a test for the diagnostic model, limited information exists on how to select a domain and construct a Q-matrix. Most studies have relied on SMEs for domain selection and Q-matrix construction. This study provides a new perspective, in which the psychometrician can guide the domain evaluation process and the Q-matrix construction using practical methods.
Limitations and Future Directions
While this study showcases a domain evaluation and the Q-matrix construction procedure with a full-application process, it is not without limitations. Even though this was a low-stakes and easy assignment, many students were motivated, which explains the high proportion of all-mastered students (49.2%). However, no hierarchical structure and particular mastery profiles were identified in cases of profile with very few mastered attributes, which could be due to some students’ low motivation, and their mastery of attributes might have been expected to be better. The mechanical efficiency is not the most challenging domain for ninth-grade students, as this domain is usually a small part of a test rather than the basis of a whole test. It is unexpected yet not uncommon to see whether some of the attributes present a high correlation because the coverage of this domain is rather small compared to the full test. Some of the attributes indeed showed a high correlation, and this might be because the two attributes are not distinct in nature. Further investigation could be done on whether the attributes that showed high correlation are actually distinct and whether they could be merged into one attribute. Typically, new items should be tested before actual administration. In the future, follow-up item analyses could be done to inspect the items, poor quality items could be dropped or refined, and the test could be improved. The current study provides some validity evidence; however, more evidence could be collected to show the validity of the test.
In the future, the results of the current study could be used to improve items before administration, which might result in more reliable test outputs. Items with poor quality or fit could be appropriately examined for potential improvement or elimination before an actual test. Secondly, more work could be done to provide internal and external validity evidence to support student classifications. For example, to provide internal validity evidence, a think-aloud protocol with students could determine whether the SMEs’ defined attributes are the skills students use during the problem-solving process. A correlation study could be done to compare this DCM-scored test with other physics tests to provide external evidence. Furthermore, future analysis should investigate how to present the feedback to students and teachers after verifying the usefulness of the diagnostic feedback. Methods like surveys, interviews, or interventions could be used to evaluate the feedback.
Footnotes
Appendix A
Appendix B
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
