Abstract
This study proposes a multiple-group cognitive diagnosis model to account for the fact that students in different groups may use distinct attributes or use the same attributes but in different manners (e.g., conjunctive, disjunctive, and compensatory) to solve problems. Based on the proposed model, this study systematically investigates the performance of the likelihood ratio (LR) test and Wald test in detecting differential item functioning (DIF). A forward anchor item search procedure was also proposed to identify a set of anchor items with invariant item parameters across groups. Results showed that the LR and Wald tests with the forward anchor item search algorithm produced better calibrated Type I error rates than the ordinary LR and Wald tests, especially when items were of low quality. A set of real data were also analyzed to illustrate the use of these DIF detection procedures.
Keywords
Introduction
Cognitively diagnostic assessments (CDAs; Nichols et al., 1995) aim to provide students with diagnostic feedback by analyzing their responses to test items. To ensure that the feedback is psychometrically valid and sound, many cognitive diagnosis models (CDMs) have been proposed, including the deterministic inputs, noisy “and” gate (DINA; Haertel, 1989) model, the deterministic inputs, noisy “or” gate (DINO; Templin & Henson, 2006) model, and the generalized deterministic inputs, noisy “and” gate (G-DINA) model (de la Torre, 2011), to name a few. CDMs are restricted latent class models, where the latent variables are typically binary, representing the presence and absence of attributes of interest. The estimated attribute profile characterizes the strengths and weaknesses of the student and thus may be used for personalized learning.
Despite a large number of CDMs available, most of them assume that all students come from the same population, which may not be the case in practice. A few researchers have suggested using multiple-group models to handle sample heterogeneity (e.g., George & Robitzsch, 2014; Xu & Davier, 2008). The multiple-group models allow the comparison between students from different groups, such as countries or genders (Johnson et al., 2013), and may be used for detecting differential item functioning (DIF; for example, George & Robitzsch, 2014) and accommodating missing responses (Rose et al., 2017).
To unlock the potentials of CDMs, many important statistical routines are needed. One of these routines is the procedure for detecting DIF items. Because DIF is closely related to test fairness, detecting DIF has become a routine task in psychometric analyses. An item is defined as a DIF item when students from different groups with the same ability show different probabilities of success (Magis et al., 2010). Similarly, in the CDM context, an item is said to function differently when the probability of success to an item differs across manifest groups of students with the same attribute profile (Hou et al., 2014). The presence of DIF items has been viewed as a potential threat to test validity and could worsen the attribute estimation (Paulsen et al., 2020).
Till now, only a few DIF detection procedures for CDMs have been investigated. Zhang (2006) investigated the performance of Mantel–Haenszel (MH; Holland & Thayer, 1988) and SIBTEST (Shealy & Stout, 1993) in DIF detection by matching students on their test scores, true scores, and attribute profiles from the DINA model. However, the attribute profiles for different groups were not estimated separately, which could yield biased estimates when DIF items exist. Also, the MH and SIBTEST performed poorly in detecting nonuniform DIF (Zhang, 2006). F. Li (2008) modified the higher-order DINA model (de la Torre & Douglas, 2004) to separate the construct-relevant DIF from the construct-irrelevant DIF. The higher-order DINA model was estimated without equal constraints of item parameters across the reference and focal groups, and then the DIF was investigated through the marginalized differences in probabilities of success of an item. However, under some conditions, Type I error rates were out of control. Hou et al. (2014) proposed to use the Wald test to detect both uniform and nonuniform DIF in the DINA model and found that the Wald test, which performed as well as, if not better than, the MH and SIBTEST methods, suffered inflated Type I error when items were of low quality. Hou et al. (2020) have recently used the Wald test for detecting DIF under the G-DINA model. Liu et al. (2019) examined the performance of the Wald test with different types of covariance matrix in DIF detection and found that the covariance matrix estimated using the complete information approach (Philipp et al., 2018) produced better calibrated Type I error than the item-wise information matrix. It should also be noted that all these studies (Hou et al., 2014; F. Li, 2008; X. Li & Wang, 2015; Liu et al., 2019; Paulsen et al., 2020; Zhang, 2006) investigated the DIF detection based on the DINA model, which is one of the simplest CDMs and may not hold in practice. An exception is X. Li and Wang (2015), who developed a model for DIF detection based on the loglinear CDM (Henson et al., 2009) by introducing additional item parameters. A major limitation of this method is that the model can only be estimated using the Markov chain Monte Carlo (MCMC) algorithm, which can be very time-consuming. Another exception is Svetina et al. (2017), where the reparameterized unified model was fit to the data and the generalized logistic regression method with item purification was used to identify DIF items among accommodated and nonaccommodated groups in the National Assessment for Educational Progress (NAEP). However, Svetina et al. (2017) did not examine the performance of the item purification.
The goal of this study is trifold: (a) to develop a multiple-group generalized deterministic inputs, noisy “and” gate (MG-GDINA) model to relax the conjunctive assumption of the MG-DINA model by Johnson et al. (2013) and George and Robitzsch (2014), (b) to compare the performance of the likelihood ratio (LR) test and Wald test in detecting DIF based on the MG-GDINA model, and (c) to propose a forward anchor item search (FS) procedure to be used along with the LR and Wald tests for DIF detection. The remaining parts of this article are laid out as follows. Section “Multiple-Group G-DINA Model” introduces the MG-GDINA model, based on which Section “Simulation Study” presents the LR test and Wald test for DIF detection. Section “Simulation Study” gives a simulation study for evaluating the performance of proposed procedures, followed by a real data example in Section “Real Data Analysis.” This article is concluded with a brief summary of findings and a discussion of future directions.
Multiple-Group G-DINA Model
Let
The G-DINA model is a generalized DINA model developed by de la Torre (2011). For item
where
The MG-GDINA model is a straightforward extension of the G-DINA model to account for multiple groups. It assumes that different groups may have different q-vectors and parameters for item
where
Detecting DIF Items Using the MG-GDINA Model
In this section, it is presented how the LR test and the Wald test can be used to identify DIF items based on the MG-GDINA model. Although the MG-GDINA model can be used for more than two groups, in this article the DIF detection is considered for only two groups, namely, the reference group and the focal group, as in previous studies (e.g., Hou et al., 2014, 2020; Zhang, 2006). It is, however, straightforward to extend the procedures for three or more groups.
LR Test for DIF Detection
The LR test used in the item response theory framework for DIF detection (IRT LR-DIF; for example, A. S. Cohen et al., 1996) can be theoretically used in conjunction with the aforementioned MG-GDINA model without any major modifications. Specifically, when it is unclear which items are DIF-free, the common practice is to fit data using two models: (a) a simpler model that treats all items as anchor items and (b) an augmented model that treats all items except the studied one as anchor items. The LR statistic can be calculated from the observed likelihoods of these two models. A limitation of this procedure is that the simpler model may not fit data well when some DIF items are assumed to be DIF-free (Wang & Yeh, 2003), which could yield an LR statistic deviating from its theoretical distribution (Maydeu-Olivares & Cai, 2006). To address this issue, some strategies have been proposed using a single DIF item or only a few DIF-free items to link two groups (González-Betanzos & Abad, 2012).
However, in CDMs, different groups are on the same scale naturally,
1
and assuming some items are DIF-free is not necessary. Because of this, this study modifies the IRT LR-DIF procedure for DIF detection in CDMs. Specifically, two MG-GDINA models were fitted to the data, where the simpler one assumes that the item parameters of all items except the studied one are free to estimate across groups. The resulting marginalized log-likelihood is denoted by
which is
Wald Test for DIF Detection
The Wald test (Wald, 1943) is a widely used hypothesis test in statistics. In the context of CDMs, it has been used for comparing nested models (de la Torre, 2011; de la Torre & Lee, 2013; Ma & de la Torre, 2019a; Ma et al., 2016), detecting DIF (George & Robitzsch, 2014; Hou et al., 2014, 2020; Liu et al., 2019), and validating the Q-matrix empirically (Ma & de la Torre, 2019b; Terzi, 2017; Terzi & Sen, 2019). To detect DIF items using the Wald test, Hou et al. (2014) calibrated data for each group separately, whereas in this study the MG-GDINA model is adopted, which calibrates multiple groups concurrently. Unlike George and Robitzsch (2014), because two groups are on the same scale automatically, the parameters of all items were allowed to vary across groups. The Wald test is then conducted for all the studied items one by one. A
where
Hou et al. (2014) calculated the covariance matrix by inverting the information matrix for each item separately and ignored the population proportion parameters. The resulting Wald test has been found too liberal, especially when the sample size is small and items are of low quality (Hou et al., 2014). Philipp et al. (2018) showed that the covariance matrix can be better estimated using an outer-product of gradient (OPG) method when all parameters are taken into consideration. Liu et al. (2019) also showed that the Wald test based on the OPG method with all parameters produced better calibrated Type I error rates for the DIF detection based on the DINA model. Therefore, this study considers all parameters when calculating the covariance matrix of
DIF-Free Item Identification
Note that the aforementioned DIF detection methods based on the LR and Wald statistics do not assume any items to be DIF-free, but it is likely that not all items in a test exhibit DIF. Specifying items that are DIF-free as anchor items may stabilize parameter estimation and in turn improve the performance of the LR and Wald statistics in detecting DIF items. A related procedure that has been widely used in the IRT context is the item or scale purification (e.g., Clauser et al., 1993). Despite a number of variants, the purification usually treats all items as anchor items at the beginning of the process to obtain comparable ability scales for two groups and then removes items that exhibited DIF in each iteration from the anchor set to obtain a “purified” scale. The purification is not used in this study because, unlike IRT models, parameters of CDMs from two groups are naturally on the same scale and viewing all items as DIF-free is unnecessary. An FS algorithm is introduced below, which shares the same goal as the purification, that is, to identify a set of DIF-free anchor items, but starts by assuming none of the items is DIF-free. Compared with the purification, which can be viewed as a “backward” search algorithm, the FS algorithm has the potential to remove the impact of including DIF items in the anchor set.
LR test with FS
To detect DIF items using the LR test with the FS procedure (denoted by LR-FS for short), the aforementioned LR-DIF method first was conducted, based on which let
Wald test with FS
To detect DIF items using the Wald test with the FS procedure (denoted by Wald-FS for short), the aforementioned Wald-DIF method was conducted first, based on which let
Simulation Study
Design
A simulation study was conducted to assess the performance of the LR-DIF, the LR-FS, the Wald-DIF, and the Wald-FS. Five factors were manipulated.
Type of DIF
Both uniform and nonuniform DIF were considered. An item is said to exhibit uniform DIF when it favors one group relative to the other consistently for all attribute profiles, or nonuniform DIF when it does not consistently favor a certain group. In particular, for simulation purposes, if item
DIF magnitude
The DIF magnitude for item
Percentage of DIF items
Similar to Paulsen et al. (2020) and Qiu et al. (2019), this study considered that 0%, 20%, and 40% of items exhibited DIF. The DIF items were randomly selected from all possible items with the constraint that one-third of the DIF items required a single attribute, one-third required two attributes, and the remaining one-third required three attributes.
Sample size per group
This study considered three levels of sample sizes for each group: N = 500, 1,000, and 2,000. The former two levels were in line with Hou et al. (2014) and the last level was included because the G-DINA model is more complicated than the DINA model used in Hou et al. (2014). These levels are also in line with the review of 36 CDM applications by Sessoms and Henson (2018), where the mean and median of sample sizes in these studies were 1,788 and 1,255, respectively, and that 30% of these studies involved samples of 2,000 or more participants.
Item quality
Similar to Ma et al. (2016), item quality had three levels:
In addition to the factors manipulated, other factors were fixed to make the simulation more manageable. In particular, the numbers of items and attributes were fixed to
In sum, this study consists of 3 (Sample Size) × 3 (Item Quality) = 9 conditions without any DIF items and 2 (Type of DIF) × 3 (Sample Size) × 3 (Item Quality) × 2 (Proportion of DIF Items) × 2 (DIF Sizes) = 72 conditions with some DIF items. Under each condition, 300 data sets were generated and four DIF detection procedures were carried out.
Analysis
To assess the performance of these four procedures in detecting DIF items, the following two criteria were considered.
Type I error
Type I error rates were calculated as the proportion of DIF-free items that were incorrectly flagged as DIF items. Note that nine conditions were considered where all items in each replication were DIF-free and 72 conditions where only a portion of items was DIF-free. For either case, the Type I error rates were calculated for each of the DIF-free items and then averaged across all DIF-free items measuring the same number of attributes. The Type I error rates are not expected to be equal to the nominal level because of the sampling errors, but have a 95% chance of falling within
Empirical power
Statistical power indicates the performance of a hypothesis test in rejecting a false null hypothesis. To compare statistical power rates of different procedures, all procedures should have comparable observed Type I error rates. However, this is not the case in this study as can be observed in Section “Results.” Consequently, the empirical power rates calculated from the empirical distributions under the null hypothesis were examined. In particular, the 95th percentile of the test statistic of each procedure was calculated under the null condition where all items were DIF-free and used as the empirical cutoff. The empirical power rate, which was calculated for each test under each condition, is defined as the percentage of obtained test statistics that were greater than the empirical cutoff under the same condition. The empirical power rates were also averaged across all items requiring the same number of attributes under each condition. As in de la Torre and Lee (2013), a test power of .8 or above is considered adequate.
The Wald test and the LR test were performed at the .05 alpha level. The maximum number of iterations for the FS algorithm was set at 10. Data simulation and DIF detection were implemented using the GDINA R package (Ma & de la Torre, 2020) and the sample code can be downloaded from https://doi.org/10.17605/OSF.IO/3579Y. To better understand the results, mixed analyses of variance (ANOVAs) were performed for each of the criteria using the R package rstatix (Kassambara, 2020). To examine the sizes of different effects, the generalized
Results
Type I error rates
Type I error rates calculated under the conditions where all items were DIF-free and the conditions where some items exhibited DIF were presented separately. In particular, Figure 1 shows the observed Type I error rates when all items were DIF-free. It can be observed that, when items were of high or moderate quality, all procedures can generally maintain the observed Type I error rates within a reasonable range around the nominal level, especially under large sample conditions. In particular, as shown in Figure 1, the Wald-DIF and Wald-FS produced averaged observed Type I error rates within

Observed Type I error rates when all items were DIF-free.

Observed Type I error rates when some items exhibited DIF.
To analyze the impact of different design factors on the observed Type I error rates, mixed ANOVA was employed. The ANOVA tables and nontrivial interaction plots are given in the Online Appendix. The highest-order nontrivial interaction is the three-way interaction of Sample Size × Item Quality × Number of Attributes (
Empirical power
The empirical power rates were analyzed using mixed ANOVA and Figure 3 displays the empirical power rates of four DIF detection methods at different combinations of factors with nontrivial effects. Results showed that the DIF detection method had a small main effect

Empirical power rates.
The mixed ANOVA also revealed that two three-way interactions had nontrivial effects, that is, Item Quality × DIF Magnitude × Number of Attributes measured
Finally, the conditions where adequate power rates can be observed cannot be easily summarized. As shown in Figure 3, these four procedures may be only able to detect DIF items of small DIF magnitude with adequate power rates when items were of high quality and the sample size was large. In contrast, some procedures were more likely to correctly detect DIF items of a large DIF magnitude even under some less favorable conditions. For example, under uniform DIF and large sample conditions, the Wald-FS and LR-FS can yield adequate power rates (i.e.,
Real Data Analysis
To illustrate the use of the Wald and LR tests in detecting DIF items in practice, a set of real data were analyzed, which are part of a larger data set obtained from a Dutch-language version of the Millon Clinical Multiaxial Inventory-III, a self-report clinical instrument (Millon et al., 2009; Rossi et al., 2007). For the current illustration, 30 items that were examined in Ma et al. (2016) were analyzed, with three clinical scales or attributes, namely, somatoform (Scale H), thought disorder (Scale SS), and major depression (Scale CC). Ma et al. (2016) only analyzed the item responses of male respondents, but in this study the responses of 471 female respondents and 739 male respondents were analyzed using the aforementioned Wald and LR tests with and without the FS procedure. The Q-matrix can be found in Ma et al. (2016).
Table 1 gives the test statistics and p values for items that were flagged by at least one of four DIF detection methods. Note that the p values were adjusted using the Holm (1979) method to control the familywise error rate at the .05 nominal level for multiple comparisons. It can be observed that 11 items were flagged by all four methods and that the LR-DIF flagged most of the DIF items (i.e., 15), whereas the Wald-FS method flagged the least (i.e., 11). In addition, four DIF detection methods produced inconsistent results for Items 2, 20, 21, and 24. Based on the simulation study, the LR-FS performed relatively well when the sample size was small for each group. Therefore, the items that were identified as DIF-free by the LR-FS method were used as anchor items for the recalibration of the data. Figure 4 displays the estimated endorsement probabilities of female and male respondents to Items 2, 20, and 21. It can be observed that all of these three items exhibited uniform DIF, where male respondents seem to have lower endorsement probabilities than female ones after controlling their latent attribute profiles. Figures for other items can be found in the Online Appendix, and it can be observed that all DIF items identified by the LR-FS method were shown to be uniform DIF (i.e., the female group had higher endorsement probabilities) with the only exception of Item 5.
DIF Detection Results Based on Different Methods.
Note. Bold values represent nonsignificant

Estimated endorsement probabilities (with standard errors) of Items 2, 20, and 21 for female and male respondents.
The simulation study showed that item quality had a major impact on the performance of DIF detection procedures. The estimated guessing and slip parameters are given in the Online Appendix and the averaged guessing and slip parameter estimates were
Summary and Discussion
In this study, a multiple-group G-DINA model has been developed, which allows us to model item responses from different groups at the same time by accounting for the fact that students in different groups may solve the problems in distinct manners. Based on the MG-GDINA model, this study focuses on procedures for detecting DIF items using the Wald and LR tests. This study modifies the traditional IRT LR-DIF procedure for detecting DIF using the LR test in CDMs. This study also proposed an FS algorithm that can be used in conjunction with the LR and Wald tests for DIF detection.
The simulation study showed that the Type I error rates of all four procedures were relatively well behaved when items were of high or moderate quality, though the Wald-DIF and Wald-FS tended to be conservative when items were of high quality, the sample size was small, and some items exhibited DIF. The LR-DIF and LR-FS could be slightly liberal when the sample size was small. When items were of low quality, all four procedures, in general, yielded inflated Type I error rates, and the FS procedure becomes particularly important for controlling the inflation of the Type I error for both Wald and LR tests. The LR-FS exhibited better controlled Type I error rates than the Wald-FS when the number of attributes required was 2 or 3, but the Wald-FS method performed slightly better when the number of attributes required was 1. Although none of the procedures outperforms others consistently, the LR-FS method appears a reasonable choice under most conditions.
The Wald test has been well documented to produce inflated Type I error for model comparison (de la Torre & Lee, 2013; Ma et al., 2016) and DIF detection (Hou et al., 2014) when items were of poor quality. Although a different approach to estimating the variance–covariance matrix has been employed in this study, the Wald test still tends to be liberal when items were of poor quality, though the incorporation of the FS procedure could help control the inflation to some degree. In contrast, although the LR test does not involve the estimation of the variance–covariance matrix, it also results in inflated Type I error and false positive rates when item quality was not desirable.
All procedures exhibited relatively low empirical power in detecting DIF. Acceptable levels of empirical power were only noted with favorable conditions (i.e., large DIF magnitude and sample size, fewer attributes required, high item quality and uniform DIF), though the LR-FS method tended to perform similarly as, if not better than, other methods investigated in terms of the empirical power rates. It is obvious from the simulation study that developing test items of good quality and performing DIF analysis using a relatively large sample are the most important factors.
This study contributes to the literature by developing the multiple-group model and by systematically investigating several DIF detection procedures using the proposed multiple-group model, but it is not without limitations. First, although this study manipulated several important factors, there were some factors that were fixed. In particular, this study only considered DIF detection methods for two groups and assumed an equal sample size for both groups; this study also simulated students’ attribute profiles from discrete uniform distribution; the structure of the Q-matrix, along with test length and the number of attributes, was also fixed. Researchers may vary some of these factors in future research. In addition, although the findings from this study recommend the use of the FS procedure, the FS procedure could be time-consuming. This is because the FS procedure usually involves multiple iterations and, at each iteration, the data need to be calibrated multiple times. Future studies may examine whether the proposed FS procedure can be further simplified.
Supplemental Material
Online_Appendix – Supplemental material for Detecting Differential Item Functioning Using Multiple-Group Cognitive Diagnosis Models
Supplemental material, Online_Appendix for Detecting Differential Item Functioning Using Multiple-Group Cognitive Diagnosis Models by Wenchao Ma, Ragip Terzi and Jimmy de la Torre in Applied Psychological Measurement
Footnotes
Acknowledgements
The authors thank Gina Rossi for access to the data used in the Real Data Analysis section.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary material is available for this article online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
