Abstract
The performance of χ2 difference tests based on limited information estimation methods has not been extensively examined for differential functioning, particularly in the context of multidimensional item response theory (MIRT) models. Chi-square tests for detecting differential item functioning (DIF) and global differential item functioning (GDIF) in an MIRT model were conducted using two robust weighted least square estimators: weighted least square with adjusted means and variance (WLSMV) and weighted least square with adjusted means (WLSM), and the results were evaluated in terms of Type I error rates and rejection rates. The present study demonstrated systematic test procedures for detecting different types of GDIF and DIF in multidimensional tests. For the χ2 tests for detecting GDIF, WLSM tended to produce inflated Type I error rates for small sample size conditions, whereas WLSMV appeared to yield lower error rates than the expected value on average. In addition, WLSM produced higher rejection rates than WLSMV. For the χ2 tests for detecting DIF, WLSMV appeared to yield somewhat higher rejection rates than WLSM for all DIF tests except for the omnibus test. The error rates for both estimators were close to the expected value on average.
Keywords
When a test is used to make comparisons across subpopulations (e.g., gender), it is assumed that the test measures psychological attributes in the same way across all groups being compared, which is often called measurement invariance. When this assumption is not met, the information obtained from such tests could result in biased consequences for some groups of examinees. If the measurement invariance is violated at the test or item level, the test or the item is said to exhibit differential functioning, referred to as differential test functioning (DTF) or differential item functioning (DIF), respectively. Investigating differential functioning is especially critical in assessing group difference on tests or in making selection decisions on high-risk examinations.
Although developments in item response theory (IRT) have provided a useful framework for test practitioners to examine differential functioning in tests or items, such IRT approaches have been mostly developed within the framework of unidimensional IRT models. However, many educational and psychological tests are multidimensional to some degree. Therefore, considerable attention in IRT has been devoted to developing multidimensional item response theory (MIRT) models (e.g., Ackerman, Gierl, & Walker, 2003; Reckase, 2009) and thus to applying MIRT models for studying differential functioning on multidimensional tests (e.g., Bolt & Johnson, 2009; Oshima, Raju, & Flowers, 1997).
Several DIF detection approaches have been proposed for tests measuring multiple dimensions. Stout, Li, Nandakumar, and Bolt (1997) extended SIBTEST procedure (Shealy & Stout, 1993) as a nonparametric approach that does not assume a specific IRT model to study DIF and proposed MULTISIB to account for a two-dimensional data structure. Oshima et al. (1997) demonstrated the use of an IRT-based technique for analyzing DIF and DTF in multidimensional data. Before differential functioning analysis, this procedure requires multidimensional linking (Oshima et al., 1997). Another approach for detecting DIF in multidimensional data is to introduce a group-specific DIF parameter in MIRT models, which can be evaluated either by comparing it with a cutoff effect size value (Fukuhara & Kamata, 2011) or by conducting a statistical test (Bolt & Johnson, 2009). Finally, differential functioning on multidimensional tests can be examined by using a χ2 difference test (also known as the likelihood ratio test). One advantage of using the χ2 test is that it does not necessarily require a linking process because the item parameters can be estimated simultaneously across groups with concurrent calibration.
The performance of the χ2 test has been extensively investigated, but most researchers have focused on studying DIF in unidimensional IRT models. In addition, most DIF studies using the χ2 difference test were based on full information maximum likelihood (FIML) estimation, probably because FIML is one of the most common and standard methods for estimating IRT model parameters (Baker & Kim, 2004). An alternative estimation is based on limited information (LI) estimation, which traditionally has been implemented within the framework of factor analysis (FA). For a comparison between FIML and LI and detailed explanations, see Forero and Maydeu-Olivares (2009), for example.
To date, χ2 tests based on LI have not been extensively studied for differential functioning on multidimensional tests. However, most educational and psychological tests are, to some degree, multidimensional, and practitioners may encounter situations in which they have no choice but to rely on the χ2 tests based on LI. The purpose of the present study was to investigate the performance of the χ2 tests obtained from two LI estimators for studying differential functioning in an MIRT model under various simulation conditions. The test statistics were evaluated in terms of Type I error rates and rejection rates 1 for detecting differential functioning at the test and item levels. Hereafter, differential functioning at the test level is referred to as global differential item functioning (GDIF). 2 This study also proposes systematic procedures for detecting GDIF and DIF using two GDIF tests and five DIF tests, which may provide useful information for understanding different types of GDIF and DIF on multidimensional tests.
MIRT Model
A compensatory multidimensional two-parameter model for dichotomous data was used in this study. The model takes the following general form:
where θ1, θ2, . . ., θ k represent k dimensions or abilities; ai1, ai2, . . ., aik are discrimination parameters for the k dimensions; di is an intercept parameter related to the difficulty of item i; and f is a cumulative distribution function (CDF), chosen to be either a normal or logistic CDF. Many IRT models, including MIRT models, can be presented in the FA model framework, and the correspondence between the two models’ parameters has been investigated (e.g., Kamata & Bauer, 2008; Takane & de Leeuw, 1987).
In this article, two robust weighted least square (RWLS) estimators as LI methods (weighted least square with adjusted means and variance [WLSMV] and weighted least square with adjusted means [WLSM]) are used; Muthén, du Toit, & Spisic, 1997). In fact, the parameter estimates and their standard errors from WLSMV and WLSM are equivalent. However, the χ2 values differ, because WLSM produces a mean-adjusted χ2, whereas WLSMV produces a mean- and variance-adjusted χ2 (Finney & DiStefano, 2006). These RWLS estimators are chosen because they work well with various sample sizes and give considerably fast computations, especially when the number of dimensions and/or the sample size is large (Flora & Curran, 2004; Muthén et al., 1997). In addition, these estimators are effective and accurate in estimating confirmatory factor analytic (CFA) model parameters with relatively small samples (Millsap & Yun-Tein, 2004) and are recommended as estimation methods for MIRT models (Tate, 2003). In particular, WLSMV is suggested for CFA with categorical indicators (Muthén & Muthén, 2010).
Chi-Square Difference Tests
The χ2 difference tests for detecting differential functioning compare two hierarchically nested models and evaluate whether one model fits better than the other. A statistical significance of the χ2 difference statistic indicates the presence of differential functioning. In a typical DIF study, two groups are considered: a reference group and a focal group. The groups are manifest, such as gender or ethnicity groups. The following section presents two sequential tests for detecting different types of GDIF, and the next section shows five sequential tests for detecting different types of DIF.
Significance Tests for GDIF
Although Equation 1 can be applied to k-dimensional data, the simplest case (k = 2) was considered in this study. In addition, each item was assumed to be loaded on the two dimensions. Two GDIF tests considered here were factor-loading invariance and scalar invariance tests, which are often implemented via multigroup CFA models. Lack of factor-loading invariance 3 indicates differences in item discrimination parameters and thus implies nonuniform DIF, whereas lack of scalar invariance represents differences in item intercept parameters (difficulties) and therefore implies uniform DIF. Hereafter, discrimination invariance is used to refer to factor-loading invariance, and intercept invariance is used to refer to scalar invariance. To study GDIF, we considered three nested models with different constraints as suggested in FA models (Meredith, 1993): (a) a configural invariance model, in which all item parameters are free to be estimated across groups under the same factor structure; (b) a weak invariance model, in which only discrimination parameters are constrained to be equal across groups; and (c) a strong invariance model, in which all item parameters are constrained to be equal.
To resolve metric indeterminacy, the means and variances of two ability dimensions were fixed at 0s and 1s, respectively, in both groups for all models. In addition, the correlation between the two dimensions was also fixed at 0 for all models. 4 The a parameter for the first item was fixed at 0 on the second dimension for both groups (a12 = 0) to deal with the model identification problem for the within-item test structure in which all items were loaded on both dimensions. By calculating the difference in the χ2 values between the configural and weak invariance models, we can test whether GDIF is present due to differences in the a parameters across groups (the discrimination invariance test). Furthermore, by comparing the difference in the χ2 values between the weak and strong invariance models, we can test whether GDIF occurs due to differences in the d parameters (the intercept invariance test). Each test statistic approximately follows a χ2 distribution with degrees of freedom (df) equal to the difference in the number of free parameters between the two models compared.
Significance Tests for DIF
When χ2 difference tests are conducted at the item level, some items are used to set a common metric across groups (i.e., anchor items). As the metric of the item parameter estimates depends on the anchor items, the anchor items are assumed to be DIF-free. The item tested for DIF is referred to as the studied item. By adopting the general procedures described in IRTLRDIF (Thissen, 2001), we proposed a series of DIF tests for dichotomous items fitted with a two-dimensional IRT model. Five sequential DIF tests were of interest. A flowchart describing the five DIF tests is presented in Figure 1. The first test (Test 1) was an omnibus test for detecting DIF in the a and d parameters simultaneously. This test is an “omnibus” test because it could be significant if there is DIF in any single item parameter or any combination of item parameters in the studied item. If Test 1 is nonsignificant, no additional tests are needed (i.e., go to “Stop” in the chart). However, if the test statistic is significant, a test for DIF in two a parameters (Test 2) can be conducted. The statistical significance for Test 2 indicates the presence of nonuniform DIF, which could be due to DIF in a1 or a2 or DIF in both a parameters. Therefore, following the significant result for Test 2, subsequent tests can be performed. For example, a test for DIF in a1 (Test 3) is conducted to see whether the DIF in Test 2 occurs due to a1. Then, Test 4 (either 4-1 or 4-2) can be conducted to examine whether DIF exists in a2. Depending on the path we take, we can end up with “nonuniform DIF in only a2” or “nonuniform DIF in both a parameters.” Note that the order of Tests 3 and 4 is not important in Figure 1. If Test 2 is not significant, we go directly to Test 5 to see whether DIF exists in d. The significance of Test 5 indicates the presence of uniform DIF.

Flowchart for assessing DIF.
The five χ2 test statistics were obtained by comparing χ2 values between two nested models: a compact model (a simpler model) and an augmented model (a more complex model). The constraints used in defining the two models were different across the five tests. The constraints for each test are described in Appendix A in the online version of this article.
Monte Carlo Study
Simulation Design
Forty items and two sample sizes (n = 500 and 1,000) were used for the reference (R) and focal (F) groups. Three sample size combinations were simulated: R500/F500, R1,000/F1,000, and R1,000/F500. These conditions were selected to resemble values observed in previous DIF studies (S. Kim & Cohen, 1998; Oshima et al., 1997; Suh & Bolt, 2011). θ1 and θ2 were assumed to be uncorrelated for both groups and were generated from a bivariate normal distribution, with each dimension with a mean of 0 and a variance of 1. The simulation conditions for the DIF and GDIF tests are explained in the following sections. Item responses were generated using Equation 1 with a normal CDF. One hundred replications were simulated for each condition using S-Plus (MathSoft, Inc., 1998).
DIF test conditions
To assess the performance of the five DIF tests, three data sets were generated: (a) non-DIF condition, (b) uniform DIF condition, and (c) nonuniform DIF condition. The item parameters used to generate non-DIF data are provided in Table A1 in the online version. Test data for 40 items were generated. The item parameters were selected from the values used in a previous analysis (Reckase, 2009). Three clusters of items were considered: (a) the first 10 items were highly loaded on θ1, (b) the next 10 items were highly loaded on θ2, and (c) the last 20 items were almost equally loaded on both θ parameters. Among the 40 items, only 9 studied items (three items from each cluster) were examined in the non-DIF condition: Items 5, 7, 8, 11, 14, 18, 25, 30, and 39. These items were selected to represent different patterns in the a parameters and different levels of the d parameter (small, medium, and large). When each of the nine items was tested for DIF, the remaining 39 items were used as anchor items.
Four DIF conditions were considered: low and high levels of uniform DIF and low and high levels of nonuniform DIF conditions. Four DIF items were simulated under each condition. Items 1 to 36 in Table A1 were used as anchor items, and the item parameters for the last four items (37-40) were simulated to show different types of DIF. Table A2 in the online version presents the item parameters for the four studied items under each condition. For the uniform DIF condition, the item parameters for Items 37 to 39 in the R group were selected to reflect the means of the three clusters in Table A1. The item parameters for Item 40 were selected to represent a different multidimensional pattern, similar to the values used in a previous study (Oshima et al., 1997). The d parameters for the four items were increased by 0.5 for the F group, which represented a low 5 level of uniform DIF magnitude. A high level of DIF was simulated by a 1.0 difference in the d parameter. For the nonuniform DIF condition, a low level was introduced by 0.3 increase in one a or both a parameters for the F group. For Items 37 to 39, the d for the F group was also raised by 0.5 representing DIF in a and d, whereas for Item 40, the same d parameter was used for the two groups representing DIF in the a parameter only. A high level of nonuniform DIF was simulated by a 0.6 difference in the a parameters. The 0.3 difference in the a parameter as well as the 0.5 and 1.0 differences in the d parameter were chosen to coincide with other DIF studies (e.g., S. Kim & Cohen, 1992; Oshima et al., 1997; Suh & Bolt, 2011). The 0.6 difference in the a parameter was selected to examine the effect of larger DIF in the a parameter.
GDIF test conditions
To investigate the performance of the two GDIF tests, three data sets were generated: (a) non-GDIF condition, (b) unidirectional GDIF condition, and (c) balanced-directional GDIF condition. In the unidirectional condition, all DIF items were introduced against the F group, whereas in the balanced-directional GDIF condition, one half of the DIF items functioned differently against the R group, and the other half functioned differently against the F group (see Oshima et al., 1997). The balanced conditions were included to examine if a cancellation effect existed at the test level.
The three GDIF simulation conditions are shown in Table A3 of the online version. A 40-item test was analyzed as one measure for each GDIF condition. The non-GDIF conditions were generated by using the same item parameters in Table A1 across groups. For each of the four unidirectional GDIF conditions (low and high uniform GDIF and low and high nonuniform GDIF conditions), the four DIF items under the corresponding DIF condition in Table A2 were analyzed with the 36 anchor items in Table A1. The four DIF conditions in Table A2 were the unidirectional GDIF conditions, because all DIF items were introduced against the F group. Table A3c shows the item parameters used to generate the balanced-directional GDIF conditions. Two types of balanced-directional GDIF conditions were considered: uniform and nonuniform. In the balanced-directional uniform GDIF condition, Items 37 and 39 favored the R group, whereas Items 38 and 40 favored the F group to the same degree. In the balanced-directional nonuniform GDIF condition, two types of nonuniform DIF were included: equally weighted a parameters and unequally weighted a parameters for the two dimensions. In both cases, DIF in the d parameter was introduced simultaneously with DIF in the a parameters due to the frequent cooccurrence of these forms of DIF (Suh & Bolt, 2011).
Ten percent GDIF (four DIF items out of 40 items) conditions were considered for the unidirectional GDIF and balanced-directional GDIF conditions. The 10% condition was used to coincide with other studies (e.g., S. Kim & Cohen, 1992; Oshima et al., 1997). To examine the effect of a larger DIF percentage on the performance of the two GDIF tests, a 30% condition was included in the balanced-directional GDIF conditions. In the 30% conditions, the same parameters in Table A3c were used three times to create 12 DIF items to control for the effect of different types of DIF items introduced in the test. In this case, Items 1 to 28 in Table 1 were used as anchor items.
Proportions of Significant Chi-Square Difference Test Statistics for Detecting GDIF.
Note. The DIF items included in the balanced-directional conditions show low levels of DIF in both parameters. GDIF = global differential item functioning; WLSMV = weighted least square with adjusted means and variance; WLSM = weighted least square with adjusted means; DIF = differential item functioning.
Data Analysis Methods
For each data set, the χ2 test statistics for GDIF and DIF were obtained from WLSMV and WLSM estimators in Mplus 6 (Muthén & Muthén, 2010). 6 Both estimators were based on a probit link function. When WLSMV and WLSM are used, the χ2 values should be adjusted because the difference in the χ2 values is not distributed as a χ2 (Muthén & Muthén, 2010). Therefore, two adjusted χ2 tests were performed. For WLSMV, the DIFFTEST option in Mplus (for more details, see Asparouhov & Muthén, 2006) was used, whereas for WLSM, a scaled χ2 difference test (Satorra & Bentler, 2001) was calculated by using scaling correction values from the Mplus output.
Results
Because the performance of the χ2 test statistic largely depends on recovering item parameters, a parameter recovery study was conducted. The item parameters were accurately estimated. The detailed procedure and results are provided in Appendix B in the online version of this article.
Type I Error and Rejection Rates for the GDIF Tests
Table 1 presents the proportions of significant χ2 test statistics for detecting GDIF at α = .05. The two columns, “Test a s” and “Test d s,” show the results of the discrimination invariance and intercept invariance tests. As noted, the tests for the a parameters preceded the tests for the d parameters as a sequential procedure for detecting GDIF. Thus, the proportions for the tests for the d parameters were calculated based on the results from the tests for a parameters, not based on 100 replications. Table 1a shows the Type I error rates. For the tests for a parameters with WLSMV, the error rates for R500/F500 were equal to the expected value (.05), whereas the other two sample size conditions yielded error rates lower than .05. For WLSM, the R1,000/F1,000 condition showed a lower value than .05, whereas the other sample size conditions produced inflated error rates. The error rates from WLSM were higher than those from WLSMV on average. For the tests for the d parameters, the error rates were zeros for both estimators across all conditions.
In the uniform GDIF condition with the unidirectional GDIF design (Table 1, panel b), the results from the tests for the a parameters indicate Type I error rates because uniform DIF was simulated for the last four items, whereas those from the tests for the d parameters represent rejection rates. For the tests for the a parameters, the errors from WLSM were higher than those from WLSMV on average. For the tests for the d parameters, both estimators detected all replications that had not been rejected by the tests for the a parameters. In the nonuniform GDIF condition with the unidirectional design, all columns indicate rejection rates. In general, WLSM showed larger rejection rates than WLSMV. As the magnitude of nonuniform GDIF increased, the rejection rate of the tests for the a parameters increased. We simulated only the magnitude of the a parameters. The GDIF amount introduced in the d parameter was equal across the two nonuniform conditions. In this regard, one interesting pattern was found; the rejection rates of the tests for the d parameters in the low nonuniform condition were higher than those in the high nonuniform condition. This is probably attributed to the fact that the magnitude of the uniform GDIF can be weakened when the slopes (the a parameters) of the items from the two groups were substantially different. This tendency seemed to diminish as the sample size increased.
In the 10% uniform GDIF condition with the balanced-directional design (Table 1, panel c), the results were similar to the case in the low uniform GDIF with the unidirectional design. This indicates that there was no clear cancellation effect of using the balanced-directional design as opposed to the unidirectional design. Likewise, no cancellation effect was observed in the 10% nonuniform DIF condition with the balanced design relative to the low nonuniform DIF condition with the unidirectional design. In fact, the balanced-directional design yielded higher rejection rates than the unidirectional design. This can be attributed in part to the different types of DIF items included in each design, not to the different design effects. The Type I error rates for the tests for the a parameters in the uniform GDIF with the balanced-directional design did not appear to be substantially affected by the percentages of DIF, whereas the rejection rates of the tests for the a parameters in the nonuniform DIF conditions increased as the percentage of DIF changed from 10% to 30%. For the tests for the d parameters, both estimators showed high rejection rates. The Type I error and rejection rates from WLSM were slightly higher than those from WLSMV.
Regarding the effect of sample sizes on rejection rates, in general, the rejection rates increased as the sample size increased except for the low nonuniform GDIF condition with the unidirectional design, where the lowest rejection rates were observed for the tests for the a parameters in the large sample size (R1,000/F1,000) condition. However, the rejection rates increased in the corresponding balanced-directional design (the nonuniform GDIF with 10% condition). Again, given the different nonuniform DIF items included in the two conditions, this exception can result from the different types of DIF items included in each condition. Indeed, based on a careful examination of the results in Table 3, Item 40 in the low nonuniform DIF showed this pattern more clearly than other items. Item 40 was the only item that showed DIF in the a parameter but not in the d parameter. Because the item parameters for Item 40 were simulated in the unidirectional design, not in the balanced-directional design, this item may contribute to the pattern shown in Table 1.
Table 1 (panels b and c) also presents the proportions of the correctly identified GDIF data sets among the 100 replications by using the two GDIF tests sequentially. The proportions in the uniform GDIF with the unidirectional design ranged from .92 to 1.00. The two estimators showed a similar accuracy. The proportions in the nonuniform GDIF with the unidirectional design varied a lot, ranging from .16 to 1.00. WLSM showed better accuracy than WLSMV. For the proportions in the balanced-directional design, WLSMV showed better accuracy than WLSM in the uniform GDIF conditions, whereas the opposite result was observed in the nonuniform GDIF conditions on average.
Type I Error and Rejection Rates for the DIF Tests
Table A4 in the online version shows the Type I error rates of the omnibus test (Test 1 in Figure 1). As explained earlier, nine studied items were tested for DIF α = .05. The different types of DIF items showed no clear pattern. However, the sample size appeared to affect the Type I error rates. The proportions of significant test statistics for the R500/F500, R1,000/F500, and R1,000/F1,000 conditions for WLSMV were .05, .07, and .03, respectively, on average, across all studied items, and the proportions for WLSM were .06, .08, and .03, respectively.
Table 2 shows the proportions of the significant χ2 statistics under the uniform DIF conditions. Because DIF was introduced in the d parameter, following the test procedures described in Figure 1, we conducted three DIF tests (omnibus test, test for the a1 and a2 parameters, and test for the d parameter) at α = .05. As in Table 1, the proportions of each subsequent test (the test for the a1 and a2 parameters and the test for the d parameter) were calculated based on the results of the preceding tests conducted following the DIF test procedure, not based on the 100 replications. Table 2 also shows the proportions of the correctly identified DIF among the 100 replications using the three sequential tests.
Proportions of Significant Chi-Square Difference Test Statistics for Detecting DIF Under the Uniform DIF Condition.
Note. DIF = differential item functioning; WLSMV = weighted least square with adjusted means and variance; WLSM = weighted least square with adjusted means.
The rejection rates for the omnibus test increased as the magnitude of the uniform DIF increased. In the low uniform DIF conditions, the rejection rates for the omnibus tests increased as the sample size increased, and WLSM tended to yield slightly higher rejection rates than WLSMV. Regarding the different types of studied items, Item 39 (which was equally loaded on both dimensions) exhibited the highest rejection rates on average across the estimators. The rejection rates in the high uniform conditions were all equal to 1.00. The test for the a1 and a2 parameters produced Type I error rates higher than the expected value on average for both estimators in the low uniform DIF condition, whereas they yielded error rates close to the expected value in the high uniform condition. Again, Item 39 yielded the largest error rates among the four studied items. The R1,000/F1,000 condition tended to produce smaller error rates than the other conditions. The test for the d parameter detected the items containing DIF perfectly regardless of the magnitude of DIF. The proportions of the correctly identified uniform DIF items ranged from .28 to .92 in the low uniform DIF condition. The proportions tended to increase as the sample size increased. WLSM showed slightly better accuracy than WLSMV. The proportions in the high uniform DIF condition ranged from .90 to .99, and the two estimators showed a similar accuracy.
Table 3 displays the proportions of the significant χ2 test statistics for the five DIF tests under the nonuniform DIF conditions at α = .05. Because detecting nonuniform DIF was of interest, four DIF tests (omnibus test, test for the a1 and a2 parameters, test for the a1 parameter, and test for the a2 parameter) were conducted. However, we also provided the results of the test for the d parameter, because one might be additionally interested in whether DIF exists in the d parameter or not. The proportions of each subsequent test (test for the a1 and a2 parameters) were calculated based on the results of the preceding tests conducted following the sequential procedure in Figure 1, not based on the 100 replications. The proportion for Test d was calculated based on the results of the omnibus test. Table 3 also shows the proportions of the correctly identified nonuniform DIF items among the 100 replications by using the four sequential tests.
Proportions of Significant Chi-Square Difference Test Statistics for Detecting DIF Under the Nonuniform DIF Condition.
Note. The numbers in the shaded cells indicate Type I error rates, whereas other numbers indicate rejection rates. DIF = differential item functioning; WLSMV = weighted least square with adjusted means and variance; WLSM = weighted least square with adjusted means.
Depending on the simulated studied items, some tests showed Type I error rates, and other tests showed rejection rates. The numbers in the shaded cells in Table 3 indicate Type I error rates, and the other numbers indicate rejection rates. As expected, the rejection rates increased as the magnitude of DIF in the a parameter increased. Indeed, all five tests detected almost all high nonuniform DIF items. Both estimators showed similar performances. Type I errors ranged from .01 to .07 on average, and WLSM produced smaller error rates than WLSMV. In the low nonuniform DIF condition, except for the omnibus tests, WLSMV tended to produce slightly higher rejection rates than WLSM. However, as the magnitude of DIF increased, the discrepancy tended to diminish.
Regarding the different types of studied items, Item 38 showed the highest rejection rates in the omnibus test among the four studied items. Item 38 was equally weighted on both dimensions for both groups, and DIF was introduced in both a parameters along with DIF in the d parameter. Item 40 showed the lowest rejection rates on the omnibus test, which was expected because this item exhibited DIF in the a2 parameter only without having DIF in the d parameter. However, this item showed the highest rejection rates in the test for the a1 and a2 parameters as well as the test for the a2 parameter. Comparing the results of the test for the a1 parameter, Item 37 appeared to produce higher rejection rates than Item 38. For both items, the same amount of DIF in the a1 parameter was introduced, but Item 38 also contained DIF in the a2 parameter. Thus, for Item 38, DIF in the a1 parameter might have been masked by DIF in the a2 parameter. For the test for the d parameter, no apparent pattern was associated with the types of items, and Items 37 to 39 showed good rejection rates. However, as shown in Table 1 panel b, the high nonuniform DIF condition tended to produce slightly lower rejection rates for the test for the d parameter than the low nonuniform condition, especially when Item 38 was analyzed for the R500/F500 condition. The rejection rates varied depending on the sample size, but no apparent pattern was observed. The average proportions of the correctly identified nonuniform DIF items for WLSMV and WLSMV were .64 and .62, respectively, in the low nonuniform DIF condition, whereas the averages were .94 and .95 in the high nonuniform DIF condition, respectively.
Discussion and Conclusion
The results from the present study provide implications for studying GDIF and/or DIF in multidimensional tests using two estimators: WLSMV and WLSM. First, for the two tests for GDIF, WLSM produced higher Type I error rates and rejection rates than WLSMV on average across all conditions. In the sense that failing to detect tests with GDIF can be a more serious error than falsely detecting tests without GDIF in real testing applications, WLSM is preferred in studying GDIF based on the results from this study. In addition, WLSM provided higher proportions of correctly identified GDIF than did WLSMV. However, Type I error rates affect rejection rates. Therefore, investigating an empirical χ2 distribution in addition to the theoretical distribution for testing the null hypotheses with the use of WLSM would be worthwhile. For the five tests for DIF, WLSMV yielded slightly higher rejection rates than did WLSM for all tests but the omnibus test in which WLSM exhibited higher rejection rates than did WLSMV. Although WLSMV tended to show slightly higher error rates than did WLSM for some conditions, the average errors were close to the expected value. The proportions of correctly identified DIF were very similar between the two estimators. Based on these results, if one has to choose one of the two estimators for studying DIF using only the omnibus test, WLSM would be a better choice. However, WLSMV is recommended when all five tests, including the omnibus test, are of interest, because WLSMV performed better in detecting all types of DIF tests except the omnibus test than WLSM.
Second, although the Type I error rates tended to be smaller in the condition of the large sample size, the effect of the sample size on the rejection rates varied depending on the types of tests and GDIF/DIF conditions. In general, the rejection rates increased as the sample size increased except for the low nonuniform GDIF and DIF conditions. As explained in the Results section, this exception might be attributed to the characteristics of Item 40 included in these conditions. However, because the type of DIF in Item 40 (i.e., showing DIF in the a parameter without DIF in the d parameter) tends to occur less frequently in practice, the rejection rates seem to increase as the sample size increases, consistent with prior DIF research (e.g., E. Kim & Yoon, 2011; Suh & Bolt, 2011).
Third, although the error rates did not seem to be affected by the magnitude of DIF, the rejection rates appeared to be affected by the magnitude of DIF (low vs. high), as expected. There were several interesting findings regarding the results of the rejection rates for the tests on the intercept parameter. One finding was that the intercept invariance test (the GDIF test) and the test for the d parameter (the DIF test) with both estimators detected almost all uniform GDIF and DIF items, even when the magnitude of DIF was low. This implies that the 0.5 difference in the d parameter between the two groups is either a moderate or a large GDIF (or DIF) effect size. Or the sample sizes may have an effect on the rejection rates of the χ2 tests. Because the statistical power of the χ2 tests is sensitive to sample sizes, studying effect size measures of DIF magnitude can help facilitate practical interpretations of the DIF test results. Thus, it would be valuable to consider effect size measures in using MIRT models and examine how they perform under the simulation conditions considered in this study, compared with the χ2 tests. To date, few studies have been conducted on developing effect size measures for MIRT models, especially at the item level (e.g., Bryant, Williamson, Wooten, & Forde, 2004). Therefore, investigating effect size measures for MIRT models should be pursued as a future study. It can also be interesting to examine the performance of alternative fit indices (e.g., the root mean square error approximation [RMSEA] and the comparative fit index [CFI]) in detecting GDIF and DIF (e.g., E. Kim & Yoon, 2011). Another finding was that when DIF in the a parameter was present along with DIF in the d parameter in the unidirectional design, the intercept invariance test (GDIF test) was not very effective compared with when DIF was present in the d parameter only. This tendency was more severe when small samples were analyzed with WLSMV in the high nonuniform DIF condition. Therefore, as the cooccurrence of DIF in both parameters is common in practice, WLSM would be again preferred for studying GDIF, especially when small sample sizes are analyzed and high levels of nonuniform DIF are expected on the test.
Finally, the balanced-directional design did not seem to affect the rejection rates for both invariance tests compared with the unidirectional design. The two GDIF tests were not sensitive to the different directions of the GDIF conditions. However, the DIF percentage (10% vs. 30%) appeared to affect the rejection rates of the discrimination invariance test. The effect of the DIF percentage on the intercept invariance test was not effectively examined, because the 0.5 difference was large enough to be detected under almost all conditions. Thus, it would be interesting to investigate the rejection rates under smaller uniform DIF conditions (e.g., 0.25 difference in the d parameter).
The findings are limited in several aspects as this was a simulation study. One limitation is that only two RWLS estimators were considered. Comparing other possible approaches (e.g., the differential functioning of items and test [DFIT] and MULTISIB) with the RWLS estimators considered in this study would be interesting. Furthermore, the impact of more realistic simulation conditions, such as different distributional differences between two groups, should be investigated. In this study, the mean and standard deviation of the latent variables were fixed at 0 and 1, respectively, because of the reasons explained in footnote 4. Therefore, the same distributions were generated for both groups. Future research is needed to examine whether the approaches considered in this study can be applied when the distributions are unequal in the two groups. Another concern is that because identifying the “correct” number of dimensions is not always simple and easy in real data analysis (Reckase, 2009), investigating the effect of model misfit or misspecification of the number of dimensions on the performance of the χ2 difference tests is important. Finally, the relationship between the items and the dimensionality of the test was established within the CFA framework in this study. This assumption, however, often does not work properly in practice as the structure may be unknown. In such cases, exploratory FA can be carried out as an option for searching for a better-fitting measurement model. It is therefore left for a future study to examine how differently and accurately the two different approaches can detect differential functioning.
The present study focused on comparing the performance of the χ2 difference tests based on WLSMV and WLSM, for detecting GDIF and DIF in an MIRT model through a Monte Carlo study, which has not been comprehensively examined in the literature. Therefore, the results and implications from this study provide test practitioners with useful information concerning the relative value of choosing one estimator over the other in studying GDIF and/or DIF, especially when one has no choice but to rely on the χ2 test based on LI methods. Furthermore, this study demonstrated systematic procedures using two GDIF and five DIF tests, which can provide a practical guideline for studying different types of GDIF and DIF and subsequently revising a test or items.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
