Abstract
Researchers are commonly interested in group comparisons such as comparisons of group means, called impact, or comparisons of individual scores across groups. A meaningful comparison can be made between the groups when there is no differential item functioning (DIF) or differential test functioning (DTF). During the past three decades, much progress has been made in detecting DIF and DTF. However, little research has been conducted on what researchers can do after such detection. This study presents and evaluates a confirmatory multigroup multidimensional item response model to obtain the purified item parameter estimates, person scores, and impact estimates on the primary dimension, controlling for the secondary dimension due to DIF. In addition, the item response model approach was compared with current practices of DIF treatment such as deleting and ignoring DIF items and using multigroup item response models through simulation studies. The authors suggested guidelines for DIF treatment based on the simulation study results.
Introduction
Many psychological studies involve group comparisons such as cultural, ethnic, gender, or treatment group comparisons. The group comparisons can be made between group means or individual scores across groups. To make meaningful group comparisons, the measurement of a construct or set of constructs is assumed to be equivalent across the groups, which is called measurement invariance (e.g., Meredith & Millsap, 1992). Measurement invariance implies that the distribution of the test score, conditional on a given value of the construct or the latent variable, is invariant across groups. If the measurement invariance does not meet at the test level or at the item level, the test or the item is said to present differential test functioning (DTF) or differential item functioning (DIF), respectively, in the item response theory (IRT) approach.
When a test is intended to be unidimensional, DTF or DIF can be viewed as the consequence of one or more dimensions not explained by the primary dimension to be measured and the failure to account for the secondary dimension can result in DTF or DIF (e.g., Ackerman, 1992; Bolt & Stout, 1996; Shealy & Stout, 1993). Specifically, Shealy and Stout (1993) pointed out that DTF occurs when there are secondary dimension(s) and the reference and focal groups do not have an equal number of secondary dimension(s) (i.e., the two groups have different means).
Procedures for detecting DIF items are by now well established in psychometric research using item response models (see Millsap & Everson, 1993, for an overview of DIF detection methods). However, little attention has been paid to how one can treat DIF items for valid group comparisons. In the item development stage, a larger number of items than is needed is created, and the detected DIF items can be revised or removed. When researchers cannot get involved in item development and have to use developed tests for their own research purposes, it may be a rare case in which they can revise detected DIF items and then recollect data with the revised items. Furthermore, deleting DIF items in the developed tests may result in lowering test reliability and content validity.
Lord (1980) cited a solution (suggested by Gary Marco) to purify a test by deleting DIF items and then scoring only based on non-DIF items. When a large portion of items in a test (e.g., above 50%) are detected as DIF items, using a separate scale for each group is often recommended (Bolt, Hare, Vitale, & Newman, 2004). However, when a portion of test items (e.g., less than 30% of test items) is known as DIF, an IRT purification method can be used to estimate the item parameters, person scores, and group mean difference (called impact 1 hereafter) on the primary dimension. De Boeck, Cho, and Wilson (2011) presented a secondary dimension modeling approach to obtain purified item parameter estimates using a confirmatory mixture multidimensional item response model when DIF items are known and groups of interest are unknown (i.e., latent classes or mixtures).
The purpose of this article is to present an IRT purification method for item calibration and scoring in the presence of DIF after a subset of items is detected as DIF items and to compare the performance of the method with that of other current DIF item treatments. The importance of DIF item treatment may differ depending on the purposes of the test used (e.g., Borsboom, 2006). The authors of the current study consider the use of test scores to detect impact and individual differences in the construct being measured. Unlike in De Boeck et al. (2011), this article focuses on manifest groups such as gender and ethnicity (instead of latent classes) using confirmatory multigroup multidimensional item response model and evaluates the secondary dimension modeling approach to obtain purified IRT item parameter estimates and person scores via a simulation study. The multigroup multidimensional item response model has been used in DIF detection contexts (e.g., Oshima, Raju, & Flowers, 1997). Novel presentation of the model in the current study is to specify a dimension structure to model a secondary dimension due to DIF. The specified methods can easily be extended to include a number of primary and secondary dimensions; however, for the sake of simplicity, this article focuses on the two dimensional models, one primary dimension and one secondary dimension.
The article is organized as follows. First, survey results about the practices of DIF treatment are presented. Subsequently, the purification method for item calibration and scoring in the presence of DIF items using a confirmatory multigroup multidimensional item response model are described. In addition, other DIF item treatments are described using item response models, based on the survey results of current practice for DIF items. Next, a simulation study was conducted to evaluate the proposed method with a comparison with current practices to treat DIF items. Finally, the article concludes with a summary and discussion.
Examples of DIF Item Treatment
To report how researchers treat DIF items in practice, 27 articles published in five American Psychological Association journals were reviewed. For review results and details, see Table 1 in Online Appendix A. It was observed that there are five distinct practices to deal with DIF items: (a) delete DIF items (30%), (b) no further action (33%; i.e., a specific DIF treatment was not mentioned), (c) ignore DIF items (26%; i.e., all items including DIF items were calibrated), (d) calibrate items for each group (7%; i.e., multigroup analysis), and (e) model DIF (4%). There was one article, Nye and Drasgow (2011), which showed modeling DIF approach. However, they did not model a secondary dimension separate from a primary dimension implying that the group difference and individual scores in their model may not be meaningful for group comparisons.
Below, the modeling DIF approach is presented with a two-parameter confirmatory multigroup multidimensional item response model. For the comparison with the modeling DIF approach, two-parameter unidimensional item response models for deleting DIF, ignoring DIF, and multigroup analysis were shown in Online Appendix B. Multigroup approach (Bock & Zimowski, 1997) allows for separate item parameter estimates for each group regarding DIF items, but through non-DIF items connects the estimates between groups to a common latent metric.
Modeling DIF Using a Multigroup Multidimensional Item Response Model
In the modeling DIF approach, it is assumed that DIF items are known after DIF detection methods are used. In addition, it is assumed that there are shifts with the DIF magnitudes on item parameters for the items suspected of DIF for a focal group. A secondary dimension is modeled to explain individual differences in endorsement probabilities for the focal group and DIF items (see a section of “DIF items and multidimensionality” for details in Online Appendix C). The logic of the method is to estimate item parameters, person scores, and impact on a primary dimension from all persons and items, and controlling for the secondary dimension from persons in the focal group and for DIF items. That is, the reference group has one dimension (
A confirmatory multiple-group multidimensional item response model can be described as follows:
where
For the reference group (
For the focal group (
In the reference group, all items load on only the primary dimension. In the focal group, all items load on the primary dimension, and only DIF items load on the secondary dimension.
To identify the model, the following constraints are imposed:
Each item discrimination parameter for the primary dimension (
Comparisons Among Four DIF Treatment Practices
Online Appendix B shows the summary of the four DIF treatment approaches, as specified in the earlier section. An example was provided in Online Appendix E to illustrate the four DIF treatment approaches, deleting, ignoring, multigroup, and modeling. In this section, the advantages and disadvantages of each DIF treatment approach in terms of estimating the item parameters for all items (
In the deleting DIF items, only non-DIF items can be calibrated. This results in lowering test reliability and content validity, especially when the DIF magnitude is high and the number of DIF is large. The impact parameter cannot be estimated simultaneously with the item parameters, unless it is calculated based on the person scores as a subsequent analysis.
In DIF study literature, the degree of DIF is mainly characterized with respect to DIF magnitudes and the number of DIF items (e.g., Kim & Cohen, 1992; Oshima et al., 1997). Thus, ignoring DIF approach may not be problematic when the DIF magnitude is low and the number of DIF items is small. However, in the presence of non-ignorable DIF (e.g., high DIF magnitudes or the large number of DIF items), item parameter estimates and person scores can be biased. As in the deleting DIF approach, the impact parameter cannot be estimated simultaneously with the item parameters.
In the multigroup DIF approach, because item parameters without DIF magnitudes are estimated only with the reference group for DIF items, the standard errors of item parameter estimates for DIF items can be larger than those with the reference and the focal groups (as in the multigroup DIF approach with two-step and in the modeling DIF approach). In the presence of DIF, the impact and person scores from the multigroup DIF approach with one step are from different dimensions between the two groups. As explained earlier, the reference group has the primary dimension, and the focal group has the primary and the secondary dimension. Thus, in the multigroup DIF approach, the impact is not meaningful, and the person scores cannot be compared on the same scale. These limitations in the one-step approach can be overcome with a multigroup DIF approach with two steps, where the first step is to have the same item parameter estimates between the two groups (using item parameter estimates from the reference group) and the second step is to obtain the impact and person scores. However, this requires the additional step (compared with the modeling DIF approach). In addition, the uncertainty of item parameter estimates can be ignored in estimating the impact because item parameters are considered known parameters in the second step. Thus, the standard error of the impact estimate can be smaller than that of the impact estimate (compared with the modeling DIF approach).
In the modeling DIF approach, comparable item parameter estimates and person scores between the reference and focal groups are obtained using all items, by controlling for the secondary dimension due to the DIF items. In this regard, the modeling DIF approach does not hurt content validity, which is not the case for the deleting DIF approach. Furthermore, because item parameters on the primary dimension are estimated with the equality constraints on the item parameters between the reference group and the focal group, the standard errors of the item parameter estimates can be smaller (compared with the multigroup DIF approach with one step). The item parameters and impact parameter on the primary dimension are estimated simultaneously, such that the uncertainty of the item parameter estimates can be incorporated in the estimation of the impact parameter. However, the number of parameters is larger than the multigroup approach (with two steps) because of the simultaneous modeling for the primary and secondary dimensions. Accordingly, the sampling variability in the modeling DIF approach can be larger than that in the multigroup DIF approach, especially when the number of DIF items increases (because the number of item discriminations of the secondary dimension increases).
Simulation Study
The main interests in the simulation study are the following two questions, assuming the presence of DIF (a) Does the modeling approach perform well in explaining the secondary dimension due to DIF to have purified IRT item parameter estimates and person scores? (b) What are the consequences of deleting and ignoring DIF items (as the two common current practices according to survey results presented in Online Appendix A) in item parameter estimates and person scores? To answer these two questions, DIF and impact were generated based on the two-group (two-parameter) item response model as a population data-generating model (a special case of the multigroup analysis [Equation 2 in Online Appendix B] when some portions of the items are non-DIF items). The model was chosen over the confirmatory multigroup multidimensional item response model (Equation 1) as a population data-generating model because the interest is in how the secondary dimension modeling approach can perform to obtain purified IRT item parameter estimates and person scores, not in parameter recovery of the model.
For the first research question in the simulation study, item parameters and person scores in the population model were compared with the item parameter estimates and the predicted person scores from the primary dimension in the modeling DIF approach using the confirmatory multigroup multidimensional item response model. In comparison with the modeling DIF approach, multigroup DIF approaches with one step (for item parameters) and two steps (for impact and person scores) were fit to the same generated datasets. For the second question, the models for deleting and ignoring DIF practices were fit to the same generated datasets.
Simulation Designs
As the focus of this article was on investigating the effect of DIF items on item parameter estimates and person scores, varying conditions were considered for different patterns of DIF. The simulation conditions include the number of DIF items (10%, 30%, or 50%), magnitude of DIF (low or high), and type of DIF (uniform or nonuniform). In addition, the sample size design (a balanced group design or an unbalanced group design) was considered a varying condition that affects IRT item parameter estimation. As shown in Table 1 in the Online Appendix, item response theory–likelihood ratio–differential item functioning (IRT-LR-DIF) is the most commonly used IRT DIF detection method. Woods (2008) reported a literature review regarding the number of persons and items in 16 papers in which IRT-LR-DIF was applied). The mean number of examinees in the reference group was 1,081, and the mean number of items (excluding a few outliers) was 20. Furthermore, the 20-item test was also chosen in other measurement invariance studies (e.g., Clark & LaHuis, 2012; Finch, 2005; Flowers, Oshima, & Raju, 1999; Meade & Bauer, 2007). For these reasons, fixed conditions were chosen in this study for the two groups (i.e., the reference group and the focal group), 20 items and 2,000 persons in total, as used in Woods. Fully crossed conditions defined by the four varying conditions resulted in 24 (
All DIF items were introduced against the focal group. Namely, first, the item parameters for the reference group were generated and then the item parameters for the focal group were manipulated by introducing DIF magnitudes in designated DIF items.
A latent variable for the reference group,
For the reference group, item discriminations were generated from a log-normal distribution with a mean of 0 and a variance of .25 used as a prior distribution in the BILOG-MG program (Zimowski, Muraki, Mislevy, & Bock, 1996). Item locations were generated from a standard normal distribution. Table 5 in Online Appendix F presents the generated item parameters for 10% number of the DIF items, nonuniform DIF, and high magnitude to illustrate the generation of DIF conditions to be explained in the following. Item parameters for the reference group in the population data-generating model (
Number of DIF items
The 10%, 30%, and 50% DIF items (two items, six items, and 10 items, respectively) were considered the number of DIF items. In the DIF study literature, 30% DIF is considered a large number of DIF (e.g., Oshima et al., 1997). As indicated in the introduction, the purification method may not be recommended over having a separate scale for each group when there is a large portion of DIF items. Reise, Widaman, and Pugh (1993) noted that partial measurement invariance may hold when less than half of the items had significant modification indices (MIs) for factor loadings of the common factor model. The 50% DIF item condition was included to investigate the relative performance of the different DIF treatment in the presence of larger DIF items. The first 18 items, 14 items, and 10 items in Table 5 of Online Appendix F were used as for anchor items (i.e., non-DIF items) for 10%, 30%, and 50%, respectively.
DIF magnitudes and type of DIF items
The DIF items were simulated under each of the four DIF conditions: low and high levels of uniform DIF and low and high levels of nonuniform DIF conditions. The DIF magnitudes were chosen to coincide with other DIF studies (e.g., Oshima et al., 1997; Suh & Bolt, 2011).
For the uniform DIF type, the item location parameters for DIF items increased by 0.5 for the focal group, thus making these items harder for the focal group. The 0.5 difference in the item location represents a low level of uniform DIF magnitude. 2 A high level of DIF magnitude was simulated by introducing a 1.0 difference in the item location parameter.
For the nonuniform DIF type, a low level of nonuniform DIF was introduced at a shift level of 0.3 in the item discrimination parameter, such that the item discrimination parameters for the focal group were set 0.3 lower than for the reference group. For a low level of nonuniform DIF, the item-location parameter(s) for the focal group decreased by 0.5 representing DIF in location. A high level of nonuniform DIF condition was simulated by decreasing a 1.0 difference in location and decreasing a 0.6 difference in discrimination.
Two scale-level effect sizes, signed test difference in the sample (STDS) and unsigned test difference in the sample (UTDS;Meade, 2010), were calculated to show how much DIF exists in the designed DIF conditions regarding the numbers, types, and magnitudes of DIF. The patterns of scale-level effect sizes can differ, depending on the manipulation of the DIF patterns. Table 6 in Online Appendix G presents two scale-level DIF effect size measures, the STDS and the UTDS; the values are on the total score scale (the two measures can range from 0 to 20) using one simulated data set for each simulation condition.
Sample size design
Balanced and unbalanced designs were considered. For the balanced design, 1,000 persons were assigned to each group. According to Woods’s (2008) literature review of the applications of IRT-LR-DIF, the mean ratio of the mean ratio of the number of examinees in the reference group to the number of examinees in the focal group was 3. Thus, for the unbalanced design, 1,500 persons were assigned to a reference group and 500 persons were assigned to a focal group.
Evaluation Measures
Two accuracy measures were considered to compare the item parameter estimates and person scores across four different DIF approaches: bias and root mean square error (RMSE). The bias is given by
Analysis
Mplus 7.11 (Muthén & Muthén, 1998-2014) was used to fit four models with marginal maximum likelihood estimation (ESTIMATOR=MLR in Mplus). For multigroup and modeling approaches, the KNOWNCLASS option for TYPE=MIXTURE was used in Mplus. Prediction errors of the person scores are not available for multigroup and modeling approaches with the MLR estimator. Thus, ESTIMATOR=BAYES (with default priors and hyperpriors) was used to obtain the prediction errors, and 100 imputations were used in Mplus. The posterior median of the person scores was used to calculate the IRT reliability when the posterior distribution for the portion of the person scores was not symmetric.
Results
Due to the page limit, the hypotheses for the simulation study are reported in Online Appendix H. No convergence problems were encountered in any replication for all four approaches in the balanced design. However, in the unbalanced design, one replication had a convergence problem in the modeling DIF approach. The replication was excluded from the analyses of the results.
As expected from the research hypotheses in Online Appendix H, the patterns in the results were similar between the balanced and unbalanced designs for DIF effects, even though there were different magnitudes of bias and RMSE due to the different number of persons in the reference and focal groups in the designs. Regarding the magnitudes of RMSE in the balanced and unbalanced designs, the expected results were found for the multigroup and modeling DIF approaches (except for the impact estimates in the two-step multigroup DIF approach, the impact estimates were not much different between the balanced and unbalanced designs). However, unexpectedly, in the deleting and ignoring DIF approaches, the RMSEs were smaller in the unbalanced design than those in the balanced design for the item location estimates and the person scores. Further investigation showed that these unexpected results were from the different location shift when the impact was ignored in the deleting and ignoring DIF approaches. Specifically, the mean true person scores across all persons was 0.264 in the balanced design, whereas it was 0.137 in the unbalanced design. The smaller shift in the unbalanced design than in the balanced design resulted in a smaller bias for the item location estimates and the person scores.
Because this study focused on DIF item treatment comparisons, the differing DIF effects results are reported for the balanced design below. Results for the unbalanced design are reported in Online Appendix I.
Item parameters
Tables 1 and 2 report the average bias, RMSE, and ratio across the items for the item discrimination estimates and the item location estimates, respectively. The item parameter results for the deleting DIF approach were not comparable with the other three approaches because only non-DIF items were used for calibration. However, bias and RMSE are reported in Tables 1 and 2 to interpret them within the deleting DIF approach. The results of the multigroup approach were from the item parameter estimates with the one-step approach, and the results for the item discrimination parameters of the modeling DIF approach were based on the primary dimension.
Average Bias, RMSE, and Ratio Across Items for Item Discrimination Estimates.
Note. RMSE = root mean square error; DIF = differential item functioning; - = not applicable.
Average Bias, RMSE, and Ratio Across Items for Item Location Estimates.
Note. RMSE = root mean square error; DIF = differential item functioning; - = not applicable.
In the deleting DIF approach, bias for the item discrimination (α) estimates was similar across DIF conditions, and RMSE increased mainly with the increasing number of DIF items. Bias and RMSE for the item-location parameter (β) estimates increased mainly with the increasing number of DIF items.
For the other three approaches, the following patterns were found for the item discrimination parameter (α). First, in terms of bias, the multigroup and modeling DIF approaches had similar values across the simulation conditions and produced smaller bias than the ignoring DIF approach (except one condition, nonuniform type, low magnitude, and 30% DIF item). Second, RMSE for the multigroup DIF approach was smaller than that of the modeling DIF approach, which was the expected result because of the larger number of parameters in the modeling DIF approach. In this approach, RMSE can be larger, with an increasing number of DIF items and high magnitudes in uniform and nonuniform DIF types. However, this pattern was found only in the nonuniform DIF type in the multigroup DIF approach. Third, the average ratio of SEMG to SEM was higher than 1.0 for all conditions, except three simulation conditions in the uniform DIF type, in which the standard errors of the item discrimination estimates were smaller in the modeling DIF approach than in the multigroup DIF approach. For all simulation conditions except the three conditions, larger differences in the ratio were found as the number of DIF items increased especially with the nonuniform DIF type.
For the item-location parameter (
Person scores and IRT reliability
Table 3 presents the average bias and RMSE across persons for person scores (θ) and IRT reliability for each DIF treatment approach. In the table, the results of the multigroup DIF approach were based on the two-step approach. When only non-DIF items were used for scoring in the deleting DIF approach, the bias of the person scores did not change across all levels of DIF conditions, whereas the RMSE for the person scores was influenced by the number of DIF items. Among the other three DIF treatments, the following patterns were found. First, bias and RMSE for the ignoring DIF approach were larger than the multigroup and modeling DIF approaches. Although there were some patterns, there were small differences between the two approaches. Second, overall, slightly larger bias and RMSE (except two conditions) were found in the modeling DIF approach than in the multigroup DIF approach for the uniform condition. However, the opposite pattern was found for the nonuniform condition except three conditions in which the same RMSE was found between the multigroup and modeling DIF approaches. Third, bias and RMSE in the multigroup and modeling approaches increased as the number of DIF items and the DIF magnitudes increased (except conditions with 10% and nonuniform DIF type). As expected, IRT reliability was mainly influenced by the number of DIF items in the ignoring DIF approach. IRT reliability was similar between the multigroup DIF treatment (with two steps) and the modeling DIF approach, and was not affected by the simulation conditions.
Average Bias and RMSE Across Persons for Person Scores and IRT Reliability.
Note. Results for multigroup approach were based on the two-step approach to compare results between the reference and focal groups; bias and RMSE for the multigroup approach were based on maximum likelihood estimates for the comparison with deleting and ignoring approaches; reliability for the multigroup (with the two-step) and modeling approaches were calculated based on Bayes estimation. RMSE = root mean square error; IRT = item response theory; DIF = differential item functioning.
Impact and variances of person scores
The impact (µ) and variance (σ2) of person scores for the focal group can be estimated in the multigroup DIF approach (with two steps) and the modeling DIF approach. Results are reported in Tables 4 and 5 for impact and variance estimates, respectively. The following patterns were found for the impact estimates. First, overall, larger bias and RMSE were found in the multigroup DIF approach than the modeling DIF approach (except one condition in RMSE, 10% DIF, low magnitude, and uniform DIF type). Second, for all conditions, the impact was underestimated in the multigroup DIF approach (except one condition, 10% DIF, low magnitude, and nonuniform type in which there was no cancelation of DIF across items and persons), whereas it was slightly overestimated in the modeling DIF approach. Third, bias and RMSE increased when the number of DIF items and the magnitudes of the DIF items increased in the multigroup DIF approach. This pattern was also found for bias in the modeling DIF approach, but the degree of the effects of the simulation factors was not as large as in the multigroup DIF approach. RMSE was mainly influenced by the number of DIF items in the modeling DIF approach.
Bias, RMSE, and Ratio for Impact Estimates.
Note. Impact for the multigroup approach was estimated with the two-step approach. RMSE = root mean square error; DIF = differential item functioning.
Bias, RMSE, and Ratio for Variance Estimates of Person Scores.
Note. Variance for the multigroup approach was estimated with the two-step approach. RMSE = root mean square error; DIF = differential item functioning.
The following patterns were observed regarding the variance of person scores. First, bias in the multigroup DIF approach was larger than bias in the modeling DIF approach across all conditions. However, RMSE was smaller in the multigroup DIF approach than that of the modeling DIF approach in the uniform DIF type (except one condition, 30% DIF, high magnitude, uniform DIF type), whereas the opposite pattern was found in the nonuniform DIF. Second, as shown in Tables 4 for impact estimates, bias and RMSE in the multigroup DIF approach increased as the number of DIF items and the magnitudes of the DIF items increased. This pattern was also observed for bias in the modeling approach, but the degree of the effects of the simulation factors was not as large as in the multigroup approach.
Discussion
The purpose of this study was to present the modeling DIF approach and to evaluate it by comparing its performance with that of other DIF treatments such as deleting, ignoring, and multigroup (with one step for item parameters and two steps for impact and person scores) approaches. Overall, the simulation results were consistent with the hypothesized ones, with few exceptions as noted earlier. The following general patterns were found in the simulation study. First, the multigroup and modeling DIF approaches outperformed the deleting and ignoring approaches for item parameter estimates and person scores. Second, overall, the multigroup approach with two steps works well, compared with the modeling DIF approach. Third, the modeling DIF approach can be a viable method to treat DIF items for most DIF conditions, except for the larger number of DIF items (e.g., 50%). Below, guidelines in choosing one DIF treatment method over another based on the simulation results are provided.
Parameters and Information of Interest
Given the mixed results for the multigroup and modeling DIF approaches, researchers can choose one of the methods depending on the parameters of interest (i.e., item parameters, person scores, impact, and variance of the person scores) and information they need (i.e., accuracy [quantified with bias], overall accuracy [quantified with RMSE], or precision [quantified with standard error]).
The simulation results showed that the multigroup DIF approach with one step can provide better overall accuracy for item parameter estimates than the modeling DIF approach in most DIF conditions. For the nonuniform DIF type, the overall accuracy for item discrimination parameter estimates in the modeling DIF approach can be similar to that in the ignoring DIF approach. The overall accuracy of the item location estimates in the modeling DIF approach was similar to that of the multigroup approach with one step only when there are 10% and 30% DIF items. The modeling DIF and the multigroup DIF with a two-step approach for item parameter estimates are recommended when the number of DIF items is not large (e.g., less than 30%).
However, the standard error of the item parameter estimates in the modeling DIF approach can be smaller than that of the multigroup DIF approach, because item parameters are estimated based on all persons of the data in the modeling DIF approach, whereas they are estimated from persons in the reference group. The item parameter estimates from the modeling DIF approach can be more precise than those from the multigroup DIF approach, especially when there are high DIF magnitudes and a large number of DIF items (e.g., 50%). If there are a larger number of persons in the reference group, the precision can be relatively good for the multigroup approach with a two-step approach. Unless there are many more persons in the reference group, the modeling DIF approach is preferred to the multigroup DIF approach when researchers need to use the standard errors of the item parameter estimates such as creating an item bank and implementing IRT equating.
For the person scores, there were small differences between the multigroup DIF approach and the modeling DIF approach based on the simulation results. However, there was the pattern that the multigroup DIF approach can be slightly better than the modeling DIF approach for the uniform DIF type. On the contrary, the modeling DIF approach can be better than the multigroup DIF approach for the nonuniform DIF type. Thus, one of the approaches can be chosen, depending on the DIF type in DIF analyses. Regarding impact and variance, the modeling DIF approach performed better than the multigroup DIF approach in terms of accuracy (bias) and overall accuracy (RMSE).
Multigroup Analysis Versus Modeling DIF
The specification of the multigroup and modeling approaches was based on the assumption that one of the two groups can be set as a reference group and another is a focal group. It is more critical to justify the reference group selection in the multigroup approach than in the modeling approach. When one of the two groups is arbitrarily set as the reference group, results (i.e., purified item parameters, person scores, impact, and variance of the person scores) can change in the multigroup approach because of the two-step nature for scoring and impact estimation in the multigroup approach. In contrast, the results from the modeling approach will not change in the case of having an arbitrarily chosen reference group. Thus, when a reference group cannot be clearly justified, using modeling approach is recommended over multigroup approach.
DIF Effect Sizes and Test Score Uses
DIF effects can be characterized differently, depending on the number of DIF items, the magnitude of the DIF, and the type of DIF. The DIF effect sizes can be calculated at the item and at the test levels to quantify the effects of these factors on the total score scale or latent variable scale. The DIF effect sizes can be a useful guideline in deciding a DIF treatment method.
In interpreting DIF effect sizes, it is important to decide whether the interpretation is made for the individual score level or the whole population level (e.g., Borsboom, 2006). As an example of the whole population level, when there is a large portion of nonuniform DIF items, cancelation is allowed between a reference group and a focal group. In this case, it is expected that differential effects between ignoring DIF and multigroup analysis or between ignoring DIF and modeling DIF will be small in impact estimate. When researchers care about the impact only, ignoring DIF would not cause bias in the presence of a full cancelation (i.e.,
In fact, it may be challenging to provide a general guideline for interpreting DIF effect sizes because the interpretation of the magnitude of DIF effect size may vary across test score uses. For example, the scale-level DIF effect size (ranged from 0 to 10),
This study has several limitations. First, as in other simulation studies, the simulation conditions employed in the study design were limited, including the 20-item test and −0.5 impact, because the authors’ main interest in the simulation study was to evaluate their proposed method and compare it with other methods under various DIF conditions. Investigating the four different approaches in more extensive simulation studies including various testing conditions is needed to provide more general guidelines for item calibration and scoring practices in the presence of DIF items.
Second, the multigroup and modeling approaches were based on the assumptions that there are two categories of items (i.e., DIF items and non-DIF items), and the DIF items were flagged with the right criterion. Thus, the performance of these approaches can differ depending on the quality of the DIF detection. It has been known that power and Type I error of DIF item detection vary across DIF detection methods (e.g., Bolt, 2002). Thus, it may be common to have different categorizations of DIF items (i.e., DIF items vs. non-DIF items) depending on the DIF detection method, which can be a potential problem with using the multigroup and modeling approaches. One possible strategy for dealing with this problem is to show the sensitivity of the item parameter estimates, person scores, and impact results to different DIF categorizations with different DIF detection methods. When consistent results occur among the different DIF detection methods, the results can finally be reported.
Third, the model specification for the modeling DIF approach is limited to one primary dimension and one secondary dimension for binary responses. It would be worthwhile to extend the modeling DIF approach for more complex DIF patterns such as more than one primary and secondary dimensions and/or for polytomous responses.
The current study presented and evaluated IRT purification methods for item calibration and scoring in the presence of DIF after a subset of items was detected as DIF items. In addition, the method was compared with current DIF treatment methods, deleting DIF, ignoring DIF, and multigroup approaches. It is hoped that this article improves the practice of dealing with DIF items and leads to further discussions and studies on the treatment of DIF items in evaluating measures.
Footnotes
Acknowledgements
The authors are grateful to Dr. Isabel Gauthier (Vanderbilt University) for making the data available for application and to Dr. Sonya Sterba (Vanderbilt University) for comments on an earlier draft.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
