Evaluation of Model Selection Strategies for Cross-Level Two-Way Differential Item Functioning Analysis

Abstract

Model specification issues on the cross-level two-way differential item functioning model were previously investigated by Patarapichayatham et al. (2009). Their study clarified that an incorrect model specification can easily lead to biased estimates of key parameters. The objective of this article is to provide further insights on the issue by specifically focusing on the impact of model selection strategies. Six model selection strategies were compared in this study. Through analyses of repeatedly simulated data, frequencies of each model being selected as the best model and parameter estimates were evaluated. As a result, it was found that the Bayesian information criterion (BIC) strategy tended to choose incomplete models more often than other strategies and led to more biased parameter estimates.

Keywords

two-way DIF hierarchical generalized linear modeling (HGLM)model selection strategies AIC BIC ABIC likelihood ratio tests

Introduction

Differential item functioning (DIF) can be described as the difference in an item difficulty between subgroups of examinees who have the same ability level on the trait being measured. While many DIF detection strategies have been proposed and studied, there have been attempts to fit a two-way DIF model with two DIF factors, where one DIF factor is an individual characteristic variable and the other DIF factor is a cluster characteristics variable (e.g., Chaimongkol, Huffer, & Kamata, 2006; Kamata, Bauer, & Miyazaki, 2008; Kamata & Cheong, 2007; Vaughn, 2007). Patarapichayatham, Kamata, and Kanjanawasee (2009) referred to this model as the cross-level two-way DIF model.

Patarapichayatham et al. (2009) pointed out that the cross-level two-way DIF model had been often described as an incomplete model and investigated the impacts of incomplete model specifications on parameter estimates for the model. They investigated four models, including the full model, an incomplete model without the two-way interaction between DIF factors, an incomplete model without the cluster-level DIF, and an incomplete model without both the two-way interaction and the cluster-level DIF. Their study revealed that estimates of the individual-level DIF and the three-way interaction would be quite biased by the three incomplete models. It was evident that the effect on these two parameters was much larger by dropping a nonzero cluster-level DIF than dropping a nonzero two-way interaction. Also, the effect was larger by dropping both parameters than dropping either one of these two parameters individually. Although their study highlighted the importance of a correct model specification, their study did not include any examinations of model selection strategies. Since a practitioner is likely to use a model selection strategy to justify a selection of an incomplete model, it is important that we gain a good understanding of the impact of model selection strategies for the cross-level two-way DIF model. Therefore, this study investigated impacts of six model selection strategies on model selection and model parameter estimates. Furthermore, this study expanded simulation conditions that Patarapichayatham et al. (2009) implemented.

Design for the Simulation Study

Modeling

As Patarapichayatham et al. (2009) described, the cross-level two-way DIF model for dichotomously scored items is a Rasch family item response model and can be written in the three-level hierarchical generalized linear modeling framework (HGLM) as

\log (\frac{p_{i j k}}{1 - p_{i j k}}) = γ_{1} X_{j} + γ_{2} W_{k} + γ_{3 i} I_{i} + γ_{4} X_{j} W_{k} + γ_{5 i} X_{j} I_{i} + γ_{6 i} W_{k} I_{i} + γ_{7 i} X_{j} W_{k} I_{i} + θ_{j}^{(2)} + θ_{k}^{(3)},

(1)

where p_ijk is the probability of a correct answer for item i by individual j in cluster k, X_j is an individual-level DIF factor, W_k is a cluster-level DIF factor, and I_i is an item indicator. As a result, γ₁ is the individual-level ability difference, γ₂ is the cluster-level ability difference, γ₃ _i is the item difficulty for item i, and γ₄ is the two-way interaction effect between the individual- and cluster-level DIF factors. In this study, γ₄ is referred to as the “two-way interaction” for simplicity. Furthermore, γ₅ _i is the individual-level DIF effect for item i, γ₆ _i is the cluster-level DIF effect for item i, and γ₇ _i is the individual-level DIF × cluster-level variable interaction for item i. In this study, γ₇ _i is referred to as the “three-way interaction.” Last, θ _j ⁽²⁾ is the individual-level ability and θ _k ⁽³⁾ is the cluster-level ability. Since θ _j ⁽²⁾ and θ _k ⁽³⁾ are random effects, their variances σ²θ _j ₍₂₎ and σ²θ _k ₍₃₎ are parameters to be estimated. These parameters were estimated within the two-level confirmatory factor analysis (CFA) model by the full-information maximum likelihood estimator using Mplus software.

Like Patarapichayatham et al. (2009), we set up four models based on Equation (1): the full model (Model 1), an incomplete model without the two-way interaction (Model 2), an incomplete model without the cluster-level DIF (Model 3), and an incomplete model without both two-way interaction and cluster-level DIF (Model 4).

Model Selection Strategies

A total of six model selection strategies were evaluated. The first three model selection strategies used a single information criterion value: one of Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted Bayesian information criterion (ABIC). Values of target information criterion were compared between the four models, and the model with the smallest value was selected as the best model.

For the fourth model selection strategy, we used the results from all three information criteria to select a model, namely, if two out of three information criteria or all three information criteria indicated the same model as the best model, the model was considered as the best model. However, it was possible that model selection results differed between the three information criteria. If it happened, no model selection decision was made. This strategy was referred to as the “2 of 3” strategy in this study.

The fifth model selection strategy evaluated the p-values of the two-way interaction and the cluster-level DIF. If both parameters were statistically significant at the α = .05 level, Model 1 was considered as the best model. If only the two-way interaction was significant, Model 2 was considered as the best model. Similarly, if only the cluster-level DIF was significant, Model 3 was considered as the best model. If both parameters were not significant, Model 4 was considered as the best model. This model selection strategy was referred to as the “p-value” strategy in our study.

The sixth model selection strategy used a series of likelihood ratio tests (LRT) for pairs of nested models. First, we evaluated Models 2 and 4 and Models 3 and 4. Four scenarios were possible. First, if both two pairs were not significant, we evaluated Models 1 and 4. If it was statistically significant, Model 1 was considered as the best model. On the other hand, if it was not significant, Model 4 was considered as the best model. Second, if the test was significant only for Models 3 and 4, we evaluated Models 1 and 3. If Model 1 was significantly better than Model 3, Model 1 was considered as the best model. However, if there were no statistical evidence that Model 1 was better than Model 3, Model 3 was considered as the best model. Third, if the test was significant only for Models 2 and 4, we further evaluated Models 1 and 2. If this test was significant, Model 1 was considered as the best model. However, if there were no statistical evidence that Model 1 was better than Model 2, Model 2 was considered as the best model. Finally, if both pairs were significant, namely, both Models 2 and 3 were significantly better than Model 4, we further evaluated Models 1 and 2 and Models 1 and 3. If both pairs were again significant, Model 1 was considered as the best model. If only the test for Models 1 and 2 was significant, Model 3 was considered as the best model. Similarly, if only the test for Models 1 and 3 was significant, Model 2 was considered as the best model. However, if both pairs were not significant, there was no statistical evidence that Model 1 was better than Models 2 and 3. In this case, no model selection decision was made, because LRT could not compare Models 2 and 3, which were not nested models.

Simulation Conditions

Two cluster sizes (50 and 100 students in each school), three magnitudes of two-way interaction (small [0.1], medium [0.2], and large [0.3]), and three magnitudes of cluster-level DIF (small [0.2], medium [0.4], and large [0.6]) were investigated, totaling 18 simulation conditions. The number of clusters (schools) was fixed at 50 in all conditions. Among 50 clusters in each simulation condition, 25 clusters were considered to be in the focal group and the remaining 25 were considered to be in the reference group of the cluster-level DIF factor. For each cluster, half of the students were categorized into the focal group for the individual-level DIF factor, and the remaining were categorized into the reference group.

It was assumed that the test consisted of 12 dichotomous items with item difficulties ranging from −2.0 to 2.5. The individual-level mean ability difference was 0.2, and the cluster-level mean ability difference was 0.1. Only 1 of the 12 items was assumed to display nonzero DIF. Regarding DIF magnitudes, the individual-level DIF was −0.5, and the three-way interaction was 0.3. For individuals in reference groups in both the individual-level DIF factor and the cluster-level DIF factor, the DIF magnitude was 0.0. However, if the individual was in the focal group for the individual-level DIF factor, but in the reference group for the cluster-level DIF factor, the DIF magnitude was −0.5. Similarly, if the individual was in the reference group for the individual-level DIF factor, but in the focal group for the cluster-level DIF factor, the DIF magnitude was 0.2, 0.4, or 0.6 depending on the simulation condition. Moreover, if the individual were in the focal group in both the individual-level and the cluster-level DIF factors, the DIF magnitude was the sum of the two-way interaction, the individual-level DIF, the cluster-level DIF, and the three-way interaction, which was 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, or 0.7 depending on the simulation condition. Furthermore, the variances of the individual-level ability (σ²θ _j ₍₂₎) and the cluster-level ability (σ²θ _k ₍₃₎) were assumed to be 1.0. Then, dichotomous item response data were randomly generated, and the four models studied were fit to estimate parameters with 50 replications for each simulation condition. The best model was selected by each of the six model selection strategies for each replication. The impact of six model selection strategies was evaluated by the frequencies of each model being selected as the best model in the 18 simulation conditions. The means and the standard deviations of the six parameters in the best model were evaluated as well.

Results

First, the frequencies of Model 1 being selected as the best model for each of the six model selection strategies for the 18 simulation conditions were evaluated. The results are summarized in Figure 1. In the figure, a cross mark (×) is placed for conditions in which Model 1 was most frequently selected as the best model. In general, the frequency sharply increased as the magnitude of the cluster-level DIF and the two-way interaction increased. An exception was the p-value strategy, with which the frequencies were already high when the magnitude of the two-way interaction was low. For the p-value strategy, a sharp increase of the frequency was observed only for the change in the cluster-level DIF magnitudes. Moreover, Model 1 was most frequently selected as the best model in all simulation conditions by the p-value strategy.

Figure 1.

Frequency of Model 1 being selected

Effects of the magnitude of the cluster-level DIF were more different between simulation conditions when the magnitude of the two-way interaction was larger. However, the difference between effects of medium and large magnitudes of the cluster-level DIF (γ₆ _i = 0.4 and 0.6) was very small in large cluster size conditions. This indicates that a larger cluster size helped identify a smaller effect of the cluster-level DIF. Overall, the AIC and the p-value strategies tended to choose Model 1 as the best model more often than other model selection strategies. Other model selection strategies required larger two-way interaction and cluster-level DIF effects for Model 1 to be most frequently selected as the best model.

Second, the frequencies of Model 2 being selected as the best model were evaluated, although a summary figure is not presented here. Overall, the frequency decreased as the magnitudes of the cluster-level DIF and the two-way interaction increased. The difference in the frequencies was greater at the small magnitude of the two-way interaction (γ₄ = 0.1). However, the frequencies became very close to each other and small when the magnitude of the two-way interaction was large (γ₄ = 0.3). Moreover, in conditions with the large cluster size, the frequencies for conditions with medium and large magnitudes of the cluster-level DIF (γ₆ _i = 0.4 and 0.6) were almost identical. Overall, all model selection strategies had a tendency to select Model 2, when the two-way interaction effect and the cluster-level DIF are small. In some conditions, except with the p-value strategy, the tendency was to choose Model 2 as the best model most frequently.

Third, the frequencies of Model 3 being selected as the best model were evaluated. The tendency was that the frequency increased as the magnitude of the two-way interaction became larger and the magnitude of the cluster-level DIF became smaller. In fact, the only condition in which Model 3 had the highest frequency was when the two-way interaction was the largest (γ₄ = 0.3), and the cluster-level DIF was the smallest (γ₆ _i = 0.2) by four of the six selection strategies (BIC, ABIC, 2 of 3, and LRT). By the p-value strategy, the frequencies were zero for all simulation conditions.

Last, the frequencies of Model 4 being selected were evaluated. In general, the frequencies for Model 4 being selected decreased when the magnitudes of the two-way interaction and the cluster-level DIF increased, except for the p-value strategy, in which the frequencies were zero or near zero in all simulation conditions. Model 4 was never the most frequently selected model by the AIC and the p-value strategies. With the other model selection strategies, Model 4 was the most frequently selected model only when the cluster-level DIF was the smallest and the two-way interaction was not the largest. The only exception was with BIC, for the condition with the medium cluster-level DIF (γ₆ = 0.4) and the smallest two-way interaction (γ₄ _i = 0.1) with the small cluster size.

Although numerical results are not reported here, the bias and the standard errors were evaluated for six parameters $(γ_{1}, γ_{2}, γ_{5 i}, γ_{7 i}, σ^{2} θ_{j} (2), and σ^{2} θ_{k} (3))$ for each model selection strategy. It was revealed that the estimates of individual-level DIF and three-way interaction were quite different from the true values across three simulation factors by the BIC strategy. It is probably because the BIC strategy led to selections of incomplete models more often, and consequently, it produced biased estimates for these two parameters. The standard errors indicated that the choice of model selection strategies did not affect the stableness of parameter estimates.

Conclusions

Our study demonstrated that the model selection and the quality of parameter estimates were indeed affected by model selection strategies and by the three simulation factors. It was evident that when the two-way interaction effect, the cluster-level DIF effect, and the cluster size became larger, all model selection strategies tended to select the complete model in all simulation conditions. On the other hand, when these effects and the cluster size were smaller, model selection strategies did not necessarily select the complete model. The performance of AIC in this study was consistent with previous research on model selections (e.g., Kang & Cohen, 2007; Li, Cohen, Kim, & Cho, 2009), that is, AIC tended to select a more complicated model. The performance of AIC directly contrasted with BIC, as the BIC strategy appeared to choose a simpler model rather than an alternative complicated model, similar to what Li et al. (2009) found in their study. However, the performance of some model selection strategies was inconsistent in some simulation conditions. These results are not surprising, because other studies had found that results from different model selection strategies are not always consistent (e.g., Kang, Cohen, & Sung, 2009; Li et al., 2009). Practically, a researcher can use a model selection strategy to justify an incomplete model. However, our results indicated that BIC is not a recommended criterion by itself for the cross-level two-way DIF detection model. If it is used, our recommendation is that it should be used along with other information criterion values, like our “2 of 3” strategy. Nonetheless, we must understand that different model selection strategies may suggest different model selections.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The original work of this paper was conducted while the first author was receiving the Strategic Scholarship for Frontier Research Network from the Commission on Higher Education of Thailand.

References

Chaimongkol

Huffer

W. F.

Kamata

(2006). An explanatory differential item functioning (DIF) model by the WinBUGS 1.4. Songklanakarin Journal of Science and Technology, 29, 449-458.

Kamata

Bauer

D. J.

Miyazaki

(2008). Multilevel measurement model. In O’Connell

A. A.

McCoach

D. B.

(Eds.), Multilevel analysis of educational data (pp. 345-388). Charlotte, NC: Information Age.

Kamata

Cheong

(2007). Multilevel Rasch model. In von Davier

Carstensen

C. H.

(Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications (pp. 217-232). New York, NY: Springer.

Kang

Cohen

A. S.

(2007). IRT model selection methods for dichotomous items. Applied Psychological Measurement, 31, 331-358.

Kang

Cohen

A. S.

Sung

H.-J.

(2009). Model selection indices for polytomous items. Applied Psychological Measurement, 33, 499-518.

Cohen

A. S.

Kim

S.-H.

Cho

S.-J.

(2009). Model selection methods for mixture dichotomous IRT models. Applied Psychological Measurement, 33, 353-373.

Patarapichayatham

Kamata

Kanjanawasee

(2009). Cross-level two-way differential item functioning analysis model by multilevel Rasch modeling. Journal of Research Methodology & Cognitive Science, 7, 15-21.

Vaughn

B. K.

(2007). A hierarchical generalized linear model of random differential item functioning for polytomous items: A Bayesian multilevel approach (Unpublished doctoral dissertation). Florida State University, Tallahassee.