Abstract
The logistic regression (LR) procedure for testing differential item functioning (DIF) typically depends on the asymptotic sampling distributions. The likelihood ratio test (LRT) usually relies on the asymptotic chi-square distribution. Also, the Wald test is typically based on the asymptotic normality of the maximum likelihood (ML) estimation, and the Wald statistic is tested using the asymptotic chi-square distribution. However, in small samples, the asymptotic assumptions may not work well. The penalized maximum likelihood (PML) estimation removes the first-order finite sample bias from the ML estimation, and the bootstrap method constructs the empirical sampling distribution. This study compares the performances of the LR procedures based on the LRT, Wald test, penalized likelihood ratio test (PLRT), and bootstrap likelihood ratio test (BLRT) in terms of the statistical power and type I error for testing uniform and non-uniform DIF. The result of the simulation study shows that the LRT with the asymptotic chi-square distribution works well even in small samples.
Keywords
The logistic regression (LR) procedure (Swaminathan & Rogers, 1990) is a popular method for testing differential item functioning (DIF). In the LR procedure, an LR is used to model the probability of getting an item correct using a conditioning variable (e.g., observed total test score), a group membership, and an interaction between the conditioning variable and group membership. An item is said to show DIF if the regression coefficients related to the group membership or group–conditioning interaction are statistically significantly different from zero. The regression coefficients are usually estimated using the maximum likelihood (ML) estimation, and the statistical hypotheses about DIF are typically tested based on asymptotic distributions. However, in small samples, it is well known that the ML estimation may be biased (Cordeiro & McCullagh, 1991; Firth, 1993) and the asymptotic distributions may not work well (Davison & Hinkley, 1997; MacKinnon, 2009).
In the previous studies of DIF, it has been pointed out that the lack of robustness of the ML estimation and asymptotic assumption may limit the use of the LR procedure in small samples (Parshall & Miller, 1995; Swaminathan & Rogers, 1990). Therefore, it is important to determine the extent to which the traditional LR procedure based on the ML estimation and asymptotic assumption is valid in small samples. Also, it is worthwhile to examine whether other methods, such as the penalized maximum likelihood (PML) estimation, penalized likelihood ratio test (PLRT), and bootstrap likelihood ratio test (BLRT), could be considered as alternatives. The PML estimation is an estimation method developed to address the issue of potential bias of the ML estimation in small samples (Firth, 1993), and the PLRT compares the likelihoods of two nested models based on the PML estimation. In the BLRT, the likelihood ratio test (LRT) statistic is tested based on the empirical sampling distribution constructed from bootstrap samples, rather than the asymptotic sampling distribution derived from the asymptotic theory (Efron, 1979). The goal of this study is to compare the performances of the traditional LR procedure and the potential alternatives in small samples. More specifically, the traditional LR procedure based on the ML estimation and asymptotic assumption was compared with the LR procedures based on the PLRT and BLRT, especially in small samples, in terms of the statistical power and type I error rate for testing uniform and non-uniform DIF.
For more detailed discussion on the LR procedure, let us consider the following three LR equations (Fidalgo, Alavi, & Amirian, 2014):
where
As can be seen from the discussion above, the hypothesis testings for the LR procedure typically depend on asymptotic sampling distributions. The LRT depends on the asymptotic chi-square distribution of the test statistics. Also, the Wald test is based on the asymptotic normality of ML estimators. In general, ML estimators have been widely used in many different settings because of the desirable asymptotic properties: The ML estimators are asymptotically unbiased and normally distributed. However, such nice asymptotic properties may not hold for small samples. The ML estimators have bias that increases with decreasing sample sizes, and such bias is usually known as finite or small sample bias. Therefore, in small samples, a bias correction for the ML estimators (e.g., the PML estimation) may be appreciable (Cordeiro & McCullagh, 1991; Firth, 1993). Also, the sampling distribution of the ML estimators may deviate from the normal distribution in small samples. In such a case, statistical inferences based on the non-parametric empirical sampling distribution (e.g., bootstrap) can be more accurate than statistical inferences based on the asymptotic normal distribution (MacKinnon, 2009).
In the specific context of DIF, previous studies have also concerned potential problems of testing DIF in small samples. Swaminathan and Rogers (1990) pointed out that the asymptotic result of an LR may not be a valid indicator of the presence of DIF in small samples. Mazor, Clauser, and Hambleton (1992) reported that more than 50% of the DIF items were missed when the Mantel–Haenszel (MH) procedure was used for samples of 500 or fewer examinees. Rogers and Swaminathan (1993) found that the distributional assumptions for the LR and MH procedures were less often met in small samples. Parshall and Miller (1995) compared the MH procedures based on the asymptotic chi-square distribution with the MH procedures based on the exact test. In their study, the performance of the MH procedures was extremely limited when the sample sizes of the focal groups were fewer than 100. Roussos and Stout (1996) compared the Simultaneous Item Bias Test (SIBTEST) and the MH procedure in small samples and found that two methods performed satisfactory in terms of type I error rates under all simulation conditions. Camilli (2006) also pointed out that the DIF test using the traditional LRT might be problematic in small samples.
Building on the previous studies, this study was designed to compare the performances of different statistical inferential methods for the LR procedure, especially in small samples. More specifically, the null hypothesis, which is
Method for Testing DIF
Swaminathan and Rogers (Wald Tests)
Swaminathan and Rogers (1990) assumed that the LR coefficients in Equation 1 follow a multivariate normal distribution:
where
where the
Then, it was shown that the null hypothesis can be tested using the following test statistic:
which follows the chi-square distribution with two degrees of freedom.
The LRT
The LRT compares the fit of two competing models to test whether the observed difference in model fit of the two models is statistically significant. The two competing models are typically called the augmented model, which is a more complex one, and compact model, which is a less complex one. If the observed difference in model fit is statistically significant, the augmented model is preferred, whereas if the difference is not statistically significant, the compact model is preferred based on the principle of parsimony. The test statistic
where
The PLRT
The PML estimation was developed to address the issue of the finite sample bias of the ML estimation. The PML estimation removes the first-order bias from the ML estimation by using a penalized log-likelihood, which is just the traditional log-likelihood with a penalty. In the PML estimation, the penalty is given to the deviation from a desired outcome, and therefore, it will pull or shrink the PML estimates from the traditional ML estimates (Cole, Chu, & Greenland, 2014; Firth, 1993). In broad terms, the PML estimation is known as the regularized estimation, which improves the estimation using some form of additional information. In Bayesian perspective, penalizing the likelihood corresponds to specifying a prior distribution, and the penalized log-likelihood can be considered as a posterior distribution of the parameter of interest. For exponential family models, the PML estimation is equivalent to maximizing a likelihood that is penalized by the Jeffreys’ invariant prior (Firth, 1993; Heinze, 2006). The PML or Bayesian approach was used to obtain more stable parameter estimates in the item response theory (IRT; Mislevy, 1986; Swaminathan & Gifford, 1985). Recently, the PML estimation was used to obtain parameter estimates for the two-parameter logistic model (2PLM) in the IRT with only 20 examinees, with which the traditional ML estimation for the IRT may not be applicable (Paolino, 2013). Given the PML estimates, the PLRT compares the penalized likelihoods of two nested models.
The BLRT
In general, the bootstrap method may be considered as an alternative to the asymptotic approaches when the validity of the asymptotic approximation is suspect (Davison & Hinkley, 1997; MacKinnon, 2009). In the bootstrap method, bootstrap samples of size
In the LRT, the
Fit a compact model,
Generate a bootstrap sample from the original data under the null hypothesis, and then calculate the
Repeat Step (b)
Calculate the bootstrap
A Simulation Study
A Monte Carlo simulation in this study was designed to compare the performances of the aforementioned different statistical inferential methods for the LR procedure. The performances in small samples were particularly of interest because the asymptotic sampling distributions may not work well in small samples. The factors manipulated in this study were (a) the sample sizes of the reference and focal groups in DIF tests (50R/50F, 100R/100F, 150R/50F, 250R/250F, 450R/50F, and 500R/500F), (b) the effect sizes of DIF for a studied item based on the area between item response functions (0, 0.6, and 0.8), (c) the ability distributions of the focal group (
In this simulation study, the sample size was the key factor because the focus of this study was to compare the performances of different inferential approaches for the LR procedure in small samples. The definition of smallness varied across different studies. Fidalgo, Ferreres, and MuÑiz (2004) examined the performance of the MH procedure in small samples with the sample sizes of 100, 150, 200, and 250. Parshall and Miller (1995) used 500R/25F, 500R/50F, 500R/100F, and 500R/200F to compare the performance of the exact and asymptotic MH procedures in small samples. The sample size requirements for Educational Testing Service (ETS) DIF analysis are at least 200 members in the smaller group and at least 500 in total (Zwick, 2012). In this present study, the performances of different inferential approaches were compared with the sample sizes of 50R/50F (total sample size
The number of items was fixed to 40. Among the 40 items, only one item was simulated as a DIF item. The amount of DIF in the studied item was induced following Swaminathan and Rogers (1990) and Jodoin and Gierl (2001). To induce DIF, the item parameters of the three-parameter logistic model (3PLM) for the reference and focal groups were chosen such that pre-specified areas between the item response functions for the two groups were obtained based on the formula given by Raju (1988). More specifically, to induce uniform DIF, the item discrimination (
where
where
Item Parameters of the 3PLM for R and F Groups.
Note. 3PLM = three-parameter logistic model; R = reference; F = focal; DIF = differential item functioning; a = discrimination parameter; b = difficulty parameter; c = guessing parameter.
The ability distributions of the reference and focal groups were also manipulated in this study. The ability distribution of the reference group was modeled as
Results
The statistical power and type I error rate of the LR procedures based on four different inferential methods were calculated for each of the 72 simulation conditions. Tables 2 and 3 show the results for uniform and non-uniform DIF, respectively. In each table, the other simulation conditions, which are the sample size in reference (
Statistical Power and Type I Error Rate of Different Inferential Methods for Uniform DIF.
Note.
Statistical Power and Type I Error Rate of Different Inferential Methods for Non-Uniform DIF.
Note.
The type I error rate was calculated as the proportion of the replications that showed DIF when the effect size of DIF is zero. Bradley (1978) suggested liberal, moderate, and strict criteria of robustness. In Tables 2 and 3, the value of the type I error marked with *, **, and *** indicates that the type I error rate is liberally [0.025, 0.075], moderately [0.040, 0.060], and strictly [0.045, 0.055] robust based on the robustness criteria suggested by Bradley. All the four methods show strictly robust type I error rate in most of the cases when the sample sizes are 1,000 (
The statistical power was calculated as the proportion of the replications that showed DIF when the effect sizes of DIF are 0.4 and 0.6. In Tables 2 and 3, the numbers highlighted with bold font represent the highest statistical power within each simulation condition. Similar to the case of the type I error, all the four methods yield similar statistical power when the sample sizes are 1,000 (
In addition to the comparisons among the four methods, several patterns can be identified across different simulation conditions in Tables 2 and 3. The statistical power of uniform DIF tests is higher than that of non-uniform DIF tests. The ability distributions of reference and focal groups seem to oppositely influence the statistical power in uniform and non-uniform DIF. In uniform DIF, the statistical power seems to be slightly higher when the ability distributions are the same, whereas in non-uniform DIF, the statistical power seems to be slightly higher when the ability distributions are different. Given the same sample size, the cases with the balanced sample sizes across reference and focal groups (
In Figure 1, the histograms and chi-square Q-Q plots of the LRT statistics calculated from the 10,000 bootstrap samples (i.e., empirical sampling distributions) are presented to compare the empirical and asymptotic sampling distributions for the sample sizes of

Histograms and chi-square Quantile-Quantile (Q-Q) plots.
Discussion
The ML estimates are only asymptotically unbiased and normally distributed. Therefore, there have been concerns about testing DIF using the LR procedure based on the asymptotic properties of the ML estimates when sample sizes are small (Rogers & Swaminathan, 1993; Swaminathan & Rogers, 1990). Because the null hypothesis of the DIF test in the LR procedure involves the regression coefficients from the LR, the potential finite sample bias of the ML estimates may degrade the performance of the LR procedure in small samples. Moreover, the potential deviation of the true sampling distribution from the assumed asymptotic chi-square distribution also may degrade the performance of the LR procedure. This study examined whether the LR procedure based on the asymptotic properties of the ML estimates still produces satisfactory statistical power and type I error in small samples, and also whether the LR procedures based on the PLRT or BLRT may be considered as alternatives.
The simulation results in this study indicate that the LRT, in which the LRT statistic comparing two likelihoods from the ML estimation is tested using the asymptotic chi-square distribution with two degrees of freedom, show slightly better performance than other methods in terms of the statistical power although the difference in performance seems not to be so significant for practical purposes. The robustness of the type I error rate seems to be similar in the LRT, PLRT, and BLRT. According to the results, it seems that the LR procedure based on the asymptotic properties of the ML estimation still works well even in small samples, and therefore, the LR procedure based on the PLRT and BLRT may not need to be considered as alternative.
At this point, it may be worthwhile to discuss why the LR procedures based on the PLRT and BLRT show slightly lower performance in spite of the advantages that the PML may reduce the finite sample bias and the bootstrap method may capture the potential deviation of the true sampling distribution in small samples. The PML originally was developed to remove the first-order term from the asymptotic bias of the ML estimates by modifying the scoring function (Firth, 1993). However, there exists the trade-off between the bias and variance in resulting PML estimates (Fan & Tang, 2013). Firth (1993) pointed out that the merit of bias reduction in any particular problem needs to be compared with any sacrifice in precision that might result. In our specific problem, it seems that the merit of reducing bias of the PLM estimates is not appreciable compared with the sacrifice in precision.
However, the bootstrap hypothesis tests are often very reliable in many cases (Davidson & MacKinnon, 1999; Nylund et al., 2007; Park, 2003). However, this is not true in every case. As Davidson and MacKinnon (2007) pointed out, when the results from the bootstrap hypothesis test and asymptotic test are similar, we can be fairly confident that the asymptotic test is reasonably accurate. In such a case, it might be more reasonable to use the asymptotic test considering the computational cost for the bootstrap hypothesis test. Figure 1 shows the similarity in the sampling distributions of the asymptotic test and the bootstrap hypothesis test in our specific problem. Although the empirical sampling distributions constructed from the bootstrap samples seem to have slightly thicker right tails compared with the theoretical chi-square distributions, the two distributions appear to be similar for practical purposes. The discrepancy between the two distributions seems to decrease with the increasing sample sizes according to the chi-square Q-Q plots in Figure 1. This result is reasonable because the asymptotic assumptions should work well in large samples.
One of the factors that can influence the performance of the bootstrap hypothesis tests is the size of the bootstrap samples. It is known that the smaller is the size of the bootstrap samples, the less powerful is the test (Jockel, 1986). The simulation result in this study also shows that the performance of the bootstrap hypothesis test with the bootstrap sample sizes of 10,000 was slightly better than the one with the bootstrap sample sizes of 1,000 in terms of the statistical power. Another factor that can influence the performance of the bootstrap hypothesis tests is the data generation process for the bootstrap samples (MacKinnon, 2009). In this study, the original data were generated using the 3PLM in the IRT, whereas the bootstrap samples were generated using the LR equation with the coefficients satisfying the null hypothesis. The slightly lower performance of the BLRT may be due to this discrepancy in the data generation process. This study only tested a single data generation process and different data generation processes may yield different results, which could be the limitation of this study. However, considering the similarity in the sampling distributions of the asymptotic test and the bootstrap hypothesis test shown in Figure 1, it is expected that different data generation processes may not significantly change the results of this study.
In all, the LR procedure based on the asymptotic LRT seems to work well even in small samples. Although the results from the PLRT and BLRT were similar with the results from the asymptotic method in this study, the PLM and bootstrap method have outperformed the asymptotic method in the cases where the asymptotic assumptions are suspect. Therefore, investigating the applicability of the PLM and bootstrap method for such a case would be very interesting topics for future research in the area of measurements.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
