Crossing SIBTEST or CSIB is designed to detect crossing differential item functioning (DIF) as well as unidirectional DIF. A theoretical formula for the power of CSIB is derived based on the asymptotic distribution of the test statistic under the null and alternative hypotheses. The derived power formula provides insights on the factors that influence the CSIB power, including DIF effect size, standard error, and sample size. The power formula and those influencing factors are further discussed in the context of the item response theory (IRT) three parameter logistic model (3PL) model. Simulation results show the consistency between the theoretical power and the observed rejection rate. The power of CSIB is compared with the unidirectional SIBTEST in theory and through simulation.
In item response theory, an item with differential item functioning (DIF) or item bias will have different item characteristic curves (ICCs) for examinees from different groups (usually a reference group and a focal group). DIF describes the fact that examinees with the same ability may have different probabilities of getting the item correct, depending on which group they come from. When the item favors one group over the other over the ability range of interest, the DIF is unidirectional (Figure 1A). When the item favors one group in one ability range and favors the other group in another ability range, crossing in ICCs occurs, and there is crossing DIF (Figure 1B). Some DIF procedures such as the Mantel–Haenszel (MH) test (Holland & Thayer, 1988; Mantel &Haenszel, 1959) and SIBTEST (Shealy & Stout, 1993) are designed for detecting unidirectional DIF, but not designed for crossing DIF. Other DIF procedures such as logistic regression DIF test (Swaminathan & Rogers, 1990) and Crossing SIBTEST(Li & Stout, 1996) can handle crossing DIF.
Unidirectional DIF and crossing DIF: (A) ICCs for a unidirectional DIF item, (B) ICCs for a crossing DIF item, (C) difference in ICCs for a unidirectional DIF item, and (D) difference in ICCs for a crossing DIF item.
In this article, a theoretical formula for the power of Crossing SIBTEST (CSIB) will be derived. A power formula for CSIB can help in planning the design of a DIF study that uses the CSIB procedure by providing the means for sample size calculation. It also provides insight on the factors that influence the power of CSIB. The derivation of the power formula is based on the large sample distribution of the test statistic under the null hypothesis and under the alternative hypothesis. In the power formula, there are three components that determine the power of a statistical test: effect size, standard error, and critical value. Based on the power formula, our focus will be on the discussion of factors including item characteristics (difficulty, discrimination, guessing, etc.) and examinee characteristics (ability distribution) that may affect the power of the CSIB, and how they affect the three components: effect size, standard error and the critical value. We also would like to draw a comparison with the unidirectional SIBTEST (USIB), for which the power formula was derived in Z. Li(2014). There are similarities between CSIB and USIB in effect size and the standard error; however, the critical value for CSIB is more complicated. Although CSIB was designed for detecting crossing DIF where USIB may lack power, CSIB can nonetheless be used for unidirectional DIF detection (Li & Stout, 1996). With the power formulas for CSIB and USIB, the question whether it is justifiable to use CSIB for unidirectional DIF in preference to USIB can be answered. Recently, Chalmers (2018) provided a new development in crossing SIBTEST (CSIB2). In this article, the power formula for CSIB will be focused on. For CSIB2, the issues related to the power calculation will be discussed, but a full development of power formula for CSIB2 is out of the scope of this article.
The rest of the article is organized as follows. The CSIB procedure is introduced, and then the null and alternative distributions of the CSIB statistic are discussed. The power formula for CSIB test is then derived based on the asymptotic distribution of the CSIB statistic. The formula is then applied to the item response theory (IRT) model and the factors that affect the power are discussed. Simulation studies are conducted to compare the power values from the formula and the actual rejection rates observed in the simulation.
Crossing SIBTEST
According to Shealy and Stout (1993), SIBTEST defines the average amount of (unidirectional) DIF for a dichotomous studied item by
where and are the ICCs for the reference and the focal groups, respectively, and is the density function for the pooled latent distribution. It is the weighted total signed area between the two ICCs where the weight is . Note that the area between ICCs is a signed-area where the part above 0 is positive and the part below 0 is negative. Although the signed area is useful for describing the unidirectional DIF, it is not suited for crossing DIF because positive and negative areas may cancel out.
On the contrary, Crossing SIBTEST defines the DIF effect by the weighted absolute area between ICCs (Li & Stout, 1996). Assuming there is one crossing point for the two ICCs at , the crossing DIF size is
Note that . Therefore by definition, it is guaranteed that has an absolute value no smaller than . The two-sided test for DIF is considered where the null and the alternative hypotheses are
where is for CSIB and for USIB.
CSIB and USIB use the total score of the matching subtest (consisting of only non-DIF items) to stratify examinees in different ability groups, where examinees having the same matching score are considered to be in the same ability level. In each stratum, the difference in probability of correct response to the item between the reference and focal groups is estimated by difference in the proportions of correct response.
USIB estimates the DIF effect by
where and are the adjusted average scores of the studied item(s) for the reference and the focal groups that achieve the same total matching score , which ranges from 0 to the total number of items , and is the proportion of examinees in the stratum with matching score . The average scores are adjusted by a regression correction method to remove the inflating effect of impact due to the measurement error introduced by using the matching score instead of the unknown true ability to match the examinees (Jiang & Stout, 1998; Shealy & Stout, 1993). For hypothesis testing, a standardized z-score statistic is formulated by dividing by its standard error:
and the standard error is estimated by
where and are the sample variances of the scores for examinees in stratum from the reference group and the focal group; and are the corresponding sample sizes. For large sample size, the standard error estimate converges to the population parameter (Z. Li, 2014):
where and are the population variances of the scores for examinees with latent trait value ; and are total sample sizes for the reference and focal groups; and are the density for latent trait distributions for reference and focal groups, so pooled distribution density where . Under the null hypothesis of no DIF, has a large-sample distribution of the standard normal distribution: . Therefore, the null hypothesis is rejected if , where the critical value is the quantile for the standard normal distribution (e.g., for , ).
In CSIB, the first step is to obtain an estimate of the crossing score , and after that the DIF effect is estimated by
And the CSIB statistic is given by
where is the same as in the USIB statistic. However, unlike in USIB, there is no easily derived large sample distribution for the CSIB statistic under the null hypothesis, as pointed out in Li and Stout (1996). Therefore, randomization tests are used to obtain the p-value, and the null hypothesis of no DIF is rejected if the p-value is less than the level of the test, for example, . In the case that the null is rejected and the estimated crossing score is outside of the range of matching score, it is further determined the DIF is unidirectional, whereas a crossing score within the range would indicate a crossing DIF. This way both unidirectional and crossing DIF can be detected by CSIB. Note that the rule for rejecting the null hypothesis is determined based on the null distribution for the CSIB statistic, which will be studied next.
The Null and Alternative Distributions of Statistic
Let , CSIB estimates the crossing point by fitting a line to the points by the weighted least squares method where the weight is the sample size in each matching score group,
and the point at which the regression line crosses the x-axis is the estimated crossing point:
and for convenience the nearest integer of is used as the estimate for , and if the crossing point is below 0 or above , the estimate is set to or , respectively.
Figure 2A, 2C, and 2E shows how this algorithm works for a non-DIF item, a crossing DIF item, and a unidirectional DIF item, respectively. For a non-DIF item, and , so is a ratio of two independent normal variable with mean 0, which results in a Cauchy random variable. Figure 2B shows the empirical distribution of the CSIB statistic obtained from 10,000 replicated datasets simulated for a non-DIF item. One can immediately see that the null distribution for the CSIB statistic is not the standard normal distribution, but a bimodal distribution that is symmetric to 0. A mixture of normal distribution with two symmetric components is fitted to the simulated values, and the fitted distribution density curve agrees well with empirical density (i.e., the histogram). Based on this null distribution for CSIB statistic, the critical value for a Type I error is the value such that the tail area is . And this critical value is larger than the critical value obtained under the standard normal distribution. For example, , the critical value obtained from the empirical null distribution for in Figure 2B is 2.3, which is larger than critical value 1.96 for . The null distribution of the CSIB statistic does not have a closed form but can be simulated using the Monte Carlo method as long as the generating IRT model for the null model is known. Data simulated with typical parameter values for the three-parameter logistic model (3PL) model show that the critical values for are close to 2.3.
Distributions of CSIB statistics for a non-DIF, a crossing DIF, and a unidirectional DIF item: (A) identification of the crossing point kc under the null hypothesis, (B) distribution of CSIB statistic under the null hypothesis, (C) identification of the crossing point kc for a crossing DIF item, (D) distribution of CSIB statistic for a crossing DIF item, (E) identification of the crossing point kc for a unidirectional DIF item, and (F) distribution of CSIB statistic for a unidirectional DIF item.
As the CSIB statistic does not follow the standard normal distribution, Li and Stout (1996) used a randomization procedure to determine the significance of the CSIB test. In the randomization procedure, each is multiplied by or randomly, and a CSIB statistic is calculated from the values with signs randomly shuffled. A randomization null distribution is constructed by the CSIB statistics calculated from many iterations of the randomization (Li and Stout recommended 1,000 iterations), and the randomization p-value is obtained by the proportion of the iterations that have greater than the value obtained from the original data. The null hypothesis is rejected if the randomization p-value is less than the user specified level of significance, for example, .05. Simulation studies in Li and Stout (1996) showed that the randomization tests control the Type I error of CSIB procedure. This is not surprising because in essence the randomization tests construct a null distribution for the CSIB statistic, which approximates the true null distribution, and identify the critical value from the randomization null distribution. Under the null hypothesis of no DIF, this randomization null distribution is equivalent to the empirical null distribution described in previous paragraph, thus the corresponding critical value for CSIB statistic would be the same. (However, as revealed later in simulation study section, this equivalence does not hold under the alternative hypothesis, which causes the randomization test lose some power.)
Under the alternative hypothesis, the asymptotic distribution for for a crossing DIF item follows a normal distribution with standard deviation 1(Figure 2D). For a unidirectional DIF item, the asymptotic distribution for follows a normal distribution with standard deviation 1(Figure 2F).
Power Formula for Crossing SIBTEST
For the two-sided test of the hypotheses , the null hypothesis is rejected if the CSIB statistic is in the rejection region , where is the critical value. By definition, the Type I error is the probability of rejecting the null given that the null hypothesis is true: , and the power is the probability of rejecting the null given that the alternative hypothesis is true: Under the alternative hypothesis, the CSIB statistic asymptotically follows a normal distribution: , so the power is
where is the cumulative distribution function for the standard normal distribution. Figure 3 illustrates the Type I error and power for the CSIB statistic. The tail areas formulated by the rejection region under the null distribution are symmetric to zero because the null distribution is symmetric, and the total area is the Type I error . The same rejection region forms the tail areas under the alternative distribution that are not symmetric and the total area is the power. When the DIF size is large, the tail area opposite to the sign of is very small and can often be ignored.
Null and alternative distributions for CSIB statistic.
The following equation is derived by substituting the values for and as shown in Equations (2) and (6)
In the case that is the score of a dichotomous item, or , where is the IRF for group ; and , where . Also, let be the total sample size of the combined groups, and and be the sample proportions for the reference and focal groups. Then the power formula for CSIB with a single studied dichotomous item is
The power formula for CSIB is similar to the formula for USIB. In Equation (9), if is replaced by , and by the critical value for standard normal distribution, the formula for SIBTEST can be derived (Z. Li, 2014).
Factors That Influencing CSIB Power
From the power formula, it can be seen that the power is determined by DIF effect size, standard error and critical value:
1. The size of the DIF effect .
2. The standard error parameter , which is determined by the following factors:
a. The total sample size ;
b. The proportion of the focal group and reference group and in the sample;
c. The distributions of the latent trait and ;
d. The IRF of the studied item, and
3. The Type I error rate which determines the critical value .
Of these factors, only sample size and the proportion of composition can be manipulated. Other factors are the item characteristics or examinee characteristics that are considered fixed once the study item and the target populations are determined. The power always increases as total sample size increases. If total sample size is fixed, a balanced design with will always have a larger power than an unbalanced design. If one of the group sizes, for example, , is fixed, increasing the other group size will always increase the power because the total sample size is larger.
Note that in deriving the CSIB power formula (Equation 11), no assumption was made about the form of the item response functions, and , the latent trait distributions, and . Therefore, the formula is applicable to any parametric or nonparametric IRT models and latent trait distributions. In general, the integrals in the formula do not have a closed form but can always be evaluated numerically. The numerical solution is good enough for conducting the power and sample size calculation in planning a DIF study. However it is hard to get much insight into how the effect size, the standard error, and the power are influenced by the item parameters and person parameters in an IRT model. Next, a specific IRT 3PL model should be considered where a closed-form approximate formula can be derived, and such insight can be seen from the closed-form solution.
Closed-Form Formulas Under IRT 3PL Model With
Consider the IRT 3PL model for dichotomous items where the IRF is given by
where is the latent trait, is the discrimination parameter, is the difficulty parameter, and is the guessing parameter. Assuming that the guessing parameters are the same for the two groups, , the ICCs for the two groups cross at
When there is no difference in discrimination parameter and a difference in difficulty parameter, the ICCs do not intersect and the DIF is uniform. Also, it is possible that even when there is a difference in discrimination parameter, the intersection point is outside the range of interest of the latent trait distribution and the DIF is essentially considered as unidirectional and not crossing. If it is further assumed that the two groups both have the same normal latent trait distribution with mean and standard deviation : , the effect size can been shown (proof online in Supplemental Appendix A) to have a closed-form approximate formula
where
In the formula, and are the item parameters for the reference group, and and are differences of the item parameters between the focal and the reference groups.
From Equation (14), how the DIF effect size is influenced by the IRT parameters can be seen. The presence of guessing parameter reduces the DIF effect size by a factor of . The crossing DIF effect can be decomposed into two parts in the bracket: the uniform DIF which is proportional to the difference in difficulty , and the nonuniform DIF which is proportional the relative difference in discrimination . The DIF effect is also influenced by the relative location of the item difficulty and the latent distribution. Effect size is large when an item’s difficulty is close to the mean ability of the population, that is, is small. When an item is too easy or too difficult relative the target population, the effect size becomes very small and it is hard to detect DIF.
When the reference and focal groups have the same latent trait distribution, the standard error can be approximated by its value evaluated under the null hypothesis
Besides the sample size and the proportion , other factors that influence the standard error are reflected in the integral The integrand has two parts: is related to the item characteristics and is the latent trait distribution.
For the 3PL model with item parameters , , and , and assuming , a closed-form approximation for the standard error has been derived (see online Supplemental Appendix):
If the closed-form formulas for and are substituted into Equation (9), a closed-form power formula can be derived.
Power Formulas for Decision Rule Based on DIF Effect Size
When the p-value is less than .05, the null hypothesis is rejected and the item is declared to have DIF. However, when sample size is very large, the test may give a small p-value even if the observed DIF size is small and of no practical significance. In practice, when making a decision whether an item has DIF, it is often useful to also consider the actual DIF effect size . For example, Dorans and Kulick (1986) developed standard p-difference, which can be considered as a special case of SIBTEST statistic. Standard p-difference uses a decision rule only based on the DIF effect size, where an item is declared as a DIF item if the DIF effect size is greater than some prespecified value, for example.
Consider the decision rule that declares a DIF item if , where is the prespecified critical value for (e.g., . As the asymptotic distribution for is , the power is derived by calculating probability to reject the null hypothesis:
Then the formulas for and (Equations 2 and 6) to calculate the power can be substituted.
The power depends on the sample size only through the standard error , which is inversely proportional to . Consider the two decision rules: (a) reject null if , or equivalently , and (b) reject null if . In the situation that the sample size is very large and the population DIF size is positive, even if is small, the power for decision rule will be close to 1 and the null is almost always rejected. On the other hand, when the population value is less than 0.05, the power for decision rule will be close to 0 for very large sample size, which means the null is almost always not rejected.
It is advisable that practitioners make decision rules based on both p-value and the DIF effect size to accommodate both the statistical significance and practical significance. For the decision rule that requires both and , the power formula is
Simulation Study
Simulation Setup
In the simulation study, a test form of 75 items is considered. All the items follow the IRT 3PL model with guessing parameter . Items 1 to 25 are items having no DIF , Items 26 to 50 are items having crossing DIF , and Items 51 to 75 are items having unidirectional DIF . For each item, the reference group discrimination parameter value was chosen by a positive random draw from a normal distribution with mean 1.2 and variance 0.1 (i.e., SD = 0.32); the reference group difficulty parameter was drawn from a normal distribution with mean 0 and standard deviation 1. For non-DIF Items 1 to 25, the focal group difficulty parameter was set equal to , and the discrimination parameter was set equal to . For crossing DIF Items 26 to 50, was set equal to , where was chosen from the following five values: 0.1, 0.2, 0.3, 0.5, 1, with five items per value. For unidirectional DIF items 51 to 75, was set equal to , where is chosen from the following five values: −0.1, 0.1, −0.2, 0.2, −0.3, with five items per value. The item parameter values were recorded and the same items were used in 10,000 replicated simulation runs. In each run, a sample of examinees from the reference group and examinees from the focal group were simulated by drawing their ability parameter from a distribution. Then for each examinee in the reference and focal groups, the dichotomous response to each of the 75 items was simulated from the IRT 3PL model. The simulated dataset was then analyzed by CSIB to detect DIF in each item, where the total score of the first 25 non-DIF items was used as the matching variable. Two ways of deciding statistical significance with level 0.05 were conducted and results compared: (a) randomization test as described in Li and Stout (1996), and rejecting the null if ; (b) rejecting the null if , the cutoff value as suggested by the empirical null distribution. In the analysis, USIB were also conducted and results were reported alongside the CSIB results. The decision rule for USIB was to reject the null if . Based on 10,000 replicated simulation runs, the rejection rate (= the number of times rejecting /10,000) was summarized for each item. This study used the source code for Crossing SIBTEST and SIBTEST written by the original authors of Li and Stout (1996), Shealy and Stout (1993), and Jiang and Stout (1998), which was free to download for public (DIF-Pack, n.d.); the source code written in FORTRAN was linked in R (R Core Team, 2017), and the simulation study and analysis were conducted in R.
Simulation Results
For Items 1 to 25 with no DIF, the rejection rate is an estimate of the Type I error rate, which is expected to be close to 0.05. For the DIF Items 26 to 50 (crossing DIF) and 51 to 75(unidirectional DIF), the rejection rate is an estimate of the true power, which will be used to check the validity of the power formulas. Figure 4A shows that the rejection rates for the non-DIF items (Type I error) are close to or below 0.05 which indicates the Type I error is well controlled by both the randomization test and the rule. Figure 4B shows the rejection rates, that is, the power, of the CSIB for each of the crossing DIF and unidirectional DIF items. It is notable that the rule consistently has higher power than the randomization test in those DIF items. This suggests that the randomization procedure somehow loses some power for DIF items as compared with the rejection rule based on Monte Carlo null distribution. A closer examination of the null distribution constructed by randomization procedure on a DIF item reveals that it does have longer tails (i.e., larger variability) than the Monte Carlo null distribution simulated under the null. The rejection region defined by randomization reference distribution (>97.5-percentile for ) would be outside the cutoff value of , thus it is less likely to reject the null. It is also revealed that the difference between randomization null distribution and Monte Carlo null distribution disappears when the randomization is conducted on a non-DIF item. Recall that in the CSIB randomization procedure, randomization samples are generated by randomly changing the signs of the difference scores . For a non-DIF item, the randomization does not change the variability of the estimates in randomization samples, as given by Equation (5) which can be considered “within-matching-score” variability, since each already has an expected value 0. For a DIF item, however, the randomization would introduce an additional “between-matching-score” variability. Because the trend along the matching scores (i.e., the expected value of ) is non-zero, but the values are become now randomly shuffled above and below 0, and this part is not accounted for by the “within-matching-score” variability in Equation (5). This explains why the randomization distribution obtained from a DIF item have longer tails than the true null distribution.
Observed rejection rate by CSIB in simulation data: (A) CSIB Type 1 error and (B) CSIB power.
For the DIF items, a comparison is made between the mean of estimated DIF effect size over the 10,000 replicates and theoretical value for calculated from Equation (2), and there is a good agreement (Figure 5A). Theoretical values of the power calculated from Equation (11) are compared with observed rejection rates for the randomization test (Figure 5B) and the rule (Figure 5C). The observed rejection rates for the randomization test are slightly lower than the theoretical power values, which is due to the additional variability introduced by randomization that causes the loss of the power. On the contrary, CSIB test based on the rule has rejection rates that are close to theoretical values given by the power formula.
Comparisons of Monte Carlo results with theoretical values in CSIB test: (A) DIF effect size , (B) CSIB power (randomization test), and (c) CSIB power.
A Comparison Between CSIB and USIB
From the simulation results, it can be seen that CSIB is capable of detecting both crossing DIF and unidirectional DIF. Figure 6 shows the difference between the rejection rates of CSIB and USIB on the DIF items. For most of the items with crossing DIF (Items 26–50), CSIB has higher power than USIB, and in many cases, USIB lacks the power to detect the DIF. For the items with unidirectional DIF (Items 51–75), both CSIB and USIB have power to detect the DIF, but the power of CSIB is consistently lower than USIB. In the case of crossing DIF, CSIB has a larger effect size β than USIB because it accounts the unsigned total area difference between the two ICCs, thus lead to larger power than USIB to detect the DIF. In the case of unidirectional DIF, CSIB and USIB have the same population parameter for the effect size and the B statistic, but the critical value for CSIB is larger than that of USIB, which causes the probability of rejecting the null to be smaller for CSIB.
A comparison of rejection rates: CSIB versus USIB.
Discussion
In this article, a formula for calculating the power of the Crossing SIBTEST procedure is derived. The derivation is based on the asymptotic distributions of the CSIB statistic under the null and alternative hypotheses, assuming the sample size is large and the test length is long. A comparison of the theoretical power from the formula and the observed rejection rate in the simulation study with moderate length (25 matching items) and moderate sample size (1,000 for each group) showed that the formula provides a good approximation to the power of CSIB. The factors that influence the power of the CSIB statistic in general and under a specific IRT 3PL model have been discussed. Power formulas for decision rules based on DIF effect size have been derived in a similar fashion, and the formulas clearly explain why the introduction of the effect size based rule helps the robustness in DIF detection when the sample size is extremely large.
The power formula helps explain the difference between the CSIB and the USIB. CSIB improves the power of DIF detection when there is crossing DIF by estimating the crossing DIF effect which is larger in absolute value than the unidirectional DIF effect . This improvement comes with the price of less power for unidirectional DIF than USIB. Because the null distribution for CSIB is more variable than the null distribution for USIB, the critical value for CSIB has to be larger than that for USIB in order to achieve the same level for Type I error, which results in less power for detecting unidirectional DIF by CSIB. A combined procedure using both CSIB and USIB is possible and cautions should be made to control the Type I error (see the online Supplement Appendix B on discussions on the combined procedure).
There is no closed form for the null distribution for CSIB statistic, so Li and Stout (1996) used a randomization test to calculate the p-value. The reference distribution constructed by the randomization test tends to have longer tails than the null distribution from direct simulation. This leads to a slight loss of power for the randomization test as compared with the value from the power formula. One possible solution to this problem is to use the empirical null distribution from Monte Carlo simulation to calculate the p-value, or equivalently, to use the cutoff value on CSIB test statistic (e.g., for the level .05) from the empirical null distribution.
The power formulas developed in this article are based on asymptotic distributions where large sample size and long test are assumed. It is not expected the formulas will work for short tests (e.g., test with five matching items).
Chalmers (2018) introduced new ideas in the crossing SIBTEST method and proposed a new method (CSIB2). In CSIB2, the p-value is calculated conditional on the location of the crossing point : (a) when is in range , suggesting crossing DIF, calculate statistic in two regions separately: and , then use statistic , and use as the null distribution to obtain the p-value; (b) if is outside of score range, suggesting unidirectional DIF, calculate statistic in the whole region, then use statistic , and use as its null distribution to obtain the p-value. For a level .05 test, the rejection region based on test statistic is (a) for , ; and (b) for or , . By using and as the null distribution in different conditions, CSIB2 seems to have avoided the difficulty of null distribution of in CSIB, thus no need to use the randomization test to calculate the p-value. However, the simulation study (in online Supplemental Appendix C) shows that the empirical null distribution of the CSIB2 test statistic is not or in those two conditions. Interestingly, despite the discrepancy between the nominal null distribution and the empirical null distribution, the empirical Type I error of CSIB2 is close to 0.05. Some explanations for this are given based on observations in simulation data (see online Supplemental Appendix C), but further investigations are needed to understand the null distribution of CSIB2 and development of a power formula for CSIB2 is possible.
Supplemental Material
Supplemantary_Appendix – Supplemental material for The Power of Crossing SIBTEST
Supplemental material, Supplemantary_Appendix for The Power of Crossing SIBTEST by Zhushan Li in Applied Psychological Measurement
Footnotes
Acknowledgements
The author thanks two anonymous reviewers and Associate Editor Dan Bolt for constructive comments and suggestions that helped improve the work.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD
Zhushan Li
Supplemental Material
Supplemental material for this article is available online.
References
1.
ChalmersR. P. (2018). Improving the crossing-SIBTEST statistic for detecting non-uniform DIF. Psychometrika, 83(2), 376–386.
DoransN. J.KulickE. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355–368.
4.
HollandP. W.ThayerD. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In WainerH.BraunH.(Eds.), Test validity(pp. 129–145). Lawrence Erlbaum.
5.
JiangH.StoutW. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational Statistics, 23(4), 291–322.
6.
LiH.-H.StoutW. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647–677.
7.
LiZ. (2014). A power formula for the SIBTEST procedure for differential item functioning. Applied Psychological Measurement, 38(4), 311–328.
8.
MantelN.HaenszelW. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719–748.
9.
R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
10.
ShealyR.StoutW. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194.
11.
SwaminathanH.RogersH. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.