A simulation study was conducted to investigate the heuristics of the SIBTEST procedure and how it compares with ETS classification guidelines used with the Mantel–Haenszel procedure. Prior heuristics have been used for nearly 25 years, but they are based on a simulation study that was restricted due to computer limitations and that modeled item parameters from estimates of ACT and ASVAB tests from 1987 and 1984, respectively. Further, suggested heuristics for data fitting a two-parameter logistic model (2PL) have essentially went unused since their original presentation. This simulation study incorporates a wide range of data conditions to recommend heuristics for both 2PL and three-parameter logistic (3PL) data that correspond with ETS’s Mantel–Haenszel heuristics. Levels of agreement between the new SIBTEST heuristics and Mantel–Haenszel heuristics were similar for 2PL data and higher than prior SIBTEST heuristics for 3PL data. The new recommendations provide higher true-positive rates for 2PL data. Conversely, they displayed decreased true-positive rates for 3PL data. False-positive rates, overall, remained below the level of significance for the new heuristics. Unequal group sizes resulted in slightly larger false-positive rates than balanced designs for both prior and new SIBTEST heuristics, with rates less than alpha levels for equal ability distributions and unbalanced designs versus false-positive rates slightly higher than alpha with unequal ability distributions and unbalanced designs.
This article presents results from a simulation study investigating the heuristics for the SIBTEST (Shealy & Stout, 1993) procedure that correspond to the ETS differential item functioning heuristics for the Mantel–Haenszel (MH) procedure that was modified by Holland and Thayer (1986). Differential item functioning (DIF) is a term used to identify when individuals from different groups (e.g., gender, ethnicity, educational level, geographical location) who have the same ability level respond differently on an item (Elliot et al., 2016; Hambleton & Swaminathan, 1985). SIBTEST and Mantel–Haenszel are two procedures commonly used when studying bias at the item level. The original SIBTEST articles and procedure (Shealy & Stout, 1993), for DIF detection, were innovative for its time and included data modeled using two standardized assessments. However, the capacity of computing power at the time may have limited the conditions and number of replications that were feasible. Further, there may have been some confusion in the selection of heuristics from the original article when values were adapted to changing ETS classifications (Roussos & Stout, 1996; Shealy & Stout, 1993). Last, a newer calculation was implemented since SIBTEST’s initial introduction (Jiang & Stout, 1998) that could assist with an improved alignment with Mantel–Haenszel criteria. Therefore, it was decided that a study to revisit the associated dichotomous heuristics that correspond with ETS DIF classifications was warranted. Many different terms (e.g., cutoffs, cut-scores, criteria) have been used to describe the MH and SIBTEST values used for determining the magnitude of DIF present. When professional judgment is used to determine a minimum score for inclusion or exclusion, the term heuristic is commonly used. Therefore, this study will use the term heuristic when discussing the values used for classifying the magnitude of DIF. In addition, this study will use the term criteria to indicate the combined use of a DIF procedure’s hypothesis test and heuristics for DIF classification.
Having the ability to detect whether items display bias against a specific subgroup is important in developing a test that is equally valid for all participants within a population. With standardized achievement testing affecting many aspects of individuals’ lives, having a valid way of determining if items measured are the same across subgroups is vital. The study of item bias has been investigated for decades, and there has been substantial work in determining statistical procedures with significance testing and providing a corresponding classification of the magnitude of bias in dichotomous items (e.g., Holland & Thayer, 1986; Jodoin & Gierl, 2001; Shealy & Stout, 1993; Zumbo, 1999). This study adds to the body of DIF research by reanalyzing the heuristic mapping between SIBTEST’s beta-uni () and delta Mantel–Haenszel (), both of which are elaborated in the following sections. The delta Mantel–Haenszel procedure has been considered the industry standard in educational testing (Roussos et al., 1999; Zwick, 2012); however, it has been shown that under non-Rasch models and when ability distribution variances differ has greater inflated Type I errors than SIBTEST (Pei & Li, 2010; Roussos & Stout, 1996). Therefore, mapping to allows researchers and practitioners an additional tool—that has verifiable levels of agreement—to assess DIF under conditions that SIBTEST performs better than .
Mantel–Haenszel
Holland and Thayer (1986) described how to utilize the Mantel–Haenszel procedure for DIF detection. The MH procedure tests combines a chi-square significance test with the null hypothesis, = 1. The formula for calculating :
where Aj and Bj are the frequency counts of correct and incorrect responses at score category j for the reference group, respectively, and Cj and Dj are the frequency counts of correct and incorrect responses at score category j for focal group, respectively. Tj represents the total sample size at score category j. Holland and Thayer (1986) also standardized the metric of analysis by calculating a logarithmic transformation that changed the hypothesis test from to . The new metric converted the MH odds ratio to a DIF measure that is interpreted as the differences in item difficulty on a scale that is used by ETS to designate heuristics for dichotomous DIF classification with negative values favoring a reference group and positive values favoring a focal group (Holland & Thayer, 1986). Table 1 shows the heuristics and hypothesis test criteria that are associated with negligible, moderate, and large DIF for use with the MH procedure. These heuristics are the ones recommended by Nancy Peterson via a memo in 1987 (as cited in Zwick, 2012). Zwick (2012) also reported (via Dorans’s personal communication) that the values of 1 and 1.5 were developed based on unrounded differences, with the idea that rounded values greater than or equal to 2 need to be avoided, but if equals 1 it can be tolerated.
DIF Classification Criteria for the Mantel–Haenszel Procedure.
DIF classification
Mantel–Haenszel
Hypothesis test
A—negligible
or not statistically significant
B—moderate
and statistically significant
C—large
and statistically significant
Note. DIF = differential item functioning.
SIBTEST
SIBTEST is also utilized to detect DIF in dichotomous items. SIBTEST is a nonparametric and multidimensional method that also uses a magnitude difference and hypothesis test to determine whether DIF is present in an item. Due to the assumption of similar target ability distributions across all groups being untenable in practice (Shealy & Stout, 1993), SIBTEST utilizes a regression correction that is based on classical test theory linear regression to help reduce bias in the estimation. The regression correction is completed using a Taylor series approximation and a nonlinear technique (Jiang & Stout, 1998). This regression correction has been shown to have reduced Type I error rates compared with other similar DIF procedures such as Mantel–Haenszel and logistic regression (Jiang & Stout, 1998; Li & Stout, 1996; Roussos & Stout, 1996). SIBTEST tests the null hypothesis that . The null hypothesis is tested by an a priori contrast (Shealy & Stout, 1993, p. 175),
where n is the total possible test score, the proportion of all examinees at score category k is , and the average bias corrected score for group g examinees at score category k is (g = R, F). R and F represent the reference and focal groups, respectively. Positive scores favor the reference group and negative scores favor the focal group. Details on the computations for the bias-corrected average scores can be found in Shealy and Stout’s (1993, pp. 191-193) seminal work and in Jiang and Stout’s (1998) improved Type I error work.
After investigating the relationship between values for classifying moderate and large DIF and values using the SIBTEST procedure, Shealy and Stout (1993) originally mapped ETS heuristics of 0.5 and 1.0 onto to propose a set of heuristics. However, after a reference to Zwick and Ercikan’s (1989) study identifying different ETS heuristics of 1.0 and 1.5, Roussos and Stout (1996) recommended two new sets of heuristics to correspond to the ETS heuristics (see Table 2).
One set of heuristics was provided for 2PL data and the other for 3PL data; however, after further investigation, the values selected for the prediction models use to predict the uncorrected rather than the corrected . The original study was also limited by restricting the combination of item parameters simulated, the number of replications per condition (100), and did not include the calculation for that has been updated since the original 1993 study. Further, after reviewing literature, it appears as if the recommended heuristics for a 2PL model have rarely been used since their original presentation, even with noncognitive data with expected guessing parameters of zero (e.g., Sohn, 2001). Therefore, the purpose of this study is to reexamine the heuristics that would be mapped to ETS heuristics for a variety of item characteristics and sample size conditions and use a large number of replications per condition for estimating true-positive and false-positive rates.
Method
This simulation study was conducted to determine the values that correspond to the DIF classification for dichotomous items using ETS’s heuristics cited in Zwick (2012). The statistics behind how to assign items to the levels of moderate or large DIF were originally described by Nancy Peterson beginning in 1987 (as cited in Zwick, 2012). Four sets of procedures are used to conduct this study. First, an evaluation of the relationship between and values was conducted. Second, we predicted using heuristics that correspond to moderate and large DIF. Third, the false-positive and true-positive rates for identifying moderate- and large-DIF items using the two sets of SIBTEST DIF classification criteria and the DIF classification criteria were compared. Last, analysis of agreement between DIF classification criteria and the two sets of DIF classification criteria were compared using Cohen’s kappa coefficient.
Data Generation
R version 3.6.3 (R Development Core Team, 2020) and associated packages were used for simulating item parameters, ability distributions, and the corresponding response strings for all simulations. Specifically, item parameters for matching subtests, the items used as a proxy for latent ability, were generated using the rlnorm, rnorm, and rbeta functions within R. The probability for a correct response to a dichotomous item followed the 3PL or 2PL model in this study. The probability function for the 3PL model is given by
where is guessing, is item difficulty, is item discrimination for item i and is the latent ability. When is constrained to zero, Equation (3) becomes the probability function for the 2PL model. Next, a random value (U) was drawn from a standard uniform distribution, and the item was coded as correct (1) if and incorrect (0) if .
Simulation Design
A test of 41 items was simulated with one suspect item and the matching subtest comprised 40 items. Several variables were manipulated in this study: the magnitude of difference in item difficulty between the reference and focal groups on the suspect item (); suspect item difficulty for low-, medium-, and high-difficulty items (bi); and item discrimination values considered to be low, medium, and high (ai). The magnitude of difference in item difficulty on the suspect item ranged from 0 to 0.8 (see Table 3).
Simulation Conditions for Reevaluating Heuristics.
Conditions
Sample size NR/NF
Equal
500/500
Unequal
750/250
Suspect item parameters
Item discrimination (ai)
0.5, 0.9, 1.25
Item difficulty (bi)
−1, 0, 1
Guessing (ci)
0, 0.2
Differences in item difficulty
0, 0.2, 0.4, 0.6, 0.8
()
Ability distributions
(Reference/Focal)
N(0, 1)/N(0, 1) N(−0.5, 1)/N(0.5, 1)
Suspect item difficulties were −1, 0, and 1 and item discrimination ranged from 0.50 to 1.25. The crossing of the three difficulty and discrimination parameters was designed using Narayanan and Swaminathan’s (1994) study that allows one to test a variety of possible conditions that would be seen in practice. Two sample size conditions were included: equal (NR = 500, NF = 500) and unequal (NR = 750, NF = 250). Two sets of ability distributions were included for the reference and focal groups: equal (both groups drawn from a standard normal distribution) and unequal (the reference group was drawn from a N(−0.5, 1) distribution, and the focal group was drawn from a N(0.5, 1) distribution). Prior research has indicated the use of the same heuristics for both 1PL and 2PL data (Wright & Oshima, 2015). Thus, modeling after Roussos and Stout’s (1996) original study, two data models were selected for simulation conditions. The first analysis used data generated with a 2PL model (ci = 0). The study was repeated with 3PL generated data (ci > 0). A summary of conditions is provided in Table 3.
Variables held constant included the test length (41 items, with a 40-item matching subtest and one suspect item), number of replications (5,000), and number of response options (2). New sets of item parameters for the matching subtest were randomly generated with each replication to maximize the generalization of the results. Using prior research (Walker et al., 2011), discrimination parameters were randomly drawn from a log normal distribution with a mean of 1.0, and a standard deviation of 0.5. The guessing parameters were constrained to 0 for 2PL data and were generated from a beta distribution with a mean of 0.16 and a standard deviation of 0.004 for 3PL data. The difficulty parameters were generated from N(0, 1).
Analysis
Analysis was completed using R Version 3.6.3 (R Development Core Team, 2020). To calculate the value for the SIBTEST procedure, custom functions that incorporate the Jiang and Stout (1998) regression correction were written in R and validated against the SIBTEST Version 1.7 software (Jiang & Stout, 1998; Shealy & Stout, 1993). Calculation of was accomplished by using the difR package (Magis et al., 2010). To reassess the SIBTEST heuristics for DIF as they compare with the heuristics used by ETS, first the relationship between the two sets of values was investigated. After a linear relationship was established, was regressed on . This regression analysis was conducted to determine the heuristics that are associated with moderate- and large-DIF values. A second study was conducted to compare the false-positive and true-positive rates for , the former criteria, and the newly identified heuristics. The false-positive and true-positive rates were calculated at both the moderate- and large-DIF levels. An item is considered a false-positive when the item is simulated to have been free of DIF yet is identified as displaying DIF. An item is considered a true-positive when the item is simulated to display DIF and is identified as displaying DIF. Items that are a false-positive or true-positive at the moderate level must satisfy two criteria. The first criterion is the hypothesis test being statistically significant, and the second criterion is that the magnitude of the statistical procedure’s index (e.g., and ) must be greater than or equal to the moderate-level DIF heuristic. Similarly, items are a false-positive or true-positive at the large-DIF level if they satisfy the criteria of the statistical procedure’s index being greater than or equal to the large-level DIF heuristic, and the hypothesis test is significant. The use of both index magnitude and statistical significance has been recommended when conducting DIF analyses (Roussos & Stout, 1996). Finally, a supplemental analysis investigating the level of agreement using Cohen’s kappa coefficients was used to compare the alignment of the heuristics with .
Results
The results from this study are presented in three parts. The first part presents a reestimation of heuristics that correspond to the heuristics used by ETS. The second part compares the false-positive rates and the true-positive rates (at both moderate- and large-DIF levels) for the two sets of criteria for SIBTEST (the former and new heuristics with hypothesis testing) and the set of criteria for the MH procedure (the heuristics with hypothesis testing). Finally, the level of agreement between the MH procedure criteria and both sets of SIBTEST criteria are calculated.
Reassessing
Scatterplots were used to investigate the relationship between and (see Figure 1). Data were combined to include both sets of ability distributions. The scatterplots show that for both 2PL and 3PL data, there were strong, negative, linear relationships between the two sets of scores. The correlation between and for 2PL and 3PL data were −0.936 and −0.926, respectively. The negative relationship is due to a negative value indicating that the reference group is favored and a positive value indicates that the reference group is favored.
Relationship between and for 2PL and 3PL Data. Data represented here are a random sample (n = 800). The full sample (n = 900,000) does not allow for a visual inspection.
Two linear regression analyses were conducted to investigate the relationship between and . One for 2PL data and the other for 3PL data. The results of a simple linear regression analysis for 2PL data (see Table 4) showed that was a significant predictor (t = −2524.3, p < .001) of . The slope of the predictor was −0.055, indicating decreases by a value of −0.055 when increases by a value of one. The adjusted R2 value was .88, accounting for 88% of the variation. Similarly, regression analysis showed that was a significant predictor (t = −23267.7, p < .001) of for the 3PL data (see Table 4). As increases by a value of one, decreases by a value of −0.066. The adjusted R2 value was .86, accounting for 86% of the variation.
With both models indicating a strong, significant relationship, it was appropriate to continue the reassessment of the 2PL and 3PL heuristics for that correspond to the heuristics for ETS classification. Since the data were simulated in favor of the reference group for all conditions and the interpretation between and are opposite, values of −1 and −1.5 were used to calculate the new heuristics for moderate and large DIF for both 2PL and 3PL data. The heuristics for 2PL data were 0.062 (95% confidence interval [CI] = [0.062, 0.063]) and 0.090 (95% CI = [0.090, 0.090]) for moderate and large DIF, respectively. For 3PL data, the heuristics for moderate and large DIF were 0.069 (95% CI = [0.069, 0.069]) and 0.102 (95% CI = [0.102, 0.102]), respectively. A summary of these heuristics with classification criteria can be found in Table 5. To further investigate the consistency of heuristics, comparisons were made across the different simulated conditions. The variance in the average heuristics ranged from 0.0002 to 0.0005 for moderate- and large-DIF levels with 2PL and 3PL data.
Newly Recommended DIF Classification Criteria for the SIBTEST Procedure.
The second part of the study was to compare false-positive and true-positive rates for the previously recommended heuristics, the newly calculated heuristics, and heuristics. For ease of reading, the previously recommended heuristics (see Table 2) will be notated as , while the newly calculated heuristics (see Table 5) will be notated as . To begin, false-positive rates will be compared for both 2PL and 3PL data at both the moderate- and large-DIF levels. Next, true-positive rates will be compared at both the moderate- and large-DIF levels.
False-Positive
When the ability distributions were equal, the false-positive rates were below the level of significance for all three sets of DIF criteria at both the moderate- and large-DIF levels (see Figure 2). When the data were 2PL, and had more similar false-positive rates than and . Conversely, when data were 3PL, and had false-positive rates that were more similar.
False-positive rates when ability distributions are equal.
When data were 2PL, had the highest moderate-DIF false-positive rates at 2.02% and 3.76% for equal and unequal sample sizes, respectively (see Table 6). When data were 3PL, had the highest moderate-DIF false-positive rate for both sample size conditions. These false-positive rates were 3.66% and 4.72% for equal and unequal sample sizes, respectively. Regardless of moderate-DIF level, large-DIF level, or data type, the three sets of DIF criteria maintained false-positive rates below the level of significance across the varying item types.
False-Positive Rates for Equal Ability Distributions.
Factor
Moderate-DIF level
Large-DIF level
2PL data
3PL data
2PL data
3PL data
Sample size
Equal
0.82
2.02
1.53
3.66
1.66
1.04
0.01
0.09
0.05
0.26
0.05
0.04
Unequal
2.40
3.76
2.66
4.72
3.72
1.84
0.11
0.51
0.26
1.00
0.32
0.11
Item type
Low b
1.33
2.51
2.37
2.87
1.34
2.58
0.06
0.26
0.21
0.25
0.06
0.20
Medium b
2.19
3.73
1.36
4.67
3.12
0.90
0.08
0.42
0.03
0.68
0.20
0.01
High b
1.30
2.42
2.55
5.03
3.61
0.83
0.04
0.23
0.22
0.96
0.28
0.01
Low a
2.90
4.30
0.93
4.70
3.37
0.83
0.14
0.63
0.02
0.83
0.27
0.02
Medium a
1.25
2.70
2.07
4.02
2.50
1.50
0.03
0.21
0.16
0.63
0.16
0.07
High a
0.67
1.67
3.28
3.84
2.21
1.99
0.00
0.07
0.28
0.44
0.12
0.13
Note. = former heuristics; = new heuristics; = delta Mantel–Haenszel; Equal = equal sample size condition with reference and focal group sample sizes of 500; Unequal = unequal sample size condition with a reference group sample size of 750 and a focal group sample size of 250; DIF = differential item functioning.
False-positive rates when the ability distributions were unequal were below the level of significance for all three sets of DIF criteria when the data are 2PL (see Figure 3). When the data were 3PL, had slightly inflated false-positive rates while and are below the level of significance. All three DIF criteria had false-positive rates below the significance level at the large-DIF level when data were 3PL. False-positive rates at the moderate-DIF level for all three criteria were below the level of significance when the sample sizes were equal, and data were 2PL (see Table 7).
False-Positive Rates for Unequal Ability Distributions.
Factor
Moderate-DIF level
Large-DIF level
2PL data
3PL data
2PL data
3PL data
Sample size
Equal
2.11
3.82
2.58
6.15
3.74
2.16
0.08
0.38
0.20
0.84
0.23
0.18
Unequal
5.11
5.86
3.48
7.61
6.96
2.98
1.02
2.59
0.72
3.49
1.54
0.60
Item type
Low b
4.93
6.30
3.45
8.27
5.71
4.60
0.80
2.18
0.93
1.81
0.64
1.11
Medium b
3.91
4.96
2.42
6.57
5.35
1.70
0.64
1.69
0.12
2.58
1.14
0.05
High b
1.99
3.27
3.23
5.80
4.98
1.42
0.20
0.58
0.34
2.10
0.87
0.02
Low a
4.74
5.75
1.85
6.68
5.85
1.52
0.83
2.17
0.06
2.73
1.14
0.04
Medium a
3.41
4.81
3.00
6.85
5.24
2.55
0.46
1.37
0.32
2.05
0.87
0.30
High a
2.68
3.97
4.25
7.11
4.95
3.65
0.35
0.91
1.01
1.71
0.64
0.83
Note. = former heuristics; = new heuristics; = delta Mantel–Haenszel; Equal = equal sample size condition with reference and focal group sample sizes of 500; Unequal = unequal sample size condition with a reference group sample size of 750 and a focal group sample size of 250; DIF = differential item functioning.
False-positive rates when ability distributions are unequal.
When sample sizes were unequal, the moderate-level DIF false-positive rates for both SIBTEST criteria were slightly inflated. When data were 3PL, and maintained moderate-DIF-level false-positive rates below the significance level for equal sample sizes, and had an inflated false-positive rate. When sample sizes were unequal for 3PL data, and had inflated moderate-level-DIF false-positive rates at 7.61% and 6.96%, respectively. ’s rate remained below the significance level for all conditions analyzed. At the large-DIF level, all three sets of DIF criteria had false-positive rates below the significance level.
When data were 2PL, and maintained false-positive rates below the level of significance regardless of item type at the moderate-DIF level. had slightly inflated false-positive rates for low-difficulty items and low-discrimination items. When data were 3PL, ’s false-positive rates were all below the significance level for all item types. maintained false-positive rates near the level of significance with the highest inflated rate being the low-discrimination items at 5.85%. had inflated false-positive rates across all item types of 3PL data when using moderate-DIF criteria. The lowest false-positive rate for was 5.80% for high-difficulty items. False-positive rates at the large-DIF level were below the level of significance for all three sets of DIF criteria.
True-Positive
When the ability distributions were equal the true-positive rates for all three sets of DIF criteria at the moderate-DIF level ranged from 70% to 75% for 2PL data and ranged from 55% to 65% when data were 3PL (see Figure 4). When data were 2PL, had the highest moderate-DIF-level true-positive rate at 74.59% and had the lowest true-positive rate at 70.26%.
True-positive rates for equal ability distributions and aggregated over sample size.
When data were 3PL, had the highest moderate-DIF-level true-positive rate at 65.68% and had the lowest true-positive rate at 54.10%. At the large-DIF level, the three sets of DIF criteria ranged from 49% to 60% for 2PL data and ranged from 30% to 48% when data were 3PL. Overall, when the ability distributions were equal, and had similar true-positive rates at both the moderate- and large-DIF levels for 2PL data, but and had similar true-positive rates when data were 3PL.
As the magnitude of the between-group differences in difficulty parameters () increased, true-positive rates increased for all three sets of DIF criteria (see Figure 5). When data were 2PL, all three sets of DIF criteria had similar true-positive rates for moderate DIF across the different levels of . and had more similar true-positive rates at the moderate-DIF level for 3PL data. At the large-DIF level, when data are 2PL, and have similar true-positive rates at the lower end of the between-group differences in difficulty parameters ( 0.4), but and have similar true-positive rates at the upper end of the between-group differences in difficulty parameters (0.6). When data are 3PL, and have similar true-positive rates at the large-DIF level, but has slightly higher true-positive rates. had substantially higher true-positive rates at the large-DIF level than and . and had similar true-positive rates at the moderate-DIF level for both 2PL and 3PL data.
Equal ability distribution true-positive rates by between-group differences in difficulty parameter.
True-positive rates for were highest among the three sets of DIF criteria for all ranges of the difficulty parameter and when the discrimination parameters were at the low and medium levels when data were 2PL (see Table 8). had higher true-positive rates for high-discrimination data. These results hold for both moderate- and large-DIF levels for 2PL data. When investigating the true-positive rates by item type when data were 3PL, had the highest true-positive rates across all item types at both moderate- and large-DIF levels. had higher true-positive rates than across all items except for the low-difficulty items at both moderate- and large-DIF levels.
True-Positive Rates for Equal Ability Distributions.
Factor
Moderate-DIF level
Large-DIF level
2PL data
3PL data
2PL data
3PL data
Sample size
Equal
70.55
75.79
71.50
67.94
61.13
54.17
49.22
59.29
52.03
47.45
37.71
30.30
Unequal
69.96
73.39
70.62
63.42
61.26
54.02
49.45
59.31
52.23
49.17
40.00
31.09
Item type
Low b
67.43
72.19
70.71
65.94
59.47
64.20
44.98
55.75
51.81
45.49
35.55
43.24
Medium b
75.68
78.92
71.59
71.00
67.71
57.60
57.87
66.32
52.15
56.58
48.14
34.25
High b
67.67
72.66
70.88
60.10
56.40
40.48
45.15
55.82
52.42
42.85
32.87
14.59
Low a
59.78
64.07
50.42
53.09
48.77
33.49
34.57
46.54
21.88
34.24
23.78
8.39
Medium a
73.79
78.20
77.34
69.76
65.17
60.03
53.98
63.48
60.32
52.60
43.26
35.62
High a
77.20
81.50
85.41
74.19
69.63
68.76
59.44
67.87
74.18
58.08
49.52
48.08
Note. = former heuristics; = new heuristics; = delta Mantel–Haenszel; = between group item difficulty difference; Equal = equal sample size condition with reference and focal group sample sizes of 500; Unequal = unequal sample size condition with a reference group sample size of 750 and a focal group sample size of 250; DIF = differential item functioning.
True-positive rates for 2PL data, when the ability distributions were unequal, follow the same overall trends when the ability distributions were equal (see Figure 6). The true-positive rates were slightly lower at the moderate- and large-DIF levels compared with the equal ability condition, and had slightly higher rates than at the moderate-DIF level. The 3PL data true-positive rates followed the same trends as the equal ability distribution conditions; however, and had essentially the same true-positive rates and had slightly lower true-positive rates at the moderate-DIF level. and had similar but lower true-positive rates than at the large-DIF level.
True-positive rates for unequal ability distributions and aggregated over sample size.
As the magnitude of increased, true-positive rates increased for all three sets of DIF criteria (see Figure 7). and had similar true-positive rates at the moderate-DIF level for both 2PL and 3PL data. All three sets of DIF criteria had similar true-positive rates for moderate-DIF across the different levels of for both 2PL and 3PL data. Comparing true-positive rates at the large-DIF level for 2PL data, had lower rates than and for all magnitudes less than 0.8. had the highest true-positive rates across all magnitudes for 3PL data at the large-DIF level.
Unequal ability distribution true-positive rates by between-group differences in difficulty parameter.
The highest true-positive rates at the moderate-DIF level for all item types were for when data were 2PL, except for the medium-difficulty items (see Table 9). had slightly higher rates than for medium-difficulty items at the moderate-DIF level. At the large-DIF level for 2PL data, had the highest true-positive rates for items with low and medium difficulty and items with low and medium discriminations. had the highest rates for high-difficulty and high-discrimination items. Overall, the true-positive rates were similar for and when data were 2PL.
True-positive rates for unequal ability distributions.
Factor
Moderate-DIF level
Large-DIF level
2PL data
3PL data
2PL data
3PL data
Sample size
Equal
67.62
72.08
71.71
59.97
55.16
54.70
46.54
56.50
52.61
41.99
32.84
30.33
Unequal
62.98
64.49
69.96
51.63
51.34
55.80
46.82
55.98
52.98
44.63
36.23
33.50
Item type
Low b
60.59
63.21
68.55
51.61
47.50
57.53
42.43
52.17
51.58
36.31
27.52
38.70
Medium b
70.95
72.90
72.23
62.16
60.18
58.89
55.25
63.56
52.93
52.06
43.99
35.53
High b
64.36
68.75
71.71
53.62
52.07
49.34
42.36
53.00
53.88
41.56
32.09
21.50
Low a
53.15
55.45
56.40
41.94
39.85
39.85
32.44
43.45
27.80
29.96
21.08
12.28
Medium a
69.07
72.34
75.38
59.98
57.35
59.89
51.01
60.24
59.47
47.27
38.34
36.33
High a
73.68
77.07
80.72
65.48
62.54
66.02
56.59
65.04
71.12
52.70
44.18
47.13
Note. = former heuristics; = new heuristics; = delta Mantel–Haenszel; = between group item difficulty difference; Equal = equal sample size condition with reference and focal group sample sizes of 500; Unequal = unequal sample size condition with a reference group sample size of 750 and a focal group sample size of 250; DIF = differential item functioning.
When data were 3PL, had the highest true-positive rates at the moderate-DIF level for all types of items except for low-difficulty items and high-discrimination items. When items had low-difficulty or high-discrimination parameters, had the highest true-positive rates at the moderate-DIF level. Similarly, at the large-DIF level for 3PL data, had the highest true-positive rates for all item types except for the low-difficulty items. When the items had low-difficulty parameters, had the highest true-positive rate. The true-positive rates for the three DIF criteria were similar at the moderate-DIF level, but had significantly higher rates at the large-DIF level.
Level of Agreement
Combining both equal and unequal ability distribution data, Cohen’s kappa coefficients ( were used to compare the level of agreement of with the original and the new . Values for exceeding 0.8 are interpreted as a strong relationship and values greater than 0.9 can be interpreted as being nearly perfect (McHugh, 2012). Frequency tables are provided for both moderate- and large-DIF heuristics (see Tables 10 and 11) to determine the distribution of items classified similarly by and the two SIBTEST heuristics for both moderate- and large-level DIF. Higher values in the diagonals represent more agreement between the two measures. At the moderate level of DIF, for 2PL data were both 0.85 for and with .
Frequency Distributions for Moderate-DIF Classification.
2PL data
3PL data
Not flagged
Flagged for DIF
Not flagged
Flagged for DIF
Not flagged
Flagged for DIF
Not flagged
Flagged for DIF
Not flagged
361,866 (40%)
22,704 (3%)
349,055 (39%)
35,515 (4%)
429,138 (48%)
73,595 (8%)
447,419 (50%)
55,314 (6%)
Flagged for DIF
45,431 (5%)
469,999 (52%)
29,634 (3%)
485,796 (54%)
23,581 (3%)
373,686 (42%)
33,358 (4%)
363,909 (40%)
Note. = former heuristics; = new heuristics; = delta Mantel–Haenszel.
Frequency Distributions for Large-DIF Classification.
2PL data
3PL data
Not flagged
Flagged for DIF
Not flagged
Flagged for DIF
Not flagged
Flagged for DIF
Not flagged
Flagged for DIF
Not flagged
493,761 (55%)
27,957 (3%)
457,230 (51%)
64,488 (7%)
553,112 (61%)
121,088 (13%)
604,370 (67%)
69,830 (8%)
Flagged for DIF
60,048 (7%)
318,234 (35%)
25,226 (3%)
353,056 (39%)
14,550 (2%)
211,250 (23%)
30,479 (3%)
195,321 (22%)
Note. = former heuristics; = new heuristics; = delta Mantel–Haenszel.
Similarly, at the large-DIF level, and had identical values with being 0.80 for 2PL data. When investigating for the 3PL data, the new SIBTEST heuristics have larger levels of agreement, specifically for large-DIF classifications. At the moderate-DIF level, for was 0.80 compared with having a value of 0.78. At the large-DIF level, for and were 0.72 and 0.65, respectively. When data were 3PL, had approximately 1% more items flagged in agreement with at the moderate-DIF level than . At the large-DIF level, had approximately 4% more items flagged in agreement with than . When data were 2PL, only flagged 0.33% more items in agreement with at the moderate-DIF level than . At the large-DIF level, flagged 0.19% fewer items in agreement with than .
Discussion
The new SIBTEST heuristics provided for in this study have slightly higher overall agreement with the ETS heuristics than the former recommended SIBTEST heuristics (Roussos & Stout, 1996). Both sets of 2PL SIBTEST heuristics are in the strong level of agreement category with (McHugh, 2012). However, for 3PL data, the level of agreement between and the new SIBTEST heuristics is higher than the level of agreement with former SIBTEST heuristics, specifically for large-DIF classifications. This corresponds with the similar true-positive and false-positive rates for and the new SIBTEST classifications. The higher level of agreement was possibly due to MH and the former SIBTEST criteria having a lower level of alignment when analyzing 3PL data, and the development of the SIBTEST heuristics prior to Jiang and Stout’s (1998) regression correction implemented for SIBTEST. The current study was able to capitalize on the inclusion of the newer regression correction being implemented in the calculation of . Further, the current study had the benefit of increased computing capacity that allowed for the simulation of a larger range of item-level characteristics and increased number of replications.
False-positive rates were slightly inflated at the moderate-DIF level when the sample size was small (<300) and the ability distributions were unequal. Even with the inclusion of Jiang and Stout’s (1998) adjustment to the calculation of , the conclusions of prior research that noted when the focal group had sample sizes smaller than 300, SIBTEST had inflated false-positive rates still holds (e.g., Narayanan & Swaminathan, 1994). Under certain conditions at the moderate-DIF levels, the three sets of DIF criteria had false-positive rates well below the level of significance. Additionally, regardless of condition, all three sets of DIF criteria had false-positive rates well below the level of significance for large DIF, with the newly suggested heuristics being slightly smaller than previous heuristics when data were 3PL. It is not uncommon to see false-positive rates controlled well below the level of significance when combining both DIF magnitude heuristics and statistical testing (Turner et al., 2011).
Based on the results from this study, two new sets of SIBTEST heuristics are recommended for use with 2PL and 3PL data for researchers wanting higher consistency with ETS guidelines (Zwick, 2012). For data with 2PL model characteristics, researchers may want to consider the heuristics of 0.062 and 0.090 for moderate and large DIF, respectively. These values are smaller than those recommended by Roussos and Stout’s (1996) article identifying DIF magnitude values for 2PL data. Similarly, if the data follow a 3PL model, researchers may want to consider the heuristics of 0.069 and 0.102 for moderate and large DIF, respectively. These values are larger than those recommended in the Roussos and Stout (1996) article. The set of 3PL heuristics suggested here provide slightly more conservative criteria for flagging an item as having large DIF. Additionally, the set of 2PL heuristics suggested here provide slightly more liberal criteria. Both sets of heuristics provide a classification that would be more aligned with the Mantel–Haenszel criteria recommended by work conducted at ETS (Zwick, 2012).
Overall, the results from this study indicate that the SIBTEST heuristics suggested by Roussos and Stout (1996) had fairly high levels of agreement with at the moderate-DIF levels. However, at the large-DIF levels, Roussos and Stout’s (1996) SIBTEST heuristics resulted in lower true-positive rates than with 2PL and substantially higher rates than for 3PL data. The SIBTEST heuristics suggested from this study provide an adjusted set of scores that result in both true-positive and false-positive rates that are closer in alignment with values than the previous SIBTEST criteria, especially for large-DIF condition. Therefore, it is recommended—if researchers want their DIF analyses to be better aligned with ETS—to use the suggested moderate- and large-DIF heuristics of 0.062 and 0.090 when data are 2PL and 0.069 and 0.102 when data are 3PL.
Limitations
This study did not include analyses to investigate which SIBTEST heuristics maximize true-positive rates while controlling for false-positive errors. Further, this study was limited to a fixed test length of 40 items and it did not investigate the heuristics for more than one suspect item. The intent was to reanalyze the mapping of corrected values to ETS ΔMH DIF classification heuristics for a single item given limitations identified in important SIBTEST studies conducted under a condition of changing ETS criteria (Jiang & Stout, 1998; Roussos & Stout, 1996; Shealy & Stout, 1993). With the opportunity of increased computing power, a wider range of conditions were possible to provide a predictive model between the two DIF procedures. Further, there was the opportunity to include the regression correction added to SIBTEST (Jiang & Stout, 1998) in this simulation study, which was not incorporated at the time of the original study. The purpose of the study is not to argue against the use of the current SIBTEST heuristics, but to provide information that prior heuristics do not align with the ETS DIF classifications using as well as the new heuristics identified in this research, especially when data fit a 3PL model. Therefore, the use of ETS guidelines for the creation of new heuristics for the SIBTEST procedure can be perceived as a strength for those valuing the ETS classifications, or a limitation for those who do not consider the ETS classifications as an ideal for comparison. The ETS guidelines for classification of DIF with heuristics is widely used; however, the development of the heuristics used for were only found through a secondary source via personal communication (Zwick, 2012). Future research is recommended to evaluate how the new heuristics compare with under a variety of additional simulated sample size and test length conditions in order to increase generalizability.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
HollandP. W.ThayerD. T. (1986). Differential item performance and the Mantel-Haenszel procedure [Paper presentation]. American Educational Research Association Annual Meeting, San Francisco, CA, United States.
4.
JiangH.StoutW. (1998). Improved Type I error control and reduced estimation bias for DIF detection using SIBTEST. Journal of Educational and Behavioral Statistics, 23(4), 291-322. https://doi.org/10.2307/1165279
5.
JodoinM. G.GierlM. J. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14(4), 329-349. https://doi.org/10.1207/S15324818AME1404_2
MagisD.BélandS.TuerlinckxF.De BoeckP. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847-862. https://doi.org/10.3758/BRM.42.3.847
NarayananP.SwaminathanH. (1994). Performance of the Mantel-Haenszel and simultaneous item bias procedures for detecting differential item functioning. Applied Psychological Measurement, 18(4), 315-328. https://doi.org/10.1177%2F014662169401800403
10.
PeiL. K.LiJ. (2010). Effects of unequal ability variances on the performance of logistic regression, Mantel-Haenszel, SIBTEST IRT, and IRT likelihood ratio for DIF detection. Applied Psychological Measurement, 34(6), 453-456. https://doi.org/10.1177/0146621610367789
11.
R Development Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org
12.
RoussosL. A.SchnipkeD. L.PashleyP. J. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics, 24(3), 293-322. https://doi.org/10.2307/1165326
13.
RoussosL. A.StoutW. F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel Type I error performance. Journal of Educational Measurement, 33(2), 215-230.
14.
ShealyR.StoutW. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194. https://doi.org/10.1007/BF02294572
15.
SohnW. J. (2001). Using differential item functioning (DIF) procedures for investigating the cross-cultural equivalence of personality tests [Unpublished doctoral dissertation]. University of Illinois at Urbana-Champaign.
16.
TurnerR. C.GitchelW. D.KeifferE. A. (2011, April). Comparing Type I error and power rates in DIF analyses when combining significance tests with effect size criteria [Paper presentation]. National Council on Measurement in Education Annual Conference, New Orleans, LA.
17.
WalkerC. M.ZhangB.BanksK.CappaertK. (2011). Establishing effect size guidelines for interpreting the results of differential bundle functioning analyses using SIBTEST. Educational and Psychological Measurement, 72(3), 415-434. https://doi.org/10.1177/0013164411422250
18.
WrightK. D.OshimaT. C. (2015). An effect size measure for Raju’s differential functioning for items and tests. Educational and Psychological Measurement, 75(2), 338-358. https://doi.org/10.1177/0013164414532944
19.
ZumboB. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Directorate of Human Resources Research and Evaluation, Department of National Defense. https://faculty.educ.ubc.ca/zumbo/DIF/handbook.pdf
20.
ZwickR. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, 2012(1). https://doi.org/10.1002/j.2333-8504.2012.tb02290.x