Improving Mantel–Haenszel DIF Estimation Through Bayesian Updating

Abstract

This study demonstrates how the stability of Mantel–Haenszel (MH) DIF (differential item functioning) methods can be improved by integrating information across multiple test administrations using Bayesian updating (BU). The authors conducted a simulation that showed that this approach, which is based on earlier work by Zwick, Thayer, and Lewis, can yield more accurate DIF estimation and improve the detection of DIF items, even when compared to other approaches that aggregate data across administrations. The authors also applied the method to data from several college-level tests. The BU approach provides a natural way to accumulate all known DIF information about each test item while mitigating the undesirable bias toward zero that affected the performance of two previous Bayesian DIF methods.

Keywords

differential item functioning test fairness Bayesian methods

The Mantel–Haenszel (MH) method of screening items for differential item functioning (DIF), developed by Holland and Thayer (1988), is used routinely by many large-scale testing programs. The MH results can be used to designate items as “A” (negligible or nonsignificant DIF), “B” (slight-to-moderate DIF), or “C” (moderate-to-large DIF) using criteria developed at Educational Testing Service (ETS) that involve both the magnitude of the MH value and its statistical significance (Zieky, 1993, p. 342). Because MH results are unstable in small samples, these analyses are typically performed only when certain sample size requirements are met. The enforcement of sample size criteria may mean that DIF analyses are not performed for certain test administration modes, such as computerized adaptive tests, in which some items may be administered to only a small number of individuals. Stringent sample size requirements may also mean that DIF analyses are not performed for certain ethnic groups, such as Native Americans. The challenge presented by small demographic groups is likely to be exacerbated by the trend toward more specific definition of ethnic and racial groups. For example, the 2000 U.S. Census reported data for 63 racial categories for certain purposes—6 main categories and 57 combinations (www.census.gov/census2000/raceqandas.html). Because of this phenomenon, it may be advantageous to adopt a DIF analysis method that can provide stable small-sample results.

Zwick, Thayer, and Lewis (1999, 2000; Zwick & Thayer, 2002) developed an empirical Bayes (EB) approach to MH DIF analysis, intended to produce more stable results in these situations. In the EB approach, a prior distribution for the DIF parameter ω is assumed. As described below, data from the current test administration, consisting of MH statistics and their standard errors, are used as a basis for estimating the mean and variance of the prior. Because the prior distribution and the likelihood function are both assumed normal, the posterior distribution of ω is also normal and its mean serves as the EB estimate of DIF. The EB approach was found to produce more stable DIF estimates than the ordinary MH method. Also, a loss-function-based DIF detection rule that made use of the EB results was better able to identify DIF items than the ETS “ABC” classification system. Superiority of the EB approach was most evident in small samples, as would be expected from theory.

In a modification of this approach, Sinharay, Dorans, Grant, and Blew (2009) used data from previous administrations, rather than the current data set, to estimate the prior mean and variance. They referred to their procedure as a fully Bayes (FB) approach. In their study, the performance of the FB method tended to be virtually identical to that of the EB method,¹ again providing more stable results than the ordinary MH statistic.

However, a particular objection was raised in connection with the EB and FB versions of MH DIF analysis––that they lead the user to pay less attention to DIF. In these Bayesian approaches, each MH statistic is “shrunk” toward the prior mean. As explained below, the EB and FB methods estimate the prior mean using the average MH from one or more test administrations. Because of the self-norming property of MH DIF statistics—the constraint that the MH values for an administration sum to approximately zero when the matching variable is the test score itself—the EB and FB methods will always cause the MH statistics to shrink toward zero.

Bayesian updating (BU) of DIF results was proposed (but not implemented) by Zwick et al. (2000) as a possible elaboration of their EB approach that could be applied for items administered more than once. The BU approach provides a natural way to accumulate all known DIF information about each test item while mitigating the undesirable bias toward zero demonstrated by the EB and FB approaches. As described below, the BU procedure generally involves shrinkage toward zero at the first administration only. Our study demonstrates how the stability of MH DIF methods can be improved by integrating information across multiple test administrations using BU.

In the next section, we provide the formulas associated with the EB, FB, and BU methods, as well as a simplified example of their application. In the subsequent two sections, we describe a simulation study that compares the BU DIF method to several competing methods in terms of the properties of the DIF estimators and the performance of DIF detection rules, respectively. In the fourth section, we discuss our application of the method to data from a college-level test in literature. In the final section, we summarize our results, discuss the limitations of the study, and outline possible future research directions.

The EB, FB, and BU Methods: Formulas and an Example

The BU approach builds on the EB approach, which is described first. The relation of the FB to the EB method is also explained. Using the fact that log of the MH odds ratio has an asymptotic normal distribution (Agresti, 1990), we assume that

M H_{i} |ω_{i} ~ N (ω_{i}, σ_{i}^{2}),

where

{M H}_{i}

is the Mantel–Haenszel DIF statistic for item i,

σ_{i}^{2}

is the sampling variance of

{M H}_{i}

, and ω _i and is the unknown parameter value corresponding to

{M H}_{i}

. We further assume the following prior distribution for the DIF parameter ω _i :

ω_{i} ~ N (μ, τ^{2}) .

Under these assumptions, we can use standard Bayesian calculations (e.g., Box & Tiao, 1973) to show that the posterior distribution of ω _i , given the observed DIF statistics, is as follows:

ω_{i} |{M H}_{i} ~ N [W_{i} {M H}_{i} + (1 - W_{i}) μ, W_{i} σ_{i}^{2})],

where

W_{i} = \frac{τ^{2}}{σ_{i}^{2} + τ^{2}} .

The estimate of the posterior mean $W_{i} {M H}_{i} + (1 - W_{i}) μ$ is the EB estimate of DIF for item i. Thus, the DIF estimate is a weighted combination of the observed MH statistic and the prior mean. The smaller the sampling error of ${M H}_{i}$ , the larger a weight it will receive.

Estimation of $σ_{i}^{2}$ , $μ$ , $τ^{2}$ , and $W_{i}$

As described in Zwick et al. (1999, 2000), the observed squared standard error of ${M H}_{i}$ , $S E^{2} ({M H}_{i})$ (see Phillips & Holland, 1987), is used as an estimate of $σ_{i}^{2}$ . The prior mean is estimated by averaging the ${M H}_{i}$ statistics for the test administration and the prior variance is estimated by subtracting the across-item average of $S E^{2} ({M H}_{i})$ values from the across-item variance of the ${M H}_{i}$ statistics:

\hat{μ} = \frac{1}{n} \sum_{i = 1}^{n} {MH}_{i} and {\hat{τ}}^{2} = \frac{1}{(n - 1)} [\sum_{i = 1}^{n} {MH}_{i}^{2} - \frac{1}{n} {(\sum_{i = 1}^{n} {MH}_{i})}^{2}] - \frac{1}{n} \sum_{i = 1}^{n} S E^{2} ({MH}_{i}),

where n is the number of items in the test. The weight

W_{i}

is estimated by substituting the estimates of

σ_{i}^{2}

and

τ^{2}

into Equation 3.

Zwick, Thayer, and Lewis conducted a series of validity studies of the EB method. For example, to investigate the sensitivity of the method to the assumption that the prior was normal, they experimented with alternative priors, such as truncated normal distributions. They also compared the estimator of $τ^{2}$ in Equation 4 to the estimator used by Longford, Holland, and Thayer (1993) and the one proposed by Camilli and Penfield (1997). None of these alternatives led to a consistent improvement in results (Zwick, Thayer, & Lewis, 1997, 1999).

The FB method is identical to the EB method except in terms of the data used to obtain the estimates of μ and $τ^{2}$ in Equation 4. In the FB method, pooled data from multiple previous administrations of the test are used instead of data from the current administration. Therefore, in the FB method, the summations in Equation 4 are over multiple administrations. Although Sinharay et al. (2009) found the performance of the FB and EB to be extremely similar, we included the FB in our study for further comparison.

The BU approach is an elaboration of the EB method that can be applied when test forms are administered multiple times. Each time an item is administered, the BU approach uses the estimates of the posterior mean and variance obtained from the immediately preceding administration (see Equations 2 and 3) as the prior mean and variance in the current application. Conceptually, the approach is simple, but it does introduce additional computational complexity in that each item has “its own” prior mean and variance. This is not the case in the EB and FB methods, which use a prior that is common across items (see Equation 1).

To see that the BU procedure can lead to results very different from the EB or FB approaches, consider the following simplified example. Suppose that an item is administered 4 times, each time resulting in an MH statistic of 1 and an SE(MH_i ) of .6. Suppose further that the prior mean and variance, as defined in the EB procedure, are 0 and .5, respectively, in each administration. Then, using Equations 2 and 3, the EB procedure would produce a DIF estimate of .5814 on each administration, with a posterior standard deviation $({P S D}_{i})$ of .4575. The FB procedure (if applied in Administrations 2–4 using all previous administrations’ data to estimate prior parameters) would result in exactly the same DIF results as EB in this example. The results for the BU approach are given in Table 1. The first BU estimate is the same as the EB estimate of .5814 by definition and has a ${P S D}_{i}$ of .4575. However, after four administrations, the BU estimate is .8023, with a ${P S D}_{i}$ that has been reduced to .2762. Thus, the BU procedure offers the increased stability of Bayesian estimates, without the degree of shrinkage toward zero that results from the EB or FB methods. The disparity between the BU estimate of .8023 and the other estimates occurs because the EB and FB procedures are “oblivious” to the fact that the item has repeatedly shown a DIF value of 1, while the BU procedure takes this into account.

Table 1

Prior and Posterior Means and Standard Deviations for the Bayesian Updating (BU) Procedure Applied to the Example Data

Administration	Prior Mean	Prior SD	Posterior Mean (BU DIF Estimate)	Posterior SD
1	0	.7071	.5814^a	.4575
2	.5814	.4575	.6569	.3638
3	.6569	.3638	.7492	.3111
4	.7492	.3111	.8023	.2762

^aIt is assumed that ${M H}_{i} = 1$ and $S E ({M H}_{i}) = .6$ in all four administrations. In this example, the Empirical Bayes and Fully Bayes methods would produce a DIF estimate of .5814 in each administration.

A natural question that arises concerning the BU approach is whether comparable performance could be achieved using non-Bayesian methods of aggregating the data from multiple administrations. To investigate this, we studied two additional DIF statistics based on aggregation—the Average MH and the Combined-data MH. For item i, the Average MH statistic, $\overline{{M H}_{i}}$ , was defined as the weighted average of the MH statistic from multiple administrations, computed as

{\overline{M H}}_{i} = \frac{1}{\sum_{a = 1}^{A} N_{a}} \sum_{a = 1}^{A} N_{a} {M H}_{i a},

where A is the number of administrations, $N_{a}$ is the number of respondents in Administration a, and ${M H}_{i a}$ is the MH statistic for item i in Administration a. The standard error of $\overline{{M H}_{i}}$ is computed as

S E ({\overline{M H}}_{i}) = \sqrt{{(\sum_{a = 1}^{A} N_{a})}^{- 2} \sum_{a = 1}^{A} N_{a}^{2} S E^{2} ({M H}_{i a})} .

The Combined-data MH was obtained by simply combining the data from multiple administrations and computing the MH statistics and their standard errors on the combined data.

The Current Project

The analyses for this project consisted of three major components. First, a simulation study of the properties of the MH, EB, FB, and BU DIF estimates, as well as the Average and Combined-data MH statistics, was conducted. Second, several decision rules for identifying DIF, based on these six kinds of DIF estimates, were compared using the simulated data. Third, the DIF methods were applied to data from several college-level tests. Each of these three studies is presented in turn.

Simulation Study of Properties of DIF Estimators

Method

Four simulation conditions were included; in all four, the reference group had a standard normal ability (θ) distribution. The focal group distribution and group sample sizes were as follows:

Condition 1: N(0, 1) focal group θ distribution; $n_{R} = 500$ ; $n_{F} = 500$ ,

Condition 2: N(0, 1) focal group θ distribution; $n_{R} = 200$ ; $n_{F} = 50$ ,

Condition 3: N(−1, 1) focal group θ distribution; $n_{R} = 500$ ; $n_{F} = 500$ , and

Condition 4: N(−1, 1) focal group θ distribution; $n_{R} = 200$ ; $n_{F} = 50$ .

Note that Conditions 2 and 4 represent situations in which DIF analyses would not be conducted under ETS rules, which require at least 200 members in the smaller group.

In terms of sample size and ability distributions, the simulation conditions in the present study were the same as those used by Zwick et al. (1999, 2000). In the current project, however, the simulation was extended to correspond to a situation in which a set of 34 items (representing a test form) was administered four different times to independent samples. The True DIF (i.e., the value of the DIF parameter ω) for each item was assumed to remain constant over administrations. Within each of the four administrations, responses to the items were generated using the three-parameter logistic (3PL) model. Item parameters were fixed throughout the study.

DIF was modeled as a difference between focal group difficulties $(b_{i F})$ and reference group difficulties $(b_{i R})$ , with $b_{i F} = b_{i R} - d_{i}$ . The item discrimination parameters were assumed to be common across groups $(a_{i F} = a_{i R} = a_{i})$ \ and the guessing parameters were set to $c_{i F} = c_{i R} = .15$ for all items. The $b_{i R}$ , $d_{i}$ , and $ln (a_{i})$ values were originally obtained as draws from normal distributions whose parameters were derived from actual test forms.

To express the true amount of DIF in the same metric as the Mantel–Haenszel delta difference of Holland and Thayer (1988), the following formulation (see Zwick, Thayer, & Lewis, 2000) was used:

ω_{i} = - 2.35 \int ln [\frac{P_{i R} (θ) / Q_{i R} (θ)}{P_{i F} (θ) / Q_{i F} (θ)}] f_{R} (θ) d θ,

where

P_{i G} (θ)

is the item response function for Group G,

Q_{i G} (θ) = 1 - P_{i G} (θ)

, and

f_{R} (θ)

is the reference group ability distribution. The True DIF (

ω_{i}

) values ranged from .09 to 3.12 in absolute value and were considered the targets of estimation. The item parameters and True DIF values for each item are shown in Table 2. Based on the ETS definitions, these items can be categorized according to their true status. Those with True DIF values less than 1 in magnitude are true A items, those with values greater than 1, but less than 1.5 in magnitude are true B items, and those with values greater than 1.5 in magnitude are true C items. The true DIF status of each item is also included in Table 2. (The right-most column of Table 2 is discussed later.)

Table 2

Item Parameters and True DIF Values

Item	$d_{i}$	$b_{i R}$	$b_{i F}$	$a_{i}$	True DIF	True Classification	$\| True DIF \| > \sqrt{1.5}$
1	−0.06	0.39	0.45	0.49	−0.09	A
2	0.51	1.66	1.15	0.67	0.76	A
3	−0.23	−0.35	−0.12	1.00	−0.76	A
4	−0.17	1.37	1.54	0.64	−0.25	A
5	0.32	0.30	−0.02	1.28	1.18	B+
6	−0.63	0.27	0.90	0.54	−1.04	B−
7	−0.56	0.02	0.58	0.81	−1.36	B−	√
8	0.11	−1.17	−1.28	0.69	0.30	A
9	−0.18	1.32	1.50	0.67	−0.27	A
10	0.05	−1.51	−1.56	0.82	0.15	A
11	−0.18	−1.27	−1.09	0.67	−0.45	A
12	−0.58	0.22	0.80	0.63	−1.10	B−
13	0.10	−0.50	−0.60	0.77	0.28	A
14	−0.70	−1.48	−0.78	1.01	−2.64	C−	√
15	−0.04	0.18	0.22	1.30	−0.15	A
16	0.45	−0.27	−0.72	1.17	1.83	C+	√
17	0.21	0.33	0.12	0.87	0.56	A
18	−0.05	0.40	0.45	1.00	−0.14	A
19	−0.39	1.98	2.37	0.64	−0.40	A
20	−0.23	−0.66	−0.43	1.57	−1.23	B−	√
21	−0.83	−0.23	0.60	0.70	−1.85	C−	√
22	0.46	−1.37	−1.83	1.15	2.03	C+	√
23	0.22	−1.64	−1.86	0.69	0.58	A
24	0.26	0.82	0.56	0.74	0.52	A
25	−0.55	1.09	1.64	0.87	−0.92	A
26	0.78	0.06	−0.72	1.20	3.12	C+	√
27	0.59	1.69	1.10	1.11	1.02	B+
28	0.60	0.81	0.21	0.70	1.23	B+	√
29	−0.46	1.50	1.96	0.76	−0.61	A
30	−0.05	0.29	0.34	0.76	−0.12	A
31	0.69	0.98	0.29	1.04	1.80	C+	√
32	−0.27	2.25	2.52	0.68	−0.24	A
33	0.36	−1.63	−1.99	0.48	0.67	A
34	−0.63	2.73	3.36	1.09	−0.20	A

Note: All guessing parameters were set to .15.

The number of replications per simulation condition was 500. Hence, within each simulation condition, each item was “administered” 2,000 times (4 Administrations × 500 Replications). The DIF statistics that were computed are listed below, along with an explanation of how, if at all, the statistic was modified across administrations.

MH statistic and standard error: These statistics were computed in all four administrations. Since the statistics are based on data from only the current administration, the four administrations within each simulation condition essentially represent additional replications. That is, other than sampling error, there is no reason to expect the MH performance to vary across the four administrations within a condition.

EB DIF statistic and PSD: These statistics were computed in all four administrations. As is the case with the MH statistics, there is no reason to expect the performance of the EB statistics to vary across administrations within a condition.

FB DIF statistic and PSD: These statistics were computed in Administrations 3 and 4, using the pooled data from all previous administrations to estimate the prior mean and variance, as outlined earlier. In Administration 3, the data from Administrations 1 and 2 were used; in Administration 4, the data from Administrations 1–3 were used.

BU DIF statistic and PSD: These updated statistics were computed in Administrations 2–4. Note that the EB statistics from Administration 1 serve as the first step in computing the subsequent BU statistics (see Table 1). In updating the DIF results for an item, data from the “same” replication in the previous administration were used. For example, the posterior mean and PSD from Replication 300 of Item 7 in Administration 3 were used as the prior mean and standard deviation in obtaining DIF results for Replication 300 of Item 7 in Administration 4.

Average MH statistic and standard error: These statistics were computed at Administration 4 only, based on data from all four administrations.

Combined-data MH statistic and standard error: These statistics too were computed at Administration 4 only, based on data from all four administrations.

For each item in each replication of each condition and administration, the various DIF statistics were computed. To summarize the performance of each method, squared bias

(B^{2} (\hat{ω}))

, variance

(V a r (\hat{ω}))

, and root mean square residual

(R M S R (\hat{ω}))

statistics were computed for each item according to the following formulas (with “i” subscripts omitted for simplicity):

B^{2} (\hat{ω}) = {(\bar{\hat{ω}} - ω)}^{2},

Var (\hat{ω}) = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {({\hat{ω}}_{r} - \bar{\hat{ω}})}^{2}},

and

R M S R (\hat{ω}) = \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {({\hat{ω}}_{r} - ω)}^{2}} = \sqrt{B^{2} (\hat{ω}) + V a r (\hat{ω})},

where

{\hat{ω}}_{r}

represents the DIF estimate (MH, EB, FB, or BU) for replication r,

\bar{\hat{ω}} = \frac{1}{R} \sum_{i = 1}^{R} {\hat{ω}}_{r}

is the average of

{\hat{ω}}_{r}

across replications, ω is the True DIF value, and

R = 500

is the number of replications.

One way in which the DIF analyses in this study differed from those of Zwick et al. (1999, 2000) was that no refinement was used in the present study. (Sinharay, Dorans, Grant, and Blew, [2009] did not conduct refinement in their study either.) In DIF refinement as applied to the Mantel–Haenszel procedure, items that are found to have DIF in an initial round of analysis are deleted from the matching variable in the final analysis. (An exception to this is that the studied item is always included in the matching variable.) Our decision to forgo refinement was influenced by the fact that the testing program whose data we analyzed in the third phase of our study does not use refinement in its own analyses. To investigate the sensitivity of our simulation results to this decision, we conducted both refined and unrefined analyses of the data from Conditions 1 and 4. We found that, averaging across items, the unrefined results were slightly more accurate than the refined ones. This outcome confirmed our decision to conduct our primary analyses without refinement. (The unexpected refinement results led us to conduct a comprehensive study of DIF refinement, which is nearing completion.)

Results

Table 3 shows the minimum, average, and maximum $R M S R (\hat{ω})$ , $B^{2} (\hat{ω})$ and $V a r (\hat{ω})$ across items for each condition for Administration 3. Table 4 includes the corresponding information for Administration 4. These two administrations are the most informative of the four. As explained above, Administration 1 does not include BU or FB results; Administration 2 does not include FB results. For the MH and EB procedures, the four administrations represent interchangeable sets of simulation data, since analyses are based on within-administration data only for these methods.

Table 3

Minimum, Average, and Maximum RMSR, Squared Bias, and Variance for Administration 3

	RMSR			Squared Bias			Variance
Method	Min	Mean	Max	Min	Mean	Max	Min	Mean	Max
a. Condition 1
MH	0.331	0.390	0.619	0.000	0.005	0.036	0.105	0.151	0.358
EB	0.300	0.372	0.723	0.000	0.027	0.297	0.090	0.120	0.226
FB	0.298	0.370	0.717	0.000	0.028	0.307	0.088	0.118	0.207
BU	0.175	0.226	0.437	0.000	0.009	0.097	0.030	0.045	0.094
b. Condition 2
MH	0.830	1.046	1.646	0.000	0.007	0.044	0.688	1.121	2.709
EB	0.544	0.746	1.661	0.000	0.295	2.360	0.216	0.341	0.477
FB	0.541	0.730	1.632	0.001	0.299	2.426	0.162	0.311	0.380
BU	0.391	0.525	1.145	0.000	0.095	1.058	0.146	0.209	0.344
c. Condition 3
MH	0.368	0.473	0.943	0.000	0.073	0.644	0.129	0.165	0.246
EB	0.320	0.469	1.120	0.000	0.129	1.094	0.102	0.124	0.162
FB	0.317	0.468	1.119	0.000	0.130	1.100	0.100	0.121	0.153
BU	0.197	0.334	0.945	0.000	0.091	0.819	0.038	0.048	0.075
d. Condition 4
MH	0.923	1.097	1.430	0.000	0.083	0.664	0.849	1.135	1.638
EB	0.507	0.808	2.115	0.000	0.505	4.127	0.225	0.279	0.486
FB	0.487	0.794	2.091	0.000	0.511	4.152	0.191	0.249	0.299
BU	0.409	0.630	1.555	0.000	0.264	2.160	0.152	0.196	0.347

Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating. Min, mean, and max values are based on distributions over 500 × 34 = 17,000 replications.

Table 4

Minimum, Average, and Maximum RMSR, Squared Bias, and Variance for Administration 4

	RMSR			Squared Bias			Variance
Method	Min	Mean	Max	Min	Mean	Max	Min	Mean	Max
a. Condition 1
MH	0.311	0.390	0.648	0.000	0.005	0.035	0.096	0.151	0.385
EB	0.297	0.372	0.753	0.000	0.029	0.327	0.082	0.120	0.239
FB	0.295	0.370	0.745	0.000	0.029	0.338	0.081	0.117	0.217
BU	0.155	0.199	0.397	0.000	0.008	0.085	0.023	0.034	0.073
Average MH	0.160	0.201	0.327	0.000	0.005	0.033	0.024	0.037	0.085
Combined-data MH	0.160	0.198	0.330	0.000	0.005	0.039	0.024	0.035	0.080
b. Condition 2
MH	0.854	1.051	1.708	0.000	0.007	0.049	0.729	1.135	2.918
EB	0.543	0.747	1.650	0.002	0.289	2.320	0.239	0.346	0.482
FB	0.531	0.732	1.622	0.001	0.300	2.404	0.165	0.312	0.361
BU	0.357	0.472	1.059	0.000	0.072	0.888	0.125	0.173	0.270
Average MH	0.416	0.528	0.843	0.000	0.006	0.027	0.173	0.281	0.710
Combined-data MH	0.384	0.484	0.842	0.000	0.004	0.028	0.147	0.237	0.702
c. Condition 3
MH	0.364	0.472	0.916	0.000	0.074	0.549	0.121	0.163	0.290
EB	0.327	0.469	1.087	0.000	0.128	0.996	0.098	0.122	0.189
FB	0.327	0.467	1.087	0.000	0.130	1.010	0.096	0.119	0.176
BU	0.180	0.312	0.905	0.000	0.087	0.756	0.030	0.037	0.062
Average MH	0.186	0.307	0.833	0.000	0.073	0.621	0.032	0.041	0.073
Combined-data MH	0.184	0.305	0.838	0.000	0.074	0.632	0.031	0.040	0.070
d. Condition 4
MH	0.923	1.103	1.551	0.000	0.080	0.506	0.853	1.158	2.199
EB	0.501	0.809	2.042	0.004	0.491	3.828	0.223	0.287	0.456
FB	0.492	0.791	2.024	0.004	0.503	3.888	0.196	0.246	0.298
BU	0.380	0.583	1.430	0.000	0.230	1.821	0.122	0.166	0.301
Average MH	0.469	0.590	0.965	0.000	0.075	0.511	0.206	0.286	0.478
Combined-data MH	0.443	0.566	1.005	0.000	0.095	0.624	0.180	0.241	0.386

Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating. Min, mean, and max values are based on distributions over 500 × 34 = 17,000 replications.

All procedures performed best in Condition 1, when samples were large $(n_{R} = 500; n_{F} = 500)$ and the two groups had the same ability distribution, and worst in Condition 4, when sample sizes were smaller $(n_{R} = 200; n_{F} = 50)$ and the groups’ ability distributions differed by 1 SD. The finding that the DIF methods performed worse when ability distributions differed is consistent with previous theoretical and empirical results on observed-score DIF methods (e.g., see the discussion in Zwick et al., 1997). In these circumstances, measurement error impairs the accuracy with which reference and focal group members are matched.

It is striking that in Conditions 2 and 4, the mean RMSR for the MH statistic exceeded 1 in both administrations, indicating that, on average, ${M H}_{i}$ values differed from their target values by a substantial amount. In contrast, the mean RMSR for the BU statistic in these conditions ranged from .47 to .63. For the two other Bayesian approaches, the mean RMSR ranged from .73 to .81.

Table 4 shows that the Average MH and Combined-data MH statistics had mean RMSRs that were very similar to each other and to those of BU except in Condition 2, where the mean RMSRs for BU (.47) and for the Combined-data MH (.48) were notably smaller than the RMSR for the Average MH (.53). The minimum RMSR values for the Average and Combined-data MH statistics were always larger than those of the BU statistic, but the maximum RMSR values were smaller than those of BU.

In all conditions and administrations in Tables 3 and 4, BU had the smallest mean variance; BU and the two aggregated MH methods (i.e., the Average and Combined-data MH) had similar mean RMSR values that were smaller than those of the competing procedures. As expected, the performance of BU in Administration 4 was better than in Administration 3. (Similarly, the Administration 3 results were superior to those of Administration 2 [not shown]). The three methods that used data from multiple administrations showed a substantial advantage over MH, EB, and FB even in the large-sample conditions (1 and 3). In Condition 1, Administration 4, for example, BU and the aggregated MH methods had mean RMSRs of .20, compared to .37 for EB and FB and .39 for MH. The mean bias of the BU procedure was always less than that of the two other Bayesian approaches, but greater than that of the MH and aggregated MH methods.

The performance of the EB and FB procedures was largely indistinguishable, as found by Sinharay et al. (2009). The FB results for Administrations 3 and 4 were very similar to each other as well, indicating that whether the prior was estimated based on two or three administrations did not have much impact on the results.

Table 5 shows, for Administration 4 in each of the four conditions, the average $S E ({M H}_{i})$ , $S E (\overline{{M H}_{i}})$ , or ${P S D}_{i}$ values, as appropriate, for each of the DIF methods over items and replications. (Although $S E ({M H}_{i})$ , $S E (\overline{{M H}_{i}})$ , and ${P S D}_{i}$ all express the accuracy of their corresponding DIF statistics, the MH standard errors do not estimate the same quantity as ${P S D}_{i}$ , which estimates the standard deviation of the distribution of the DIF parameter, given the data.) In all conditions, BU yielded much more precisely determined results than the MH, EB, and FB methods and slightly more precise results than the two aggregated MH methods. EB and FB offered a substantial advantage over the MH in the small-sample conditions (2 and 4) but only a slight advantage in the large-sample conditions (1 and 3).

Table 5

Average Standard Error (for MH-based Methods) or Posterior Standard Deviation (for EB, FB, and BU) for Administration 4 in the Four Simulation Conditions

DIF Method	Condition 1	Condition 2	Condition 3	Condition 4
MH	0.382	1.027	0.400	1.038
EB	0.362	0.758	0.372	0.703
FB	0.361	0.765	0.372	0.716
BU	0.188	0.461	0.196	0.448
Average MH	0.191	0.516	0.200	0.520
Combined-data MH	0.189	0.481	0.198	0.486

Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating. Averages were computed over 500 × 34 = 17,000 replications.

Simulation Study of Properties of DIF Decision Rules

Method

The simulation data were also used to compare various rules for DIF flagging. The ETS rules are based on those currently used at ETS. The C rule flags items if they meet the ETS criteria for C status, which state that the absolute value of ${M H}_{i}$ must be at least 1.5 and must be statistically greater than 1 at $α = .05$ . The B rule, as implemented in this study, flags items if they at least meet the ETS criteria for B status: The absolute value of ${M H}_{i}$ must be at least 1 and must be statistically different from zero at $α = .05$ . Thus, items meeting either the ETS B or C criteria are flagged under the B rule.

At Administration 4, we also applied versions of the B rule and C rule to the two aggregated MH statistics. The decision rules based on the Average MH (which, as far as we know, have not been previously used) have the advantage that they can be implemented using only the ${M H}_{i}$ and $S E ({M H}_{i})$ values from past administrations. No new DIF analyses are needed. The statistic in Equation 5 was used as the basis for applying the magnitude criteria for the ETS ABC rules (e.g., $\overline{{M H}_{i}}$ had to exceed 1.5 in magnitude in order for item i to be considered a C item). The statistical criteria were implemented by dividing $\overline{{M H}_{i}}$ by its standard error, given in Equation 6.² The Combined-data MH rule was applied by simply combining the data from the four administrations, computing the MH statistics and their standard errors on the combined data, and then implementing the B and C rules. Because some low-volume testing programs do, in fact, accumulate data over time before conducting DIF analysis, this rule parallels actual practice.

All the remaining rules were based on the results of the three Bayesian estimation procedures. These include an approach to item flagging based on loss functions that was outlined by Paul Holland in unpublished memos (January 27, 1987; February 11, 1987). A rule based on this approach was implemented by Zwick et al. (2000) and included in the study by Sinharay et al. (2009). The loss associated with keeping (failing to flag) item i, $L_{i} (k e e p)$ , is assumed to be quadratic in $ω_{i}$ , the true MH DIF value for item i. That is, $L_{i} (k e e p) = C ω_{i}^{2}$ where C is a constant. As Zwick et al. (2000) noted, “[t]his means that the seriousness of the loss is assumed to be proportional to the squared distance of the value of $[ω_{i}]$ from its null value of zero. In his second memo, Holland proposed that the loss associated with flagging an item be regarded as constant. The rationale (developed in consultation with ETS staff involved in the DIF review process) is that neither the resources needed for reviewing an item nor the costs associated with eliminating an item depend on the true degree of DIF in the item” (p. 231). That is, $L_{i} (f l a g) = K$ for all i.

Holland proposed that $|ω_{i}| = 1$ be used as an indifference point, reasoning that, at $|ω_{i}| = 1$ , we are indifferent as to whether the item is kept or flagged. The loss associated with keeping the item can therefore be set equal to the loss associated with flagging the item at $|ω_{i}| = 1$ , leading to the conclusion that $C ω_{i}^{2} = C = K$ . A rule can then be derived indicating that an item should be flagged if the posterior expected loss of keeping the item exceeds the loss of flagging the item:

Flag if $E (K ω_{i}^{2} |{M H}_{i}) > K$ , which can be restated as

F l a g i f V a r (ω_{i} |{M H}_{i}) + {[E (ω_{i} |{M H}_{i})]}^{2} > 1.

To implement this rule, we substituted our estimates of the posterior mean and variance, into Equation 5. In the current study, we labeled this rule “EB Loss Function–Liberal” (EB LF-L) because Zwick et al. (2000) found that, while DIF detection rates were good for this rule, Type I error rates tended to be high in certain conditions. We therefore included a variant of the rule in the present study:

F l a g i f V a r (ω_{i} |{M H}_{i}) + {[E (ω_{i} |{M H}_{i})]}^{2} > 1.5.

This can be shown to be equivalent to using a slightly higher indifference point of $\sqrt{1.5} = 1.225$ . We labeled this rule “EB Loss Function–Conservative” (EB LF-C).

It is worth noting that the EB LF-L rule is expected to function in essentially the same way as the ETS B rule in very large samples, where statistical significance is virtually assured. In this situation, the ETS B rule (which, as implemented here, identifies items that meet either the B or C criteria) will flag items with ${M H}_{i}$ values of at least 1 in magnitude, which is the goal of the EB LF-L rule as well. The EB LF-C rule is expected to be stricter than the ETS B rule (i.e., flag less often), but not as conservative as the ETS C rule.

Each of the LF rules was implemented using the EB, FB, and BU estimates of the posterior mean and standard deviation of the ω distribution. In addition, we implemented rules based directly on the posterior distribution of ω. In these rules, items were flagged if their posterior probability of C status exceeded certain thresholds. Because these posterior density rules either performed similarly to the LF rules or performed less well, they are not discussed further. The ETS rules and loss function rules are provided in Table 6.

Table 6

Definitions of DIF Decision Rules

Name of Rule	DIF Statistics to Which Rule Applies	Rule for Rejecting H₀ (Null Hypothesis of no DIF)
ETS B-rule	MH	Absolute value of ${M H}_{i}$ is at least 1 and is statistically different from zero at $α = .05$
ETS C-rule	MH	Absolute value of ${M H}_{i}$ is at least 1.5 and is statistically greater than 1 at $α = .05$
Loss Function rule–liberal (LF-L)	EB, FB, BU	$V \hat{a} r (ω_{i} \| {MH}_{i}) + {[\hat{E} (ω_{i} \| M H_{i})]}^{2} > 1$ . See Equation 7
Loss Function rule–conservative (LF-C)	EB, FB, BU	$V \hat{a} r (ω_{i} \| {MH}_{i}) + {[\hat{E} (ω_{i} \| {MH}_{i})]}^{2} > 1.5$ . See Equation 8
Average MH-B rule	Average of MH statistics across administrations $(\overline{{M H}_{i}})$	Absolute value of Average ${M H}_{i}$ is at least 1 and is statistically different from zero at $α = .05$ . See Equations 5 and 6
Average MH-C rule	Average of MH statistics across administrations $(\overline{{M H}_{i}})$	Absolute value of Average ${M H}_{i}$ is at least 1.5 and is statistically greater than 1 at $α = .05$ . See Equations 5 and 6
Combined-data MH-B rule	MH on data pooled across administrations	Absolute value of ${M H}_{i}$ based on pooled data is at least 1 and is statistically different from zero at $α = .05$
Combined-data MH-C rule	MH on data pooled across administrations	Absolute value of ${M H}_{i}$ based on pooled data is at least 1.5 and is statistically greater than 1 at $α = .05$

Note: MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating.

Results

Results for all four conditions are provided for a total of 12 procedures for flagging items:

ETS B (Administration 1 only; results are interchangeable across administrations)

ETS C (Administration 1 only, as above)

EB LF-L (Administration 1 only, as above)

EB LF-C (Administration 1 only, as above)

FB LF-L (Administrations 3 and 4)

FB LF-C (Administrations 3 and 4)

BU LF-L (Administrations 2–4)

BU LF-C (Administrations 2–4)

Average MH-B rule (Administration 4)

Average MH-C rule (Administration 4)

Combined-data MH-B rule (Administration 4)

Combined-data MH-C rule (Administration 4)

Table 7 shows the percentage of correct flagging decisions for the true A, B, and C items for each flagging rule in each simulation condition, for each relevant test administration. For the true A items, the correct decision is not to identify the item as a DIF item. For all other items, flagging was considered the correct decision for purposes of constructing this table. For ease of comparison, the flagging rules are divided into two categories: liberal rules, which include the various versions of the ETS B rule and LF-L rule, and conservative rules, which include the ETS C rule and LF-C rules.

Table 7

Percentage of Correct Decisions for DIF Flagging Rules

	Condition 1			Condition 2			Condition 3			Condition 4
Rule (Admin)	A	B	C	A	B	C	A	B	C	A	B	C
Liberal rules
ETS B rule (1)	89.2	71.4	97.0	94.8	19.9	45.3	91.6	57.3	88.2	95.5	14.3	36.4
EB LF-L (1)	90.3	68.0	97.3	66.6	53.9	79.7	93.5	51.0	85.8	77.2	36.8	58.0
FB LF-L (3)	90.4	69.5	97.4	68.0	56.2	82.5	93.9	51.3	85.8	79.0	36.5	57.8
FB LF-L (4)	90.5	68.3	97.4	67.1	53.9	82.3	93.8	49.9	86.9	79.3	35.9	61.5
BU LF-L (2)	93.6	75.9	99.4	79.4	56.3	86.3	96.2	54.1	91.6	86.2	37.3	65.7
BU LF-L (3)	94.9	80.8	99.9	84.9	58.8	90.1	97.3	56.4	94.1	90.9	39.8	70.7
BU LF-L (4)	95.5	82.9	100.0	87.7	61.4	92.8	98.0	56.4	95.4	92.9	39.6	74.0
Average MH-B rule (4)	95.1	84.5	100.0	84.8	65.9	92.2	97.6	59.8	96.3	88.8	50.9	82.7
Combined-data MH-B rule (4)	95.3	84.4	100.0	86.7	66.1	92.0	97.7	58.8	95.9	90.6	50.5	82.1
Conservative rules
ETS C rule (1)	99.6	15.7	76.9	98.4	7.4	29.6	99.7	10.5	54.1	98.6	5.6	21.4
EB LF-C (1)	96.9	41.8	92.5	84.1	34.9	62.6	97.8	28.2	72.5	90.0	20.8	40.1
FB LF-C (3)	97.0	42.6	92.2	85.8	35.5	63.0	98.4	27.5	72.0	92.8	19.4	37.1
FB LF-C (4)	96.8	41.5	93.0	85.4	33.5	63.5	98.2	26.8	72.2	92.7	17.5	39.4
BU LF-C (2)	98.7	42.0	97.4	90.8	35.9	74.1	99.4	27.4	78.0	94.7	21.6	50.4
BU LF-C (3)	99.2	41.9	98.9	93.8	37.2	80.1	99.7	27.2	80.7	96.7	21.9	55.4
BU LF-C (4)	99.4	43.0	99.4	95.3	38.1	83.5	99.8	26.4	82.8	97.6	21.3	58.9
Average MH-C rule (4)	100.0	8.4	95.7	99.1	12.7	63.5	100.0	8.4	63.0	99.6	8.2	44.6
Combined-data MH-C rule (4)	100.0	8.1	95.3	99.4	12.4	64.4	100.0	8.3	62.5	99.6	7.7	44.8

Note: FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating; LF-L = Loss function-liberal; LF-C = loss function-conservative. The parenthesized administration numbers refer to the administration when the DIF decision was made. The A, B, and C designations in the headings refer to the true DIF status of the items. There were 21 A items, 7 B items, and 6 C items. For a set of K items, the number of replications on which the tabled percentages are based is 500K.

Like the results of Tables 3 and 4, the results of Table 7 show that in general, the DIF methods performed best when sample sizes were large and when the reference and focal groups had the same ability distribution. The results for the liberal rules also demonstrated that accumulating information across administrations using the BU approach led to better DIF identification than that obtained with the ETS B rule or the LF-L rules based on EB or FB. Among the conservative rules, BU LF-C performed very well relative to the ETS C rule and the LF-C rules based on EB and FB. For example, in Administration 4, it had a Type I error rate (the complement of the percentage of correct decisions, averaged over the A items) of less than 5% for all four conditions and had DIF detection rates ranging from 21% to 43% for true B items and from 59% to 99% for true C items. By contrast, the ETS C rule had Type I error rates of less than 2% in all four conditions, but had very poor detection rates, ranging from 6% to 16% for true B items and from 21% to 77% for true C items.

Comparing the performance of the BU approach to the aggregated MH methods produced a somewhat more complex picture. Among the liberal rules, the two aggregated MH B rules and the BU LF-L method from Administration 4 performed similarly, with a slight advantage going to the aggregated MH rules. However, Type I error rates for all three of these procedures were high for Condition 2 (reaching 15% for the Average MH B rule) and, to a lesser degree, for Condition 4. The results for the conservative rules were quite different. Here, the BU LF-C approach had considerably higher detection rates than the aggregated MH methods, though its Type I error rate was close to theirs. In particular, consider the results for Condition 3, where the BU LF-C analyses from Administrations 2–4 all had Type I error rates near 0. Its detection rate ranged from 78% to 83% for C items and from 26% to 27% for B items. By contrast, the two aggregated MH C rules had detection rates of about 63% for C items and 8% for B items.

A complication that arises in interpreting the results of Table 7 is that the criterion for evaluating the Bayesian approaches (EB, FB, and BU) is not strictly appropriate. These methods are not based on the ABC classification system. Instead, they are intended to detect DIF above a particular indifference point. In the case of the LF-L methods, this presents no particular interpretation problem: Because these methods have an indifference point of 1, the set of items that have nonignorable DIF (i.e., True DIF with a magnitude greater than 1) coincides exactly with the set of true B and C items. For the LF-C rules, the situation is more complicated. Some items with true B status (see Table 2) are considered non-DIF items from the perspective of the LF-C rules because their True DIF values are less than $\sqrt{1.5}$ . The far right column in Table 2 indicates which items have True DIF values exceeding $\sqrt{1.5}$ .

To investigate the impact of this disparity in item classification on the interpretation of the flagging results, we evaluated the performance of all DIF rules using the indifference points of 1 and $\sqrt{1.5}$ , respectively. A portion of these results is shown in Table 8, which displays the percentages of correct decisions for six conservative DIF rules in Conditions 1–4. Percentages are shown separately for items with True DIF less than $\sqrt{1.5}$ (25 items) and those with True DIF greater than $\sqrt{1.5}$ (9 items).

Table 8

Percentage of Correct Decisions for Items With True DIF Below and Above the Indifference Point of $\sqrt{1.5}$

	Condition 1		Condition 2		Condition 3		Condition 4
Rule (Admin)	Below	Above	Below	Above	Below	Above	Below	Above
ETS C rule (1)	97.9	58.8	97.6	22.6	99.1	42.6	98.2	17.0
EB LF-C (1)	92.0	79.2	81.5	54.6	95.7	63.5	88.9	35.3
FB LF-C (4)	91.8	79.1	82.8	54.8	96.4	63.3	91.8	34.2
BU LF-C (4)	94.8	86.6	91.0	71.4	99.1	73.6	96.0	50.3
Average MH-C rule (4)	99.6	69.4	97.8	48.3	100.0	48.5	99.1	34.6
Combined-data MH-C rule (4)	99.7	69.0	98.1	48.6	100.0	48.1	99.2	34.6

Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating; LF-C = loss function-conservative. The parenthesized administration numbers refer to the administration when the DIF decision was made. There were 25 items with True DIF below $\sqrt{1.5}$ and 9 items with True DIF above $\sqrt{1.5}$ . For a set of K items, the number of replications on which the tabled percentages are based is 500K.

In general, BU LF-C had a higher rate of detecting DIF items than other results with similar Type I error rates (i.e., incorrect decisions for items with True DIF less than $\sqrt{1.5}$ ). In Condition 1, BU LF-C had a Type I error rate not far from those of the ETS C rule, EB LF-C, and FB LF-C, but had a higher DIF detection rate (87% vs. 59–79%). The aggregated MH C rules were very conservative in this condition, with Type I error rates near 0 and correspondingly lower detection rates. In Condition 2, BU LF-C had a higher detection rates than its competitors but had a higher Type I error rate than the three C rule methods. In Condition 3, BU-LF-C was one of four methods with Type I error rates of less than 1%. The others were the ETS C rule and the aggregated MH C rules. BU LF-C had a much higher detection rate than the other three (74% vs. 43–49%). In Condition 4, BU LF-C had a Type I error rate of only 4% and a detection rate of 50%, higher than the rates for the other methods. However, the two aggregated MH C-rule procedures, which had detection rates of 35%, had superior Type I error control. In general, viewing the results in terms of whether items’ True DIF exceeded $\sqrt{1.5}$ (rather than in terms of true A, B, or C status) did not alter the previous conclusions about the performance of the BU LF-C rule.

Results are displayed in a different way in Figures 1 –4, which show, for Conditions 1–4, respectively, the flagging rates (not the percentage of correct decisions) for the B, C, EB LF-C, and BU LF-C rules in Administration 4. The flagging rate for each item was plotted against its True DIF value; these flagging rates were then connected for readability. The figures show that within each condition, the shape of the flagging rate plots is similar across flagging rules; this results from the fact that all the rules are ultimately based on Mantel–Haenszel results. Disparities across rules in rates of flagging are much greater in the small-sample conditions (2 and 4) and are particularly evident for extreme-DIF items, for which the BU LF-C rule shows a much higher detection rate than the other rules. An anomaly that is evident in all four conditions is that, for all DIF rules, detection is quite poor for Item 22, which has a True DIF value of 2.03 (see Table 2). We are investigating the reasons for this finding.

Figure 1.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 1, Administration 4.

Figure 2.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 2, Administration 4.

Figure 3.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 3, Administration 4.

Figure 4.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 4, Administration 4.

Figures 5 –8 display, for Conditions 1–4, respectively, the flagging rates for the two Average-MH rules, the two Combined-data MH rules, and the BU LF-C rule in Administration 4. The figures show that the Average MH-B rule and the Combined-data MH-B rule produced nearly identical results; the same is true for the Average MH-C rule and the Combined-data MH-C rule. The two B rules had consistently higher flagging rates than BU LF-C. While this is advantageous for items with large DIF, it is undesirable for items with True DIF near zero. The two C rules had consistently lower flagging rates than BU LF-C, as indicated in Table 9. (As in Figures 1 –4, results for Item 22 appear anomalous.)

Figure 5.

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 1, Administration 4.

Figure 6.

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 2, Administration 4.

Figure 7.

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 3, Administration 4.

Figure 8.

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 4, Administration 4.

Table 9

Number of DIF Items in One Form of a Literature Test

	Administration 1		Administration 2		Administration 3		Administration 4
	n_F = 704, n_M = 364		n_F = 580, n_M = 310		n_F = 502, n_M = 313		n_F = 1131, n_M = 551
Rule	F	M	F	M	F	M	F	M
ETS B rule	23	22	18	24	22	28	24	23
ETS C rule	8	3	4	2	5	7	11	6
EB LF-L	15	14	9	6	9	20	21	18
EB LF-C	8	6	5	2	6	8	14	10
FB LF-L	_	_	_	_	8	16	18	16
FB LF-C	_	_	_	_	5	8	13	8
BU LF-L	_	_	12	14	13	18	16	18
BU LF-C	_	_	8	3	7	7	11	9
Average MH-B rule	_	_	_	_	_	_	18	21
Average MH-C rule	_	_	_	_	_	_	7	5
Combined-data MH-B rule	_	_	_	_	_	_	19	17
Combined-data MH-C rule	_	_	_	_	_	_	7	5

Note. FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating; LF-L = Loss function-liberal; LF-C = loss function-conservative. “F” refers to counts of items that showed DIF in favor of female test-takers; “M” refers to counts of items that showed DIF in favor of male test-takers. There were 230 items in the test form.

Application of DIF Methods to Actual Test Data

Method

We analyzed data from college-level tests in Literature (four administrations of one test form and three administrations of another), Psychology (two administrations of each of two forms), and Biology (three administrations of one form). Because the sample sizes were not adequate for DIF analyses based on ethnic groups, we conducted only male–female DIF analyses. For Psychology and Biology, the number of DIF items was small. Using the ETS C rule for purposes of comparison, Psychology and Biology had 0–2 DIF items per form, each of which consisted of roughly 200 items. For Literature, however, the number of DIF items was much higher, particularly in the form for which data from four national administration dates were available (6 to 17 of 230 items). Therefore, in our discussion, we focus on this Literature form.

As in the operational analyses, we included only test-takers who stated they were U.S. citizens and that English was their best language. Because these tests are formula-scored in practice, DIF analyses were performed using formula score as the matching variable.³ On the four administrations of this form, the mean formula score for men exceeded the mean for women by .2 to .3 standard deviations.

Of particular interest to us was the performance of the BU method across administrations, as well as any differences in conclusions that might result from using the BU method instead of the ETS B or C rules. For Administration 4, we also included the aggregated MH rules in the comparison.

Results

Table 9 summarizes the DIF results for the various rules and administrations and also includes the sample sizes for men and women. Several aspects of the results are noteworthy. First, in general, more DIF was detected in Administration 4, most likely because of the considerably larger samples that were available for this administration. Second, there is a considerable amount of DIF in each direction. To help us understand the nature of DIF in this literature form, we obtained a copy of the test form itself from the testing program. Examination of the test form showed that (regardless of DIF method) the items that showed DIF in favor of women were almost invariably about female authors or literary characters. Those that showed DIF in favor of men tended to be about male authors or about philosophical or political issues. The pattern of DIF suggests that it may occur in this test because men and women have differential degrees of interest in certain questions.

A somewhat surprising finding about the EB and FB methods, given that the results of Tables 4 and 7 suggest they function almost identically, was that the EB approach led to slightly higher detection rates than the FB method. For example, in Administration 4, EB LF-C detected 24 DIF items, compared to 21 for FB LF-C. Results showed that the EB statistics were slightly more variable across items (SD = .76) than were the FB statistics (SD = .72).

Despite the fact that it had a Type I error roughly comparable to the ETS C rule in the simulation, the BU LF-C rule detected more DIF items on the Literature form. In Administration 4, BU LF-C detected 20 DIF items versus 17 for the ETS C rule. The aggregated MH B rules and the EB LF-L, FB LF-L and BU LF-L rules all flagged about the same number of items (34–39). The aggregated MH C rules were the most conservative of all the methods, detecting DIF in only 12 items.

It was of particular interest to compare the BU LF-C rule to the ETS C rule. Four items that were flagged by BU LF-C were not flagged by the ETS C rule, and one item that was flagged by the C rule was not flagged by BU LF-C. Detailed results for these five discrepant items are shown in Table 10. The items were arbitrarily numbered from 1 to 5, where Item 1 is the item flagged by the C rule and not the BU LF-C rule, and Items 2–4 were flagged by the BU LF-C rule, but not the C rule.

Table 10

Item Histories for Items With Discrepant DIF Results In Administration 4 of a Literature Test

Item	DIF Rule	Administration 4 Results (n = 1,682)		Item History
		Administration 4 Results (n = 1,682)		Administration 3 (n = 815)		Administration 2 (n = 890)		Administration 1 (n = 1,068)
		Estimate	SE/PSD	Estimate	SE/PSD	Estimate	SE/PSD	Estimate	SE/PSD
1	ETS C rule	1.634*	0.343	0.458	0.471	0.678	0.457	1.474	0.414
	BU LF-C	1.090	0.198	0.820	0.242	0.949	0.282	1.116	0.358
	Average MH-C rule	1.189	0.206
	Combined-data MH-C rule	1.233	0.197
2	ETS C rule	−1.377	0.324	−2.869*	.501	−1.138	0.464	−1.606	0.419
	BU LF-C	−1.507*	0.197	−1.584*	0.248	−1.167	0.285	−1.185*	0.361
	Average MH-C rule	−1.657*	0.205
	Combined-data MH-C rule	−1.638*	0.191
3	ETS C rule	−1.216	0.270	−1.408	0.399	−1.077	0.395	−1.612*	0.348
	BU LF-C	−1.247*	0.165	−1.265*	0.209	−1.211*	0.245	−1.295*	0.313
	Average MH-C rule	−1.318	0.170
	Combined-data MH-C rule	−1.255	0.157
4	ETS C rule	1.520	0.418	1.354	0.539	1.537	0.538	2.524*	0.555
	BU LF-C	1.512*	0.237	1.509*	0.287	1.570*	0.340	1.593*	0.438
	Average MH-C rule	1.734*	0.253
	Combined-data MH-C rule	1.640*	0.236
5	ETS C rule	−1.317	0.276	−1.714*	0.416	−1.429	0.413	−1.026	0.372
	BU LF-C	−1.263*	0.172	−1.230*	0.219	−1.044	0.258	−0.798	0.330
	Average MH-C rule	−1.342	0.177
	Combined-data MH-C rule	−1.205	0.167

Note. MH = Mantel–Haenszel, BU = Bayesian updating; SE = standard error; PSD = posterior standard deviation.

Asterisks indicate results that were flagged. Item 1 showed DIF according to the ETS C rule, but not BU Loss Function-Conservative (BU LF-C), in Administration 4. Items 2–5 showed DIF according to BU LF-C, but not the C rule, in Administration 4. By definition, the BU results for Administration 1 are the same as the EB results.

The table shows the histories for these items, including the MH and BU estimates, $S E ({M H}_{i})$ values, and ${P S D}_{i}$ values from all four administrations. Results for the aggregated MH C rules are included here as well. An asterisk is used to indicate which items were flagged in each administration. Item 1 had large MH statistic (1.63) in Administration 4 and was flagged by the C rule. However, two of its previous ${M H}_{i}$ values were very small (.68 and .46), leading to a BU value of only 1.09 in Administration 4. By contrast, Items 2–5 had ${M H}_{i}$ values exceeding 1 in all four administrations, which are manifested in their BU results for Administration 4. The ${P S D}_{i}$ for these items also tended to decrease over the four administrations. Even though the BU estimates were actually smaller than the MH estimates for Items 4 and 5 in Administration 4, the ${P S D}_{i}$ was small enough to lead to flagging by the BU LF-C rule. It appears that BU LF-C rule is more likely than the current ETS criteria to flag items that consistently show moderately high DIF values across several administrations.

Table 10 also confirms the fact that BU does not always lead to the same results as the other two aggregation methods. For Items 1, 2, and 4, the three methods did lead to the same conclusion, but for Items 3 and 5, the BU approach led to flagging, while the two aggregated MH methods (like the ordinary MH method) did not. These differences are discussed further in the next section.

Discussion

Our study of the BU DIF procedure comprised three phases, involving properties of competing DIF estimators, properties of competing DIF hypothesis-testing procedures, and results of applying competing DIF methods to actual test data.

Properties of the BU estimator were investigated using item response data simulated under four conditions. In the simulation model, the same test was “administered” 4 times, with the DIF parameters remaining constant across administrations. The BU statistic was compared to the Mantel–Haenszel DIF statistic, the empirical Bayes (EB) DIF statistic (without updating) of Zwick et al. (1999, 2000), the fully Bayes (FB) DIF statistic of Sinharay et al. (2009), and two MH statistics based on multiple administrations—the Average MH statistic and the Combined-data MH statistic. Averaged over items and replications, the BU statistic was found to have the smallest variance of all the methods. The BU statistics and the aggregated MH statistics had the smallest average RMSR values. The average bias of the BU statistic exceeded that of MH and aggregated MH statistics but was smaller than that of EB and FB.

Several rules for flagging DIF items were also investigated using the simulated data: two rules each for the BU, EB, and FB statistics, as well as the ETS rules for identifying items as B and C items. The ETS rules were applied to the MH, the Average MH, and the Combined-data MH statistics. Results showed that BU approach for accumulating information across administrations enhanced DIF identification, although the Average MH and Combined-data MH B rules performed about as well as BU LF-L. Among the more conservative flagging rules, BU LF-C outperformed all other methods including the aggregated MH approaches.

Finally, the BU approach was compared to the competing approaches using data from several college-level tests. Results from four administrations of a particular Literature test form suggest that the BU approach is more likely than either the current ETS approach or the aggregated MH methods to flag items that consistently show moderately high DIF values across several administrations.

It is useful to consider the reasons that the BU approach performs differently from the aggregated MH methods: The DIF estimator is different, and the decision procedure is different.

First, consider the estimator. The BU and Average MH statistics represent alternative methods of weighting the MH statistics from the four administrations. (The BU statistic also includes an additive term corresponding to the prior mean from Administration 1.) The weighting scheme for the BU statistic operates such that if the current estimate of the prior variance (i.e., the posterior variance from the previous administration) is small relative to the current value of $S E^{2} ({M H}_{i})$ , then the current ${M H}_{i}$ value will receive only a small weight relative to the current estimate of the prior mean (i.e., the posterior mean from the previous administration). However, if $S E^{2} ({M H}_{i})$ is small relative to the prior variance, the reverse is true. For the Average MH statistic, the weights were simply based on sample size (Equation 5), although alternative weights (e.g., based on the inverse of $S E^{2} ({M H}_{i})$ ) could be used. The Combined-data MH differs from the BU and Average MH statistics in that the aggregation process takes place before the MH is computed. In a test with K possible scores, the $2 \times 2 \times K$ tables from the four administrations are combined and a single MH statistic is computed. Although the three aggregation-based DIF estimators had similar average RMSRs in the simulation, where the true DIF and the sample size were constant across administrations (Table 4), this finding is unlikely to hold in more realistic situations. As Table 10 shows, the three approaches can lead to quite different DIF estimates, and their ordering is not consistent over items.

The second reason that the BU and the aggregated MH procedures can produce different conclusions is that the rules for flagging items are quite different. For the aggregated MH rules, we applied the ETS B and C rules, which involve both a statistical significance criterion and a magnitude criterion. For the BU method, we used the loss-function-based rules described in Equations 7 and 8. It is possible that the aggregated MH statistics would perform better if different decision criteria were applied.

Overall, our findings on the performance of the BU procedure were very favorable, indicating that it can provide improved information on items that are administered more than once as part of intact test forms. By yielding more accurate DIF estimates, the procedure could facilitate the interpretation of DIF findings. The BU and other Bayesian approaches could also allow the performance of DIF analysis in smaller demographic groups than is currently typical, a feature that may be particularly important as definitions of race and ethnicity become more specific and as adaptive testing (which can produce sparse data) increases in popularity. Further exploration of the method’s performance in groups and administrations of varying size would be fruitful, as would an investigation of the method’s functioning when the true DIF varies over administrations. Extensions of the BU method could include application to items that are readministered, but in new contexts, to polytomous items, to DIF statistics other than the Mantel–Haenszel, and to other item statistics, such as difficulty and discrimination.

Footnotes

Acknowledgments

We appreciate the assistance we received from Diane Cruz, Edward Kulick, and Mei Su and the manuscript reviews provided by Edward Kulick and Sandip Sinharay.

Notes

References

Agresti

(1990). Categorical data analysis. New York, NY: Wiley.

Box

G. E. P.

Tiao

G. C.

(1973). Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley.

Camilli

Penfield

D. A.

(1997). Variance estimation for differential test functioning based on Mantel-Haenszel statistics. Journal of Educational Measurement, 34, 123–139.

Holland

P. W

. (1987, January 27). Expansion and comments on Marco’s rational approach to flagging items for DIF. ETS internal memorandum. Princeton, New Jersey.

Holland

P. W

. (1987, February 11). More on rational approach item flagging. ETS internal memorandum. Princeton, New Jersey.

Holland

P. W

. (2004, February 9). Comments on the definitions of A, B and C items in DIF. ETS internal memorandum. Princeton, New Jersey.

Holland

P. W.

Thayer

D. T.

(1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer

Braun

H. I.

(Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum.

Longford

N. T.

Holland

P. W.

Thayer

D. T.

(1993). Stability of the MH D-DIF statistics across populations. In Holland

P. W.

Wainer

(Eds.), Differential item functioning (pp. 171–196). Hillsdale, NJ: Lawrence Erlbaum.

Phillips

Holland

P. W.

(1987). Estimation of the variance of the Mantel-Haenszel log-odds-ratio estimate. Biometrics, 43, 425–431.

10.

Sinharay

Dorans

N. J.

Grant

M. C.

Blew

E. O.

(2009). Using past data to enhance small sample DIF estimation: A Bayesian approach. Journal of Educational and Behavioral Statistics, 34, 74–96.

11.

Zieky

(1993). DIF statistics in test development. In Holland

P. W.

Wainer

(Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ: Lawrence Erlbaum.

12.

Zwick

Thayer

D. T.

(2002). Application of an empirical Bayes enhancement of Mantel-Haenszel DIF analysis to a computerized adaptive test. Applied Psychological Measurement, 26, 57–76.

13.

Zwick

Thayer

D. T.

Lewis

(1997). An investigation of the validity of an empirical Bayes approach to Mantel-Haenszel DIF analysis. (ETS Research Report No. RR-97-21). Princeton, NJ: Educational Testing Service.

14.

Zwick

Thayer

D. T.

Lewis

(1999). An empirical Bayes approach to Mantel-Haenszel DIF analysis. Journal of Educational Measurement, 36, 1–28.

15.

Zwick

Thayer

D. T.

Lewis

(2000). Using loss functions for DIF detection: An empirical Bayes approach. Journal of Educational and Behavioral Statistics, 25, 225–247.

16.

Zwick

Thayer

D. T.

Mazzeo

(1997). Descriptive and inferential procedures for assessing DIF in polytomous items. Applied Measurement in Education, 10, 321–344.

Improving Mantel–Haenszel DIF Estimation Through Bayesian Updating

Abstract

Keywords

The EB, FB, and BU Methods: Formulas and an Example

Estimation of σ i 2 , μ , τ 2 , and W i

The Current Project

Simulation Study of Properties of DIF Estimators

Method

Results

Simulation Study of Properties of DIF Decision Rules

Method

Results

Application of DIF Methods to Actual Test Data

Method

Results

Discussion

Footnotes

Acknowledgments

Notes

References

Estimation of $σ_{i}^{2}$ , $μ$ , $τ^{2}$ , and $W_{i}$