Abstract
This study demonstrates how the stability of Mantel–Haenszel (MH) DIF (differential item functioning) methods can be improved by integrating information across multiple test administrations using Bayesian updating (BU). The authors conducted a simulation that showed that this approach, which is based on earlier work by Zwick, Thayer, and Lewis, can yield more accurate DIF estimation and improve the detection of DIF items, even when compared to other approaches that aggregate data across administrations. The authors also applied the method to data from several college-level tests. The BU approach provides a natural way to accumulate all known DIF information about each test item while mitigating the undesirable bias toward zero that affected the performance of two previous Bayesian DIF methods.
The Mantel–Haenszel (MH) method of screening items for differential item functioning (DIF), developed by Holland and Thayer (1988), is used routinely by many large-scale testing programs. The MH results can be used to designate items as “A” (negligible or nonsignificant DIF), “B” (slight-to-moderate DIF), or “C” (moderate-to-large DIF) using criteria developed at Educational Testing Service (ETS) that involve both the magnitude of the MH value and its statistical significance (Zieky, 1993, p. 342). Because MH results are unstable in small samples, these analyses are typically performed only when certain sample size requirements are met. The enforcement of sample size criteria may mean that DIF analyses are not performed for certain test administration modes, such as computerized adaptive tests, in which some items may be administered to only a small number of individuals. Stringent sample size requirements may also mean that DIF analyses are not performed for certain ethnic groups, such as Native Americans. The challenge presented by small demographic groups is likely to be exacerbated by the trend toward more specific definition of ethnic and racial groups. For example, the 2000 U.S. Census reported data for 63 racial categories for certain purposes—6 main categories and 57 combinations (www.census.gov/census2000/raceqandas.html). Because of this phenomenon, it may be advantageous to adopt a DIF analysis method that can provide stable small-sample results.
Zwick, Thayer, and Lewis (1999, 2000; Zwick & Thayer, 2002) developed an empirical Bayes (EB) approach to MH DIF analysis, intended to produce more stable results in these situations. In the EB approach, a prior distribution for the DIF parameter ω is assumed. As described below, data from the current test administration, consisting of MH statistics and their standard errors, are used as a basis for estimating the mean and variance of the prior. Because the prior distribution and the likelihood function are both assumed normal, the posterior distribution of ω is also normal and its mean serves as the EB estimate of DIF. The EB approach was found to produce more stable DIF estimates than the ordinary MH method. Also, a loss-function-based DIF detection rule that made use of the EB results was better able to identify DIF items than the ETS “ABC” classification system. Superiority of the EB approach was most evident in small samples, as would be expected from theory.
In a modification of this approach, Sinharay, Dorans, Grant, and Blew (2009) used data from previous administrations, rather than the current data set, to estimate the prior mean and variance. They referred to their procedure as a fully Bayes (FB) approach. In their study, the performance of the FB method tended to be virtually identical to that of the EB method, 1 again providing more stable results than the ordinary MH statistic.
However, a particular objection was raised in connection with the EB and FB versions of MH DIF analysis––that they lead the user to pay less attention to DIF. In these Bayesian approaches, each MH statistic is “shrunk” toward the prior mean. As explained below, the EB and FB methods estimate the prior mean using the average MH from one or more test administrations. Because of the self-norming property of MH DIF statistics—the constraint that the MH values for an administration sum to approximately zero when the matching variable is the test score itself—the EB and FB methods will always cause the MH statistics to shrink toward zero.
Bayesian updating (BU) of DIF results was proposed (but not implemented) by Zwick et al. (2000) as a possible elaboration of their EB approach that could be applied for items administered more than once. The BU approach provides a natural way to accumulate all known DIF information about each test item while mitigating the undesirable bias toward zero demonstrated by the EB and FB approaches. As described below, the BU procedure generally involves shrinkage toward zero at the first administration only. Our study demonstrates how the stability of MH DIF methods can be improved by integrating information across multiple test administrations using BU.
In the next section, we provide the formulas associated with the EB, FB, and BU methods, as well as a simplified example of their application. In the subsequent two sections, we describe a simulation study that compares the BU DIF method to several competing methods in terms of the properties of the DIF estimators and the performance of DIF detection rules, respectively. In the fourth section, we discuss our application of the method to data from a college-level test in literature. In the final section, we summarize our results, discuss the limitations of the study, and outline possible future research directions.
The EB, FB, and BU Methods: Formulas and an Example
The BU approach builds on the EB approach, which is described first. The relation of the FB to the EB method is also explained. Using the fact that log of the MH odds ratio has an asymptotic normal distribution (Agresti, 1990), we assume that
Under these assumptions, we can use standard Bayesian calculations (e.g., Box & Tiao, 1973) to show that the posterior distribution of ω
i
, given the observed DIF statistics, is as follows:
The estimate of the posterior mean
Estimation of
,
,
, and
As described in Zwick et al. (1999, 2000), the observed squared standard error of
Zwick, Thayer, and Lewis conducted a series of validity studies of the EB method. For example, to investigate the sensitivity of the method to the assumption that the prior was normal, they experimented with alternative priors, such as truncated normal distributions. They also compared the estimator of
The FB method is identical to the EB method except in terms of the data used to obtain the estimates of μ and
The BU approach is an elaboration of the EB method that can be applied when test forms are administered multiple times. Each time an item is administered, the BU approach uses the estimates of the posterior mean and variance obtained from the immediately preceding administration (see Equations 2 and 3) as the prior mean and variance in the current application. Conceptually, the approach is simple, but it does introduce additional computational complexity in that each item has “its own” prior mean and variance. This is not the case in the EB and FB methods, which use a prior that is common across items (see Equation 1).
To see that the BU procedure can lead to results very different from the EB or FB approaches, consider the following simplified example. Suppose that an item is administered 4 times, each time resulting in an MH statistic of 1 and an SE(MHi
) of .6. Suppose further that the prior mean and variance, as defined in the EB procedure, are 0 and .5, respectively, in each administration. Then, using Equations 2 and 3, the EB procedure would produce a DIF estimate of .5814 on each administration, with a posterior standard deviation
Prior and Posterior Means and Standard Deviations for the Bayesian Updating (BU) Procedure Applied to the Example Data
aIt is assumed that
A natural question that arises concerning the BU approach is whether comparable performance could be achieved using non-Bayesian methods of aggregating the data from multiple administrations. To investigate this, we studied two additional DIF statistics based on aggregation—the Average MH and the Combined-data MH. For item i, the Average MH statistic,
where A is the number of administrations,
The Current Project
The analyses for this project consisted of three major components. First, a simulation study of the properties of the MH, EB, FB, and BU DIF estimates, as well as the Average and Combined-data MH statistics, was conducted. Second, several decision rules for identifying DIF, based on these six kinds of DIF estimates, were compared using the simulated data. Third, the DIF methods were applied to data from several college-level tests. Each of these three studies is presented in turn.
Simulation Study of Properties of DIF Estimators
Method
Four simulation conditions were included; in all four, the reference group had a standard normal ability (θ) distribution. The focal group distribution and group sample sizes were as follows: Condition 1: N(0, 1) focal group θ distribution; Condition 2: N(0, 1) focal group θ distribution; Condition 3: N(−1, 1) focal group θ distribution; Condition 4: N(−1, 1) focal group θ distribution;
Note that Conditions 2 and 4 represent situations in which DIF analyses would not be conducted under ETS rules, which require at least 200 members in the smaller group.
In terms of sample size and ability distributions, the simulation conditions in the present study were the same as those used by Zwick et al. (1999, 2000). In the current project, however, the simulation was extended to correspond to a situation in which a set of 34 items (representing a test form) was administered four different times to independent samples. The True DIF (i.e., the value of the DIF parameter ω) for each item was assumed to remain constant over administrations. Within each of the four administrations, responses to the items were generated using the three-parameter logistic (3PL) model. Item parameters were fixed throughout the study.
DIF was modeled as a difference between focal group difficulties
To express the true amount of DIF in the same metric as the Mantel–Haenszel delta difference of Holland and Thayer (1988), the following formulation (see Zwick, Thayer, & Lewis, 2000) was used:
Item Parameters and True DIF Values
Note: All guessing parameters were set to .15.
The number of replications per simulation condition was 500. Hence, within each simulation condition, each item was “administered” 2,000 times (4 Administrations × 500 Replications). The DIF statistics that were computed are listed below, along with an explanation of how, if at all, the statistic was modified across administrations.
MH statistic and standard error: These statistics were computed in all four administrations. Since the statistics are based on data from only the current administration, the four administrations within each simulation condition essentially represent additional replications. That is, other than sampling error, there is no reason to expect the MH performance to vary across the four administrations within a condition.
EB DIF statistic and PSD: These statistics were computed in all four administrations. As is the case with the MH statistics, there is no reason to expect the performance of the EB statistics to vary across administrations within a condition.
FB DIF statistic and PSD: These statistics were computed in Administrations 3 and 4, using the pooled data from all previous administrations to estimate the prior mean and variance, as outlined earlier. In Administration 3, the data from Administrations 1 and 2 were used; in Administration 4, the data from Administrations 1–3 were used.
BU DIF statistic and PSD: These updated statistics were computed in Administrations 2–4. Note that the EB statistics from Administration 1 serve as the first step in computing the subsequent BU statistics (see Table 1). In updating the DIF results for an item, data from the “same” replication in the previous administration were used. For example, the posterior mean and PSD from Replication 300 of Item 7 in Administration 3 were used as the prior mean and standard deviation in obtaining DIF results for Replication 300 of Item 7 in Administration 4.
Average MH statistic and standard error: These statistics were computed at Administration 4 only, based on data from all four administrations.
Combined-data MH statistic and standard error: These statistics too were computed at Administration 4 only, based on data from all four administrations.
For each item in each replication of each condition and administration, the various DIF statistics were computed. To summarize the performance of each method, squared bias
One way in which the DIF analyses in this study differed from those of Zwick et al. (1999, 2000) was that no refinement was used in the present study. (Sinharay, Dorans, Grant, and Blew, [2009] did not conduct refinement in their study either.) In DIF refinement as applied to the Mantel–Haenszel procedure, items that are found to have DIF in an initial round of analysis are deleted from the matching variable in the final analysis. (An exception to this is that the studied item is always included in the matching variable.) Our decision to forgo refinement was influenced by the fact that the testing program whose data we analyzed in the third phase of our study does not use refinement in its own analyses. To investigate the sensitivity of our simulation results to this decision, we conducted both refined and unrefined analyses of the data from Conditions 1 and 4. We found that, averaging across items, the unrefined results were slightly more accurate than the refined ones. This outcome confirmed our decision to conduct our primary analyses without refinement. (The unexpected refinement results led us to conduct a comprehensive study of DIF refinement, which is nearing completion.)
Results
Table 3 shows the minimum, average, and maximum
Minimum, Average, and Maximum RMSR, Squared Bias, and Variance for Administration 3
Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating. Min, mean, and max values are based on distributions over 500 × 34 = 17,000 replications.
Minimum, Average, and Maximum RMSR, Squared Bias, and Variance for Administration 4
Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating. Min, mean, and max values are based on distributions over 500 × 34 = 17,000 replications.
All procedures performed best in Condition 1, when samples were large
It is striking that in Conditions 2 and 4, the mean RMSR for the MH statistic exceeded 1 in both administrations, indicating that, on average,
Table 4 shows that the Average MH and Combined-data MH statistics had mean RMSRs that were very similar to each other and to those of BU except in Condition 2, where the mean RMSRs for BU (.47) and for the Combined-data MH (.48) were notably smaller than the RMSR for the Average MH (.53). The minimum RMSR values for the Average and Combined-data MH statistics were always larger than those of the BU statistic, but the maximum RMSR values were smaller than those of BU.
In all conditions and administrations in Tables 3 and 4, BU had the smallest mean variance; BU and the two aggregated MH methods (i.e., the Average and Combined-data MH) had similar mean RMSR values that were smaller than those of the competing procedures. As expected, the performance of BU in Administration 4 was better than in Administration 3. (Similarly, the Administration 3 results were superior to those of Administration 2 [not shown]). The three methods that used data from multiple administrations showed a substantial advantage over MH, EB, and FB even in the large-sample conditions (1 and 3). In Condition 1, Administration 4, for example, BU and the aggregated MH methods had mean RMSRs of .20, compared to .37 for EB and FB and .39 for MH. The mean bias of the BU procedure was always less than that of the two other Bayesian approaches, but greater than that of the MH and aggregated MH methods.
The performance of the EB and FB procedures was largely indistinguishable, as found by Sinharay et al. (2009). The FB results for Administrations 3 and 4 were very similar to each other as well, indicating that whether the prior was estimated based on two or three administrations did not have much impact on the results.
Table 5 shows, for Administration 4 in each of the four conditions, the average
Average Standard Error (for MH-based Methods) or Posterior Standard Deviation (for EB, FB, and BU) for Administration 4 in the Four Simulation Conditions
Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating. Averages were computed over 500 × 34 = 17,000 replications.
Simulation Study of Properties of DIF Decision Rules
Method
The simulation data were also used to compare various rules for DIF flagging. The ETS rules are based on those currently used at ETS. The C rule flags items if they meet the ETS criteria for C status, which state that the absolute value of
At Administration 4, we also applied versions of the B rule and C rule to the two aggregated MH statistics. The decision rules based on the Average MH (which, as far as we know, have not been previously used) have the advantage that they can be implemented using only the
All the remaining rules were based on the results of the three Bayesian estimation procedures. These include an approach to item flagging based on loss functions that was outlined by Paul Holland in unpublished memos (January 27, 1987; February 11, 1987). A rule based on this approach was implemented by Zwick et al. (2000) and included in the study by Sinharay et al. (2009). The loss associated with keeping (failing to flag) item i,
Holland proposed that
Flag if
To implement this rule, we substituted our estimates of the posterior mean and variance, into Equation 5. In the current study, we labeled this rule “EB Loss Function–Liberal” (EB LF-L) because Zwick et al. (2000) found that, while DIF detection rates were good for this rule, Type I error rates tended to be high in certain conditions. We therefore included a variant of the rule in the present study:
This can be shown to be equivalent to using a slightly higher indifference point of
It is worth noting that the EB LF-L rule is expected to function in essentially the same way as the ETS B rule in very large samples, where statistical significance is virtually assured. In this situation, the ETS B rule (which, as implemented here, identifies items that meet either the B or C criteria) will flag items with
Each of the LF rules was implemented using the EB, FB, and BU estimates of the posterior mean and standard deviation of the ω distribution. In addition, we implemented rules based directly on the posterior distribution of ω. In these rules, items were flagged if their posterior probability of C status exceeded certain thresholds. Because these posterior density rules either performed similarly to the LF rules or performed less well, they are not discussed further. The ETS rules and loss function rules are provided in Table 6.
Definitions of DIF Decision Rules
Note: MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating.
Results
Results for all four conditions are provided for a total of 12 procedures for flagging items:
ETS B (Administration 1 only; results are interchangeable across administrations)
ETS C (Administration 1 only, as above)
EB LF-L (Administration 1 only, as above)
EB LF-C (Administration 1 only, as above)
FB LF-L (Administrations 3 and 4)
FB LF-C (Administrations 3 and 4)
BU LF-L (Administrations 2–4)
BU LF-C (Administrations 2–4)
Average MH-B rule (Administration 4)
Average MH-C rule (Administration 4)
Combined-data MH-B rule (Administration 4)
Combined-data MH-C rule (Administration 4)
Table 7 shows the percentage of correct flagging decisions for the true A, B, and C items for each flagging rule in each simulation condition, for each relevant test administration. For the true A items, the correct decision is not to identify the item as a DIF item. For all other items, flagging was considered the correct decision for purposes of constructing this table. For ease of comparison, the flagging rules are divided into two categories: liberal rules, which include the various versions of the ETS B rule and LF-L rule, and conservative rules, which include the ETS C rule and LF-C rules.
Percentage of Correct Decisions for DIF Flagging Rules
Note: FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating; LF-L = Loss function-liberal; LF-C = loss function-conservative. The parenthesized administration numbers refer to the administration when the DIF decision was made. The A, B, and C designations in the headings refer to the true DIF status of the items. There were 21 A items, 7 B items, and 6 C items. For a set of K items, the number of replications on which the tabled percentages are based is 500K.
Like the results of Tables 3 and 4, the results of Table 7 show that in general, the DIF methods performed best when sample sizes were large and when the reference and focal groups had the same ability distribution. The results for the liberal rules also demonstrated that accumulating information across administrations using the BU approach led to better DIF identification than that obtained with the ETS B rule or the LF-L rules based on EB or FB. Among the conservative rules, BU LF-C performed very well relative to the ETS C rule and the LF-C rules based on EB and FB. For example, in Administration 4, it had a Type I error rate (the complement of the percentage of correct decisions, averaged over the A items) of less than 5% for all four conditions and had DIF detection rates ranging from 21% to 43% for true B items and from 59% to 99% for true C items. By contrast, the ETS C rule had Type I error rates of less than 2% in all four conditions, but had very poor detection rates, ranging from 6% to 16% for true B items and from 21% to 77% for true C items.
Comparing the performance of the BU approach to the aggregated MH methods produced a somewhat more complex picture. Among the liberal rules, the two aggregated MH B rules and the BU LF-L method from Administration 4 performed similarly, with a slight advantage going to the aggregated MH rules. However, Type I error rates for all three of these procedures were high for Condition 2 (reaching 15% for the Average MH B rule) and, to a lesser degree, for Condition 4. The results for the conservative rules were quite different. Here, the BU LF-C approach had considerably higher detection rates than the aggregated MH methods, though its Type I error rate was close to theirs. In particular, consider the results for Condition 3, where the BU LF-C analyses from Administrations 2–4 all had Type I error rates near 0. Its detection rate ranged from 78% to 83% for C items and from 26% to 27% for B items. By contrast, the two aggregated MH C rules had detection rates of about 63% for C items and 8% for B items.
A complication that arises in interpreting the results of Table 7 is that the criterion for evaluating the Bayesian approaches (EB, FB, and BU) is not strictly appropriate. These methods are not based on the ABC classification system. Instead, they are intended to detect DIF above a particular indifference point. In the case of the LF-L methods, this presents no particular interpretation problem: Because these methods have an indifference point of 1, the set of items that have nonignorable DIF (i.e., True DIF with a magnitude greater than 1) coincides exactly with the set of true B and C items. For the LF-C rules, the situation is more complicated. Some items with true B status (see Table 2) are considered non-DIF items from the perspective of the LF-C rules because their True DIF values are less than
To investigate the impact of this disparity in item classification on the interpretation of the flagging results, we evaluated the performance of all DIF rules using the indifference points of 1 and
Percentage of Correct Decisions for Items With True DIF Below and Above the Indifference Point of
Note. MH = Mantel–Haenszel; FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating; LF-C = loss function-conservative. The parenthesized administration numbers refer to the administration when the DIF decision was made. There were 25 items with True DIF below
In general, BU LF-C had a higher rate of detecting DIF items than other results with similar Type I error rates (i.e., incorrect decisions for items with True DIF less than
Results are displayed in a different way in Figures 1 –4, which show, for Conditions 1–4, respectively, the flagging rates (not the percentage of correct decisions) for the B, C, EB LF-C, and BU LF-C rules in Administration 4. The flagging rate for each item was plotted against its True DIF value; these flagging rates were then connected for readability. The figures show that within each condition, the shape of the flagging rate plots is similar across flagging rules; this results from the fact that all the rules are ultimately based on Mantel–Haenszel results. Disparities across rules in rates of flagging are much greater in the small-sample conditions (2 and 4) and are particularly evident for extreme-DIF items, for which the BU LF-C rule shows a much higher detection rate than the other rules. An anomaly that is evident in all four conditions is that, for all DIF rules, detection is quite poor for Item 22, which has a True DIF value of 2.03 (see Table 2). We are investigating the reasons for this finding.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 1, Administration 4.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 2, Administration 4.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 3, Administration 4.

Flagging rates for the B, C, EB LF-C, and BU LF-C rules in Condition 4, Administration 4.
Figures 5 –8 display, for Conditions 1–4, respectively, the flagging rates for the two Average-MH rules, the two Combined-data MH rules, and the BU LF-C rule in Administration 4. The figures show that the Average MH-B rule and the Combined-data MH-B rule produced nearly identical results; the same is true for the Average MH-C rule and the Combined-data MH-C rule. The two B rules had consistently higher flagging rates than BU LF-C. While this is advantageous for items with large DIF, it is undesirable for items with True DIF near zero. The two C rules had consistently lower flagging rates than BU LF-C, as indicated in Table 9. (As in Figures 1 –4, results for Item 22 appear anomalous.)

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 1, Administration 4.

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 2, Administration 4.

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 3, Administration 4.

Flagging rates for the Average MH-B, Average MH-C, Combined MH-B, Combined MH-C, and BU LF-C rules in Condition 4, Administration 4.
Number of DIF Items in One Form of a Literature Test
Note. FB = fully Bayes; EB = empirical Bayes; BU = Bayesian updating; LF-L = Loss function-liberal; LF-C = loss function-conservative. “F” refers to counts of items that showed DIF in favor of female test-takers; “M” refers to counts of items that showed DIF in favor of male test-takers. There were 230 items in the test form.
Application of DIF Methods to Actual Test Data
Method
We analyzed data from college-level tests in Literature (four administrations of one test form and three administrations of another), Psychology (two administrations of each of two forms), and Biology (three administrations of one form). Because the sample sizes were not adequate for DIF analyses based on ethnic groups, we conducted only male–female DIF analyses. For Psychology and Biology, the number of DIF items was small. Using the ETS C rule for purposes of comparison, Psychology and Biology had 0–2 DIF items per form, each of which consisted of roughly 200 items. For Literature, however, the number of DIF items was much higher, particularly in the form for which data from four national administration dates were available (6 to 17 of 230 items). Therefore, in our discussion, we focus on this Literature form.
As in the operational analyses, we included only test-takers who stated they were U.S. citizens and that English was their best language. Because these tests are formula-scored in practice, DIF analyses were performed using formula score as the matching variable. 3 On the four administrations of this form, the mean formula score for men exceeded the mean for women by .2 to .3 standard deviations.
Of particular interest to us was the performance of the BU method across administrations, as well as any differences in conclusions that might result from using the BU method instead of the ETS B or C rules. For Administration 4, we also included the aggregated MH rules in the comparison.
Results
Table 9 summarizes the DIF results for the various rules and administrations and also includes the sample sizes for men and women. Several aspects of the results are noteworthy. First, in general, more DIF was detected in Administration 4, most likely because of the considerably larger samples that were available for this administration. Second, there is a considerable amount of DIF in each direction. To help us understand the nature of DIF in this literature form, we obtained a copy of the test form itself from the testing program. Examination of the test form showed that (regardless of DIF method) the items that showed DIF in favor of women were almost invariably about female authors or literary characters. Those that showed DIF in favor of men tended to be about male authors or about philosophical or political issues. The pattern of DIF suggests that it may occur in this test because men and women have differential degrees of interest in certain questions.
A somewhat surprising finding about the EB and FB methods, given that the results of Tables 4 and 7 suggest they function almost identically, was that the EB approach led to slightly higher detection rates than the FB method. For example, in Administration 4, EB LF-C detected 24 DIF items, compared to 21 for FB LF-C. Results showed that the EB statistics were slightly more variable across items (SD = .76) than were the FB statistics (SD = .72).
Despite the fact that it had a Type I error roughly comparable to the ETS C rule in the simulation, the BU LF-C rule detected more DIF items on the Literature form. In Administration 4, BU LF-C detected 20 DIF items versus 17 for the ETS C rule. The aggregated MH B rules and the EB LF-L, FB LF-L and BU LF-L rules all flagged about the same number of items (34–39). The aggregated MH C rules were the most conservative of all the methods, detecting DIF in only 12 items.
It was of particular interest to compare the BU LF-C rule to the ETS C rule. Four items that were flagged by BU LF-C were not flagged by the ETS C rule, and one item that was flagged by the C rule was not flagged by BU LF-C. Detailed results for these five discrepant items are shown in Table 10. The items were arbitrarily numbered from 1 to 5, where Item 1 is the item flagged by the C rule and not the BU LF-C rule, and Items 2–4 were flagged by the BU LF-C rule, but not the C rule.
Item Histories for Items With Discrepant DIF Results In Administration 4 of a Literature Test
Note. MH = Mantel–Haenszel, BU = Bayesian updating; SE = standard error; PSD = posterior standard deviation.
Asterisks indicate results that were flagged. Item 1 showed DIF according to the ETS C rule, but not BU Loss Function-Conservative (BU LF-C), in Administration 4. Items 2–5 showed DIF according to BU LF-C, but not the C rule, in Administration 4. By definition, the BU results for Administration 1 are the same as the EB results.
The table shows the histories for these items, including the MH and BU estimates,
Table 10 also confirms the fact that BU does not always lead to the same results as the other two aggregation methods. For Items 1, 2, and 4, the three methods did lead to the same conclusion, but for Items 3 and 5, the BU approach led to flagging, while the two aggregated MH methods (like the ordinary MH method) did not. These differences are discussed further in the next section.
Discussion
Our study of the BU DIF procedure comprised three phases, involving properties of competing DIF estimators, properties of competing DIF hypothesis-testing procedures, and results of applying competing DIF methods to actual test data.
Properties of the BU estimator were investigated using item response data simulated under four conditions. In the simulation model, the same test was “administered” 4 times, with the DIF parameters remaining constant across administrations. The BU statistic was compared to the Mantel–Haenszel DIF statistic, the empirical Bayes (EB) DIF statistic (without updating) of Zwick et al. (1999, 2000), the fully Bayes (FB) DIF statistic of Sinharay et al. (2009), and two MH statistics based on multiple administrations—the Average MH statistic and the Combined-data MH statistic. Averaged over items and replications, the BU statistic was found to have the smallest variance of all the methods. The BU statistics and the aggregated MH statistics had the smallest average RMSR values. The average bias of the BU statistic exceeded that of MH and aggregated MH statistics but was smaller than that of EB and FB.
Several rules for flagging DIF items were also investigated using the simulated data: two rules each for the BU, EB, and FB statistics, as well as the ETS rules for identifying items as B and C items. The ETS rules were applied to the MH, the Average MH, and the Combined-data MH statistics. Results showed that BU approach for accumulating information across administrations enhanced DIF identification, although the Average MH and Combined-data MH B rules performed about as well as BU LF-L. Among the more conservative flagging rules, BU LF-C outperformed all other methods including the aggregated MH approaches.
Finally, the BU approach was compared to the competing approaches using data from several college-level tests. Results from four administrations of a particular Literature test form suggest that the BU approach is more likely than either the current ETS approach or the aggregated MH methods to flag items that consistently show moderately high DIF values across several administrations.
It is useful to consider the reasons that the BU approach performs differently from the aggregated MH methods: The DIF estimator is different, and the decision procedure is different.
First, consider the estimator. The BU and Average MH statistics represent alternative methods of weighting the MH statistics from the four administrations. (The BU statistic also includes an additive term corresponding to the prior mean from Administration 1.) The weighting scheme for the BU statistic operates such that if the current estimate of the prior variance (i.e., the posterior variance from the previous administration) is small relative to the current value of
The second reason that the BU and the aggregated MH procedures can produce different conclusions is that the rules for flagging items are quite different. For the aggregated MH rules, we applied the ETS B and C rules, which involve both a statistical significance criterion and a magnitude criterion. For the BU method, we used the loss-function-based rules described in Equations 7 and 8. It is possible that the aggregated MH statistics would perform better if different decision criteria were applied.
Overall, our findings on the performance of the BU procedure were very favorable, indicating that it can provide improved information on items that are administered more than once as part of intact test forms. By yielding more accurate DIF estimates, the procedure could facilitate the interpretation of DIF findings. The BU and other Bayesian approaches could also allow the performance of DIF analysis in smaller demographic groups than is currently typical, a feature that may be particularly important as definitions of race and ethnicity become more specific and as adaptive testing (which can produce sparse data) increases in popularity. Further exploration of the method’s performance in groups and administrations of varying size would be fruitful, as would an investigation of the method’s functioning when the true DIF varies over administrations. Extensions of the BU method could include application to items that are readministered, but in new contexts, to polytomous items, to DIF statistics other than the Mantel–Haenszel, and to other item statistics, such as difficulty and discrimination.
Footnotes
Acknowledgments
We appreciate the assistance we received from Diane Cruz, Edward Kulick, and Mei Su and the manuscript reviews provided by Edward Kulick and Sandip Sinharay.
