Abstract
There has been growing use of ideal point models to develop scales measuring important psychological constructs. For meaningful comparisons across groups, it is important to identify items on such scales that exhibit differential item functioning (DIF). In this study, the authors examined several methods for assessing DIF on polytomous items generated by an ideal point process. Two paradigms (i.e., null hypothesis significance testing [NHST] and effect size quantification) were utilized, and three test statistics (i.e., the log-likelihood ratio [LR], the Akaike information criterion [AIC], and Lord’s chi-square) and two approaches to DIF testing (i.e., the constrained and free baseline methods) were evaluated. In addition, the authors investigated three levels of impact. The results revealed that DIF effect sizes were moderately large for the .50 uniform DIF conditions and small for nonuniform DIF; moreover, the LR test in general yielded the best results. When there was small to moderate impact, the free baseline approach combined with an item linking implementation produced the most satisfactory results.
Keywords
The measurement of individual differences is fundamentally important for psychological research, educational and clinical decision making (R. J. Cohen & Swerdlik, 2002), and personnel selection (Hough & Oswald, 2000). Accurately measuring and comparing individual differences underpins individual difference theory (Ackerman & Humphreys, 1990), and these processes are vital to the study of domains such as job performance (Motowidlo, Borman, & Schmit, 1997) and work motivation (Mitchell & Daniels, 2003). Recent advancements in measurement science have converged on the notion that self-reported typical behaviors (e.g., personality, attitudes, and vocational interests) are most appropriately characterized by an ideal point response process (Drasgow, Chernyshenko, & Stark, 2010a, 2010b; Roberts, Laughlin, & Wedell, 1999; Stark, Chernyshenko, Drasgow, & Williams, 2006; Tay & Drasgow, 2012; Tay, Drasgow, Rounds, & Williams, 2009).
Ideal point response processes, also termed the unfolding technique (Coombs, 1964), contrast sharply with dominance response processes and their associated traditional psychometric models that include classical test theory, factor analysis, and logistic item response theory (IRT). Dominance models assume that the probability of a correct or affirmative response to an item is positively related to (θ j −bi) whereas ideal point models assume that the probability is negatively related to | θ j −bi |, where θ j denotes the standing of person j on the latent trait and bi denotes the extremity of item i. Thus, ideal point models posit that individuals are most likely to endorse items that are closest to their latent trait standing, and that the probability of endorsement is nonmonotonic and decreasing as the item’s location becomes more discrepant from the person’s standing on the latent trait.
The rationale and importance of using ideal point models to characterize self-reported typical behaviors converge from both theoretical and empirical perspectives. Theoretically, Thurstone (1928) argued that individuals use introspective cognitive processes when answering items, asking themselves, “Does this statement closely describe me?” Therefore, the closer an item’s location on the latent trait continuum to the individual’s standing, the greater the probability the individual endorses the item, and the probability of endorsing an item decreases as items’ locations are further away from an individual’s ideal point (Thurstone, 1929). For example, the item “Compulsory military training in all countries should be reduced but not eliminated” from the Thurstone–Droba War scale (cited by Likert, 1932, p. 34) might be disagreed with by both a strong militant and a strong pacifist. Empirically, Chernyshenko, Stark, Chan, Drasgow, and Williams (2001) fit several IRT models to data from the Sixteen Personality Factor (16PF) Questionnaire and found the best model was not monotonic, which is the hallmark of dominance models, but, instead, found nonmonotonicities in the probabilities of endorsement, suggesting an ideal point response process. In addition, a study by Carter and Dalal (2010) has demonstrated that, because of its unique flexibility, an ideal point model fits work satisfaction data better than dominance models. Tay, Ali, Drasgow, and Williams (2011) further conducted a simulation study and found that ideal point models are not simply more flexible (e.g., fitting many types of response processes): A goodness-of-fit index had substantial power to detect model misspecification when fitting an ideal point model to dominance data and vice versa.
If an ideal point model is required to appropriately characterize responses to assessments of psychological constructs using introspection, it is crucial to devise methods to assess whether items measure similarly across relevant groups. For legal and scientific reasons, measurement equivalence has been on the center stage of psychological measurement research (Thissen & Wainer, 2001). Common measurement equivalence procedures—such as multigroup confirmatory factor analysis (CFA) and differential item functioning (DIF)—are often used to ensure that group differences on individual difference measures are attributable to actual differences in the trait assessed (Stark, Chernyshenko, & Drasgow, 2004, 2006) rather than a flaw in the assessment instrument.
A recent study by Carter and Zickar (2011) initiated research on the evaluation of DIF for ideal point models (see also Roberts & Gordon, 2008). These authors examined a log-likelihood ratio (LR) method and Raju’s differential functioning of items and tests (DFIT; Raju, van der Linden, & Fleer, 1995) and found the LR approach to function more effectively.
Many fundamental questions about DIF detection with ideal point models remain unanswered. For example, how do different approaches to DIF assessment (e.g., the constrained baseline and free baseline approaches) and different statistical testing methods (e.g., the log-likelihood test, Akaike’s information criterion [AIC], and Lord’s chi-square) influence the efficacy of DIF detection? Are different forms of DIF (uniform and nonuniform DIF) detectable? What is the effect of impact (i.e., mean differences in the trait across reference and focal groups)? What is the magnitude of DIF when it is quantified by a statistic like Cohen’s d?
To begin answering these important questions, the authors utilized two paradigms: (a) the null hypothesis significance testing (NHST) paradigm and (b) the effect size quantification paradigm. NHST has been the most common approach to the study of DIF. Here two models—a null model and an alternative model—are constructed and compared, and a test statistic is computed. If the test statistic is statistically significant, the null hypothesis is rejected and the studied item is flagged for DIF. The authors adopted two approaches to constructing NHST models, the constrained baseline approach and free baseline approach, and used three test statistics, 1 the LR test, the AIC (Akaike, 1973, 1974), and Lord’s (1980) chi-square test. In addition, their DIF detection performance was examined in situations where there was impact.
An alternative to NHST is the effect size paradigm where the magnitude of differences is quantified (J. Cohen, 1992). For DIF, an effect size measure indexes the magnitude of differential functioning between reference and focal populations (Meade, 2010). A compelling virtue of the use of DIF effect size is that it is independent of sample size, and thus, large Ns do not inevitably lead to many significant differences.
In the following, the authors begin by introducing the most commonly used ideal point IRT model—the generalized graded unfolding model (GGUM)—and then discuss the approaches to DIF testing with the NHST paradigm. Next, they elaborate on the effect size paradigm and introduce the DIF quantification method utilized in this study. Then these approaches are used in a simulation study.
The GGUM and DIF Testing Models
Although multiple ideal point IRT models have been proposed, including the hyperbolic cosine latent trait model (Andrich & Luo, 1993) and the normal probability density function model (Maydeu-Olivares, Hernandez, & McDonald, 2006), the GGUM (Roberts, Donoghue, & Laughlin, 2000, 2002) is perhaps the most widely used. The GGUM has the form,
where i denotes the ith item; Zi represents a random variable denoting the response to the ith item; z = 0, 1, 2, … , C, where z is the observed response, 0 represents the strongest level of disagreement, and C represents the strongest level of agreement; θ j denotes the location of jth individual on the latent continuum; δ i represents the location of the ith item on the latent continuum; α i represents the discrimination of the ith item; τ ik denotes the kth subjective category threshold parameter associated with the ith item; C is the number of observable response categories minus 1; and M = 2 ×C+ 1.
The authors discuss two approaches—the constrained baseline approach and free baseline approach—to constructing the LR test and the AIC test. Previous research has revealed that these approaches can yield very different results (Stark, Chernyshenko, & Drasgow, 2006).
The Constrained Baseline Approach
The constrained baseline approach has been commonly used for DIF analysis. This approach constrains the parameters of all items to be equal across reference and focal groups in its baseline model. The alternative model freely estimates the parameters of the studied item while holding the parameters of all other items equal across groups. Then, a test statistic (e.g., the LR) is computed from the baseline and the alternative models. If a given item truly has DIF, the difference of log-likelihood chi-square statistics is expected to be greater than a critical value, and the AIC value from the alternative model will be smaller than that from the baseline model.
The Free Baseline Approach
The free baseline approach assumes there may be DIF on some items and begins by freely estimating item parameters in the reference and focal groups. Its alternative model assumes no DIF on the studied item and constrains its parameters across the reference and focal groups. Then the log-likelihood chi-square statistic and AIC indices are calculated as in the constrained baseline approach (note that DIF is indicated when the AIC is smaller for the baseline model). Statistical theory (Maydeu-Olivares & Cai, 2006) indicates that the free baseline approach should be preferred. Moreover, a simulation study (Stark, Chernyshenko, & Drasgow, 2006) found much lower Type I error rates and higher power with the free baseline approach for dominance IRT and CFA models. However, no study has compared the two approaches for ideal point models.
Three Statistical Testing Methods for DIF Detection
LR Test
The LR test is a popular testing method for comparing models. This test is
where
AIC
The AIC is an index of model fit based on information theory. According to Akaike (1973, 1974),
Here, Lm is the maximum likelihood of the estimated model and h is the number of parameters estimated for the model. The AIC index is typically used to evaluate the fit of alternative models. In this study, the authors utilized the AIC index for NHST: The AIC indices of two models (i.e., a null model and an alternative model) were compared to determine whether a model assuming DIF was a better fit. Because a smaller AIC index indicates better fit, the studied item was identified as containing DIF if the AIC index of the DIF model was smaller. To the best of our knowledge, there has been little research examining the use of the AIC index to detect DIF and its effectiveness has not previously been explored. The authors included this method because the AIC index is conveniently generated by GGUM and it has a computational advantage if it is proven to be effective for DIF analysis.
Lord’s Chi-Square
Lord’s (1980) chi-square method is well known and widely used for assessing DIF. For this method, item parameters are first estimated separately for each group, and the latent trait scales are linked so that all the parameters are put on the same metric. Then item parameter estimates for different groups are compared; a significant difference between estimates indicates DIF. Lord’s chi-square method has been extensively studied for dominance models and has been found to be robust and powerful. For example, by conducting simulation studies, McLaughlin and Drasgow (1987) and Lim and Drasgow (1990) found that Lord’s chi-square produced accurate DIF results when marginal maximum likelihood or Bayes modal estimates were used. In an empirical study with a 45-item vocabulary test, Raju, Drasgow, and Slinde (1993) found that Lord’s chi-square test yielded DIF detection results that closely agreed with the results of other methods such as the Mantel–Haenszel (MH), the signed area measure, and the unsigned area measure. Indeed, Donoghue and Isham (1998) reported Lord’s chi-square test as the most effective in identifying DIF items. However, little is known about the robustness and power of this method for ideal point models. One of the objectives was to compare Lord’s chi-square test with the LR test and AIC method.
The Impact Issue for Ideal Point Models
The impact issue is an important consideration when assessing DIF detection. It occurs when there is a mean difference between the reference and focal groups. Donoghue, Holland, and Thayer (1993) and Mullis, Dossey, Owen, and Phillips (1993) reported that it is quite common for the reference and focal groups to have a mean discrepancy of one standard deviation. The unequal latent means between two groups may pose a problem for DIF detection. This problem has been carefully studied for dominance models. For example, Chang, Mazzeo, and Roussos (1996) simulated a reference sample from a N(0,1) distribution and a focal sample from a N(−1,1) distribution. Their Simultaneous Item Bias Test (SIBTEST) procedure yielded appropriately small Type I error rates for the impact conditions. Using similar manipulations, but with the MH method, Wang and Su (2004a) found small Type I error rates in the impact conditions for the one-parameter logistic (1PL) model but inflated Type I error rates for the two-parameter logistic (2PL) and three-parameter logistic (3PL) models. It has been found that under impact conditions, Type I error rates increase as the number of parameters increases, and the common DIF assessment methods (e.g., MH, SIBTEST, and LR test) perform better when there is no impact than when there is impact (Finch, 2005; Wang & Su, 2004a, 2004b).
The authors suspect that the impact issue is prevalent for constructs that require ideal point models. These constructs include a variety of personality traits, attitudes, and vocational interests, which may vary substantially among subpopulations (e.g., genders, cultures, socioeconomic status [SES] classes, age subgroups). For example, Westerners are significantly more extraverted than non-Westerners, and Hong Kong Chinese and Japanese are significantly more neurotic than mainland Chinese and South Koreans, indicating that subcultures can differ even in the same geographic area (McCrae et al., 2010). However, the common DIF detection methods have not been examined under impact situations for ideal point models. By incorporating impact conditions in this study, the authors sought to examine DIF detection under a range of conditions commonly encountered.
DIF Effect Size
Although the NHST paradigm has been widely used to detect DIF, it does not come without criticism. One criticism is that significance tests only provide a yes-or-no result and thereby fail to provide useful information such as the magnitude, value, or importance of an effect (Kirk, 2006). Importantly, the result of a significance test is highly dependent on the sample size: Even a trivial effect can reach a statistical significance if the sample size is big enough.
The effect size paradigm overcomes these limitations by quantifying the magnitude of effect. In this study, two concerns were taken into consideration when the authors selected a measure of DIF effect size. First, the DIF areas for uniform DIF of ideal point models are almost completely symmetric with opposite signs. The signed area and related effect size measures (e.g., A. S. Cohen, Kim, & Baker, 1993; Penfield, 2010; Raju, 1988) will be close to zero due to cancellation. Second, because this study examined several forms of DIF, a calculation method is preferred that generates a standardized effect size that can be compared across conditions.
With these two considerations in mind, the authors selected a new method developed by Nye (2011). The Nye method first squares the difference between conditional expected scores, which avoids the cancellation problem and thus it is sensitive to different forms of DIF. More importantly, the Nye method standardizes the DIF effect size by dividing it by the pooled standard deviation of the reference and focal groups’ responses to a given item. This standardization puts all the DIF effect sizes on the same metric and this statistic—like Cohen’s d—is comparable across different DIF conditions and studies. This new method has been empirically tested in both dichotomous and polytomous dominance IRT contexts and has proved effective.
Nye’s (2011) DIF effect size 2 is
Here, fF(θ) is the normal distribution latent trait density function for the focal group; ESiR(θ) and ESiF(θ) are the expected scores for item i given a latent trait value of θ in the reference and focal groups, respectively, and are given by
and
for polytomous items that are scored x = 0, 1, … , C, and SDiP is the pooled standard deviation of reference and focal groups and is used to standardize the effect. SDiP is computed as
Here, NR and NF are the sample sizes of the reference and focal groups, respectively; Var iR and Var iF are the variances in the reference and focal groups, respectively, and they are given by
and
where PiR(x = k|θ) and PiF(x = k|θ) are the probabilities of an examinee in the reference and focal groups, respectively, with ability level θ selecting category k of item i. In this study, these probabilities were calculated using the GGUM model in Equation 1. In practice, this integral may be approximated numerically.
The Present Study
Using simulation procedures, the present study was designed to systematically understand DIF detection for ideal point IRT models under the NHST and effect size paradigms. Under the NHST paradigm, the authors examined three DIF detection methods and compared two approaches with DIF testing model construction. In addition, they investigated DIF detection under various conditions by manipulating five data characteristics: (a) type of DIF, (b) level of impact, (c) scale length, (d) percentage of DIF items, and (e) sample size per comparison group. The DIF effect size paradigm was used to overcome limitations of the NHST paradigm and help better understand DIF detection by offering a precise measure of DIF effect magnitude.
Method
Basic Simulation Setup for the GGUM Model
Item Parameter Generation
The authors followed the procedures by Roberts et al. (2002) for the basic model generation in the simulation, because they proposed the GGUM model and their simulation procedures have been followed in other studies (e.g., Tay et al., 2011), which facilitate comparisons across studies. Specifically, α i was generated from a uniform distribution U(0.5,2). The location parameters, δ I , were evenly distributed in the interval of [−2,2] with the first item and nth values always −2 and 2, respectively. The responses to each item had five categories: 0 to 4, where 0 represented the strongest disagreement and 4 represented the strongest agreement. Therefore, each item required four thresholds (τi1, τi2, τi3τi4); τ i4 was generated from a uniform (−1.4,−0.4) distribution and the other thresholds (i.e., τi1, τi2, τi3) were calculated using Equation 10:
where eik− 1 represents a random error term generated from a N(0,0.04) distribution.
DIF Item Manipulations
The number of DIF items for the focal group was determined by the product of the scale length (10 or 20 items) and the percentage of DIF items in the scale (0%, 20%, or 40%). DIF items were randomly selected for each replication. For large and small uniform DIF conditions, δ i values of the selected DIF items were incremented by .50 and .25, respectively; for nonuniform DIF conditions, α i values of the selected DIF items were increased by .50.
Theta Generation and Impact Manipulation
In this study, all latent trait values for the reference group were sampled from a N(0,1) distribution. For the conditions of no impact, the latent trait values for the focal group were also sampled from a N(0,1) distribution. For the conditions of .25 SD and .50 SD impact, the latent trait values for the focal group were sampled, respectively, from N(−0.25,1) and N(−0.50,1) distributions.
Response Data Generation
After item parameters were generated for the reference and focal groups, Equation 1 was used to compute the probability that each category of a given item was endorsed by a respondent with a simulated theta value (θ
j
). Then, the cumulative probability from the first category to the fifth category of an item (i.e., a vector
Constructing DIF Tests
Constrained Baseline Analysis
Here, item parameter estimates were constrained across the reference and focal groups. The corresponding alternative model was similar to the baseline model except that the studied DIF item was freely estimated. Each item in turn was compared with the baseline model to determine whether the studied item contained DIF by using the LR test and the AIC (Roberts & Gordon, 2008).
Free Baseline Analysis
In this analysis, the responses were separated from the reference and focal groups, and item parameters were freely estimated across the two groups. Next, each studied item was constrained to be equal across the two groups and DIF was evaluated by the LR test and the AIC (Stark, Chernyshenko, and Drasgow, 2006). This worked appropriately for conditions with no impact, because fitting models with the latent trait distributions specified as N(0,1) place item parameter estimates on the same metric. However, when there was impact between the reference and focal groups, it was necessary to take steps to place parameter estimates on the same scale. To do so, the authors used linking items (i.e., anchor items) in the estimation process. Specifically, 1, 2, or 4 non-DIF items were constrained to be equal across groups to link the latent trait metrics.
The linking items were randomly selected from non-DIF items. Random selection was preferred because a preliminary study showed that the DIF effect size was largely independent of theta location, and in practice, it is difficult to know a priori the characteristics of the linking items; thus, random selection improves the results’ generalizability. To implement 1-item linking, one item was randomly selected from the items with positive delta (δ i ) values (because the deltas were symmetric around 0, the results would be the same regardless of whether the item was selected from the negative or positive delta items). 3 For 2- and 4-item linkings, half of the items were randomly selected from the negative delta items and randomly selected the other half of the items from the positive delta items. After selecting the linking item(s), a certain number (i.e., the product of the DIF percentage and scale length) of items were then randomly selected from the remaining items for the DIF manipulation.
Lord’s Chi-Square Method
After generating response data for the reference and focal groups, the authors first ran GGUM2004 (Version 1.1; Roberts & Shim, 2008) separately for each group. Then they used GGUMLINK (Version 1.0; Roberts, 2002) to link item parameters of the focal group to the reference group. Then Lord’s chi-square statistic was computed.
Calculating DIF Effect Sizes
The calculation of DIF effect size followed Equation 4. A preliminary study showed that the item-level DIF effect size was independent of scale length and sample size. Therefore, the authors only report DIF effect sizes for the 10-item scale in nine conditions: three types of DIF (i.e., .50 uniform DIF, .25 uniform DIF, and .50 nonuniform DIF) × three levels of impact (i.e., 0 SD, .25 SD, and .50 SD).
Summary
For the NHST approach, the authors investigated six factors: type of DIF (no DIF, nonuniform DIF, large uniform DIF, and small uniform DIF) × level of impact (no impact, .25 SD impact, and .5 SD impact) × scale length (10 and 20 items) × percentage of DIF items (0%, 20%, and 40%) × sample size (250, 500, and 1,000) × testing method (LR, AIC, and Lord’s chi-square). In addition, the constrained and free baseline approaches were nested in the log-likelihood test and AIC methods, and three forms of anchor items (1-, 2-, and 4-linking items) were nested under the .25 SD and .5 SD impact levels for the log-likelihood test and AIC methods. For the effect size paradigm, the authors examined two factors: type of DIF (no DIF, nonuniform DIF, large uniform DIF, and small uniform DIF) × level of impact (no impact, .25 SD impact, and .5 SD impact). The authors replicated 100 times for each condition in the NHST paradigm and replicated 1,000 times in the effect size paradigm.
A shell script was written in the R programming language (R Development Core Team, 2011) to automate the entire processes, including data generation, DIF model construction, parameter estimation, output extraction, and DIF effect size calculation. In the item estimation process, the initial signs—which were determined by the simulated delta values of each item before the DIF manipulation—were added to improve estimation accuracy (see Roberts & Shim, 2008).
Analysis
Analysis for the NHST
The analyses for the NHST primarily focused on Type I error rates and power. The Type I error rate was calculated as the percentage of items that were falsely flagged as DIF items among the non-DIF items across 100 replications for each condition. Similarly, power was calculated as the percentage of items that were correctly flagged as DIF items among all the truly DIF items over the 100 replications.
LR Test
In this study, the df for the χ2 statistic was 6. For a significance level of .05, the critical value of the χ2 statistic is
Akaike Information Criterion (AIC)
A smaller AIC value indicates a better data-model fit when jointly considering fit and parsimony. For the constrained baseline approach, a studied item was flagged for DIF when the AIC value from the baseline/null model was greater than its corresponding alternative model (i.e., AICbaseline− AICalternative > 0); for the free baseline approach, a studied item was flagged for DIF when the AIC value from the baseline/null model was smaller than its corresponding alternative model (i.e., AICbaseline− AICalternative < 0).
Lord’s Chi-Square Analysis
Calculation of the chi-square statistic for each item strictly followed Lord (1980) and an item was flagged for DIF if the obtained chi-square statistic was greater than 12.59.
Inferential Statistical Analysis
Besides the descriptive analyses, the authors also conducted inferential statistical analyses of Type I error rates and power to better understand the effects of different factors on the dependent variables. Specifically, they adopted De Ayala’s (De Ayala & Sava-Bolesta, 1999) method to analyze Type I error rates and power to examine the proportion of the variability accounted for by different factors.
Analysis for the DIF Effect Size
Across 1,000 replications, effect sizes were computed when each item was subjected to a DIF manipulation. Means and standard deviations of the effect sizes were then computed for each item in each condition. For each condition, the effect sizes for each item were about the same (as suggested by the small standard deviations of the 10 DIF effect sizes), so the authors computed the grand mean of all the 10 items and used the grand mean as the DIF effect size index for each condition.
Results
The results of the significance testing methods are summarized in this section and partly presented in Tables 1 to 3. 4 DIF effect sizes were examined for nine DIF conditions, and each condition was based on 1,000 replications. The results for DIF effect sizes are presented in Table 4.
Type I Error Rates and Power of DIF Detection When There Was No Impact.
Note: DIF = differential item functioning; AIC = Akaike information criterion. The numbers without parentheses were produced by the constrained baseline approach, and the numbers in parentheses were produced by the free baseline approach.
The small uniform DIF with .25 manipulation is omitted here due to space limitation. The complete table is available from the first author upon request.
Results of DIF Detection When the Impact Was .25 SD a .
Note: DIF = differential item functioning; AIC = Akaike information criterion. The complete table is available from the first author upon request.
The results were based on 4-item linking conditions; the numbers without parentheses were produced by the constrained baseline approach, and the numbers in parentheses were produced by the free baseline approach.
Results of DIF Detection When the Impact Was .50 SD a .
Note: DIF = differential item functioning; AIC = Akaike information criterion. The complete table is available from the first author upon request.
The results were based on 4-item linking conditions; the numbers without parentheses were produced by the constrained baseline approach, and the numbers in parentheses were produced by the free baseline approach.
DIF Effect Sizes for Different DIF Conditions for a Scale Length of 10 Items.
Note: DIF = differential item functioning; Mgrand = mean of DIF effect sizes based on the 1,000 replications; SDgrand = standard deviation based on all the 10 items over the 1,000 replications; SDitem = standard deviation of the means of DIF effect sizes of the 10 items.
Results From the NHST
Comparison of the Constrained and Free Baseline Approaches and Influence of Impact
The results revealed that the free baseline approach generally outperformed the constrained baseline approach when there was no impact. For example, when uniform DIF was simulated (Table 1), the Type I error rates were generally well controlled at around .05 with the free baseline approach, but they were highly inflated with the constrained baseline approach, especially when the sample size was large. Specifically, as shown in Table 1, for large uniform DIF conditions, the Type I error rates provided by the free baseline approach ranged from .03 to .07 (M = .05). In contrast, the Type I error rates provided by the constrained baseline approach ranged from .14 to .90 (M = .47). Power was high for the free baseline approach (M = .87). Because it failed to control the Type I error rate, the power values for the constrained baseline approach are not interpretable.
However, when there was moderate impact (e.g., .25 SD), the constrained baseline approach outperformed the free baseline approach for the 1- and 2-item linking conditions, but the free baseline approach performed better than the constrained baseline approach for the 4-item linking conditions; when there was high impact (e.g., .50 SD), neither the free baseline approach nor the constrained baseline approach performed well, regardless of the number of linking items. One important finding was that the constrained baseline approach seemed to be more resistant to impact than the free baseline approach: The results for the constrained baseline approach were almost unchanged across Tables 1 to 3, whereas the results for the free baseline approach changed dramatically from no impact to large impact.
Overall, the results from Tables 1 to 3 showed that impact exerted a dramatic effect on Type I error rates and power. When there was no impact (Table 1), low Type I error rates and high power were generally obtained with the free baseline approach. However, it became more difficult to obtain satisfactory results with increased level of impact. When the impact increased to .25 SD (Table 2), Type I error rates dramatically increased, especially for the 1- and 2-item linking conditions. When the impact was .50 SD, the authors found that none of the methods provided satisfactory results (i.e., low Type I error rates and high power), especially for the large uniform DIF conditions: The log-likelihood test with the constrained baseline approach provided varying results for different DIF types and the constrained baseline approach produced unacceptably high Type I error rates for conditions of 20% and 40% DIF.
Comparison of the Three Significance Testing Methods
Looking through Tables 1 to 3, it appears that the LR test method performed the best, although for some conditions the AIC method performed as well as the LR test. For the constrained baseline approach, the power of the AIC for all the conditions was similar to the power of the LR test. However, Type I error rates by the AIC for all the conditions were slightly higher than those of the log-likelihood methods. The performance by the free and constrained baseline approaches under the AIC method showed the same pattern that was observed for the log-likelihood method.
The results for Lord’s chi-square method are presented in the right two columns in Tables 1 through 3. The results indicated that this method produced satisfactorily low Type I error rates for nonuniform DIF conditions but inflated Type I error rates for the uniform DIF conditions. The Type I error rates for uniform DIF conditions varied based on sample sizes, DIF percentages, and scale lengths. The pattern tended to be very similar to the pattern that was observed for the log-likelihood test and AIC methods with the constrained baseline approach; that is, Type I error rates tended to increase with sample size, scale length, and percentage of DIF item in the scale. Even with inflated Type I error rates, the power of Lord’s chi-square was substantially lower than the free baseline log-likelihood method.
Results From the Inferential Statistical Analysis
The inferential analysis examined the variability of the Type I error rates and power by the LR test method accounted for by six factors: level of impact, baseline (constrained and free), DIF type, DIF percentage, scale length, and sample size. All the two-way interaction terms were also included in the analysis. The results showed that impact accounted for the biggest proportion for both Type I error rates and power (
DIF Effect Size
The results of the DIF effect sizes for the nine DIF conditions are presented in Table 4. 6 The grand means (Mgrand) and grand standard deviations (SDgrand) are presented for DIF effect sizes for each condition, followed by the standard deviations of the 10 items’ means of DIF effect sizes (SDitem). The grand means and standard deviations are based on all 10 items over the 1,000 replications (i.e., 10,000 DIF effect sizes).
The findings of the DIF effect sizes showed several patterns. First, uniform DIF had a much bigger DIF effect size than nonuniform DIF for the same DIF manipulation (e.g., .50); the DIF effect size for .50 nonuniform DIF was even smaller than the .25 uniform DIF (M.50 uniform = .640; M.25 uniform = .324; M.50 nonuniform = .169). According to J. Cohen (1992), an effect size around .20 is considered a small effect size, around .50 is medium, and around .80 is large. Thus, the results indicated that the DIF effect sizes in this study were moderately large for the .50 uniform DIF conditions and were small for nonuniform DIF conditions. Second, interestingly, impact has almost no influence on DIF effect size. For each DIF type, increasing impact even decreased the DIF effect size very slightly (e.g., for .50 uniform conditions: M0 SD impact = .640, M.25 SD impact = .640, M.50 SD impact = .626). Third, the standard deviations of the 10 items’ mean DIF effect sizes were quite small, ranging from .004 to .047, suggesting that the DIF effect size was largely independent of the items’ extremity and consistent across the entire latent trait continuum. This result also confirmed the appropriateness of randomly selecting items for DIF manipulations.
Discussion
As more and more researchers have seen the value of ideal point IRT models for psychological measurement, DIF assessment has become increasingly important. In light of this background, this study systematically assessed DIF for polytomous responses generated by an ideal point process. This study contributes to the existing DIF literature in several ways. This is the first study to quantify DIF effect size for unfolding models. The authors overcame the challenges of traditional method of calculating DIF effect size in the ideal point models and used a new method to investigate the DIF effect size under various conditions. By examining the quantification of DIF effect sizes for various data conditions, this study advances the understanding of DIF and its detection. Second, this is the first study that has investigated the impact problem for the GGUM model. As discussed previously, the impact problem is quite prevalent for psychological attributes that require ideal point measurement models. This study examined three levels of impact and investigated its influence on both DIF detection and DIF effect sizes. The authors found that impact severely affected DIF detection under the NHST paradigm, especially when the impact was bigger than .25 SD, although it had almost no influence on the DIF effect size. They also discovered that, when there was impact, the free baseline approach combined with an item linking implementation produced the most satisfactory results. These findings not only increase the knowledge regarding DIF phenomena for the GGUM model but also have great practical implications. Third, the multiple approaches to DIF testing and the 126 DIF conditions examined in the study extend the findings of the initial work of Carter and Zickar (2011). In their study, they examined 12 conditions through two popular DIF testing methods, the LR test and Raju’s DFIT, and found that the LR method was superior to DFIT. The LR method was compared with the AIC and Lord’s chi-square method using free and constrained baseline approaches under different levels of impact conditions with varying numbers of linking items. The authors found the AIC method was almost as good as the LR method when there was no impact or the impact was small. When the impact was .50 SD, none of the three methods performed well. In addition, the authors used De Ayala’s method (De Ayala & Sava-Bolesta, 1999) for further analyzing the effects of data characteristics (e.g., sample size, DIF types, DIF percentage, scale length) on DIF detection. DIF types and sample size are found to exert considerable greater influence than DIF percentage and scale length on DIF detection. These new findings have improved the understanding of DIF detection for ideal point models.
Applied Implications and Recommendations
Checking Ideal Point Models and Verifying Testing Assumptions
The present simulations were entirely based on ideal point IRT models for DIF detection. However, in reality, the most appropriate model may not be clear in advance and thus the authors recommend that researchers and practitioners check model fit before adopting the current methods for DIF analysis. Tay et al. (2011) presented procedures for assessing the relative fit of both ideal point and dominance models.
In addition to the parametric form, the ideal point model (or, at least, GGUM) assumes a unidimensional latent trait and, consequently, local independence. Checking the assumption of unidimensionality is difficult because factor analysis of ideal point data can produce artificial factors (Davison, 1977; Tay & Drasgow, 2012). At the very least, the survey items should be examined, making sure that all the items are answered independently without influence from the preceding items, that is, there are no order effects or priming effects.
The Impact Problem
One of the important findings of this study is the impact problem for DIF detection with GGUM. When there is impact, accurately identifying DIF items is especially critical because the group may result from measurement nonequivalence.
Findings of this study indicate that although impact did not affect the magnitude of the DIF effect size, it did undermine DIF detection. Especially when the impact was as large as .50 SD, none of the testing methods or linking strategy worked very effectively to detect DIF, which poses a crucial challenge for practitioners. Although this problem clearly requires further research, this study provides some recommendations before an effective solution is found. First, the authors recommend calculating the standardized mean difference of latent trait estimates between the two groups of interest. If the calculated impact is substantially greater than .25 SD, at present, there seems to be no effective method for DIF detection. However, if the impact is found to be of a more modest size, researchers may proceed by carefully selecting a few (e.g., 4) non-DIF items, using them as anchor items, and conducting a DIF analysis with the LR method and free baseline approach.
Of course, selecting non-DIF items is another challenge. However, this study seems to suggest using the constrained baseline approach is a good start to find non-DIF items. Note that the constrained baseline approach produced high power and inflated Type I error rates, therefore items classified as non-DIF by this approach are likely to be truly non-DIF. Thus, the authors recommend first utilizing the constrained baseline approach to identify a few non-DIF items, and then using the non-DIF items as linking items with the free baseline approach to detect DIF.
DIF Model Construction and Testing Methods
The present findings revealed that both the free baseline approach and the constrained baseline approach have strengths and weaknesses, although the free baseline approach seems generally superior to the constrained baseline approach, which confirms previous findings by Stark, Chernyshenko, and Drasgow (2006). Although the free baseline approach outperformed the constrained baseline approach when there was no impact, it seemed to be more susceptible to impact than the constrained baseline approach, which highlights an important advantage of the constrained baseline approach. Therefore, the constrained baseline approach might be useful for the situations where there exists large impact. Similarly, Lord’s chi-square method showed less susceptibility to impact. Thus, it may be also useful to utilize Lord’s chi-square method for DIF detection when there is substantial impact.
Limitations and Future Directions
Detecting DIF for ideal point models is a fairly new topic, and clearly more research is needed for researchers and practitioners to better understand this topic. In this study, the authors only investigated the GGUM model. It is possible that results may be different for other models, such as the normal probability density function model (Maydeu-Olivares et al., 2006). Moreover, although simulation is useful for assessing DIF detection methods, the authors acknowledge that not all the assumptions of the psychometric model may be satisfied in real situations. Thus, research on the robustness of DIF detection methods under assumption violation is clearly needed.
Having said that, the authors believe this study has identified important questions for future research. Perhaps the most important is the impact problem. This problem can be approached from several perspectives. First, as one of the reviewers speculated, GGUM2004 may have some metric indeterminacy issues, as it appeared to encounter problems in simultaneously estimating parameters of reference and focal groups when there was impact. To address this concern, the authors assigned delta signs for the items as advised by Dr. James Roberts and checked model fit. Nonetheless, further research into metric indeterminacy is needed. A second issue for further research concerns the number of item parameters estimated. Previous research has shown that Type I error rates increase as the number of parameters in the model increases when there is impact (e.g., Finch, 2005; Wang & Su, 2004a, 2004b). The GGUM involved six parameters per item in this study, which may have contributed to the impact problem. This speculation can be tested by fitting GGUM models with different numbers of response categories. A third topic for research focuses on the ideal point model itself. It is possible that the impact problem is not unique to the GGUM model but is a problem for other unfolding IRT models as well. Thus, DIF detection is needed to be examined with other ideal point models, such as the normal probability density function model (Maydeu-Olivares et al., 2006). Fourth, perhaps some strategy for different testing methods and linking procedure may work effectively for DIF detection under impact conditions. For example, iteratively linking the latent trait metrics (Candell & Drasgow, 1988) might be considered for Lord’s chi-square method.
Conclusion
Impact perhaps is the biggest barrier to DIF detection for ideal point models. When there was impact, the authors observed many false positives and consequently a significant DIF statistic may not truly indicate differential functioning. Therefore, it is important to calculate the standardized mean difference of latent trait estimates as a first step to gain insight about how subsequent analyses should be interpreted. If impact is close to zero, the LR test with the free baseline approach should prove effective; and if impact is small to moderate (e.g., .25 SD), the authors recommend using the free baseline approach with a few anchor items. Detecting DIF when impact is .50 SD or greater is problematic for the methods that were examined. Clearly, more research is needed, and it is believed this is an important direction for future directions.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
