Abstract
The growing use of scales in survey questionnaires warrants the need to address how does polytomous differential item functioning (DIF) affect observed scale score comparisons. The aim of this study is to investigate the impact of DIF on the type I error and effect size of the independent samples t-test on the observed total scale scores. A simulation study was conducted, focusing on potential variables related to DIF in polytomous items, such as DIF pattern, sample size, magnitude, and percentage of DIF items. The results showed that DIF patterns and the number of DIF items affected the type I error rates and effect size of t-test values. The results highlighted the need to analyze DIF before making comparative group interpretations.
Keywords
Item bias and Differential Item Functioning (DIF) have been widely studied in psychological and educational testing since the 1980s. As indicated in the last edition of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 2014), there is a broad consensus about how DIF can undermine the validity of test score interpretations. As it is considered a threat to the validity of group comparisons based on test scores, psychometricians and statisticians have worked intensively to develop statistical tests, effect size measures, and criteria that not only identify DIF items but also help users make decisions on whether to keep or remove these DIF items from tests. For recent reviews on DIF research, readers should refer to the work of Hidalgo and Gómez-Benito (2010), Osterlind and Everson (2009), Sireci and Rios (2013), and Zumbo (2007).
The interest in item bias and DIF in survey research has been quite varied. In general, scant attention has been paid to how DIF can affect survey data quality, with the exception of some studies, most of them in the health survey field aimed at detecting DIF in particular scales included in survey questionnaires. Even within the total survey error (Biemer 2011) paradigm, no systematic efforts have been carried out to evaluate the impact of DIF. This could minimize the impact of DIF in later statistical decisions based on survey data from DIF items.
The growing use of psychological scales in education, public opinion, health, and quality-of-life surveys at a national and international level warrants the need to address how DIF can affect statistical decisions, based on scale scores included in survey questionnaires. For example, let us imagine that a survey researcher is interested in whether there are gender differences in, for instance, one of the versions of the PROMIS scales (e.g., Cella et al. 2010) frequently used in health surveys. One of the research questions could be what is the impact of gender-based DIF on the eventual statistical decision if the group means (men or women) of the observed scores—or trait estimates—on the PROMIS scales are equal? This could also be applied to some of the scales generally included in health surveys, such as the KIDSCREEN questionnaires (Ravens-Sieberer et al. 2001). For example, Robitail et al. (2007) found 5 items showing uniform DIF and another 10 items flagged with nonuniform DIF in a cross-cultural study of KIDSCREEN-27 across 13 countries. Due to the wide usage of KIDSCREEN and similar scales in the health context and in other fields, it would be useful to know how much DIF can be tolerated without affecting statistical conclusions about real differences between groups.
While some recent studies have been conducted in testing of how DIF affects statistical decisions (e.g., Li and Zumbo 2009; Oliveri et al. 2014), the previously mentioned issues have not been systematically studied in survey research. For example, Teresi et al. (2008) reviewed DIF concepts and methods for detecting DIF in patient-report outcome measures. These authors noted the need to develop methods that would allow researchers to analyze DIF magnitude and its effects.
The existence of DIF in scales included in any survey questionnaire could be due to the fact that (a) DIF analyses are not often performed when survey questions and scale data are analyzed or (b) scales are considered a “valid instrument” for which no additional research is needed. The second reason often leads, in survey research, to the parallel version of the decision to keep DIF items within educational tests, considering content validity issues. For example, Varni et al. (2015) found a small number of DIF items in PROMIS Parent Proxy Report Scales that were retained with the advice to avoid their use when comparing younger and older children.
This study will address these issues by aiming to answer the general research question of “how does polytomous DIF affect observed scale score comparisons?” The objective is to discover the variables related to the presence of DIF (e.g., group sizes, number of DIF items, or DIF patterns), which can affect the type I error and effect size of the independent samples t-test on the observed total scale scores. In order to attain this goal, the present study follows the framework developed by Li and Zumbo (2009) for analyzing the impact of dichotomous DIF on statistical conclusions, based on observed test scores. Li and Zumbo’s framework was adapted to cover the specific characteristics of polytomous items like different DIF patterns that cannot be explored in dichotomous items, and more simulation conditions were added to increase the relevance of the study in survey research.
Method
A simulation study was conducted using R software version 2.15.2 (2012). Different conditions were manipulated in order to examine the DIF effect on between-group mean differences. In this study, data were simulated based on a polytomous item response theory (IRT) model.
Data Generation
Item responses were generated using the IRT graded response model (Samejima 1969). One of the assumptions of this model establishes that the item response to a polytomous item is a continuous variable divided into m ordered categories, one for each response option. For each category, it is possible to specify an item category characteristic curve (ICCC), which describes the probability that a respondent with a particular level of “ability” (θ), that is, the underlying construct measured (e.g., opinion, attitude, self-reported health outcome, etc.), chooses the k response option (Pjk (θ)) to item j as a function of θ. As the ICCCs for polytomous items have different forms for each response category, Samejima (1969) defined the “boundary” category characteristic curves (BCCCs), or boundary response functions, to represent the cumulative probability (P*jk (θ)) of a response above category k. For a polytomous item having, for instance, four response categories, there are three BCCCs. The logistic form of the BCCC is given as:
Each of the BCCCs describes the probability that a respondent advances to a response category equal to or greater than k. Where aj
is the discrimination parameter for item j, which is constant across all categories for the same item, bjk
is the location parameter for item j in the boundary category k, and θ is the ability parameter. An item j may be characterized by a vector of location parameters
In this study, the simulated test consisted of 20 items with five categories, where the
Data were generated by the following different steps: (a) Latent trait parameters were randomly generated from a Gaussian distribution using R software and (b) item response data were generated for two groups using a computer program written by the authors. This program implemented the procedure described by Hambleton and Cook (1983) but modifying the input, which was the latent trait in this study, and the defined item parameters.
For simplicity, the terms “reference” and “focal” groups are used below to describe the experimental conditions. Both terms are common in DIF testing literature and can also mean “majority” and “minority” group, or the mainstream group and possible cultural–linguistic groups in the target population, respectively.
Experimental Conditions
Four independent variables were manipulated in this study: sample size, amount of DIF, DIF patterns, and the percentage of DIF items in the test.
Sample size
Different sample sizes were used for the reference and focal groups: 100/100, 250/250, 500/500, and 1,000/1,000. These conditions reflected situations that are more likely to occur in survey practice and involve small and large sample sizes for group comparison. However, it is also frequent to find unbalanced sample sizes across reference and focal groups. In these cases, equal sample sizes can be reached by extracting a random subsample from the majority group, which allows the cross validation of DIF results. Moreover, the mean “abilities” of the reference (μR) and focal (μF) groups were settled as equal (no impact condition), thus the mean difference was 0 (μd = μR − μF = 0).
Magnitude of DIF
Two levels of differences in location parameters (
Percentage of DIF items
The proportion of DIF items in the test was also manipulated. Four levels were considered: 0 percent, 10 percent (2 DIF items, items 1 and 2 in Table 1), 30 percent (4 DIF items, items 1–4 in Table 1), and 40 percent (8 DIF items, specifically items 1–8 in Table 1). The condition exhibiting DIF in 10 percent of the items shows a possible common situation for scales included in survey questionnaires, given that it is common in achievement or aptitude tests to find between 10 percent and 15 percent of DIF items (Narayanan and Swaminathan 1994), this being even higher in questionnaire adaptations with percentages of DIF items above 20 percent when comparing different linguistic or cultural versions of scales (Gierl, Gotzmann, and Boughton 2004). Following Li and Zumbo’s (2009) work, it was expected that a larger magnitude of DIF would increase the differences in item response between groups, and, hence, the combined DIF effect across items might result in greater type I error rates. Thus, the type I error rate might be affected by the proportion of DIF items in the scale: the more the DIF items, the higher the type I error rate in t-test statistics. In this study, the magnitudes of DIF in the test were indirectly manipulated through the percentage of DIF items and DIF direction, “favoring” the focal or the reference group, depending on each specific condition.
Parameters for DIF Items in Reference and Focal Groups When the Amount of DIF Was 0.4 and the Percentage of DIF Items Was 40 Percent.
Note: DIF = differential item functioning.
DIF patterns
Five DIF patterns were manipulated: constant, balanced, shift-low, shift-high, and constant-item/balanced-test (Su and Wang 2005).
Constant DIF pattern
The item parameters of the two groups depicted the magnitude of DIF, which were bFjk = bRjk + s, where s is a positive number. Thus, in this condition, all the location parameters within a DIF item of the focal group were larger than those of the reference group by s, which takes values of .2 or .4, depending on the condition. Larger location parameters for the focus group mean that reference and focus group members, who have the same level of ability, do not have equal probability of advancing in each response category and that difference between groups is the same across all-boundary categories. The constant DIF pattern can provoke an amplification effect at test level (Li and Zumbo 2009). Previous studies provide evidence on how constant DIF can come from an item-level property such as unfamiliar content, translation problems, or contextual difference between groups (Benítez and Padilla 2014; Penfield and Lam 2000).
Balanced DIF pattern
The location parameters of the two groups were bFj 1 = bRj 1 + s, bFj 2 = bRj 2, bFj 3 = bRj 3, and bFj 4 = bRj 4 − s. This pattern inserted DIF on extreme categories leaving intermediate boundary categories free of DIF. Overall, the magnitudes of DIF within items were balanced between groups, so that contaminations within items were cancelled out between groups.
Shift-low DIF pattern
The location parameters of the two groups were bFj 1 = bRj 1 + s and had the same values for the other categories. That means that DIF was added only in the lowest extreme boundary category.
Shift-high DIF pattern
The location parameters were set bFj 4 = bRj 4 + s, while being equal for the other categories, similar to the previous pattern but including DIF in the highest category. There is evidence supporting that when DIF affects one isolated boundary category (as in the balanced, shift-low, and shift-high DIF patterns), the proper category is the responsible factor for DIF (Penfield, Alvarez, and Lee 2009). Such factors could be associated with translation problems with the category labels, different relations between item stems and response categories, and so on.
Constant-item/balanced-test pattern (cancellation)
DIF was constant within items as in the first pattern, but the effect of the items was cancelled at the test level by changing the DIF effect across items. Specifically, half of the DIF items had a positive s value, and the other half had a negative s value, so that the contaminations within the test were cancelled out between groups.
In terms of type of DIF patterns, the balanced, shift-low, and shift-high patterns represented situations of nonconstant DIF, since some boundaries between categories require more ability from the focal group than from the reference group in order to advance to the next response category, while for others (constant and constant-item/test balanced), the ability level would be equal across groups.
A total of 4 × 2 × 3 × 5 DIF conditions plus four non-DIF conditions were manipulated and 1,000 replications were made under each one. Table 1 shows the specific parameters for each simulated DIF pattern.
As Table 1 shows, items for the reference group maintained the values of location parameters across conditions, while DIF manipulations were applied for item location parameters in the focal group, as explained above.
Data Analysis
The effect of DIF in the observed total scores was analyzed by comparing total scores of reference and focal groups computing independent samples t-test analyses. Also, the standardized mean difference was calculated by computing Cohen’s d (Cohen 1988) to assess the effect size of the difference between means, as follows:
where
Two dependent variables were calculated: (a) type I error rates, as the proportion of rejected t-tests for 1,000 replications in each of the manipulated conditions and (b) typified mean difference (d) effect size measure for each replication. For this effect size measure, the mean and standard deviation for each replication were obtained. Moreover, intervals at the 95 percent confidence level were also calculated for the d effect size means in each condition through all replications (Kelley and Preacher 2012). For type I error rates, a value of .05 was used at the nominal significance level, which meant that the type I error was defined as the proportion of times that a true null hypothesis was falsely rejected at the .05 level. Following Bradley’s (1978) indications when interpreting the type I error rates, in the present study, a fairly stringent criterion for robustness was assumed, which requires the empirical type I error rate to lie between .045 and .055. However, the results obtained were also compared with the moderate criterion (type I error rate between .040 and .060) and the liberal criterion (type I error rate between .025 and .075).
Results
Type I Error Rate
Table 2 shows type I error rates for t-test of mean differences for independent samples in each one of the conditions. Regarding the non-DIF item condition, the type I error rates were, as expected, close to the nominal level across all sample sizes, ranging from .04 to .05.
Type I Error Rate for Independent Samples t-Test across DIF Conditions.
Note: DIF = differential item functioning; RG = reference group; FG = focus group.
As Table 2 shows, the effect of the DIF presence on mean differences was affected by the manipulated DIF pattern as well as by the number of DIF items and the amount of DIF in these items. For example, when the manipulated pattern was “constant,” type I error rates were higher than .05 in all conditions, except when the sample size was 100 in both groups. In this condition, differences were observed only when 10 percent of items or more showed DIF regardless of the amount of DIF.
In terms of the percentage of DIF items, type I error rates were above the nominal level of significance when the percentage of DIF items was equal to or greater than 20 percent, even with small sample sizes. More specifically, type I error rates ranged from .06 to .81 when the percentage of DIF items was 20 percent, being closer to 1 when the sample size was higher (1,000/1,000). In general, type I error rates were closer to 1 when the percentage of DIF items was higher, that is, 40 percent, and the larger sample size.
Results for the balanced DIF pattern (the pattern exhibiting DIF only in extreme categories) analysis showed that the type I error rates were lower than or equal to .05 for all conditions. Furthermore, when the simulated DIF pattern was “cancellation,” that is, constant across all item categories (within items) but balanced in test, the type I error rates were lower or slightly higher than the nominal level. This fact was evident with higher sample sizes (1,000/1,000) and with higher percentage of DIF items (40 percent). In any case, type I error rates were lower when a liberal criterion was considered.
Finally, when DIF was manipulated across categories, similarly to what Penfield et al. (2009) called constant nonpervasive DIF, and specifically when DIF was in low categories (shift-low pattern), it was found that type I error rates were between .040 and .058 when the amount of DIF was low for sample sizes from 100 to 500. In this situation, none of the type I error rates were inflated, using Bradley’s (1978) moderate criterion. However, when the sample size was increased to 1,000, the type I error rates were between .034 (lower than the moderate criterion) and .079 (higher than the liberal criterion). In addition, when the sample size was larger than 100 examinees per group, the type I error rate did not reach the moderate degree of robustness in the condition of 40 percent of DIF items, as the type I error rate was .067. Therefore, the type I error rate was inflated when higher percentages of DIF items and larger sample sizes were analyzed, reaching values over the liberal criterion.
On the other hand, when DIF was manipulated in high categories (shift-high pattern), the stringent criterion for the nominal level was not reached in 16 of the 24 manipulated conditions, as type I error rates were lower than .045 (in nine cases) and higher than .055 (in the rest of the situations). Nevertheless, the liberal criterion was satisfied in all the conditions except in the condition with the largest sample size (1,000/1,000), the largest amount of DIF (.4), and the highest percentage of DIF items (40 percent). In this case, the type I error rate was .102.
Effect Size Measure
Means and standard deviations of the standardized mean differences as well as confidence intervals are reported in Table 3. Regarding the measure of effect size for the total scores, it was found that d increased in the constant DIF pattern (d̄ = .109), followed by the shift-low DIF pattern (d̄ = .028), while the lowest values were obtained in the balanced DIF pattern (d̄ = .001). The effect size of the differences between both groups (reference and focal) was not affected by sample size, although it was affected by the percentage of DIF items. It was observed that the higher the number of DIF items in the test, the higher the difference in the typified mean between groups. With respect to the DIF patterns, for the constant DIF pattern, d̄ was .039 when only 10 percent of the items had DIF, but it increased to .104 when 20 percent of the items in the test showed DIF, and .185 for 40 percent. For the other DIF patterns, small effects on d̄ values were found.
Mean, Standard Deviation (SD), and Confidence Interval (CI) for d Effect Size Measure across Different Conditions.
Note: DIF = differential item functioning; RG = reference group; FG = focus group.
Variations in the typified mean differences were found related to the magnitude of DIF across DIF patterns. In this line, higher values in the typified mean were observed when higher magnitudes of DIF were included in the constant, shift-low, and shift-high patterns, but these effects were not obtained for both the balanced and the cancellation patterns. The effect was the same for all sample sizes whose increment was related to the drop in standard deviations, as expected.
Discussion
The aim of the study was to investigate conditions under which the presence of DIF affected observed scale score comparisons. The objective behind the study was to find out under what circumstances the DIF effect can be disregarded, given how difficult removing DIF items from scales in surveys can be. In survey research, group comparisons are often made without conducting DIF analyses, and when DIF is analyzed, there is rarely the option of removing DIF items due to content validity reasons (Langer et al. 2008; Reeve et al. 2007; Teresi et al. 2007). Therefore, this study provides arguments based on simulated results for survey researchers to better interpret DIF effects on scales included in survey questionnaires.
The results of the study suggest that the presence of DIF items affects statistical conclusions based on comparing means, but this effect is linked to specific aspects. The most relevant of them are DIF patterns that increase the DIF effect, followed by the number of DIF items in the test, and the magnitude of DIF in these items. Thus, when the manipulated DIF pattern was constant, the type I error rates exceeded the nominal level and an amplification effect occurred at the test level. However, when the sample size was small (100/100) and the percentage of DIF items was low (10 percent), type I error rates were in the expected range, that is, DIF did not affect the mean comparisons between groups. Therefore, the presence of DIF was modulated by reduced sample sizes and small percentages of DIF items. The situation changed when the percentage of DIF items was 20 percent or higher (often found when assessing DIF between different cultural or linguistic versions). In these cases, the presence of DIF would threaten the validity of group comparisons based on total scale scores. Therefore, in terms of the effect on group comparisons, the worst scenario seems to be when the pattern of DIF is constant across the localization parameters, unidirectional (“favoring” only the reference group), the percentage of DIF items is 20 percent or higher, and sample size is above 100 respondents per group.
As previously mentioned, results from the present study are in line with those reported by Li and Zumbo (2009) for tests consisting of dichotomous items. These authors found that, when the manipulated DIF was unidirectional (amplification effects), the inflation of the type I error rates increased as the number of DIF items and the sample size augmented. The results obtained in the present study also indicated that the effect size differences increased as the percentage of DIF items in the test increased, which means that the observed mean differences were spuriously inflated by the presence of more DIF contamination in the test. Finally, the amount of DIF and the percentage of DIF items provoke undesired differences across groups, but clear criteria cannot be established considering both variables separately, as it seems both are affecting the scale together. Therefore, decisions about using the DIF items should take into account both content validity considerations and equivalence issues when comparing different linguistic and cultural groups.
Thinking of survey researchers and practitioners, some guidelines can be proposed based on the main results of the simulation study: (a) before making comparisons across groups, DIF analysis should be performed together with empirical item characteristic curve representations to describe possible DIF patterns; (b) if the DIF patterns found are constant, look at the item stems to identify unfamiliar contents or stimuli, translation problems, and so on; and (c) when the DIF pattern found is not constant, possible different understanding of the category meaning (labels, numbers, etc.) should be investigated.
The present study has some limitations, mainly related to the conditions examined. Future research should focus on overcoming these flaws by addressing the issues that remain unclear, such as the impact of the length of the test on the DIF effect, unbalanced sample sizes, the impact between groups (i.e., when mean group differences are higher than zero (μ d > 0)), or the extension of results to other magnitudes of DIF and sample sizes (i.e., determining the approximate number after which the DIF effect is clear). However, despite such limitations, this study provides a useful framework for the conditions behind the increase in the impact of DIF, which improves our knowledge of how DIF works.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was partially funded by the Andalusia Regional Government under the Excellent Research Fund (Project n° SEJ-6569) and by Seneca Foundation, Agency for Science and Technology in the Region of Murcia under Research Projects in the Humanities and Social Sciences Fund (Project 11917/PHCS/09).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
