Abstract
The purpose of this study was to apply two types of Differential Item Functioning (DIF), net and global DIF, as well as the framework of Differential Step Functioning (DSF) to real testing data to investigate measurement invariance related to test language. Data from the Program for International Student Assessment (PISA)–2006 polytomously scored science items for four countries with different test languages were used, where French and English represented the reference languages. The findings showed that many items exhibited both types of DIF, although, in most cases, the results were inconsistent for the two source languages. In addition, net and global DIF tests did not always yield the same results depending on the DSF effect pattern. Furthermore, the DSF analysis provided valuable information over and above that provided by the net DIF analysis concerning the nature and the location of the DIF effect.
In polytomous items (e.g., performances that cannot be scored as simply correct or incorrect), differential item functioning (DIF) is present when individuals belonging to two different groups, but having the same level of ability, have differing probabilities of obtaining each score level of the polytomous response variable (Potenza & Dorans, 1995). The increasing use of polytomous item formats has led to the development of various methods for assessing DIF in polytomous items (Penfield & Camilli, 2007). Penfield, Alvarez, and Lee (2009) introduced two conceptions of DIF in polytomous items, net DIF and global DIF. This distinction between net and global DIF is unique to polytomous items, because a polytomous item contains three or more score levels and all score levels are used in the estimation of ability.
Net and Global DIF
Global DIF statistics are based on the unsigned conditional between-group difference across all score levels of the polytomous item (Penfield, Alvarez, & Lee, 2009). Examples of global polytomous DIF statistics include the generalized Mantel–Haenszel statistic (Somes, 1986), item response theory (IRT) likelihood ratio tests (Kim & Cohen, 1998), polytomous logistic regression approaches (French & Miller, 1996), and the simultaneous step-level (SSL) test of DIF (Penfield, 2007).
In contrast to global DIF, net polytomous DIF statistics are based on the signed conditional between-group difference across all score levels. Thus, it is possible for between-group differences to vary in sign across the score levels of the item (e.g., DIF favoring the reference group for some score levels but favoring the focal group for others), which can yield a net DIF effect of 0 (or near 0) despite the presence of sizable effects within particular score levels (Penfield, 2010). Numerous statistics for evaluating net polytomous DIF exists, including Mantel’s chi square (Mantel, 1963), the standardized mean difference index (Dorans & Schmitt, 1993), polytomous SIBTEST (Chang, Mazzeo, & Roussos, 1996), and the cumulative common log-odds ratio (Penfield & Algina, 2003).
One of the practical implications for the distinction between these two forms of DIF is that net and global DIF represent different violations of the measurement invariance, and therefore, it is crucial to clearly identify the form of measurement invariance of interest for a particular assessment. For example, a particular testing program may use the standard of no net DIF as a reasonable threshold for invariance to hold. Other testing programs, however, may have a concern for the detection of measurement invariance occurring anywhere within the item and use the standard of no global DIF as a necessary proof of measurement invariance (Penfield, 2010).
Differential Step Functioning (DSF)
Penfield (2007) stated that the current standard practice in conducting DIF analyses with polytomous items is to conduct a single item-level test that examines invariance across all score levels simultaneously and thus are considered omnibus tests of DIF across all score levels. However, the investigation of DIF effect associated with each score level can be accomplished using the framework of differential step functioning (DSF).
Using the framework of DSF in examining between-group measurement equivalence in polytomous items has several advantages over the omnibus measures of DIF (Penfield, 2007). First, tests of DSF can be more powerful than net tests of DIF when the magnitude and/or the sign of the DSF test vary across the steps underlying the polytomous response variable. A second advantage of the DSF framework is that it allows the DIF analyst to determine precisely which score levels (or steps) are responsible for an observed DIF effect.
The framework of DSF is based on the concept of the step function. Several forms of the step functions were proposed, and two of these forms are the cumulative step function form and the adjacent-categories form. For a polytomous item having r (r > 2) ordinal score levels, the probability of observing each score level, given a particular level of ability, is determined by a set of J = r − 1 step functions. Under the cumulative approach, each of the J step functions specifies the probability of successfully advancing (i.e., stepping) from a lower score level to a higher score level as a function of ability. Given this description of the concept of DSF, net DIF can be conceptualized as the aggregated DSF effects across the J steps (Penfield, Gattamorta, & Childs, 2009). For investigating DSF, three general approaches have been described in the literature: an IRT approach (Cohen, Kim, & Baker, 1993), a logistic regression approach (French & Miller, 1996), and an odds ratio approach (Penfield, 2007). One of the advantages of the odds ratio approach over the other two approaches is that it is not hindered by the assumption of model fit (Penfield, Alvarez, & Lee, 2009).
Under the odds ratio approach, the null hypothesis of no DSF at the jth step can be tested using the following test statistic:
This test statistic is distributed approximately as standard normal, where
Test Language DIF
The Program for International Student Assessment (PISA) had been translated (and/or adapted) into more than 40 different test languages (The Organization for Economic Co-Operation and Development [OECD], 2009). The typical PISA procedures include the development of two parallel source versions (in English and French), with a recommendation that each country develops two separate versions in their language of instruction, one from each source language, by two independent translators then reconciles them into a final national version by another independent translator (Grisay, 2003).
Research studies (Arim & Ercikan, 2005; Ercikan, 1999; Le, 2006; Yildirim & Berberoglu, 2009), however, indicated that for PISA and other international tests, DIF exists on different versions of the same test when administered in multiple languages. Le (2006) conducted a comprehensive exploration of DIF in which English was compared with other five languages in the PISA-2006 field trial data. Using IRT, the results indicated the presence of DIF in PISA items across different item formats, with the English group being advantaged over the other groups of different languages. In addition, Yildirim and Berberoglu (2009) found that 24% of PISA-2003 math items in one booklet displayed DIF using three different DIF detection methods, when English and Turkish speaking examinees were compared.
The present study was motivated by three observations: (a) the lack of published demonstrations of the net versus global distinction of DIF in polytomous items; (b) the lack of published documentation of the DSF framework; and (c) the limited number of published studies exploring language DIF in translated test forms, given the growing importance of this as more and more assessments are being translated. Therefore, the purpose of the present study was to examine net and global DIF in the PISA-2006 polytomously scored science items when the two source languages, French and English, were compared with two other languages. In addition, the framework of DSF was used to enhance the analysis of measurement equivalence over that provided solely by traditional omnibus measures of DIF. Given the growing potential of polytomous item formats across a range of measurement contexts (innovating item types, automated scoring, and so on), this is a growing area of interest, and so, the present study is timely with respect to all of this. In addition, the issues of net DIF, global DIF, and DSF seem to be important components to the advancements of validity of these items.
Method
Data
PISA is conducted by the OECD in selected countries to measure how well 15-year-old students are prepared to meet the challenges of the future. In each assessment, one of the three areas (science, reading, and mathematics) is chosen as the major domain and given greater emphasis; the remaining two areas are assessed less thoroughly. In 2000, the major domain was reading; in 2003, it was mathematics; and in 2006, it was science (OECD, 2009). In PISA-2006, a total of six polytomously scored science items, with three possible scores, 0, 1, and 2, were presented to students in 10 (out of 13) test booklets. Each test item appeared in four of the test booklets with different position in each booklet.
The data used in the present study came from the PISA-2006 polytomously scored science items for four countries, France (n = 4,716), Denmark (n = 4,532), the United States (n = 5,611), and Slovak (n = 4,731). For the purposes of the present study, these four countries were grouped into two sets, such that each set contains a country with one of the source languages as its test language. The first set contains France and Denmark, where French was the source language, while the second set contains the United States and Slovak, where English was the source language. Denmark and Slovak were selected such that the mean of test scores within each set was very similar for both countries. France with a mean score of 495 and Denmark with a mean score of 496 were classified at the average. However, the United States with a mean score of 489 and Slovak with a mean score of 488 were classified below the average. It should be noted that the examination from source to target languages was not fully crossed (i.e., no attempt was made to examine French to Slovak or English to Danish).
Overview of the Analyses
The null hypothesis of no global DIF was tested through testing the null hypothesis of no DSF at each of the J steps. If the null hypothesis of no DSF was retained for all J steps, then the null hypothesis of no global DIF was retained. If, however, the null hypothesis of no DSF was rejected for one or more steps, then the null hypothesis of no global DIF was rejected. This approach is referred to as the simultaneous step-level (SSL) test of DIF (Penfield, 2007).
For polytomous items, Penfield, Alvarez, and Lee (2009) recommended using at least one global DIF test and one net DIF test for a comprehensive DIF analysis. Therefore, this study used the SSL test as a global DIF test and Mantel’s chi-square test as a net DIF test. If the null hypothesis of no DIF was retained for both global and net DIF, it was concluded that measurement equivalence exists. However, if the null hypothesis of no DIF was rejected for either global or net DIF, a thorough DSF analysis was conducted where the step functions were defined using the cumulative approach. Gattamorta and Penfield (2012) recommended the use of the cumulative step function over the adjacent-categories step function due to its stability.
To judge the magnitude of the DSF effect at each of the J steps for each item,
Moreover, the taxonomy of DSF forms proposed by Penfield, Alvarez, and Lee (2009) was adopted in the present study to identify and interpret the causes of DIF. This taxonomy categorizes the DSF according to two dimensions: The first dimension distinguishes between pervasive and non-pervasive DSF. Pervasive DSF corresponds to the situation where all J steps display substantial DSF effects (medium or large in magnitude), suggesting that the factor causing the DIF is using its influence at the item level. Whereas, the DSF form is labeled as non-pervasive when some steps, but not all, display a substantial DSF effect. This suggests that the factor causing the DIF is using its influence at the level of a particular step.
The second dimension of the DSF taxonomy concerns the consistency of the DSF effect across the affected steps; it distinguishes between constant, convergent, and divergent DSF forms. When the steps are displaying substantial DSF effects that are relatively equal in magnitude and sign, constant DSF occurs. Convergent DSF concerns the situation whereby all steps displaying a substantial DSF effect have the same sign, but not the same magnitude. However, if different steps display DSF effects that have different signs, then divergent DSF form is present. When there are only two steps associated with each item, it will not be possible to distinguish between constant, divergent, and convergent DSF effects when the DSF form is labeled non-pervasive. Therefore, in the present study, it will be referred to this DSF form as “potential non-pervasive,” a term that was used by Penfield, Alvarez, and Lee (2009).
PISA uses the imputation methodology usually referred to as plausible values to indicate students’ proficiency levels based on the observed item responses. The plausible values are random draws from the marginal posterior of the latent distribution. Usually, five plausible values are allocated to each student on each performance scale of PISA. It is recommended that statistical analyses be performed independently on each of these five plausible values and results be aggregated to obtain the final estimates of the statistics and their respective standard errors (OECD, 2009). In the present study, only one run of DIF and DSF analyses was conducted because there were only six polytomous items (out of 108 items). If all six items were flagged as having DIF, this small number of affected items would not, possibly, have a notable effect on the results. In addition, Magis and Facon (2013) showed that item purification is not always useful and that a single run of the DIF method may return equally suitable results.
To summarize, the following analyses were conducted using Penfield’s (2005) DIFAS computer program (using the plausible values as the stratifying variables and considering the French and the American groups as the reference groups):
A test of the null hypothesis of no net DIF was conducted using Mantel’s chi-square test with a Type I error rate of .05. This statistic is distributed as chi square with one degree of freedom. Therefore, the critical value of this statistic is 3.84.
DSF analysis was conducted using the estimated step-level common log-odds ratio
A test of no global DIF was conducted using the SSL test with a family-wise Type I error rate of .05. SSL tests were conducted using
The aforementioned analyses were conducted five times and then averaged.
Results
To achieve the purposes of this study, the six polytomously scored science items in PISA-2006 were analyzed, and the results of these analyses are displayed in Table 1.
Results of Net DIF, Global DIF, and DSF Analyses for PISA-2006 Polytomously Scored Science Items for the Two Source Languages and Across Four Different Booklets.
Note. Standard errors are reported in brackets, and the numbers preceding brackets correspond to the estimated cumulative step-level log-odds ratio
Results of Net and Global DIF Analyses
Only one item (Item S447), out of six items, was free of DIF. This item displayed non-significant tests of both net and global DIF, and small DSF effects across all booklets. However, for each source language, French and English, Table 1 shows that four (67%) polytomous items were flagged as having DIF in the form of net DIF and/or global DIF across at least half of the test booklets. For French as the source language, Items S114, S498, and S519 were flagged as having both net and global DIF across all test booklets. The remaining item (Item S465) displayed significant tests of net and global DIF in two (50%) booklets (Booklets 5 and 11). For English as the source language, Items S114, S465, S485, and S519 were flagged as having net DIF and/or global DIF. However, none of these items displayed both types of DIF across all booklets.
When comparing the findings of net and global DIF resulted under each source language, it was clear that for French as the source language, more administrations of PISA polytomous items were flagged as having both types of DIF as compared with those administrations for English as the source language. Because each one of these four items appeared in four different booklets, there were 16 different administrations of these items. Fourteen administrations were flagged as having net DIF where French was the source language as compared with 7 administrations where English was the source language. Similarly, for French as the source language, 14 administrations were flagged as having global DIF as compared with 9 administrations for the other source language.
Results of DSF Analysis
For French as the source language, five (83%) items (Items S114, S465, S485, S498, and S519) displayed different DSF effects in, almost, all booklets. Three items (Items S114, S498, and S519) displayed substantial DSF effects that were either medium or large across the two steps (pervasive form of DSF) in all booklets, indicating that a potentially biasing factor may exist at the item level. Item S465 displayed substantial DSF effects in just two booklets, a pervasive form in Booklet 5 and a potential non-pervasive form in Booklet 11. However, Item S485 showed non-substantial DSF effects in two booklets and a potential non-pervasive form of DSF in the other two booklets indicating that the potentially biasing factor resides in the lowest score category in Booklet 10, while it resides in the highest score category in Booklet 12.
For English as the source language, four (67%) items (Items S114, S465, S485, and S519) displayed different DSF effects in, almost, all booklets. For Item S114, DSF effects were of the pervasive form in all but one booklet (Booklet 1), indicating that a potentially biasing factor may exist at the item level. However, for Item S485, potential non-pervasive form of DSF resulted in all booklets suggesting that a potentially biasing factor may exist and it resides in the highest score category. For one of the two remaining items (Item S465), potential non-pervasive form of DSF resulted in two of the booklets (Booklets 4 and 5) while it appeared to be free of DSF in the other two booklets (Booklets 11 and 12). By contrast, Item S519 displayed pervasive DSF effects in two booklets (Booklets 2 and 3), but potential non-pervasive form in Booklet 9 and no DSF in Booklet 5.
It is worthy to note that in just one item (Item S519), the DSF effects were of the same sign (positive sign) for both source languages across all booklets. This indicates that in both sets of countries, this item gave a relative advantage to the group of the source language over the group with the target language, after conditioning on ability, for whichever test booklet was being administered. However, for the remaining items, the DSF effects for both source languages had different signs. The DSF effects for French as the source language were positive across four items (Items S114, S465, S485, and S498), indicating that French examinees were advantaged by those items over the Danish examinees with the same proficiency levels. A different finding emerged for the other source language, English; the DSF effects for three items (Items S114, S465, and S485) were negative, indicating that the Slovak group was given a relative advantage over the English group after conditioning on ability. However, Item S498 yielded negligible DSF effects and was therefore considered free of DSF.
Conclusion
Five major conclusions emerged from analyzing PISA-2006 polytomously scored science items for test-language-related net DIF, global DIF, and DSF. First, test-language DIF (net and global DIF) is present in PISA-2006 polytomously scored items. For each source language, 67% of the items displayed DIF in either or both types of DIF. The results resemble findings in previous studies that showed the existence of test-language-related DIF in international tests such as PISA (Arim & Ercikan, 2005; Ercikan, 1999; Le, 2006; Yildirim & Berberoglu, 2009).
Second, DIF results were different for both source languages. The hypotheses of no net DIF and no global DIF were rejected for French as the source language in more administrations of PISA items as compared with those where English was the source language. French examinees were given a relative advantage over Danish examinees, after conditioning on ability, whereas the English group was disadvantaged by the items as compared with the other group of examinees (Slovak). The differences may be due to problems in translation (Le, 2006). However, other factors may affect item equivalence across language versions of PISA, such as cultural and curriculum differences between the groups (Le, 2009; Sireci & Berberoglu, 2000; van de Vijver & Tanzer, 2004).
Third, for both source languages and for each item, the findings showed that under the condition of no DSF for each of the two steps, no DIF of either type exists, which agrees with what would be theoretically expected (Penfield, Gattamorta, & Childs, 2009). Examination of such items that perform well (e.g., Items S447 and S498) might prove helpful to item developers.
For items that were flagged for DIF, however, the analysis of DSF provided valuable information concerning the nature of the DIF effect (i.e., is the DIF an item-level effect or an effect isolated to specific score levels) and the location of the DIF effect (i.e., precisely which score levels are manifesting the DIF effect). For some items that were flagged for DIF (e.g., Items S114, S498, and S519 for French vs. Denmark), DSF effects were of pervasive form indicating that a biasing factor may exist at the item level that may be located in the content of the item stem or the general properties of the item itself. This implies that there is a need for a through item content revision and a testing on a one-to-one basis to pinpoint the potential causes of DIF. For the remaining items that yielded potential non-pervasive form of DSF, the DSF analysis showed that a potentially biasing factor may reside in one of the steps indicating that the biasing factor may well be in the scoring criteria for that step. For example, potential non-pervasive DSF observed in just the first step suggests that the cause of the DIF likely resides in the second lowest score level because one group of examinees is experiencing a relative difficulty in making the transition from the lowest score level into a higher score level. This suggests that scoring criteria should be revised for those items and for the affected steps to clarify the potential causes of DIF and DSF.
Fourth, net and global DIF tests did not always yield the same result. The pattern of results observed in this study is what would be theoretically expected in that you can get different results depending on whether one is using a net or global test, depending on the DSF effect pattern. Specifically, when there was significant net DIF but not significant global DIF (e.g., S114, the United States vs. Slovak), the DSF effects were pervasive convergent, and thus the net DIF effect is more powerful than the global effect because all DIF effects were in the same direction. However, when there was significant global DIF but not significant net DIF (e.g., S465, the United States vs. Slovak), the DSF effects were potential non-pervasive, and thus the global DIF effect is more powerful. The existence of large DSF effects in one, and only one, step of these items might be diluted by the negligible (near zero) DSF effects of the other step, yielding a relatively small aggregated (net) DIF.
Fifth, in this study, PISA polytomous items had three score levels resulting in just two step functions, which precludes distinguishing between different forms of non-pervasive DSF, that is, non-pervasive constant, non-pervasive convergent, and non-pervasive divergent. Therefore, when there are only two steps and the DSF is non-pervasive, the distinction between constant, convergent, and divergent forms of DSF is irrelevant.
It is suggested that polytomous items with more than three score levels (J ≥ 4) be analyzed to be able to identify non-pervasiveness in light of convergent and divergent causes. Moreover, this study compared each source language with just another test language; it would be beneficial to replicate the analyses with data from other PISA participating countries encompassing more divergent test languages, and/or other countries that are statistically different from the average. Finally, it would be interesting to investigate DSF using the other two approaches available in the literature, IRT approach and logistic regression approach, and comparing the results with those yielded by the odds ratio approach.
Footnotes
Acknowledgements
The authors would like to thank Professor Randall Penfield at the University of North Carolina at Greensboro and Professor Margret Wu at Victoria University, and the two anonymous reviewers for their valuable comments, suggestions, and edits.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
