Abstract
In this study, differential item functioning (DIF) trends were examined for English language learners (ELLs) versus non-ELL students in third and tenth grades on a large-scale reading assessment. To facilitate the analyses, a meta-analytic DIF technique was employed. The results revealed that items requiring knowledge of words and phrases in context favored non-ELLs in grade 3, whereas items requiring evaluation skills favored ELLs in grade 10. However, inconsistent patterns were found across gender and ethnicity. Educational implications are discussed.
Reading is a foundational, necessary skill for academic achievement in schools and for success in civic life and careers in modern society. Thus building foundational skill in reading, specifically reading comprehension, is of high importance to all students, including those learning to read English as a second language. Despite its recognized importance, we have a limited understanding about how students’ performance on different types of reading-comprehension questions varies as a function of their language-learning status. In the present study, we extend previous studies on performance patterns on different types of reading-comprehension questions for English-only students (those who speak English as a first and primary language) and English language learners (students who have English as a second language; ELLs) using meta-analytic differential-item-functioning (DIF) analyses.
Reading comprehension assessment, particularly in standardized high-stakes assessments, is challenging because items must be appropriate for readers from diverse racial, ethnic, cultural, and linguistic backgrounds (Abedi, Bailey, Butler, Castellon-Wellington, Leon, & Mirocha, 2005). Previous studies have shown that ELLs typically achieve skills in word reading equivalent to those of English-only students, provided that they have been exposed to systematic and explicit instruction in decoding (Lesaux & Siegel, 2003). Despite this comparable achievement in word reading, however, ELLs usually lag behind non-ELLs in text-level reading skills (e.g., Al Otaiba, Petscher, Pappamihiel, Williams, Dyrlund, & Connor, 2009; Kim, Y. -S, 2012; Lesaux, Geva, Koda, Siegel, & Shanahan, 2008) including reading comprehension.
Even on assessments that are not high stakes, ELLs consistently underperform compared to their English-speaking peers. In the national sample of the 2003 National Assessment of Education Progress (NAEP), fourth-grade ELLs scored 33 points below non-ELLs in reading, and in Florida the gap was 22 points. By 2011 these gaps were 36 points nationally and 32 points within the state of Florida, showing virtually no improvement over time in the underperformance of ELLs. This gap is larger at higher grades, though comparisons must be made cautiously because the twelfth-grade and fourth-grade scales are not linked directly (based on reports created using the NAEP Data Explorer at http://nces.ed.gov/nationsreportcard/naepdata/dataset.aspx; National Center for Education Statistics, 2003, 2011).
Several factors might contribute to this gap in reading comprehension, such as cultural background (e.g., Briere, 1968), background knowledge (e.g., Erickson & Molloy, 1983; Hale, 1988), and native language (e.g., Alderman & Holland, 1981; Farhady, 1982; Oltman, Stricker, & Barrows, 1988; Politzer & McGroarty, 1985; Spurling & Ilyin, 1985; Swinton & Powers, 1980). Most notably, it has been suggested that students’ limited linguistic knowledge in L2 (their second language, i.e., English) is likely to be an important cause of ELLs’ reading-comprehension lag (Kim, Y. -S, 2012; Lesaux et al., 2008; Mancilla-Martinez & Lesaux, 2010; Nakamoto, Lindsey, & Manis, 2008). Previous studies have found that minor changes of wording in tests can affect the test scores of ELLs differently than those of non-ELLs, and authors have argued that unnecessarily complex language on test items should be reduced to improve the performance of ELLs (Abedi & Lord, 2001; Wolf & Leon, 2009).
One way to examine the appropriateness of tests for ELLs is by using differential item functioning (DIF) analysis. DIF is “a difference in item performance between two comparable groups of examinees; that is, groups that are matched with respect to the construct being measured by the test” (Dorans & Holland, 1993, p. 35). DIF is present when the probability of answering an item correctly differs between groups, controlling for ability level. If a score gap between ELLs and non-ELLs results from differential item performance, it could threaten the validity of scores for the ELLs (Holland & Wainer, 1993). In other words, if particular types of questions are differentially difficult for ELLs, different total test scores may result for the two different groups, who truly have the same ability (Scherbaum & Goldstein, 2008). It is difficult to assess ELLs’ true reading ability from their reading test performance if their scores are based on items that have DIF.
These previous studies examined students’ performance differences on each item, using an item-based DIF approach. Although useful and informative, item-based DIF approaches are limited in that they do not allow us to examine the possible DIF patterns or trends across items that differ in their content or other characteristics. One way to investigate DIF trends for bundles or subsets of test items is the meta-analytic approach. The goal of meta-analytic summaries is to explore variation in effect sizes – here the DIF indices – that may be related to content differences or other features. If DIF values within the same test (e.g., across different types of questions) differ by more than just sampling error, variation in the DIF values can potentially be explained by the use of a moderator that characterizes item or sample features. In meta-analysis, moderators are important characteristics that may explain variation in effects, specifically the DIF values. Meta-analytic DIF summaries model variation in the Mantel-Haenszel (MH) DIF estimates and the reciprocal of the variance of each MH DIF value is used for weighting. Further technical details are found in Appendix A.
Furthermore, within item sets moderators such as student characteristics may account for some of the variation in DIF values (Koo, 2012). Again, these moderators are not usually examined using typical item-based DIF analyses. We cannot assume that DIF patterns or trends are consistent across all important demographic groups (e.g., gender or ethnicity). Gender and ethnicity are usually factors that are directly tested as sources of DIF rather than as moderators. However, in this study, gender and ethnicity were used as moderators to explain variation in DIF values for ELLs versus non-ELLs. Even though no ELL DIF is detected on a test, it can be useful to examine whether DIF magnitudes are consistent across gender or ethnic background. For instance, ELL DIF might be detected only for females, but not for males, or for Asians, but not for Hispanics.
In the present work, we extend previous studies by examining DIF patterns in different types of reading-comprehension questions for children in different grades, and thus in different developmental phases. Even though previous studies have shown that some reading-comprehension questions classified by item content or features favored ELLs or non-ELLs (Kim & Jang, 2009), to our knowledge no studies have been conducted to identify ELL DIF patterns on reading tests for different age groups. Patterns of DIF might vary as a function of children’s developmental level of reading skills. Young children (e.g., third graders) are still rapidly developing in their language and cognitive skills, which are critical to reading comprehension, whereas older children’s (graders’) language (L1 or L2) and cognitive skills are more advanced. Thus, the extent to which DIF exists in various reading-comprehension questions, and the extent to which moderators play a role in the size of DIF, might vary across grades. In summary, we had two primary goals in the present study: (a) to use meta-analytic techniques to examine DIF patterns related to ELL status on a reading-comprehension test across different types of reading-comprehension questions; and (b) to examine variation in DIF across various types of reading-comprehension questions using the student characteristics of gender and ethnicity as moderators. To address these goals, we used data from a large-scale, high-stakes state reading-comprehension test in grades 3 and 10.
Methods
Data sources
Reading test data from the 2009 Florida Comprehensive Achievement Test (FCAT) for third and tenth graders were examined in this synthesis of Mantel-Haenszel (MH) DIF indices. Non-ELLs served as the reference group and ELLs as the focal group. In the present study ELLs were students who had received ESOL (English for Speakers of Other Language) services in the immediately preceding two years, as noted in the state dataset. Because data on English proficiency were no available, we therefore used receipt of ESOL services as a proxy for proficiency. A total of 173,737 third-grade students were included in the present analyses and 5288 (3%) were considered to be ELLs. In tenth grade, 160,391 students were included in the present analyses and 4293 (2.6%) were considered to be ELLs. Table 1 includes sample sizes for students grouped by gender, ethnicity (White, Black, Hispanic, and Asian), and ELL status. Multi-racial and American-Indian student groups were excluded because of their small sample sizes.
Sample size by subgroups for grade 3 and grade 10.
Item classification
All items in the reading-comprehension assessment are classified into four types of reading-comprehension questions, grouped according to related benchmarks from the Sunshine State Standards. The benchmarks for FCAT are developed by a Reading Content Advisory Committee “composed of 15–20 Reading and/or Language Arts professionals from schools, school districts, and universities” (Florida Department of Education, 2004, p. 15). The Florida State Board of Education reviews the Sunshine State Standards and they are also reviewed by Florida educators, and representatives of Florida College System institutions who have expertise in reading. Items concerning “words and phrases in context” (phrases-in-context hereafter) require students “to use strategies to increase vocabulary through structural clues (prefixes, suffixes, roots), word relationships (antonyms, synonyms), and words with multiple meanings” (Florida Department of Education, 2009, p. 23). “Main idea” items require students “to determine the stated or implied main idea, to recognize an organizational pattern, and to identify the author’s purpose and point of view to construct meaning from text” (Florida Department of Education, 2009, p. 23). “Comparisons and cause/effect” (cause–effect) items require students to recognize cause-and-effect relationships. “Reference and research” (evaluation hereafter) items require students to locate and interpret information for various purposes, to use multiple reference materials, and “to check validity and accuracy of research information” (Florida Department of Education, 2009, p. 23). Appendix B presents excerpts of items as examples of the four item types. Table 2 shows the number of items for each of the question types. Table 3 shows the mean scores (average numbers of correct responses) for the four reading-comprehension question types for each grade by ELL status.
The number of items by reading-comprehension question type for third and tenth graders.
Mean raw scores for four reading-comprehension question types for non-ELLs and ELLs.
Calculation of effect sizes
For this study, the MH DIF index was used to detect DIF because the MH procedure is one of the most commonly used methodologies to detect DIF (Holland & Thayer, 1988). It serves as the standard among various testing companies such as the Educational Testing Service (ETS). The MH odds ratio,
A matching criterion is required to analyze DIF. Ideally examinees should be matched in terms of their true reading abilities. However, no perfect reading ability measure exists in reality; thus proxies are used in practice. In this study, the total reading test score across the four item categories was used as a proxy for the true reading ability. The four categories may not measure exactly the same reading ability, but they should share a common underlying dimension, and this is interpreted as overall reading ability. Below we discuss confirmatory factor analyses that investigate whether a unitary structure is reasonable for this test.
The moderators gender and ethnicity were examined to identify possible differences in DIF magnitudes. Ethnicity has four categories – Hispanic, Black, Asian, and White. An ELL DIF effect size was calculated for each of eight different subgroups: Male-Hispanic, Male-Asian, Male-Black, Male-White, Female-Hispanic, Female-Asian, Female-Black, and Female-White. For this study, after calculating ELL DIF effect sizes for each separate subgroup, we investigated whether the effect sizes varied across subgroups. Thus the total number of effect sizes for each type of reading-comprehension question was eight times the total number of items for each question type.
Statistical analysis
Koo (2012) proposed the idea of applying meta-analysis techniques with DIF indices, arguing that each
Because preliminary analyses suggested that all effect sizes did not measure a single common effect, random-effect models or mixed-effects models were used. To examine differences owing to moderators, weighted mean DIF values were computed for each of the eight different subgroups. Those weighted DIF values were compared to determine whether mean DIF systematically differed between males and females, or across ethnic backgrounds of examinees using meta-analytic analysis of variance (ANOVA; e.g., Borenstein, Hedges, Higgins, & Rothstein, 2009). Meta-analytic ANOVA allows one to examine differences owing to categorical factors such as gender and ethnicity.
In the meta-analytic approach, chi-square (Q) statistics are used to examine whether the moderators explain significant variability in the DIF effect sizes. An overall chi-square test Q examines whether all DIF effects arise from one population, which also provides a test of the fixed-effects model. The overall Q statistic has an asymptotic chi-square distribution with m − 1 degrees of freedom, where m is the number of ESs. Two additional types of Q statistics were also used: the Q statistic for the model (QM) and the Q statistic for error (QE). QM evaluates the variation in DIF indices explained by the moderators (Borenstein et al., 2009). If QM is large, it suggests that the moderators explain some or all of the between-items and between-groups variation. QE indicates whether the model is correctly specified; that is, whether the moderator explains all the variation in DIF that is not a result of sampling error. If QE is small and QM is large, the moderators are considered to explain all or nearly all between-items and between-groups differences in DIF. In contrast, if QE is big, then the moderators do not explain all DIF variation. In this case, a mixed-effects model that accounts for the residual unexplained variance is needed.
Results
Preliminary analysis: Factor structure of reading-comprehension questions in grades 3 and 10
A confirmatory factor analysis (CFA) was conducted to assess the model fit of one-factor and four-factor structures for the four categories of FCAT reading questions. Several fit indices were used to assess the factor structures: the root-mean-squared error of approximation (RMSEA), comparative fit index (CFI), normed fit index (NFI), and chi-square. The four-factor model was specified a priori, based on the item groupings found in the Sunshine State Standards (i.e., phrases-in-context, main idea, cause–effect, and evaluation). Thus, we used CFA to test the applicability of the four subtest model (four-factor model) to a unitary (one-factor) model.
Table 4 shows results of confirmatory analyses of one-factor and four-factor structures for the reading questions. In grade 3, RMSEA values for the one-factor and four-factor models were <.01 and .02, respectively. RMSEA values of less than .06 reflect a good fit (Hu & Bentler, 1999). Also, the CFIs of .92 and .93 for the one-factor and four-factor models met the criterion (.90 or larger) for acceptable model fit (Hu & Bentler, 1999). In addition, the NFI values of .92 and .93 for the one-factor and four-factor models met the criterion (.90 or larger) for acceptable model fit (Hu & Bentler, 1999). Thus both models are plausible and fit reasonably well for grade 3.
Summary of tests of model fit of four reading question types for grade 3 and grade 10.
Note: **p < .001. df = degree of freedom. RMSEA = Root mean square error of approximation. CFI = comparative fit index. NFI = normed fit index.
To further evaluate model fit of the four-factor model, we examined the chi-square fit test values for grade 3. The chi-square values for the one-factor and four-factor models were 90,120.3 with p < .01 and 88,839.8 with p < .01, respectively. Both were statistically significant, indicating that these models did not show acceptable model fit. However, a chi-square difference test shows that the four-factor model for grade 3 fit significantly better than the one-factor model (χ2 difference = 1280.4 with df = 6).
Results were similar for grade 10. Even though the chi-square values for the one-factor and four-factor models (37,815.8 with p < .01 and 37,287.3 with p < .01) were statistically significant, RMSEA (.02), CFI (.96) and NFI (.96) values for both models were identical and met the criteria for good fit. However, the four-factor model was again significantly better than the one-factor model according to the chi-square difference test (χ2 difference = 528.46 with df = 6). These results confirmed that both the unitary model and the four-factor model were reasonable for grades 3 and 10. Thus it is not problematic to use the overall test scores as the matching variables for the DIF analyses. However, the different tests show that the four reading categories for grade 3 and grade 10 are acceptable as factors, and they show statistically better fit than the one-factor model for FCAT reading questions in both grades. This suggests that distinctions between item categories in their degrees of DIF may also exist.
Research Question 1: Analysis of overall DIF values for four types of reading-comprehension questions for grades 3 and 10
Table 5 presents the weighted mean DIF values and standard errors under the random-effects model for the four reading question types for grades 3 and 10 and Figure 1 shows the associated confidence intervals of the mean DIF values by question type and grade. In grade 3, two item types showed significant mean DIF values – phrases-in-context and main-idea items. In contrast, the means for cause–effect and evaluation items in grade 3 were essentially zero, indicating that the probability to answer correctly did not differ between grade-three non-ELLs and ELLs of the same ability level. The weighted mean DIF value for phrases-in-context items for grade 3 was −0.19. Its confidence interval did not include zero, suggesting that non-ELLs tended to perform better than ELLs on phrases-in-context items, after controlling for reading ability. The confidence interval for DIF on main-idea items with a mean of .046 also did not include zero. However, although the main-idea DIF effect was significant, the fact that the mean and interval endpoints were all very close to zero suggests a minimal DIF effect.
Random-effects overall DIF estimates and confidence intervals by grade and item type.
Note: *Significant. k = number of effects.

The random-model confidence intervals for the weighted ELL DIF mean for four types of reading question in grades 3 and 10.
In grade 10, the confidence intervals of mean DIF for all types of reading questions included zero except that for DIF of evaluation items. The positive mean of .096 and positive confidence interval indicate that ELLs in grade 10 tended to perform slightly better on evaluation questions than non-ELLs after controlling for reading ability.
Even though the magnitudes of DIF were not large in our samples, the patterns of DIF may contribute to different total test-score averages for ELLs and non-ELLs if many small differences accumulate systematically (Scherbaum & Goldstein, 2008).
Research Question 2: Categorical analyses using moderators
Meta-analytic ANOVA was conducted to further investigate differences among effect sizes according to two moderators, gender and ethnicity. All statistics including weighted mean DIF values, standard errors, and Q statistics for grade 3 and grade 10 are shown in Tables 6 and 7, respectively.
Categorical analyses across reading-comprehension question types for grade 3.
Note: *p < .05. k = number of effects. bMixed-effects model. cFixed-effects model.
The fixed-effects model was used for QE and QM.
Categorical analyses across reading-comprehension question types for grade 10.
Note: *p < .05. k = number of effects. bMixed-effects model. cFixed-effects model.
The fixed-effects model was used to compute QE and QM.
Initial tests of gender differences in DIF for all reading-comprehension question types were not statistically significant in both grades, except for main-idea items in grade 10. Thus gender did not explain differences among DIF values except for grade10 main-idea items (QM(1) = 4.4, p < .05). Mean main-idea DIF values for males and females in grade 10 were −0.06 and −0.02, respectively. However, within the two gender groups in grade 10, DIF values varied significantly across items (QE values were 368.13 for boys and 448.35 for girls, df = 68). Mixed-effects confidence intervals show that once between-items variation is incorporated into the uncertainty of the analyses, the gender differences are no longer significant (mixed-effects QM (1) = 0.71, p = .39). Figure 2 shows these mixed-effects confidence intervals for the mean DIF of main-idea items in grade 10.

Confidence intervals for the mean ELL DIF of “main idea” items by gender in grade 10.
Four ethnic groups – Asian, Black, Hispanic, and White students– were also examined for differences in their mean DIF values. The test of between-groups differences was statistically significant in grade 3 only for phrases-in-context. The weighted mean DIF values for phrases-in-context items differed by ethnicity under the fixed-effects model (QM(3) = 14.21, p < .05), indicating that ethnicity explained some variability in the DIF indices. However, after considered between-item variation, the ethnicity differences in grade 3 were no longer significant (mixed-effects QM (3) = 1.76, p = .62).
Figure 3 shows the confidence intervals for the mean DIF of phrases-in-context items in grade 3. None of the confidence intervals included 0 and all were negative. Specifically, the confidence interval for DIF of phrases-in-context items for Whites was much lower compared to those for the other ethnicities, indicating that after controlling for reading ability White ELLs tended to perform much worse than White non-ELLs on phrases-in-context items. The lower bound of the confidence interval for phrases-in-context was −0.43, which is in the moderate DIF range (i.e., category B) according to the ETS criteria for DIF. Zwick and Ercikan (1989) recommended that category B items be replaced with equivalent items.

Confidence intervals for the mean ELL DIF of “phrases-in-context” items by ethnicity in grade 3.
The results for the ethnicity effects in grade 10 differed from those in grade 3. Ethnicity explained some differences in DIF for main-idea items for tenth graders with QM(3) = 11.66, p < .05. After incorporating uncertainty owing to between-item variation, ethnicity differences on main-idea items were not significant (mixed-effects QM(3) = 3.42, p = .33). However, as was true for other analyses, significant within-ethnicity variation caused the mixed-model standard errors in Table 7 to be relatively large, so the significant ethnicity effect found under fixed effects should be considered tentative, but worthy of further study. The mixed-effects confidence intervals for the DIF values of main-idea items are presented in Figure 4. The confidence intervals of main-idea-item DIF means for Asians, Blacks, and Hispanics all covered zero, indicating that the performance of Asian, Black, and Hispanic ELLs was similar to that of Asian, Black, and Hispanic non-ELLs on main-idea items after controlling for reading ability. However, the confidence interval of the mean DIF for main-idea items for Whites fell largely in the negative range, indicating White non-ELLs tended to perform better than White ELLs on main-idea items, after controlling for reading ability.

Confidence intervals for the mean ELL DIF of “main idea” items by ethnicity in grade 10.
Ethnicity explained some differences in DIF for evaluation items for tenth graders under the fixed-effects model (QM(3) = 30.86, p < .05) and the mixed-effects models (mixed-effects QM(3) = 8.76, p < .05). The confidence intervals for the DIF values of evaluation items are presented in Figure 5. The confidence intervals of evaluation-item DIF values for Asians, Hispanics, and Whites were mostly positive, indicating that ELLs in these three groups tended to perform better than non-ELLs on evaluation items after controlling for reading ability in grade 10. For Hispanics the mean DIF value was positive and also significant, indicating that Hispanic ELLs were significantly more likely than Hispanic non-ELLs of the same ability level to answer evaluation items correctly. In contrast, opposite results were seen for Blacks. The confidence interval of the mean DIF value for evaluation items for Blacks fell largely in the negative range, indicating Black non-ELLs tended to perform somewhat better than Black ELLs on evaluation items, after controlling for reading ability.

Confidence intervals for the mean ELL DIF of “evaluation” items in grade 10 by ethnicity.
Discussion
In the present study, we examined the extent to which ELL and non-ELL students’ performance on different types of questions in a high-stakes reading-comprehension test varied for students who are in different developmental phases – third and tenth graders. In particular, we investigated whether DIF varied as a function of students’ gender and ethnic backgrounds. First, we found using confirmatory factor analysis that four types of reading-comprehension questions – phrases-in-context, main idea, cause–effect, and evaluation – existed in the FCAT reading comprehension test for third and tenth graders.
Overall, ELL third graders’ performance was lower than that of non-ELLs on the phrases-in-context items, after controlling for ability level. This result is somewhat similar to Kim and Jang’s (2009) finding that ELLs had poorer performance on items involving vocabulary knowledge. Thus, even after ELL students’ overall reading scores were taken into consideration, their performance on words and phrase usage was weaker than that of non-ELL students. The phrases-in-context items included vocabulary knowledge such as morphology, word relations, and polysemy. By definition, the ELLs in the present study were receiving ESOL services as a result of limited English proficiency, for which vocabulary is an important component. Therefore, it is not surprising that ELLs performed poorly on vocabulary-related items. If the goal of reading-comprehension assessment at lower grade levels is largely to assess knowledge of vocabulary and understanding expressions in text, given its critical role in reading and language development (this is true for non-ELLs as well; e.g., Anderson & Freebody, 1983; Stahl, 2003), then the inclusion of vocabulary and expression items (i.e., the words and phrases in context items) is justified. In contrast, if the goal of the reading-comprehension assessment is to measure higher-level inferential and evaluative skills with reading materials, the numbers of such vocabulary and expression items should be limited.
Our finding of significant DIF for phrases-in-context items held for only third graders, and while the effect was strongest on average for White students, all ethnic groups showed negative DIF values. Perhaps the lower grade level ELLs may not have developed sufficient strategies to learn vocabulary compared to the older ELLs. This might lead ELLs at lower grade levels to have difficulties with phrases-in-context items.
In contrast, ELL tenth graders’ performance was significantly better than that of non-ELL students on the evaluation items, after controlling for ability level. A similar result was also found in Kim and Jang’s (2009) study where ELLs tended to perform better than non-ELLs on evaluation items, after controlling for ability level. Although further investigation of the question of why tenth-grade ELLs should perform better on evaluation items than non-ELLs is beyond the scope of the present study, we suspect it may be related to ELLs’ prior educational experiences in their native languages. In other words, most tenth-grade ELLs would have had several years of formal education in their native language. Depending on the nature and quality of the education they received in their native languages, these students’ evaluative skills, which involve identifying and interpreting information from various sources, might have been fairly advanced. Although these students are designated as ELLs as a result of their lack of English language proficiency, they may have been able to transfer key skills needed for evaluation items from their native language to English.
An important way in which the present study extends previous studies is by examining the moderators of gender and ethnic background. We found no gender effects in either grade after accounting for between-items uncertainty in DIF magnitudes. When students’ ethnic backgrounds were considered, all third-grade ELLs tended to perform more poorly than non-ELLs on phrases-in-context items, after controlling for ability level. This held for all ethnic groups but was strongest for White students. The confidence interval of the mean phrases-in-context DIF value for Whites was, in fact, close to the ‘moderate DIF’ category according to ETS criteria. Furthermore, White ELLs in grade 10 tended to perform worse than White non-ELLs on main-idea items, after controlling for ability level. In contrast, tenth-grade ELLs of Asian, Hispanic, and White origin tended to outperform non-ELLs on evaluation items after controlling for ability level. These results suggest that in order to achieve fairness and validity of the test item, writers and test developers should be aware that phrases-in-context and main-idea items are sensitive to DIF for White ELLs at both grade levels. Because most of the White students are Indo-Europeans, these results are likely owing to their native language and ethnic backgrounds. Ethnicity is also a proxy, to some degree, for the native language and cultural/educational differences of ELL students, and thus may tie in to possible item specifics that could lead to DIF in reading tests (e.g., Asian languages have completely different structures than Indo-European languages).
Overall the results of the present study suggest that educators need to provide additional sustained help on phrases-in-context items to ELLs at lower grade levels. Although a need in this area for ELLs might be apparent in the beginning stages of language and reading acquisition, studies have indicated that even after ELLs are no longer designated as “limited English proficiency” learners, they still need continued support to succeed in academic school subjects (Francis, Rivera, Lesaux, & Kieffer, 2006). In terms of results for differential performance by gender, studies in reading have consistently shown that male students lag in reading skills and show disproportionately higher levels of reading disability. Recent studies have shown that this is not likely to reflect differential identification and labeling, but may reflect true performance differences (Badian, 1999; Wagner & Schatschneider, 2009; but see Siegel & Smythe, 2005). In our study, we found that gender differences were not significant after controlling for ability level in both grades, except for main-idea items in grade 10.
Future studies and conclusion
Although the findings of the present study are informative, it has several limitations. First, the results are limited to the reading-comprehension assessment used in the present study and to the racial and gender composition of the student population in Florida. Thus, further investigation with different samples from other places would be necessary in order to generalize our findings. Second, more research is needed to explain the mechanisms behind the moderating factors. Specifically, it is not clear why Asian, Hispanic, and White ELLs outperformed non-ELLs. Understanding the mechanism is important for pinpointing educational implications. However, the findings related to ethnicity should be interpreted with caution because we did not have information about the language and cultural backgrounds of the students from different ethnic backgrounds. It would be helpful to examine the backgrounds of those who were included in each ethnicity category. For instance, Black ELLs might be from the Caribbean or Africa, and thus could have linguistic and cultural experiences that are quite different. Examining ELLs’ linguistic and cultural backgrounds could help us to better understand differential performances on different types of items.
Footnotes
Appendix A: Calculation of Mantel-Haenszel DIF index and effect sizes in the meta-analysis
where j is the index for level of the total test score and Aj, Bj, Cj, and Dj are the counts given in Table A1. In Table A1, Aj and Cj are the numbers of examinees at score level j who find the correct answer, and Bj and Dj are the numbers of examinees at score level j who do not find the correct answer. Also, mC j is the sum of Aj and Cj and mIj is the sum of Bj and Dj at score level j. Last, nFj, nRj and nj are the total numbers of examinees in the focal group, reference group, and total group at score level j, respectively.
where wi is the reciprocal of the variance of
and the standard error under the fixed-effects model is then simply
where Vi is the within-item sampling variance for item i given above and
where
Appendix B: Example items from the 2006 FCAT released reading test for grade 10
Excerpts of items are presented here as examples of the four item types. Text passages that were associated with the items are not presented in some cases.
Read these lines from the poem “Woman with Flower.”
What is the meaning of the word nurturing as it is used in these lines?
