Abstract

This chapter addresses the psychometric challenges in assessing English language learners (ELLs) and students with disabilities (SWDs). The first section addresses some general considerations in the assessment of ELLs and SWDs, including the prevalence of ELLs and SWDs in the student population, federal and state legislation that requires the inclusion of all students in large-scale assessments, validity considerations in the assessment of ELLs and SWDs, importance of test accommodations in their assessment, and an introduction to the psychometric challenges, which are intricately interwoven with validity and fairness considerations, in assessing ELLs and SWDs. The second section discusses the efficacy of test accommodations and modifications for SWDs and ELLs. The third section addresses the need for invariant measurement for ELLs and SWDs. In the assessment of a diverse student population it is important to examine the extent to which the psychometric properties of a test are invariant across groups of students. This necessitates obtaining evidence of reliability and score precision, internal structure evidence, external structure evidence, and evidence of equating invariance for ELLs and SWDs. The establishment of measurement invariance for ELLs and SWDs is required to make valid and fair score interpretations for these students and for group comparisons. Under the No Child Left Behind Act (NCLB) of 2002, growth measures have been implemented to help determine Annual Yearly Progress (U.S. Department of Education, 2005); however, the research on the efficacy of models for monitoring change for ELLs and SWDs has been scarce. Such research is crucial given that the federal Race to the Top initiative calls for multiple measures in educator evaluation systems, including measures that assess student progress (U.S. Department of Education, 2010). The last section addresses issues related to including SWDs and ELLs in measures of “growth” and of educator effectiveness.
English Language Learners and Students With Disabilities
English language learners are the fastest growing group of the nation’s student population, and it is projected that 25% of the students nationally will be ELLs by 2025 (National Clearinghouse for English Language Acquisition & Language Instruction Educational Programs, 2007). The definition of an ELL provided by Title IX of the Elementary and Secondary Education Act indicates that an ELL is a student between the ages of 3 and 21 years whose difficulties in speaking, listening, reading, writing or understanding the English language may be sufficient to deny the individual the ability to meet the State’s proficient level of achievement . . ., the ability to successfully achieve in classrooms where the language of instruction is English . . .” (English Learners Office, E. J. McClendon Educational Center, 2002)
To better understand the challenges in classifying students as ELLs, the reader is referred to Abedi (2008) who addresses the nuances in defining ELLs and provides recommendations for improving the validity of classifying students as ELLs. ELLs develop English language proficiency at different paces, with meaningful differences in their learning trajectories, and new ELLs enter schools each year. As a consequence, these students are variable in terms of both their English proficiency and their academic proficiency. It is important to recognize that ELLs differ from non-ELLs in terms of not only English language proficiency but also cultural and educational backgrounds (Abedi, 2004; Solano-Flores & Trumbull, 2003). There is ample evidence that there is a large performance gap between ELLs and non-ELLs in all content areas, particularly in those with heavy English language demands (Abedi, 2006; Abedi & Gandara, 2006; Solano-Flores & Trumbull, 2003). As described by Abedi and his colleagues (Abedi, 2004; Abedi & Gandara, 2006), there are many factors that contribute to the performance gap, including parent education level and poverty, challenges in second language (L2) acquisition, inequitable opportunities to learn (OTLs), and tests that are not well suited to assessing their knowledge and skills. Based on a three-state study, J. Kim and Herman (2009) suggested that both linguistic barriers and long-term ELL designation may contribute to achievement gaps for ELLs.
There are 6.6 million SWDs who receive special education, making up 13% of public school enrollment. They are disproportionally poor, minority, and identified as ELLs (“ESEA Reauthorization,” 2010). The majority of SWDs, about 80% to 85%, are students without intellectual impairments; they are “students who with specially designated instruction, appropriate access, supports, and accommodations, as required by IDEA, can meet the same achievement standards as other students” (“ESEA Reauthorization, 2010, p. 39). The classification of students as SWDs and ELLs is not necessarily disjoint. ELLs who are at the lower end of the achievement scale tend to be overidentified as SWDs. Based on data from the 1998–1999 academic year in eleven urban school districts in California, Artiles, Rueda, Salazar and Higareda (2005) found that ELLs with minimal proficiency were at a greater risk of being identified as learning disabled and being placed in special education programs than any other group of students of similar achievement. Artiles and colleagues (2005) proposed several reasons why ELLs are overidentified as learning disabled, including problems with the screening process, invalid assessment instruments, the belief that language differences constitute a disability, and accountability pressures.
As implied above, SWDs and ELLs are not homogeneous groups, in that there is diversity within each of these groups of students. This diversity within SWDs and ELLs poses many validity and psychometric challenges in their assessment.
Federal and State Legislation for ELLs and SWDs
Federal (NCLB of 2002) and state legislation requires the inclusion of all students, including ELLs and SWDs, in large-scale assessments, holding districts and states accountable to the same standards for all students. To accomplish this, NCLB requires states to disaggregate their test data and report the performance of student subgroups, including SWDs and ELLs. Assessments therefore need to measure the targeted construct equally well for all examinees, regardless of group membership (Thompson, Blount, & Thurlow, 2002). The two assessment consortia, Partnership for Assessment of Readiness for College and Career (PARCC) and Smarter Balanced Assessment Consortium (Smarter Balanced; SBAC), will continue to provide disaggregated test data for student subgroups and have committed to ensuring comparability of test scores across students. The goals of the Smarter Balanced assessment system are to provide accurate measures of achievement and growth for students with disabilities and English language learners. The assessments will address visual, auditory, and physical access barriers—as well as the unique needs of English language learners—allowing virtually all students to demonstrate what they know and can do. (http://www.smarterbalanced.org/parents-students/support-for-under-represented-students)
Similarly, a major goal of PARCC is to provide “all students, including but not limited to, students with disabilities, English learners, and underserved populations with
Apply principles of
Minimize/eliminate features of the assessment that are irrelevant to what is being measured, so that all students can more
Measure the
Leverage technology for delivering assessment components as widely accessible as possible
Build accessibility throughout the test itself with
Use a combination of
Establish
A major goal of such design principles for both PARCC and SBAC is to ensure the comparability of scores across students and subgroups of students, leading to equitable and valid score inferences and uses of test results.
Argument-Based Approach to Validity of Score Inferences
In the assessment of all students it is necessary to consider the evidence that is needed to support the validity of the score inferences and uses of the test results. This requires the specification of the inferences and uses, evaluation of the proposed inferences and their supporting assumptions using evidence, and consideration of plausible alternative interpretations. The argument-based approach to validity, which entails an interpretative and use (IU) argument and a validity argument, provides the foundation for test development and evaluation considerations (Kane, 2006, 2013). An IU argument explicitly links the inferences from performance to conclusions and decisions, including the actions resulting from the decisions. A validity argument provides a structure for evaluating the merits of the IU arguments; it requires the accumulation of theoretical and empirical support for the appropriateness of the claims (Kane, 2006). Each inference in the validity argument is based on a proposition or claim that requires support. The validity argument entails an overall evaluation of the plausibility of the proposed interpretations and uses of test scores by providing a coherent analysis of the evidence for and against the proposed interpretations and uses (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014; Cronbach, 1988, Kane, 1992; Messick, 1989).
Two sources of potential threat to the validity of score inferences are construct underrepresentation and construct-irrelevant variance (Messick, 1989). Construct underrepresentation occurs when an assessment does not fully capture the targeted construct, jeopardizing the generalizability of the score inferences to the larger domain. An example of when construct underrepresentation may occur is when ELLs are not assessed in their dominant language, which limits their access to the construct tested, resulting in test scores that do not represent their proficiency on the tested construct. This implies that test developers need to ensure that the knowledge and skills being assessed by the items and scoring rubrics represent the targeted knowledge and skills and that students have access to the content tested. Construct-irrelevant variance occurs when one or more constructs are being assessed, along with the intended construct, and systematically lowers or raises scores for subgroups of students. As an example, SWDs often have barriers that will have an impact on their performance on achievement tests. Potential sources of construct-irrelevant variance for SWDs and ELLs include, but are not limited to, linguistic demands of items, context and format of items, response mode, and rater’s or computer’s attention to irrelevant features of responses.
Test Accommodations for ELLs and SWDs
Test accommodations have an important role in discussions of fairness and accessibility, with direct implications for the validity of score interpretations for SWDs and ELLs. Test accommodations are typically provided to address both construct-irrelevant variance and construct underrepresentation. The purpose of accommodations is to ensure that SWDs and ELLs have full access to the construct the test is measuring and respond in a way that represents the students’ knowledge, skills, and abilities on the intended construct, and to promote valid score interpretations and uses (Tindal & Fuchs, 1999). Accommodations are intended to increase the validity of score interpretations and uses by minimizing the impact of student attributes that are irrelevant to the construct being measured, and therefore allowing access to the construct the test is measuring. This allows for scores for accommodated students to be meaningfully compared to students for whom testing accommodations are not needed. Providing students with appropriate accommodations is considered a way to level the playing field for SWDs and ELLs. Consequently, it is necessary to examine how the accommodations provided to students affect the measurement of the targeted construct, that is, the extent to which the accommodations give SWDs and ELLs a fair and equal opportunity to demonstrate what they know and can do (see Thurlow & Kopriva, 2015, this volume, for a discussion on how accommodations can help ensure that all ELLs and SWDs have access to the content of the assessments).
Psychometric Challenges in the Assessment of ELLs and SWDs
The validity of score interpretations for subgroups at one occasion and across occasions requires equity and comparability of test scores, which implies the need for measurement invariance across subgroups. Underlying the measurement of students is a critical assumption that the test score scale is measuring the same construct with the same precision for all subgroups of students. If that assumption holds, analyses and comparisons of the scores for different subgroups are appropriate and can provide meaningful interpretations. But if that assumption does not hold, such analyses and comparisons are compromised.
The educational testing of SWDs and ELLs as well as other student subgroups poses a number of psychometric challenges. A major concern is the extent to which there is measurement invariance across tested subgroups. Measurement invariance implies that the internal structure (e.g., factorial structure) of the test is similar across subgroups and the items are functioning similarly across subgroups, providing support that the test is measuring the same construct(s) across the subgroups. Measurement invariance also implies that the score scale is comparable across subgroups and that the construct is being measured equally precisely for the subgroups, and consequently an evaluation of the reliability and score precision for the subgroups is needed. When forms of tests are equated, measurement invariance also requires that equating functions derived from different subgroups produce the same results if scores are to be equitable. Last, there are psychometric issues related to including SWDs and ELLs in the measurement of “growth” and educator effectiveness. Growth models have been used in NCLB accountability to monitor states’ progress in closing achievement gaps and to set high expectations for annual improvement for all students. There has been little research however on the efficacy of these models for tracking and monitoring change for ELLs and SWDs. Furthermore, the Race to the Top initiative calls for multiple measures of educator evaluation systems, including measures of student growth (U.S. Department of Education, 2010).
A number of psychometric challenges arise when examining measurement invariance across subgroups. Many of these psychometric issues in testing SWDs and ELLs, including reliability, score precision, comparability, and equating, have been discussed in the literature (e.g., Abedi, 2002; Geisinger, 1994; Sireci, 2009; Sireci, Han, & Wells, 2008). Typically, the subgroups of interest tend to perform poorer than the general population, and it is likely that the test data may be more multidimensional for students performing at the lower end of the score scale because these students have acquired knowledge in some areas but not others (Abedi, 2002). Consequently, lack of measurement invariance may be a result of overall proficiency on the construct than ELL or SWD status. This implies that measurement invariance should be evaluated using not only the general population as a comparison group but also students who are scoring within the same score range as the subgroup of interest. A related concern is that indices that are used to examine measurement invariance are affected by restriction of score range. Examining measurement invariance with groups of students within similar score ranges will help alleviate some of these concerns. Furthermore, each of the groups, SWDs and ELLs, are heterogeneous, so when sample size permits, invariance in the reported scores should be evaluated with relevant subgroups of SWDs and ELLs, such as student groups defined by use of particular test accommodations and by language dialect (Sireci, 2009; Solano-Flores, 2006). The precision of the scores also need to be evaluated for different subgroups of students.
Efficacy of Test Accommodations and Modifications for ELLS and SWDS
To help alleviate construct-irrelevant barriers to the assessment and construct underrepresentation, test accommodations and test modifications are provided for SWDs and ELLs. Accommodations are changes to test materials and procedures that are assumed not to alter the construct tested, whereas modifications include changes in items (e.g., reducing length of reading passages) or test procedures for which it is not clear whether these modifications affect only access or also affect the construct being measured and, consequently, the score inferences that are drawn (Kettler, Elliott, & Beddow, 2009). As a result, modifications may affect the comparability of test scores for students who received the modification versus those who do not receive the modification. If there is no available evidence that scores remain comparable to the intended construct measured, changes to items or the test are commonly called modifications (S. E. Phillips & Camara, 2006). The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) indicates that test accommodations result in comparable measures that maintain the intended construct, whereas test modifications most likely will result in noncomparable measures that change the intended construct. The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999) state that the purpose of an accommodation is to “minimize the impact of test taker attributes that are not relevant to the construct that is the primary focus of the assessment” (p. 101). Test accommodations can be categorized into five areas: timing (alternative test schedules), response (alternative ways to respond to the test), setting (changes to test setting); equipment and materials (use of additional references or devices); and presentation (alternative ways to present test materials) (Thurlow, Lazarus, & Christensen, 2013). For a more thorough description of the types of test accommodations, see Thurlow and Kopriva (2015, this volume).
The efficacy of test accommodations has been examined by comparing the mean performance level of either ELLs or SWDs on an accommodated test with the level on a nonaccommodated test. This approach has been criticized because improved performance on the accommodated test may indicate that the accommodated version was simply easier. Because of this, researchers have examined the improvement in average scores for the subgroup as compared to the general population. If accommodations have a positive effect for the subgroup but not the general population, it is assumed that construct-irrelevant variance (e.g., due to limited English proficiency for ELLs) was reduced, which has been termed the interaction hypothesis. Based on a review of studies examining the interaction hypothesis for SWDs, Sireci, Scarpati, and Li (2005) suggested that the interaction hypothesis needs to be refined, that is, gains may be observed for the general population on the accommodated version but the gains are significantly larger for the accommodated SWD group. This phenomenon is referred to as differential boost. The differential boost or disability group-by-accommodation interaction hypothesis refers to an accommodation improving performance of SWDs (or ELLs) more than it improves performance of students without disabilities (SWoDs; or non-ELLs; Fuchs & Fuchs, 2001).
Experimental studies in which accommodations are given to randomly assigned groups of students, SWDs and SWoDs or ELLs and non-ELLs, allow for more valid inferences regarding the effects of accommodations than inferences based on correlations. However, there are a number of challenges in conducting experimental studies, including small sample sizes and less representative samples. When interpreting the results of such studies, these issues should be considered. Some researchers have examined the effects of accommodations using large-scale assessment data as described below (e.g., Engelhard, Fincher, & Domaleski, 2011), which allows for a large representative sample of students in these subgroups.
Efficacy of Test Accommodations for SWDs
Some early studies examined the effects of accommodations for just SWDs, and did not compare them with SWoDs. As an example, the effects of extended time on the Reasoning section of the SAT were examined on scores of students with learning disabilities when they were first tested under standard conditions and then tested with extended time (Camara, Copeland, & Rosthschild, 1998). Score gains were 3 times larger for students with learning disabilities who had extended time as compared to SWoDs.
Sireci et al. (2005) provided a review of experimental research studies examining the effects of test accommodations for SWDs as compared to SWoDs. The most common accommodations were extended time and oral presentation, with many of the studies focusing on students with learning disabilities. For most of the studies testing the interaction hypothesis, all students—SWDs and SWoDs—performed better under the accommodated testing condition. Moreover, SWDs had greater score gains than SWoDs when provided with accommodations, giving support for the differential boost hypothesis (Fuchs & Fuchs, 2001). As an example, a study on the impact of accommodations on performance of elementary school SWDs and SWoDs found that students with learning disabilities had a differential boost from the read-aloud accommodation on a reading comprehension test but not from extended time or provision of large-print text (Fuchs, Fuchs, Eaton, Hamlett, & Karns, 2000). Kosciolek and Ysseldyke (2000) used a counterbalanced design to examine the effects of the oral accommodation on reading performance for a commercially produced test. Based on a repeated-measures analysis of variance (ANOVA), they found that SWDs had much larger gains under the oral administration of the test as compared to SWoDs (effect size =.56). Other research on extra time as an accommodation indicates that SWDs benefit differentially when compared with SWoDs and that extra time does not appear to alter the constructs tested by most state achievement tests (Sireci, Li, & Scarpati, 2003). In an experimental study using a mathematics state test, Johnson (2000) reported that SWDs gained considerably more than SWoDs when tested under standard conditions and then retested with an oral accommodation of the test.
In the 2010 National Center for Educational Outcomes report, Cormier, Altman, Shyyan, and Thurlow reviewed studies published in 2007 and 2008 for read-aloud accommodations for SWDs. Three studies indicated differential boost and three indicated no differential boost for SWDs. Subsequently, the National Center for Educational Outcomes examined studies published in 2009 and 2010 and reported that there was a differential boost in three read-aloud studies for SWDs and parallel boosts in two other studies (Rogers, Christian, & Thurlow, 2012).
For the purpose of this chapter, published studies in peer-reviewed journals evaluating the differential boost hypothesis for SWDs were examined from 2004 to 2013. 1 This time period was chosen because Sireci et al. (2005) reviewed studies examining the effects of test accommodations through the 2003 year. Our review indicated that evidence for differential boost was not independent of type of research design being implemented, age of the students, type of accommodation, construct being measured, and sample size. For the review, individual studies were subgrouped by grade level, accommodation, and construct measured; a study with two accommodations each at two grade levels is described as four separate studies. Of the 11 studies that used an experimental design, 4 showed evidence of a differential boost. Evidence of differential boost was found in 29% (2 of 7) of studies that examined fewer than 500 students (SWDs and general population) and 44% (4 of 9) of studies with more than 500 students. Studies based on large-scale assessments showed evidence of the interaction hypothesis in 4 out of 8 cases. Evidence of differential boost was found in 50% (3 of 6) of the studies that examined elementary school students and 30% (3 of 10) of the studies that examined middle school and high school students. When reading was the measured construct, 4 out of 10 (2 elementary and 2 secondary) studies showed evidence of differential boost compared to 2 out of 6 (1 elementary and 1 secondary) studies when mathematics was the construct measured.
Elbaum (2007) conducted a counterbalanced randomized experimental design to test 327 (nSWD = 187) middle school and 316 (nSWD = 204) high school students on two equivalent alternative forms of a mathematics test designed to mimic a statewide test. By testing all students in both the standard test taking condition and with the aid of an oral accommodation, data were analyzed using repeated-measures ANOVA. Test condition was statistically significant, p < .001, as was the main effect of disability status, p < .001. The interaction effect of disability by test condition was also significant, p < .001, and partial η2 = .02, indicating that SWoDs benefited more from the read-aloud accommodation than SWDs. Even though SWoDs benefitted more from the accommodation, when the two groups were subcategorized into the upper and lower 50% scoring students (under the accommodation), the differences of the subcategory effect sizes within disability status were similar. In other words, the top 50% scoring SWDs and SWoDs had a much larger, and relatively equivalent, effect from the accommodation than those in the lower 50% of their respective group. Elbaum (2007) noted that regardless of disability status, the effect of the accommodation depended on the students’ prior mathematics skill set.
Randall and Engelhard (2010a) examined the differential boost hypothesis for a third- to fourth-grade band and a sixth- to seventh-grade band of students on a state reading test when an oral accommodation or resource guides were provided. Students were nested within schools that were randomly assigned to one of the three conditions. The read-aloud accommodation group included 945 students (nSWD = 459) in third grade whereas the resource guide group included 995 (nSWD = 428) students in sixth grade. Students were pretested at the end of third grade and sixth grade under the standard administration of the state test and then retested in the fourth and seventh grades under their assigned condition. Using repeated measures ANOVA, the effect size for performance gains for Grade 4 SWDs when oral administrations were provided was .22, whereas it was .02 for SWoDs, indicating that Grade 4 SWDs benefitted more from the oral administration as compared to SWoDs. Grade 4 SWDs who had the use of a resource guide test were negatively afected by this accommodation (effect size = −.12), suggesting that the use of a resource guide can hinder performance. For Grade 7 students, SWDs and SWoDs had a similar boost in mean test scores when they had an oral accommodation (effect size = .17 and .20, respectively). These results provide support for the differential boost hypothesis for elementary-grade students when they receive an oral accommodation on a reading test but not for the middle-grade students. As suggested by the authors, the construct may be altered when providing read-aloud accommodations for elementary students. Earlier studies that have examined the effects of a read-aloud accommodation for reading tests have indicated no significant gains for SWDs and SWoDs (McKevitt & Elliott, 2003) or similar gains for both students with learning disabilities and students without learning disabilities on reading, science, and math commercially published tests (Meloy, Deville, & Frisbie, 2000). However, these studies had relatively small sample sizes. In addition, differences in item types, tested content, and grade levels as well as the composition of the SWD samples may account for differences in the results.
Other studies have tried to control for student background variables when examining the effect of oral administration accommodations using relatively large sample sizes. Using regression procedures, Huynh, Meyer, and Gallant (2004) examined the effect of oral administration accommodation on student performance on a Grade 10 state mathematics test and reported that after controlling for student background variables, including gender, ethnicity, and Grade 8 mathematics and reading test performance, SWDs under oral administration performed better than SWDs under the standard administration (effect size = .21) and that both of these groups performed poorer than SWoDs. As they indicated, this effect size would be considered small by Cohen’s criteria; however, if a 50% proficient rate is assumed for all students, the effect size of .21 represents a proficient rate of 58%, indicating an 8% improvement for these students.
In an effort to target students who would most likely benefit most from an oral administration, Laitusis (2010) examined the differential boost hypothesis for students with reading-based learning disabilities (SRLDs). Using an experimental design with repeated measures, Laitusis (2010) found that 1,181 fourth-grade and 847 eighth-grade SRLDs benefited differentially from read-aloud accommodations on a commercially published reading comprehension test after controlling for reading fluency and ceiling effects, with the differential performance boost greater in Grade 4 than Grade 8. For Grade 4, the effect size for SRLDs was .57 and for non-SRLDs it was .14. For Grade 8, the effect size was .32 for SRLDs and .06 for non-SRLDs. This pattern was also observed when controlling for reading fluency. The oral administration studies have generally found a greater effect for elementary school students than middle and high school students.
Lewandowski, Lovett, and Rogers (2008) explored the effect of the extended time accommodation for SRLDs. Sixty-four students, 32 with disabilities, in Grades 10 to 12 were provided a subtest of the Nelson-Denny Reading Test (Brown, Fishco, & Hanna, 1993) in which they were told to change the pen color at the 13-minute mark and asked to drop their pens after 19.5 minutes. This change was to determine differences under the standard condition (13 minutes) and an extended accommodation (additional 6.5 minutes) in which the number of items correct, number of items attempted, and percentage correct were marked at each time point. Using three 2 × 2 mixed-model ANOVA procedures, no evidence of a differential boost was established when comparing SWoD to students with reading disabilities for items correct (d = 2.68 standard time, d = 3.39 extended time), items attempted (d = 2.39 standard time, d = 3.13 extended time), nor percentage correct (no significant interaction of disability × accommodation), indicating that extended time on a reading test did not affect students with reading disabilities differently than SWoD.
Additionally, H. Li (2013) used hierarchical linear modeling (HLM) to examine the effects of read-aloud accommodations for SWDs and SWoDs through a meta-analysis of 27 studies, published and unpublished. HLM is a valuable statistical method for meta-analyses because it provides a means to explain variations in effect sizes. The majority of the SWDs in these studies had learning disabilities. The variation of the 128 effect sizes across the studies was captured in the Level 1 model and the Level 2 model allowed for the identification of potential sources of this variation, including disability status, construct measured, delivery method, grade level, and extra time. The results indicated that the effect of read-aloud accommodations for SWDs was significantly stronger than the effect for SWoDs (differential boost due to disability status SDs ranged from 0.161 to 1.75). The accommodation effect was also significantly stronger when the subject area was reading and when the test was read by human proctors than when video/audio players or computers were used. Furthermore, the effect of read-aloud accommodations was significantly stronger for elementary school students than middle school students.
Efficacy of Test Modifications for SWDs
Federal legislation allowed states to use modified versions of their general assessment for up to 2% of the total student population to be counted as proficient (U.S. Department of Education, 2007). These were assessments based on modified academic achievement standards for SWDs who had an individualized education program and were unlikely to reach proficient on the general state test. These tests were intended to provide universal access and reduce cognitive load while providing valid score inferences regarding the same constructs measured on the general test. Elliott and others (2010) and Kettler and others (2011) examined the impact of item and test modifications on performance and score precision for SWDs and SWoDs. Three groups of eighth-grade students (n = 755) defined by eligibility and disability from four states were administered original and modified versions of reading and mathematics tests. There were approximately equal numbers of SWDs who were eligible for the modified test, SWDs who were not eligible, and SWoDs. The most common item modifications were removal of a distractor, simplification of language, addition of graphic support, and reorganization of layout. An experimental design was used with students experiencing each of three conditions (original, modified, and modified with reading support) but receiving each item in only one of the three conditions. Using a meta-analytic approach to estimate coefficient alpha, they reported minimal changes in score reliability across groups and conditions (between .88 and .94 for reading and between .85 and .90 for math). A differential boost was also reported when controlling for ability level using item response theory (IRT), that is, there were larger reductions in item difficulty for SWDs who would be eligible to take the modified test than students who would not be eligible. In addition, the results provided evidence for the modification of shortening the length of the item stem and evidence for the modification of using visuals for mathematics items. Additional guidelines on how to modify items to improve accessibility is provided by Kettler (2011).
Efficacy of Test Accommodations for ELLs
Research in the early 2000s has shown that the achievement gap between ELLs and non-ELLs is greater on tests that are linguistically complex, that is, tests that have a greater English language demand (Abedi, 2003; Abedi & Lord, 2001). Items with unnecessary linguistic complexity in tests such as mathematics and science are a source of construct-irrelevant variance, introducing unintended multidimensionality, and have fairness implications when assessing ELLs (Abedi, Leon, & Mirocha, 2000; Abedi, 2004). Examinees’ limited proficiency in the language in which tests are administered is a threat to the validity of score interpretations when language is not the construct being measured (AERA, APA, & NCME, 2014). The validity of score inferences can be threatened also by the linguistic skills of individuals who write or adapt items, and it has been proposed that language-based accommodations are needed for those administering tests to ELLs and those who are scoring their written responses (Solano-Flores, 2008).
Linguistic modification as a form of accommodation makes the test more accessible to ELLs. Abedi and his colleagues examined the impact of linguistic modification on ELL performance (Abedi, 2009; Abedi et al., 2000). As an example, an early study by Abedi et al. (2000) examined the performance of Grade 8 students with different accommodations, including modified linguistic structures, extra time, and provision of a glossary, and found that only the linguistic accommodation narrowed the performance gap between ELLs and non-ELLs significantly.
Similar to the section on the efficacy of accommodations for SWDs, published studies in peer-reviewed journals evaluating the differential boost hypothesis for ELLs were examined from 2004 to 2013. 2 Twelve of 16 (75%) studies that implemented an experimental design found evidence of a differential boost. Studies with a sample size less than 500 showed evidence of a differential boost 10 out of 19 (53%) times whereas the 2 studies that used a sample size greater than 500 did not show evidence of differential boost. Results of studies based on large-scale assessments showed evidence of a differential boost in all 8 of the cases. Seven of the 11 studies using elementary students demonstrated differential boost whereas 4 of the 11 studies using secondary-level students showed evidence of differential boost. Seventy-eight percent of the studies using science tests demonstrated differential boost (6 of 7 elementary, 1 of 2 secondary) whereas those using a mathematics test showed differential boost in 40% (5 of 11 elementary, 3 of 9 secondary) of the cases.
Abedi (2009) examined the accessibility and validity of computer-based testing for ELLs and non-ELLs. He examined several accommodations used in an administration of mathematics tests administered to ELLs in Grades 4 and 8, including a computer administration with a pop-up glossary, a customized English dictionary, and extended time (Grade 4 only). The math test included public released items from the National Assessment of Educational Progress (NAEP) and the Trends in International Mathematics and Science Study. Items were rated in terms of their linguistic complexity on a scale from 0 to 4. A latent composite of multiple measures of students’ English proficiency was used as a covariate and a proportional random sampling method was used to assign both ELLs and non-ELLs to each of the accommodation conditions and a nonaccommodation condition. For both Grades 4 and 8, ELL student performance was highest in the computer condition. For Grade 4, adjusting for initial differences in the level of English proficiency among the ELL groups, a 0.5 SD difference between the means for the computer condition and the nonaccommodated condition was reported. The obtained coefficient of determination, η2 = .03, indicated that the computer as an accommodation explains 3% of the test score variance for ELLs. The other statistically significant difference was for the comparison of extended time and the nonaccommodated condition, with a η2 = .024. For Grade 8, adjusting for differences in the level of English proficiency among the ELL groups, the adjusted mean of 10.66 for the computer condition was significantly higher than the adjusted mean of 9.11 for the nonaccommodated condition. As a comparison, there were no statistically significant differences between the means of the accommodated conditions with the nonaccommodated condition for non-ELLs in both Grades 4 and 8, providing some evidence for the effectiveness of the computer accommodation for ELLs.
The effects of the accommodations for two levels of linguistic complexity of the items were also examined. Using a multivariate analysis of covariance model, for Grade 4 ELLs all three accommodations made a significant difference on performance on the more linguistically complex items, and two of the accommodations (computer and extra time) made a significant difference on the performance on the less linguistically complex items. For Grade 8 ELLs, the computer accommodation produced significant differences on the more linguistically complex items, and there were no significant differences on the less linguistically complex items. Grade 8 ELLs spent nearly 3 times as much time glossing as compared to non-ELLs, providing additional support for the computer accommodation (Abedi, 2009).
Studies have examined the effects of academic language versus linguistic complexity on ELL performance. Solano-Flores, Barnett-Clarke, and Kachchaf (2013) compared ELL and non-ELL performance on Grades 4 and 5 mathematics content knowledge and academic language tests consisting of multiple-choice items. Pretests were administered prior to instruction, followed by posttests after instruction. The content knowledge items focused on computation and problem solving, and the academic language items focused on terms used to refer to mathematical concepts (e.g., mixed number). The study included 579 (ELLs = 338) and 564 (ELLs = 333) students in Grade 4 and 464 (ELLs = 231) and 484 (ELLs = 254) in Grade 5 for the content knowledge and academic language assessments, respectively. For both ELLs and non-ELLs, the percentage of items correct was higher for the academic language tests than the content knowledge tests and higher for the posttest than the pretest. The authors argued that because the academic language items have a greater number of words, on average, the performance differences can be attributed, in part, to the emphasis on academic language in instruction and not due to the linguistic complexity of the items. Repeated-measures ANOVAs indicated that non-ELLs had significantly higher gain scores than ELLs for both tests and in both grades (partial η2 = .029–.176). The gain score differences for the ELL group as compared to non-ELL group tended to be greater for the academic language test than the content knowledge test. In summary, both groups performed better on the academic language test than the content test, but non-ELLs outperformed ELLs on both tests. In terms of score gains, smaller gains were found for the academic language tests, in particular for ELLs. Based on these results, the authors argued that ELLs may not have developed a meaning-making system prior to instruction to meet the different sets of interpretative demands of the two types of items, and suggested that caution is needed in including ELL students in large-scale testing before they have developed academic language proficiency. They also argued that academic language should be taken into consideration during test development and that the analysis of the semiotic structure of items can help ensure that the language demands of items are consistent with the targeted construct.
Several studies that conducted a meta-analysis to examine differential boost for ELLs were not included in the review described previously. Kieffer, Lesaux, Rivera and Francis (2009) examined accommodations for ELLs on large scale assessments. A meta-analysis was performed using 11 studies with 23,999 participants, nELL = 6,554, to examine the effects of seven different types of accommodations: simplified English, English dictionaries or glossaries, bilingual dictionaries or glossaries, tests in the native language, dual language test booklets, dual-language questions for English passages, and extra time. Only English dictionaries and glossaries had a positive and statistically significant average effect size. Separating out the 4 studies that involved quasi experiments and only including those from randomized experiments, the average effect size was still found to be statistically significant,
In 2012 Kieffer, Rivera, and Francis expanded the 2009 meta-analysis to include more recent studies. Their results indicated that the achievement gap between ELLs and non-ELLs was between 9% and 19% when given simplified language. When given English dictionaries there was an 11% to 21% reduction in the performance gap, and when provided extra time or an untimed assessment the performance gap was reduced between 15% and 31% in ELLs and non-ELLs on large scale assessments. However, when the language of the test was matched to the language of instruction there was no reduction of performance gap.
Pennock-Roman and Rivera (2011) performed a meta-analysis using 14 studies that either randomly assigned ELLs to test accommodation versus control conditions or used repeated measures in counterbalanced order. In their analysis, they accounted for language proficiency, test format, and time constraints, which allowed them to examine the factors that led to effective accommodations. When students were provided with sufficient time, most accommodations did improve performance of ELLs. When given enough time, the most effective accommodations were dual language, the bilingual glossary, and the English glossary conditions (Glass’s d effect size values were .299, .247, and .229, respectively). The pop-up English glossary was the most effective accommodation under restricted time conditions (Glass’s d was .285). The most effective accommodation for ELLs with low English language proficiency and/or who were receiving instruction in Spanish, their native language, was a Spanish test version as compared to an English test version (Glass’s d effect size values of .95 and 1.45 for the Spanish version as opposed to .13 and .40 for the English version). Whereas the Spanish version accommodation was not effective for ELLs with intermediate levels of English language proficiency, students receiving individualized education programs, and students with a home language background other than Spanish. These results suggest that construct-irrelevant variance due to English proficiency was reduced for some accommodations and were dependent on level of English proficiency, time constraints, and test format, indicating the need for accounting for test format, language proficiency, and test time when examining the effects of test accommodations for ELLs.
Using HLM, a meta-analysis was conducted to investigate the effects of accommodations on the performance of ELLs (H. Li & Suen, 2012). This analysis included data from 19 studies with 85 effect sizes, including journal articles, reports, theses, and conference papers. It also included a wide range of tested subjects. The Level 1 model in HLM examined the variation of effect sizes of the accommodations across studies and the Level 2 model attempted to explain the potential sources of this variation. The variables considered to account for variation of effects of accommodations were ethnicity, grade level, test subject, English proficiency, and accommodation type. The results indicated that, on average, the accommodated ELLs scored 0.157 SDs above the nonaccommodated ELLs, suggesting that accommodations improved their test performance, and the estimated variance of the effect sizes was significant. Of the Level 2 variables examined, only a low level of English proficiency had a significant effect, indicating that those with low levels of English proficiency had higher accommodation effects. The other four variables did not help explain the variability in effect sizes. For example, there were no significant differences between math/science and other subjects in terms of accommodation effects. Furthermore, accommodation type (linguistic simplification, dual-language booklet, Spanish version, dictionary and glossary, extra time) was not statistically significant. The final model indicated that accommodated ELLs who did not have low English proficiency scored, on average, 0.079 SD units above the nonaccommodated groups, whereas accommodated low English proficiency ELLs scored up to 0.569 SD units above their nonaccommodated groups, providing some evidence of the benefits of accommodations for ELLs, especially for those with low English proficiency. Also, approximately 38% of the variability in the effect sizes was explained by the students’ English proficiency in the final model. It should be noted, however, that the studies used in these analyses did not include non-ELLs as a control group, and thus further research on the appropriateness of accommodations is warranted.
Efficacy for Test Modifications for ELLs
Shaftel, Belton-Kocher, Glasnapp, and Poggio (2003) examined the effects of item modifications on student performance for ELLs and non-ELLs using a state mathematics test for Grades 5, 8, and 11. Sample sizes ranged from 177 (ELL) to 1,030 (non-ELL) across the three grades. Using a counterbalanced design for a modified version of the state test (simplified language) with an unmodified version, the non-ELL group showed no significant differences. To analyze differences within the ELL group, data were evaluated from ELLs who took the original version of the test in spring 2000 with matched items from the modified version on the spring 2001 assessment. Using a common item anchor block equating design and an analysis of covariance with items as the covariate, it was found that at Grade 7, ELLs taking the original items tended to perform better than the ELLs taking the modified version of the test. At Grade 4, ELLs taking the modified version had a significantly higher adjusted mean score of 0.49 points whereas students in Grade 10 had a significant difference of 0.91 compared to the nonmodified version. The researchers also used IRT procedures to obtain estimates for student ability and item parameters for the modified and unmodified versions of the tests for ELLs. In Grade 4, item discrimination and difficulty estimates were nearly identical using the two versions of the test. For Grade 7 there were differences in the mean item difficulty estimates with the modified version being more difficult but no differences were obtained in ability or discrimination estimates. Finally for Grade 10, the mean ability estimate for the modified version was .303 whereas the original version yielded a mean ability estimate of .003. Even with significant differences using analysis of covariance, the differences were very small with large sample sizes. Taking this into account with minimal differences in the IRT estimates, the authors concluded that there is no evidence that the modified version yielded differences compared to the unmodified version.
Measurement Invariance
Although the observance of a differential boost can support the use of accommodations for SWDs and ELLs, it can be challenged because better performance of SWDs and ELLs is not equivalent to assessments that provide valid score inferences (Sireci et al., 2005). It is necessary to examine the extent to which the psychometric properties of a test are invariant across groups of students and test administration formats. The establishment of measurement invariance across ELLs and SWDs is required to make valid score interpretations for individual students and for group comparisons. If accommodations function as intended, item and test measurement properties should be similar when the test is administered to ELLs (or SWDs) who require accommodations and receive them and ELLs (or SWDs) who do not require accommodations. Similarly, these item and test measurement properties should be similar to those for non-ELLs and SWoDs. A measure can be considered invariant when members of different subpopulations who have the same standing on the construct being assessed obtain the same score on the test with the same precision.
Measurement invariance studies directly examine the item and test measurement characteristics for ELLs and SWDs. Invariance of the internal structure of the test across subgroups, or factorial invariance, is needed to ensure comparable and valid score interpretations. However, it is important to recognize that factorial invariance alone is insufficient evidence to ensure valid score interpretations and uses. Other aspects of comparability need to be examined across subgroups of students, including the precision of scores and accuracy of classification rates, the relationship between test scores and other measures (i.e., external structure evidence), cognitive processes evoked by the test items, and consequences of test score inferences (AERA, APA, & NCME, 2014).
This section addresses reliability and score precision for ELLs and SWDs, computer-adaptive testing to improve measurement accuracy and precision, language of testing and rater language background as a source of measurement error for ELLs, internal structure evidence for ELLs and SWDs using factor analysis and differential item functioning (DIF), external structure evidence for ELLs and SWDs, and equating invariance for ELLs and SWDs.
Reliability, Measurement Error, and Score Precision
A relatively large percentage of ELLs and SWDs demonstrate lower performance on large-scale assessments (Abedi, Leon, & Kao, 2007; Thurlow, Bremer, & Albus, 2011). This raises concerns in assessing ELLs and SWDs because test scores tend to be less precise at the ends of the score scale, resulting in less reliable scores for many ELLs and SWDs. This was demonstrated for SWDs using data from a fourth- and eighth-grade mathematics test and English language test from one state (Laitusis, Buzick, Cook, & Stone, 2011). The results indicated that SWDs scoring at chance level ranged from 12% to 22%, whereas only 1% to 3% of SWoDs scored at chance level, resulting in less precise scores for many SWDs. Internal consistency indices, such as coefficient alpha, are commonly used as measures of test score reliability. These indices are affected by restricted ranges of performance, which is common in the assessment of SWDs and ELLs. Furthermore, using these indices when some items measure an irrelevant construct in addition to the intended construct for subgroups will produce more measurement error and lower reliability estimates.
Reliability and Score Precision for ELLs
Abedi (2002, 2003) examined the precision of measurement across examinee groups and reported that internal consistency reliabilities (coefficient alphas) for scores from commercially developed tests were consistently lower for ELLs than non-ELLs in Grades 2 and 9 in reading, mathematics, language, science (Grade 9 only), and social science (Grade 9 only). For Grade 2 the differences in the reliability coefficients for ELLs and non-ELLs ranged from .013 to .062 and for Grade 9 they ranged from .096 to .120. As indicated by the authors, the increase in reliability differences from the lower to upper grades may be due to more complex language structures in tests in the upper grades. For 10th- and 11th-grade students Abedi (2003) reported that on reading, science, and math commercially published tests the coefficient alphas were higher for non-ELLs than for both nonaccommodated ELLs and accommodated ELLs with one exception (coefficient alphas were higher for both ELL groups than the non-ELL group for the Grade 10 math test). As indicated by Abedi (2003) a number of factors, including language background, restriction of range, socioeconomic status (SES), and OTL, may contribute to the observed differences in reliabilities.
Young et al. (2008) reported coefficient alphas for Grade 5 and Grade 8 students taking mathematics and science state tests for NCLB accountability. The coefficients were higher for non-ELLs (.878–.939) than for both accommodated ELLs (.603–.911) and nonaccommodated ELLs (.750–.912), with sample sizes ranging from 183 to 1,246 for the accommodated ELL groups. The accommodations included translated directions and access to glossaries. For the science tests, the reliabilities were higher for the nonaccommodated ELLs than for the accommodated ELLs, whereas this pattern did not hold for the mathematics tests.
Generalizability theory has also been used to examine the precision of ELL scores on large-scale assessments. Generalizability theory (Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972) allows for examining the contribution of various sources of measurement error to the generalizability of test scores to the large construct domain. D. Li and Brennan (2007) examined differential precision of a commercially published reading comprehension test using univariate and multivariate person × item and person × (item: passage/process) generalizability studies. There were eight reading passages, and the items associated with the reading passages were categorized into three process areas: factual understanding, inference and interpretation, and analysis and generalization. Using these designs they were able to examine three sources of error: reading passages, processes, and items nested within the cross-classification of passages and cognitive processes. The data were from 500 ELL and 500 non-ELL students. The results indicated that reading passages had a larger variance for ELLs than for non-ELLs, which was primarily due to the larger variability of one of the cognitive process areas—generalization. The generalizability coefficients from the multivariate design were consistently lower for the ELLs than the non-ELLs; the greatest difference (.18) was for the generalization area. The generalizability coefficients for the three process areas ranged from .754 to .769 for non-ELLs and .579 and .672 for ELLs. The generalizability coefficient for the composite was .898 for non-ELLs and .831 for ELLs, suggesting differences in the validity of the score interpretations for ELLs and non-ELLs.
Lakin and Lai (2012) examined differential reliability of a commercially published ability test with measures for verbal, quantitative, and nonverbal/figural reasoning across 144 ELLs and 236 non-ELLs in Grades 3 and 4. Such tests are used in conjunction with other information to provide differential instruction to students, typically in the elementary grades. The non-ELL group performed significantly better than the ELL group on the verbal, quantitative, and nonverbal tests (Cohen’s d indices 1.13, .78, and .68, respectively). It is important to note that the standard deviation for the verbal test was nearly twice as large for the non-ELL group than the ELL group, 14.37 and 7.67, respectively, and over half as large for the quantitative test, 12.75 and 9.09, respectively, whereas the standard deviations for the nonverbal test were more similar. Using univariate and multivariate person × item generalizability studies, they observed that verbal and quantitative reasoning skills were measured less precisely for ELLs (generalizability coefficients of .838 and .883, respectively) than they were for non-ELLs (generalizability coefficients of .961 and .953, respectively), which is partly a function of the restricted range of scores for ELLs. From their analysis, they estimated that ELLs would need to respond to more than twice as many quantitative items and more than 3 times as many verbal items in order to obtain comparable precision as non-ELLs. The composite score across the three areas had relatively high generalizability estimates for both ELLs (.95) and non-ELLs (.98). As the authors indicated, although the reliability of the composite scores are high and similar across the two groups, the use of student performance profiles across the three reasoning tests to inform instructional decisions at the student level should be done cautiously for ELLs.
Reliability and Score Precision for SWDs
The reliability of test scores for SWDs in Grade 8 when reading comprehension passages were segmented (with relevant items) was examined by Abedi et al. (2010). Most of the SWDs had specific learning disabilities (107 out of 117). For the original version of the reading test, the researchers reported a .52 coefficient alpha for SWDs and .78 for SWoDs. Similar to the other studies, there was greater variability in the scores for SWoDs (SD = 4.58, n = 302) than for SWDs (SD = and 3.32 and n = 52). SWDs performed better on the segmented version of the test and had more variability in their scores (SD = 4.20 and n = 58), resulting in a higher coefficient alpha, .69, as compared to the original version. Test statistics for SWoDs were similar across versions, with a coefficient alpha of .79 for the segmented version (SD = 4.67 and n = 294).
With larger sample sizes, Huynh and Barton (2006) reported coefficient alphas of .875, .892, and .895 for SWDs who received an oral accommodation on a state Grade 10 reading test, SWDs who did not receive an accommodation, and SWoDs, respectively. Although the two SWD group means were lower than the mean for the SWoDs, the standard deviations were not as disparate as in the previously discussed studies (accommodated SWDs SD = 76.3, nonaccommodated SWDs SD = 89.3, and SWoDs SD = 89.6). Huynh et al. (2004) reported similar coefficient alphas for the state’s Grade 10 mathematics test for SWDs who received an oral accommodation (.878), nonaccommodated SWDs (.881), and SWoDs (.895).
For Grade 4 students, Randall and Engelhard (2010b) obtained coefficient alphas of .940 for SWDs who received an oral accommodation on a reading state test and .916 for SWDs who did not receive an accommodation. The coefficient alphas were .976 for SWoDs who received an oral accommodation on the reading test and .960 for nonaccommodated SWoDs. This pattern did not hold for Grade 7 students. For Grade 7 students, the coefficient alphas were .903 for oral accommodated SWDs and .918 for nonaccommodated SWDs; and .898 for oral accommodated SWoDs and .954 for nonaccommodated SWoDs. These results may suggest that oral accommodations lead to increased precision for SWDs at lower grade levels but not for students at upper grade levels.
Computer-Adaptive Testing to Increase Measurement Accuracy and Precision
Stone and Davey (2011) discuss some of the advantages of using computer adaptive testing (CAT) with SWDs, including more precise measurement of SWDs and ELLs (G. Phillips, 2009), capability of tracking which accommodations are being used and when they are employed by students (Thurlow, Lazarus, Albus, & Hodgson, 2010), and allowing for a wider array of accommodations for SWDs (Thurlow et al., 2010). The latter two are advantages for any computer-based test. Computer-based testing, whether it is adaptive or not, can allow for more flexibility in tailoring the access tools (e.g., magnification, color contrast) and supplemental accessibility information (audio, braille, tactile versions of item content, simplified vocabulary) to the individual student so as to help ensure students have access to the targeted test construct (Russell, 2011). Consequently, these accessibility tools and supplemental information can enhance the validity of the test score inferences for students. As indicated by Russell, depending on a student’s access needs, flexible tailoring may require an adaptation to the presentation of item content [e.g., magnifying], the interaction with that content [e.g., masking content to decrease distractions], the response mode [e.g., assistive communication device to produce a response], or the representational form in which content is communicated [e.g., audio or alternate language]. (p. 9)
It is imperative to explicitly specify the knowledge and skills assessed by the item. This will allow for evaluating whether the use of the accessibility tools and information promotes valid score interpretations about student achievement of the targeted construct or whether such use alters the intended construct (see Russell, 2011, for additional challenges to accessible test design).
The tailoring of items to the ability level of the students is attractive in the assessment of SWDs and ELLs because current tests tend to target students in the middle of the score range, resulting in less precise measurement of SWDs and ELLs who tend to perform at the lower end of the score scale. Another attractive feature of CAT is the potential for the administration of fewer items to reach sufficient measurement precision. Smarter Balanced assessment has adopted a CAT system and PARCC will use computer-delivered assessments.
There are also some challenges in using CAT for SWDs because these students have divergent learning profiles and may find items that are typically more challenging easier than some typically less challenging items. As indicated by Thurlow in an ESEA report (“ESEA Reauthorization,” 2010), computer adaptive practices must be transparent enough to detect when a student is inaccurately measured because of splinter skills common for some students with disabilities, for example, with poor basic skills in areas like computation and decoding, but with good higher level skills, such as problem solving, built with appropriate accommodations to address the barriers of poor basic skills. (p. 43)
Stone and Davey (2011) discussed some of these challenges. Item response functions may differ for SWDs due to, for example, accommodation status, divergent learning profiles, less access to computers, and less familiarity with keyboarding. Similar concerns arise when assessing ELLs because of their divergent learning profiles. Consequently, SWDs and ELLs should be included in the calibration sample for the item bank, and if not, subgroup analyses are needed to examine the appropriateness of the item parameters for the different groups. Divergent learning profiles may also lead to less stable and less accurate estimation of SWDs’ standing on the latent construct. Stone and Davey (2011) reiterated the need for the detection of discrepant response patterns for SWDs. As they indicated, if SWDs respond incorrectly to the first few easy items due to divergent learning profiles, they may not be administered more challenging items that are within their proficiency range. Others have argued that the use of CAT will allow for spanning of grade-level content on the assessments designed based on the Common Core State Standards, resulting in more precise measurement of students at the extreme levels, including SWDs and ELLs, although others oppose practices that may resemble off-grade–level testing (Way et al., 2010).
Language of Testing as a Source of Measurement Error for ELLs
In an early study examining the effects of testing ELLs in their native, first language (L1) as opposed to being tested in English, their L2, Solano-Flores, Lara, Sexton, and Navarrete (2001) found that native speakers of Spanish, Haitian-Creole, and Mandarin Chinese did not necessarily benefit when math and science items were administered in their native language as compared to English. They conducted a student × rater × item × language of administration generalizability study and found that the largest variance component across the three linguistic groups was the student × item × language, indicating that some students performed on some items better when administered in L1, whereas other students performed better on some items when administered in L2.
In an effort to disentangle the measurement error when testing ELLs and to demonstrate the linguistic heterogeneity of ELLs, Solano-Flores and Li (2006, 2009, 2013) examined language and dialect as sources of measurement error using generalizability theory. In the Solano-Flores and Li (2006) generalizability study, native speakers of two dialects of Haitian-Creole in Grades 4 and 5 were given a set of NAEP math items either in the standard dialects of L1 and L2 or their local dialect and the standard dialects of L1. They found that the student × item × language accounted for the largest amount of variation in test scores for students tested across languages (39%; English and standard dialect) and the student × item × dialect accounted for the largest amount of test score variation for students tested across dialects (33% for site A and 38% for site B; standard dialect and local dialect). These results indicate that students perform better in dialect A over dialect B for some items but perform better in dialect B over dialect A for other items. As indicated by the authors, the interaction of dialect with student and item is as important a source of measurement error as the interaction of language with student. Furthermore, they demonstrated that the number of items needed to obtain dependable measures of achievement may vary for students speaking different dialects of Spanish, implying the need for dialect-level analyses in the testing of ELLs. To further investigate dialect as a source of measurement error, native Spanish speaker ELLs in Grades 4 and 5 were given the same set of NAEP math items in both their L1 and L2 (Solano-Flores & Li, 2009). Using a student × item × rater × language and student × item × rater × dialect generalizability study for each site, they found that the student × item × language interaction accounted for the largest amount of total score variation (45% and 48%) and the student × item × dialect interaction accounted for the largest amount of total score variation (41% to 48%), respectively. For both of these studies the sample of items chosen did not include any schematic representations (e.g., graphs, tables, illustrations). It would be of value to conduct a study that includes items that have accompanying schematic representations to evaluate whether these results generalize to test items with schematic representations. As Martiniello (2009) suggested, the impact of the linguistic complexity is attenuated when items have schematic representations that may help ELLs make meaning of text. The author suggests that including them may help diminish the negative effect of linguistic complexity on ELL performance.
Rater Language Background as a Source for Measurement Error for ELLs
In addition to limited proficiency in language in which tests are administered, the raters’ language background can be a source of construct-irrelevant variance and contribute to measurement error. Researchers have examined how rater language background has an influence in their scoring of ELL responses to constructed-response items (Kachchaf & Solano-Flores, 2012). Typically, these researchers examine mean differences of scores given by raters of different language backgrounds and conduct generalizability studies that examine different sources of error contributing to the scores, including error due to the language background of raters. In a study conducted by Kachchaf and Solano-Flores, four native English-speaking and four native Spanish-speaking certified bilingual teachers, who had experience in teaching ELLs, scored responses to mathematics constructed-response items for fourth- and fifth-grade Spanish-speaking ELLs. Students were administered the items in Spanish and English on different testing occasions, with the sequence of items administered randomly determined. There was no significant difference in mean scores due to language of testing or the interaction of language of testing and rater language background. The results of the student × language of testing × item × (rater: language background) revealed that the largest amounts of variance was due to the interaction of student and item (36%) and the interaction of student, language of testing, and item (22%), which is consistent with previous studies (Solano-Flores, 2006, 2008). The error variances due to rater language background and the rater nested within rater language background were negligible. The results of this study are consistent with previous findings that suggest experienced educators, who teach the population of interest (in this case, bilingual teachers of ELLs), can reliably score student responses to mathematics constructed-response items provided that they receive well-developed training materials and procedures (Lane & Stone, 2006).
Internal Structure Evidence Using Factor Analysis and Differential Item Functioning
Evaluation of the internal structure of the test can be examined through exploratory or confirmatory factor analyses (CFAs), including multigroup analyses and IRT analyses. Using exploratory factor analyses, the data are evaluated separately for each group to identify the number of factors underlying test performance. This can be followed by a multigroup CFA to determine whether the same factors are underlying test performance across groups (Cook, Eignor, Steinberg, Sawaki, & Cline, 2009). Nested CFA models allow for testing whether specific features of the internal structure of the test are invariant across groups, including the number of factors, factor loadings and errors in their estimation, factor intercorrelations, item intercept, and item uniqueness.
DIF, differential bundle functioning (DBF), and differential distractor functioning (DDF) allow for examining the equivalence of subgroup performance at the item level (or at the level of a coherent subset of items). DIF occurs when examinees of equal standing on the measured construct but from different subgroups differ in their probability of responding correctly to an item (Holland & Thayer, 1988). When DIF occurs, it indicates that the item measures some additional construct for one of the subgroups, which negatively affects the validity and comparability of test score interpretations and uses. DIF can be conceptualized as multidimensionality in the test data. DIF can be considered a shift in the distribution of ability along a secondary construct that influences the probability of a correct response (Camilli, 1992). For example, one group may be more able on a secondary construct, such as English reading skills, on a mathematics test.
There are a number of factors that can have an impact on the validity of the results of invariance analyses, including small sample sizes, nonoverlapping proficiency distributions, and lack of measurement precision (Sireci, 2009). The use of large-scale test data and the evaluation of effect sizes help minimize the impact of sample size in interpreting the results. Nonoverlapping proficiency distributions arise because the SWD or ELL groups have distributions that are centered lower on the score scale than the general population. The poorer performance of SWDs and ELLs may be due, in part, to lack of access to the construct being measured. Consequently, these subgroups typically have a restricted range of scores. Differences in distributions affect the results of invariance studies in predictable ways such as easy items flagged for DIF in favor of the focal group (SWDs or ELLs) and the first factor identified in a factor analysis is not as strong for the reference group (general population; Sireci, 2009). In addition, restriction of range can account for lower reliability estimates and poorer predictive validity evidence for these subgroups. To minimize the effect of differential restriction of range across the groups, the reference group (general population) can be selected to have the same distribution as the SWD or ELL group.
Lack of measurement precision can arise from a number of factors, including restriction of range for the SWD and ELL groups, more guessing within these groups when the test is less accessible to them, and the use of individual items instead of item sets or parcels for DIF and factor analyses. A concern with using individual items as the unit of analysis in DIF studies and studies examining the factorial structure of a test is the lack of precision and noise at the item level. Conducting these analyses with meaningful item parcels instead of individual items has been recommended to alleviate this concern (Cattell & Burdsal, 1975). However, as Sireci (2009) has indicated, item parceling may mitigate any effects if the item level data is multidimensional. Using both item level and item parceling help alleviate this concern. Furthermore, results may differ when the analyses are conducted on raw scores versus IRT-scaled scores, because scale transformations are most affected at the lower end of the score distribution and ELLs and SWDs tend to score at the lower end.
Differential Item Functioning
As previously indicated, DIF occurs when examinees of equal standing on the measured construct but from different subgroups differ in their probability of answering an item correctly. If DIF occurs, it indicates that the item measures some additional construct for one of the subgroups, which affects the validity and comparability of the score interpretations. For those students who need accommodations or modifications and are provided them, if the accommodations or modifications are effective it is more likely that DIF will not be detected. It should be noted, however, that DIF may still occur due to reasons not associated with the accommodations. Accommodations are intended to allow for a “level playing field,” providing support for the comparability and fairness of the assessment without giving undue advantage to the subgroup receiving the accommodation or modification.
Clauser and Mazor (1998) outlined the steps in examining DIF, including specifying the comparison groups, selecting a matching criterion, choosing a statistical approach, and interpreting and making decisions regarding items that are flagged as DIF. As discussed by Buzick and Stone (2011), to ensure valid inferences, scores on the criterion measure need to be reliable, valid, and free of statistical bias (Clauser & Mazor, 1998); standardized conditions are needed for the administration of the criterion measure for the focal and reference groups (Dorans & Holland, 1993); sample sizes need to be sufficient for the reference and focal groups (Zieky, 1993); and focal and reference groups need similar ability distributions, especially when IRT methods are not used (Mazor, Clauser, & Hambleton, 1992).
DIF studies for SWDs and ELLs typically use an internal matching criterion—total test score—and the test is administered to the focal and reference groups under standardized conditions. When an internal matching criterion is used, there is little impact on the validity and reliability of the matching criterion, particularly when purification of the matching criterion is done (i.e., when the total test score excludes the DIF items). However, as previously mentioned, reliability and validity differ for SWDs and ELLs as compared to SWoDs and non-ELLs. Threats to the validity of DIF results are primarily due to the heterogeneity of the samples, sample size, and differences in the ability distributions for the groups.
Differential Bundle Functioning
Standard DIF detection and review procedures have not been very useful in explaining why DIF occurs in the flagged items (AERA, APA, & NCME, 2014). To address this problem, Roussos and Stout (1996) developed an approach to test DIF hypotheses that are generated from theory and substantive item analyses. They argue that the DIF analysis approach has suffered from both lack of power primarily due to exploratory analysis of single items and inflation of Type I errors (over identification of DIF) due to inappropriate matching criteria for matching examinees. They proposed DBF in which sets of items are bundled according to some organizing principle. An organizing principle for examining DBF for ELLs may be the level of English language demands. DBF examines whether two groups with equal ability have the same probability of answering a bundle of items correctly. DBF is considered to be a confirmatory approach to examining DIF in that hypotheses are stated regarding one group being assessed on the second dimension to a greater extent than its matched group (Douglas, Roussos, & Stout, 1996). The advantages of DBF as compared to DIF analyses are potentially better control of Type I errors and greater statistical power. Moreover, bundles for items are grouped based on a substantive hypothesis about differential performance on the bundle of items for the reference and focal group. An organizing principle for bundling items for ELL students (i.e., the measurement of a secondary construct) may be the amount and/or complexity of reading or writing required by items. The categorization of items into bundles can be accomplished by expert opinion alone or dimensionality analysis augmented by expert opinion. One approach is to group items according to different levels of reading complexity and conduct DBF analyses. DBF allows for examining trends across items that differ in content, cognitive skill, reading complexity, or other item characteristics. Researchers have also conducted analyses on the cognitive skills evoked by different subgroups to help explain potential sources of DIF (e.g., Ercikan et al., 2010; Lane, Wang, & Magone, 1996).
Differential Item Functioning for SWDs
Studies have examined DIF for SWoDs compared to SWDs (e.g., Barton & Finch, 2004; Cohen, Gregg, & Deng, 2005) as well as for students with different categories of disability (e.g., Kato, Moen, & Thurlow, 2009; Stone, Cook, Laitusis, & Cline, 2010). Examining DIF across different categories of disability acknowledges the heterogeneity of SWDs and has the potential to provide more meaningful results.
Barton and Finch (2004) examined DIF for accommodated SWDs and SWoDs on both a mathematics and reading test. Their results indicated that some items favored the accommodated SWD group and others favored the SWoD group. Several of the items that favored the accommodated SWD group had heavier reading loads than other items. One of the most common accommodations was reading the items aloud, indicating that having the items read to the accommodated SWDs may have provided them better access to the tested construct of comprehension. DIF was examined for 1,250 accommodated SWDs and 1,250 SWoDs on a mathematics test with extended time as the only accommodation using a mixture Rasch model (Cohen et al., 2005). The results suggested that DIF was not related to the accommodation – extended time, but was related to differences in difficulty with types of mathematics content (e.g., word problems, intuitive geometry and measurement, and plane geometry and algebraic or symbolic manipulation), indicating the multidimensionality of test data. This study recognized the need to consider item difficulty and student proficiency level in accommodation validity research, suggesting that students’ accommodation status is not adequate in explaining the occurrence of DIF. Further, none of the group differences were due to reading proficiency. This study was unique in that it examined SWDs with only one accommodation, extended time, which is the most common accommodation.
Finch, Barton, and Meyer (2009) examined DIF for accommodated SWDs and SWoDs in Grades 3 to 8 on mathematics and reading tests using a two-stage analysis. They first used SIBTEST to identify an initial pool of items detected as DIF. Next, a total score was obtained for those items that did not exhibit DIF, and this total score was used for matching in logistic regression analyses of DIF that allowed for the detection of both uniform and nonuniform DIF. Uniform DIF occurs if the item favors one subgroup across the ability scale, whereas nonuniform DIF occurs if the item favors one subgroup on a region of the ability scale but favors the other subgroup on another region of the ability scale. The logistic regression analysis was also conducted for the four most common types of accommodations: Questions Read Aloud, Directions Read Aloud, Alternate Testing Setting, and Extended Time. Overall for the language test items that displayed uniform DIF, SWDs not receiving accommodations were favored. The items that displayed nonuniform DIF at the lower grade levels favored nonaccommodated SWDs at the lower levels of proficiency and favored accommodated SWDs at the higher levels of proficiency. The authors suggested that younger accommodated SWDs may not be able to take advantage of the accommodations to the same extent as older accommodated SWDs. In particular, younger SWDs with lower proficiency levels who receive read-aloud accommodations were at a disadvantage on items that required a heavy navigational load (i.e., integration of text with tables and indices), which was consistent with findings from Barton and Huynh (2003). The mathematics items typically displayed uniform DIF, with 7 of the 11 DIF items favoring nonaccommodated SWDs. The results suggested that accommodations associated with oral administration, extra time, and alternate test settings were associated with mathematics DIF items favoring nonaccommodated SWDs.
Cho, Lee, and Kingston (2012) expanded on these studies by also examining the effects of item characteristics. Using IRT methods, they examined potential causes of DIF for SWoDs and accommodated SWDs, including item difficulty, item discrimination, item type, and item features, as well as accommodation status and ability on a mathematics test for Grade 3 to Grade 8 students. Moreover, for examining DIF for these two groups they used matched samples, with demographic variables and mathematics test scores serving as the matching variables. Overall, they found that mathematics item types (story, explanation, straightforward) and features were significantly related to item difficulty but not to DIF. Story items were not always more difficult than explanation and straightforward items. They found no significant differences among story items, explanation items, and straightforward items for most of the grades, challenging the belief that story items are more challenging because of the required reading (Cho et al., 2012). There were significant differences among item type for two of the six grades: Story items were significantly more difficult than explanation items in Grade 3 (partial η2 = .13), and explanation items were easier than story and straightforward items (partial η2 =.12) in Grade 5. Across the grades, 73 out of the 470 items had significantly different IRT b-parameter estimates, indicating DIF, and 34% of these 73 items favored the nonaccommodated SWDs and 76% favored the nonaccommodated SWoDs. Item type was not related to DIF, that is, DIF was observed similarly among items of different item types.
Using an IRT model, they found that there was no consistent interaction between accommodation status and student ability with respect to DIF. For only Grades 3 and 5 was this interaction significant. For these grades not all students in the accommodated SWD and nonaccommodated SWoD groups were consistently advantaged or disadvantaged by the DIF items. As suggested by the authors, additional studies are needed to examine this interaction given that the results in four of the six grades do not support Scarpati, Wells, Lewis, and Jirka’s (2009) conclusion that accommodated SWDs with high performance levels may be better able than those at lower performance levels to make effective use of their accommodation needs.
To examine context effects of DIF on a mathematics test for SWDs in one state, Randall, Cheong, and Engelhard (2011) used a hierarchical generalized linear model (HGLM)—an explanatory model—that incorporated item response models into hierarchical models in multilevel settings. They also used the many-faceted Rasch model (MFRM)—a descriptive IRT model. They argued for the use of MFRM since it provides fit statistics for all facets (e.g., item, student, disability group, and test condition) so that model fit issues can be addressed before examining DIF. As a fixed effects model, MFRM treats schools as a fixed facet. Whereas, as a random effects model, a hierarchical generalized linear model can use both student- and school-level variance in the outcome measure to predict the impact of DIF. The authors randomly assigned students with a wide range of disabilities and SWoDs from 74 schools to one of three conditions (resource guide modification, calculation use modification, or standard test administration). Their results suggest that some problem-solving items may not be invariant across disability status and test condition, with some items under the calculator use condition favoring SWDs and others favoring SWoDs.
Kato et al. (2009) examined DIF as well as DDF using a multistep multinomial logistic regression analysis for students with specific categories of disabilities (speech/language impairments, learning disabilities, and emotional behavior disorders) on third- and fifth-grade state reading tests. Although a relatively large number of items displayed statistically significant DIF and DDF, they found that only a small number of items displayed substantive DIF and DDF, and they were for students with learning disabilities. As indicated by the authors, this finding emphasizes the importance of treating SWDs as a heterogeneous group, and when sample size permits studies should be conducted on students within specific categories of disabilities. Their results were in contrast to previous DIF analyses with undifferentiated SWDs (Abedi et al., 2007) in that DIF did not increase for items toward the end of the test. The authors suggested that DIF may be due to specific characteristics of items.
Differential Item Functioning for ELLs
The linguistic complexity of tests as a source of construct-irrelevant variance may threaten the validity and fairness of the assessment of all students. Although ELLs may have the content knowledge in areas of mathematics and science, they may not have language skills that allow access to the content of the test items. Consequently, similar to SWDs, DIF may occur for ELL students because of lack of access to the construct being measured.
Abedi and his colleagues (Abedi & Lord, 2001; Abedi, Lord, & Plummer, 1997, Abedi, Lord, & Hofstetter, 1998) have reported a relationship between linguistic features and the difficulty level of content knowledge items, such as mathematics, science, and history, for ELLs; however, the performance of ELLs on these types of items varied across tests and across grades. Of the features studied, only item length has shown consistent negative effects on item difficulty for ELLs. Item length has shown greater difficulty value differences for ELLs and non-ELLs in the eighth-grade NAEP mathematics test (Abedi et al., 1997). The effects of other linguistic features on the difficulty of mathematics items for ELLs have been relatively inconsistent across tests and grades (Shaftel et al., 2006), suggesting that the grade level, academic content, and other factors may contribute to DIF for ELLs. Abedi, Courtney, Leon, Keo, and Azzam (2006), however, have shown that aggregating linguistic features, including both syntactic and lexical features, to form a composite linguistic complexity score resulted in a significant effect for ELLs. A linguistic complexity score based on familiarity of nonmathematical vocabulary, presence of syntactically complex sentences, relative clauses, and abstract format of the item statement significantly predicted the difference between the item difficulty value for ELLs and non-ELLs in the NAEP Grade 8 national test data. This study examined the impact of linguistic features on item difficulty values for ELLs without conditioning on ability level as done in DIF analyses.
Using a meta-analytic DIF procedure, Koo, Becker, and Kim (2014) examined DIF trends in ELLs and non-ELLs on a state reading test for the 3rd and 10th grades. This approach allowed for examining variation in DIF indices that are related to item content features. Their results indicated that reading items requiring knowledge of words and phrases in context favored non-ELLs in Grade 3 but not in Grade 10. The items requiring knowledge of words and phrases in context were overall more difficult for ELL third graders than non-ELLs, suggesting that ELLs at lower grade levels may have not acquired sufficient strategies to learn vocabulary (Koo et al., 2014). Items requiring evaluation skills favored ELLs in Grade 10 but not Grade 3. DIF was not observed for main idea items or cause-effect items for ELLs.
Martiniello (2009) examined whether the relative difficulty of mathematics word problems in a fourth-grade state test for ELLs is associated with construct-irrelevant linguistic complexity and the use of nonlinguistic (schematic) representations using IRT DIF methods. Linguistic complexity and schematic representation correlated significantly with DIF statistics (r = .58 and r = −.55, respectively). Items with greater linguistic complexity tended to favor non-ELLs over ELLs, whereas items with schematic representations tended to favor ELLs over non-ELLs. Using ordinary least squares (OLS) multiple regression, the results indicated that linguistic complexity and its interaction with schematic representation accounted for 66% of the variation of DIF statistics across the two groups. The author hypothesized that the impact of linguistic complexity is attenuated when items have schematic representations that may help ELLs make meaning of text. The author suggests that including them may help diminish the negative effect of linguistic complexity on ELL performance.
Wolf and Leon (2009) examined whether the language demands of items on several states’ mathematics and science tests were associated with the degree of DIF for ELLs. A total of 542 items from 11 tests at Grades 4, 5, 7, and 8 from three states were rated for the linguistic complexity. They found a stronger association between the linguistic rating and DIF statistics for ELLs for easier items but not for the more difficult items. General academic vocabulary and the amount of language in an item had the strongest relationship with DIF values, particularly for ELLs with low English proficiency levels. The items were grouped into four bundles to examine more closely the relationship between language demands and ELL student performance. DBF results indicated that as language demands increased more DBF was obtained.
Sinharay, Dorans, and Liang (2011) demonstrated how methods can be used to evaluate whether the inclusion or exclusion of students for whom English is not their first language (NEFL) has an impact on the results of DIF analyses. They conducted DIF analyses on the mathematics section of the PSAT (Preliminary SAT) using Whites as the reference group with various focal groups such as Hispanic and Asian, and using English as the first language (EFL) as a proxy for English language proficiency. DIF analyses were conducted for samples consisting of those examinees that indicated EFL as well as a combined sample (EFL and NEFL). The correlations of the Mantel-Haenszel DIF statistics between the EFL group and combined group were above .987, indicating that the ordering of items with respect to DIF is the same across the groups. The same DIF analyses were conducted after simulating higher proportions of NEFLs in the combined sample, ranging from .1 to .9 in increments of .1. Although the DIF results were very similar across subgroups for the proportion of NEFLs in the actual test data (approximately 9.5%), as the proportion of NEFL examinees increased (above .4) the DIF statistics for the EFL and combined groups differed for the ethnic/cultural groups, especially for the Hispanic group. As indicated by the authors, it is of no consequence if DIF analysis is performed on the EFL group or on the combined group under present conditions; however, as the proportion of NEFL examinees increases over the years it is expected to have an effect on the DIF results.
Factorial Invariance
To have a common test scale for various groups of examinees, there is a need to establish measurement invariance of the construct being tested. Multigroup CFA and IRT models can be used to establish the invariance of the construct being measured across groups. Measurement invariance is typically tested across groups at a set of hierarchical structured levels: factor loadings, intercepts of measured variables and factors, and residual variances of observed variables. Multigroup CFA examines the change in the goodness-of-fit index (GFI) when cross-group constraints are imposed on a measurement model. Standardized root mean square residual, comparative fit index, and root mean square error of approximation are typically examined to evaluate the results.
Factorial Invariance for ELLs
Using CFA to examine the internal structure of Grade 9 data from commercially published tests of reading, math, and science, Abedi (2002) reported that correlations of item bundles with the latent factors were consistently lower for ELLs than non-ELLs. For reading, the correlations ranged from .719 to .779 for ELLs and from .832 to .858 for non-ELLs, and for math, the correlations ranged from .657 to .789 for ELLs and from .796 to .862 for non-ELLs. As indicated by Abedi (2003), a number of factors, including language background, restriction of score range, SES, and OTL, may contribute to the observed differences.
In another study examining the internal structure of tests, Abedi, Courtney, and Leon (2003) observed a single factor with 83% of the variance of the item performance explained by the first factor for non-ELLs, whereas only 45% of the variance was explained by the first factor for ELLs. This implies that the test is primarily unidimensional for the non-ELL group, whereas the test is multidimensional for the ELL group, suggesting that more than one construct is being measured for the ELL group.
Factorial Invariance for SWDs
In an early study examining the internal structure using CFA, Rock, Bennett, and Kaplan (1987) reported that the SAT-Verbal and SAT-Mathematical factors were similar for SWDs and SWoDs, but the correlation between the two factors identified in the analyses was slightly lower in some of the SWD groups than in SWoD group. Rock, Bennett, Kaplan, and Jirele (1988) replicated the Rock et al. (1987) study with GRE data and reported that the fit of the three-factor solution (Verbal, Quantitative, and Analytic) was acceptable for most subgroups. For those receiving the nonstandard administration (i.e., cassette-recorded version of administration), however, a four-factor model was preferred, in which the Analytic factor represented two components, logical reasoning and analytical reasoning, especially for students with physical disabilities and those needing the large-type version. A study comparing the internal structure on the Maryland School Performance Assessment Program for accommodated SWDs and nonaccommodated SWDs reported similar factor structures (Tippets & Michaels, 1997).
Using maximum likelihood factor analysis, Huynh et al. (2004) examined the internal structure of a state Grade 10 math test for SWDs who were tested under standard conditions, SWDs who were given an oral accommodation and SWoDs who were tested under standard conditions. The preliminary principal component analysis and maximum likelihood factor analysis indicated that the internal structure of the test data was similar across groups, especially between the two disability groups. For the two disability groups, the first eigenvalue accounted for 61% of the variation in the test data and for the SWoD group it accounted for 66% of the variation. The second eigenvalue accounted for approximately 13% of the total variation for the two SWD groups and 11% for the SWoDs. In another study, Huynh and Barton (2006) examined the internal structure of a state Grade 10 reading test for SWoDs who were tested under standard conditions, SWDs who were tested under standard conditions, and SWDs who were given an oral accommodation. Using CFA, the root mean square error of approximation, standardized root mean square residual, and GFI indices supported the hypothesis that a common factorial model could be used to represent the test data for all three groups of students. The factor analysis that was performed for descriptive purposes generally supported the hypothesis; however, there were some moderate differences of .158 and .101 in factor loadings for two of the subtests, details and main idea, across the groups.
Using factor analysis, Cook et al. (2009) demonstrated that the reading comprehension construct assessed by a commercially published test was invariant when a read-aloud accommodation (audio CD) was provided for Grade 4 and Grade 8 students with reading-based disabilities and SWoDs. In another study, Cook, Eignor, Sawaki, Steinberg, and Cline (2010) examined whether a state standards–based Grade 4 English Language Arts (ELA) assessment measured the same construct for SWoDs, SWDs who took the test under standard conditions, and SWDs who took the test with accommodations as specified in their individualized education program or 504 plan, and SWDs who took the test with a read-aloud accommodation. Using multigroup CFAs with item bundle scores, the RMSEA, comparative fit index, and GFI indices indicated that the factor structures across the four groups were similar, providing support for measurement invariance across the groups.
D. Kim and Huynh (2010) examined the comparability of scores from a ninth-grade online and paper-and-pencil administration of a state end-of-course English test for SWoDs and SWDs, specifically students with learning disabilities. Computer-based testing can include features that make tests more assessable for all students, including greater flexibility in administration to accommodate different learning styles, standardization of accommodated administration for SWDs, built-in accommodation (e.g., text-to-speech support, video), and built-in tutorials and practice tests (Thompson, Thurlow, Quenemoen, & Lehr, 2002). D. Kim and Huynh (2010) examined four levels of nested hierarchy factorial invariance: (a) all parameters were freely estimated, (b) factor loadings were constrained to be equal across groups (weak invariance), (c) factor loadings and intercept constrained to be equal across groups (strong invariance), and (d) factor loadings, intercept, and residual variances constrained to be equal across groups. The results of testing the hierarchical models indicated that the factor structure, factor ladings, intercepts, and error variances were invariant between the online and paper-and-pencil for both SWoDs and students with learning disabilities, that is, the highest level of measurement invariance was confirmed. They concluded that intergroup differences between the means of observed items reflect differences in means of the construct. Other researchers have used both CFA and the Rasch model to assess measurement invariance in high stakes tests (Randall & Engelhard, 2010b).
External Structure Evidence
The external structure of a test refers to the relationship between test scores to other variables, including relationships to measures of the same construct (convergent validity evidence), to future performance (predictive validity evidence), and to measures of different constructs (discriminant validity evidence; AERA, APA, & NCME, 2014). These relationships are expected to be invariant across subgroups within the population and evaluations of the invariance of the relationship between the measures across groups are warranted, including a comparison of the correlations and the linear relationships across the groups (Linn, 1978).
Using simulated data and real data sets, Kane and Mroch (2010) demonstrate the impact of regression toward the mean, which tends to introduce bias, in differential validity studies that examine convergent and discriminant validity evidence. They demonstrate that although measures can have the same correlation between groups, when both measures contain measurement error, the slopes and intercepts can differ across groups using OLS regression models. This concern is relevant when comparing group-specific regression lines for ELLs and SWDs with those from the general population because even if the true-score relationship between the variables for the subgroup of interest and the general population is the same, the regression lines for the groups will differ due to regression toward the mean. This occurs because the subgroups tend to have lower means on the two measures. This will lead to inaccurate interpretations regarding the relationship between the variables for the student groups. The orthogonal regression approach within principal component analysis for estimating true-score relationships is not subject to regression toward the mean and can more accurately estimate true-score relationships between the two variables as demonstrated by Kane and Mroch (2010).
Using structural equation modeling to examine the external structure of Grade 9 data from commercially published tests of reading, mathematics, and science, Abedi (2002) reported higher correlations for latent factors underlying test performance for ELLs than non-ELLs. The correlation between latent factors for math and reading for non-ELLs was .782 as compared to .645 for ELLs and for reading and science for non-ELLs it was .837 and for ELLs it was .806. Koretz and Hamilton (2000) provided evidence for a format effect for accommodated SWDs on state tests composed of constructed-response items and multiple-choice items. The correlations for constructed-response items in different subjects were larger than the correlations between constructed-response and multiple-choice items in the same subject for accommodated SWDs but not for other students, in particular, at the lower grades. They provide an example for Grade 7 in which the correlations between the constructed-responses in different subjects was .63 for accommodated SWDs and the correlations between constructed-responses and multiple-choice responses in the same subjects were .55 and .57.
Predictive validity studies have examined college performance of accommodated and nonaccommodated SWDs in comparison with accommodated and non-accommodated SWoDs. Cahalan, Mandinach, and Camara (2002) investigated the predictive validity of the SAT I Reasoning Test for examinees with learning disabilities and extended time accommodations. The use of the SAT I test scores alone to predict freshman GPA tended to overpredict for accommodated male students and accurately predict for accommodated female students. Using OLS regression, a more recent study examined the differential validity of the SAT for students whose best language is not English (Mattern, Patterson, Shaw, Kobrin, & Barbuti, 2008). They reported that the SAT accurately predicts freshman GPA for students whose best language is English, the critical reading and writing SAT sections underpredict for students whose best language is not English (mean standardized residual of .40 and .37, respectively), and the mathematics section provides accurate predictions for students whose best language is not English. For students who indicated that their best language is English and another language, the SAT tends to slightly overpredict freshman GPA (standardized residuals ranging from −.09 to −.02).
For K–12 large-scale assessment, the absence of criterion measures has resulted in more emphasis on examining the internal psychometric properties of tests administered to subgroups of students (Koretz & Hamilton, 2006). However, states are currently examining the extent to which their state test scores, especially at the high school level, are related to criterion measures such as SAT and ACT scores as an attempt to provide evidence that the scores are related to college and career readiness. Some states are also using test scores in the middle school grades to predict performance at the high school level.
Equating Invariance
The invariance in reported scores across groups of students defined by subgroup or test accommodations is necessary for the validity of the score interpretations and comparisons across groups. Population invariance requires that the equating functions derived from different subpopulations produce the same results across forms if they are to be equitable. Score equity assessment (SEA; Dorans, 2004) is a psychometric approach for examining fairness and equity in reported test scores by evaluating equating invariance across different groups. The analysis is conducted at the group level. This approach provides information on whether student subgroups can be considered from the same population when equating. If groups of examinees are from different populations, there is a lack of equity in the reported scores. This implies that there is a lack of score comparability, and the validity of score interpretations is hindered.
Equating is a psychometric procedure used to adjust scores for differences in difficulty across two or more forms of a test (Kolen & Brennan, 2004). Raw scores on each form are equated to the same scale before they are reported to help ensure equity. The property of equating invariance is met when subpopulations within the overall population have the same equating relationship from raw to reported scores. Group-level score equity and comparability across groups are achieved when an equating is invariant (Dorans & Holland, 2000; Kolen & Brennan, 2004). Raw scores for tests that exhibit invariant factor structures and item response functions across groups are expected to have equating invariance in SEA analysis. This implies that the construct is being assessed in the same way, and the precision of scores is the same across groups. SEA compares each group in a population to the overall population of examinees.
Sinharay et al. (2011) showed how methods can be used to evaluate whether the inclusion or exclusion of students for whom English is not their first language (NEFL) has an impact on equating results. Equating procedures ensure that test scores on different forms are interchangeable. The authors conducted the equating of the PSAT to the SAT on three examinee samples: students for whom English is their first language (EFL), NEFL, and Total (EFL and NEFL group). As the authors indicated, their equating procedures are similar to a score equity evaluation that examines the invariance of equating across subpopulations, addressing fairness at the test score level. Using the unsmoothed chained equipercentile equating method, they reported that the results were similar across the groups. The results were also similar for the simulated subsamples for which the proportion of NEFL examinees increased considerably above the actual proportion of approximately 9.5%. Based on these results, it appears that there is little consequence if equating of the PSAT to the SAT is conducted on the EFL sample or the Total sample (EFL and NEFL) now or in the future when the proportion of ELLs will be greater in the population of students in this country.
The effect of language on the invariance of equating functions with different test formats has also been examined. The population invariance of equating for a teacher certification paper-and-pencil and computer-based tests was examined when equating results were obtained from subgroups defined by English as Secondary Language (ESL) status (Cid & Spitalny, 2013). The results from this study indicated differences between equating conversions of the total and ESL groups at score levels near the cut scores on the scale for both modes of testing, whereas there were no differences between the total group and the group of students whose primary language was English. Because the differences between equating conversions for the total and ESL group comparisons were similar in each mode of testing, the authors indicated that the mode of testing does not differentially affect the degree to which the equating functions of the subgroups were invariant.
Using the SEA approach, equating invariance was examined for a fifth-grade state science test for groups of students defined by SWD status, ELL status, and use of accommodations (Huggins & Elbaum, 2013). Measurement comparability and reported score equity was confirmed for SWDs and ELLs who used accommodations across all score ranges; however, it was not confirmed for SWDs and ELLs who did not use accommodations. The researchers also examined invariance at the high-stakes cut score by comparing the classification profiles across a student’s equated scores; one score was based on the equating of the overall population and the other based on the equating for the student’s subgroup. The equating for SWD and ELL groups with accommodations had a higher classification consistency rate as compared to the equating for SWD and ELL groups without accommodations. In both analyses the differences were small, and as indicated by the authors, these differences may be due to small subgroup sizes, lack of measurement invariance, or both. As an example, for students in the SWD and/or ELL group with accommodations, there was a 96.38% agreement rate between the two proficiency classifications, whereas in the SWD and/or ELL group without accommodations there was a 95.13% agreement rate. With larger samples, PARCC and SBAC will be able to conduct such analyses for students with specific disabilities and ELLs from different language and cultural backgrounds.
Issues Related to Including SWDs and ELLs in Measures of “Growth” and AF Educator Effectiveness
States have used “growth” models for educational accountability purposes since the Growth Model Pilot Project was initiated in 2005 (U.S. Department of Education, 2005). These models are used in NCLB accountability to monitor states’ progress in closing achievement gaps and to set high expectations for annual improvement for all students, including SWDs and ELLs. As growth measures have been implemented to help determine annual yearly progress (U.S. Department of Education, 2005), it has become apparent that low- and high-performing students are not being measured as precisely even though they can be accurately classified in a performance level. There has been little research on the efficacy of growth models for tracking change for ELLs and SWDs. Such research is crucial given that the federal Race to the Top initiative calls for multiple measures in educator evaluation systems, including measures of student growth (U.S. Department of Education, 2010). It should be noted that the term growth is used in this chapter because of current practice, but it has been argued that it is inappropriate to use the term in association with these models because they do not actually estimate performance over time (Castellano & Ho, 2013).
While arguing for the need for educator evaluation systems to fairly account for the inclusion of SWDs and ELLs in mainstream classrooms, Jones, Buzick, and Turkan (2013) discussed a number of challenges in including these subgroups, including challenges when estimating growth models used for accountability purposes. They discussed issues associated with value-added scores that are obtained from statistical models that attempt to explain the contribution of individual teachers to student achievement, by accounting for prior student achievement and in some cases student and school characteristics. They described a number of measurement challenges that need to be considered in including SWDs and ELLs when estimating growth models and including them in the evaluation of educator effectiveness. First, there is an inconsistent use of testing accommodations over time, which can increase measurement error and have an impact on change in student test scores (Abedi, Hofstetter, & Lord, 2004; Sireci et al., 2005). Careful attention needs to be paid to identifying students who need accommodations and ensuring that there is consistency for students over time in providing accommodations. Second, a relatively large percentage of ELLs and SWDs demonstrate low performance on state assessments (Abedi et al., 2007; Thurlow et al., 2011), threatening the validity of student changes in achievement as a measure of teacher effectiveness. Because scores at either end of the score scale are not as precise as those in the middle of the score scale, measures of effectiveness for teachers with large numbers of ELLs and SWDs, who tend to score at the lower end of the score scale, will be adversely affected due to the lack of score precision. Students scoring at the lower end of the scale are likely to guess more when responding to items. This was demonstrated using data from a fourth- and eighth-grade math and ELA test from one state by Laitusis et al. (2011). They found that SWDs scoring at chance level ranged from 12% to 22%, whereas only 1% to 3% of SWoDs scored at chance level. It should be noted that a goal of the two consortia, PARCC and SBAC, is to more precisely measure students who are at the lower end of the score scale. Third, SWDs and ELLs are both heterogeneous groups, varying in terms of student characteristics, special services, OTL, and accessibility. For example, some ELLs may enter school late in the year, making it difficult to isolate teacher effects. Fourth, it is difficult to attribute student progress to individual teachers. As an example, mainstream teachers share responsibility of instruction with special education teachers and ESL teachers.
Other researchers have discussed these challenges as well as additional concerns when including ELLs and SWDs in the measurement of growth and teacher effectiveness (Lakin & Young, 2013; Stevens, Zvoch, & Biancarosa, 2012). There is a potential for more missing data because of high mobility rates, resulting in exclusion of students from accountability indices using growth data. Changes in the use of accommodations from year to year for ELLs and SWDs may have several unintended outcomes such as masking real academic progress or indicating spurious progress, producing differential trajectories of progress for ELLs and SWDs as compared to non-ELLs/SWDs, and leading to differences in proficiency classification and classification accuracy across the growth models for ELLs and SWDs.
Jones et al. (2013) provided suggestions in response to these challenges. First, they proposed that practitioners use a roster validation system, with both the special education and general education teachers being 100% responsible for their shared students. Second, they argued for precision across the score scale so as to reduce the amount of measurement error in low-performing students as well as high-performing students, which is a goal of the two assessment consortia—PARCC and SBAC. A system that accurately assigns, records, and monitors the use of test accommodations should be a priority so that this information can be used when interpreting measures of growth in the evaluation of teachers. Third, they called for more studies examining the validity of value-added modeling and investigating the variables that account for heterogeneity of subgroups and their effects on valued added scores.
A recent study examined variations among three “growth” models, value tables (change in student proficiency categories over time), projection models, and student growth percentiles (SGPs; change in a student’s normative position in an achievement distribution over time), in terms of their sensitivity to ELL status with respect to the number of on-track classifications and the predictive accuracy of those classifications for ELLs (Lakin & Young, 2013). They used state mathematics and ELA test data from a large California school district for students in the 2012–2015 high school graduating classes. Data from the 2012 year were used as the calibration year for the projection and SPG models. The year at which all students needed to meet the status proficiency goal was set at Grade 7 and students in Grades 4 to 6 were evaluated for growth targets. For the ELA data, the value table model identified the largest number of ELL students on track (42% to 52% dependent on grade level), followed by the SGP model (12% to 15%), and the smallest number was identified by the projection model (1% to 4%). This pattern held for non-ELLs; however, the percentage identified for the value table model and the projection model for non-ELLs differed from the ELL classification rate (38% to 52% and 2% to 6%, respectively). Across both ELA and math test data, the value table model had the largest number of differences in on-track classification rates, indicating that more ELL students than non-ELLs were on track.
They also examined the accuracy of the models in predicting the 23% of ELL and 25% of non-ELL students who were not proficient in Grade 3 but were proficient by Grade 7. Although the value table model identified more ELLs as on-track, those identified were less likely to be proficient at Grade 7. The projection model was the most accurate of the three models, but it identified the fewest number of students as on-track. A regression residual analysis showed that ELA and mathematics scores were more likely to be underestimated for ELLs as compared to non-ELLs, indicating that ELLs were being underestimated in the early grades for their future success by the projection model. The SGP model had slightly lower accuracy than the projection model and had similar accuracy for ELLs and non-ELLs. Lakin and Young (2013) suggested that the projection model “may unfairly penalize schools with large numbers of ELL students because it fails to identify accurately all of the ELL students who will later be successful” (p. 22). This result also has implications for using these models for evaluating educators with a relatively large number of ELLs in their classes. It is important to note that their sample comprised nearly 50% ELLs. If these models are applied to samples consisting of a relatively small percentage of ELLs, the rate of errors produced by these models will be greater.
In summary, there are a number of challenges in including ELLs and SWDs in models evaluating educator effectiveness: the number of ELLs and SWDs in classrooms, the inconsistent use of accommodations across years, the mobility of ELLs and SWDs and the subsequent omission of their test scores in these models, and the imprecision in measuring students who perform at the lower end of the score scale. The assessment consortia are addressing some of these concerns, by attempting to design tests with better access for these students and more precision across the scale and by designing systems for assigning and monitoring accommodations used by students.
Conclusion
There are a number of research design and psychometric issues that affect the validity and fairness of assessing SWDs and ELLs. One design issue is related to identifying students who are classified within each of these subgroups and the various classification schemes across districts and states. In discussing the nuances in defining ELLs, Abedi (2008) addressed concerns with various procedures used by states to classify students as ELLs. Depending on who is included in the samples will affect the results of studies examining the validity and fairness of assessing ELLs and SWDs. A clear delineation of the sample is needed to ensure that the results can be interpreted in a meaningful way.
A related issue is the heterogeneity of students within both ELL and SWD subgroups. Heterogeneity needs to be addressed when examining the efficacy of accommodation for SWDs and ELLs as well as when examining the psychometric characteristics of the item and test scores for these subgroups. The experiences of students during the assessment, and consequently their performances, are affected by a complex interaction of test characteristics (e.g., academic content, test language, item type, scoring) and cultural, language, economic, and educational histories of the students (Abedi, 2006; Abedi & Gandara, 2006; Solano-Flores, 2008; Solano-Flores & Trumbull, 2003). English is acquired at different paces by ELLs, and new ELLs enter schools each year, resulting in this group of students being variable in terms of both their English proficiency and academic proficiency. More research on the assessment of ELLs is needed on disaggregated groups. Additional research is also needed on disaggregated groups of SWDs (Sireci, 2009). Combining students with very different disabilities in one group to obtain sample sizes that allow for sufficient statistical power when evaluating SWDs and SWoDs may hide true effects. Although there have been some studies examining the efficacy of test accommodations on performance and measurement invariance within each SWD and ELL group, sample sizes tend to be small. With larger sample sizes for subgroups within ELLs and SWDs, PARCC and SBAC will be in a position to examine test and item properties for student groups with specific disabilities and ELL groups from different cultural and linguistic backgrounds.
In an attempt to address heterogeneity of ELLs when examining DIF, Ercikan, Roth, Simon, Sandilands, and Lyons-Thomas (2014) examined whether students who spoke French at home may contribute to diversity among linguistic minority groups in Canada. Using the PISA (Program for International Student Assessment) data for reading, science, and math, they found that the consistency of DIF identification ranged between 7% and 10% in separate DIF analyses for the students who speak French at home and those who do not speak French at home, whereas the consistency of DIF identification ranged from 24% to 54% for the combined group. As Ercikan et al. state, “This highlights the methodological problems with investigating measurement comparability for groups with great degrees of population heterogeneity . . .” (p. 283). They also stressed the need to identify the sources of DIF. Their review of the items used in the study suggested that the linguistic load of the item, the vocabulary, and the complexity of sentence structure may be factors that disadvantaged the students. These results are in support of the work of Abedi and his colleagues.
Test accommodations are provided to SWDs and ELLs to address both construct-irrelevant variance and construct underrepresentation. Accommodations help ensure that SWDs and ELLs have full access to the construct the test is measuring and respond in a way that represents their knowledge, skills, and abilities on the intended construct (Tindal & Fuchs, 1999). When studying the efficacy of accommodations, however, the presence of multiple and different accommodations for SWDs and ELLs confounds the results, making it difficult to determine the effects of a particular accommodation (Sireci et al., 2003). Although some recent studies that have examined the efficacy of accommodations included SWDs and ELLs who received only one accommodation, additional studies are needed to examine the effects of accommodations for groups receiving just one accommodation so that the effects can be attributed to a given accommodation. The assessment consortia, PARCC and SBAC, will have computer administration of their tests, requiring additional research on the efficacy of online strategies and accommodations on ELL and SWD performance.
Although the observance of a differential boost can support the use of accommodations for SWDs and ELLs, it can be challenged because better performance of SWDs and ELLs is not equivalent to assessments that provide valid score inferences. Evidence is needed for measurement invariance across ELLs and SWDs so as to make valid score interpretations for individual students as well as for group comparisons. Measurement invariance studies examine the stability of item and test measurement characteristics across groups. An evaluation of the stability of the estimated item parameters across groups is an important initial step for examining measurement invariance. Factorial invariance and other test and item characteristics that help ensure comparability, including the precision of scores, accuracy of classification rates, and the relationship between test scores and other measures, have been examined for SWDs and ELLs.
Because test scores tend to be less precise at the lower end of the score scale and many ELLs and SWDs demonstrate relatively low performance on large-scale assessments (Abedi et al., 2007; Thurlow et al., 2011), studies have obtained lower reliability estimates for ELLs and SWDs. Internal consistency indices, such as coefficient alpha, are commonly used as measures of test score reliability. These indices are affected by restricted ranges in performance. Furthermore, using these indices when some items measure an irrelevant construct in addition to the intended construct for subgroups will lead to lower reliability estimates. A number of factors, including language background, restriction of range, SES, and OTL, may contribute to the observed differences in reliabilities for ELLs (Abedi, 2003).
The evaluation of the comparability of the internal structure of the test has been examined through exploratory factor analyses and CFAs, including multigroup analyses and IRT analyses. DIF and DBF have been used to examine the equivalence of subgroup performance at the item level and at the level of a coherent subset of items. When DIF occurs, it indicates that the item measures some additional construct for one of the subgroups, which negatively affects the validity and comparability of test score interpretations and uses. As indicated by Camilli (1992), DIF can be considered a shift in the distribution of ability along a secondary construct that influences the probability of a correct response. One group may be less able on a secondary construct, such as English reading skills on a science test for ELLs. The use of technology-enhanced items by PARCC and SBAC will require studies examining the extent to which these novel items are measuring the same construct for ELLs and SWDs as compared to the general student population.
There are a number of factors that can have an impact of the validity of the results of invariance analyses, including heterogeneity of samples, small and differing sample sizes, differential guessing rates, nonoverlapping proficiency distributions, and lack of measurement precision. The use of large-scale test data and the evaluation of effect sizes help minimize the impact of sample size in interpreting the results. As previously indicated, nonoverlapping proficiency distributions arise because the SWD or ELL groups have distributions that are centered lower on the score scale than the general population. Consequently, these groups typically have a restricted range in scores. Differences in distributions affect the results of invariance studies in predictable ways such as easy items flagged for DIF in favor of the focal group (SWDs or ELLs; Sireci, 2009). In addition, restriction of range can also account for lower reliability estimates and poorer predictive validity evidence for these groups as well as differences in factorial structure. To minimize the effect of differential restriction of range across the groups, the reference group (general population) can be selected to have the same distribution as the SWD or ELL group.
The finding by some research (e.g., Kato et al., 2009) that results differ by disability category “underscores the importance of recognizing the limitations of treating all students with disabilities as a single homogenous group and suggests that the behavior of students with different kinds of disabilities needs to be examined separately whenever possible” (p. 38). The sample sizes for the focal groups (SWDs and ELLs) tend to be much smaller than the reference group and this may affect the statistical values. As an example, when examining DIF and calculating the difference in R2 to compare models, the values are based on the entire sample and may be smaller than when group sizes are approximately equal (Kato et al., 2009).
When forms of tests are equated, measurement invariance also requires that equating functions derived from different subgroups produce the same results if scores are to be equitable. Therefore, when sample size permits, the equating functions for different subgroups should be examined when establishing measurement invariance. If vertical scales are developed to measure student progress, the validity of the vertical scales need be examined for ELL and SWD groups. Differences in learning profiles and trajectories for these students suggest that the vertical scale developed for the general population may not be appropriate for these students.
The tailoring of items to the ability level of the students is attractive in the assessment of SWDs and ELLs because current paper-and-pencil tests tend to target students at the middle of the score scale range, resulting in less precise measurement of SWDs and ELLs who tend to perform at the lower end of the score scale. Another attractive feature of CAT is the potential for the administration of fewer items to reach sufficient measurement precision. SBAC has adopted a CAT system and PARCC will use computer-delivered assessments. Stone and Davey (2011) discussed some of the advantages of using CAT with SWDs, including more precise measurement of SWDs. They also addressed some of the challenges of using CAT, including item response functions differing for SWDs due to accommodation status, divergent learning profiles, less access to computers, and less familiarity with keyboarding. Consequently, SWDs and other subgroups, such as ELLs, should be included in the calibration sample for the item bank, and if not, subgroup analyses are needed to examine the appropriateness of the item parameters for the different groups. Divergent learning profiles may also lead to less precise estimation of SWDs’ and ELLs’ standing on the latent construct. Stone and Davey reiterated the need for the detection of discrepant response patterns for subgroups.
Research on the efficacy of models for monitoring change for ELLs and SWDs is needed. Such research is crucial given that the federal Race to the Top initiative calls for measures that assess student progress (U.S. Department of Education, 2010). A number of challenges in including ELLs and SWDs in models evaluating educator effectiveness have been identified, including the number of ELLs and SWDs in classrooms, the inconsistent use of accommodations across years, the mobility of ELLs and SWDs, and the imprecision in measuring students who perform at the lower end of the score scale. The assessment consortiums are attempting to address some of these concerns, by designing tests that are more accessible for these students in the attempt to achieve more comparable scores as well as designing systems for assigning and monitoring accommodations used by students. Empirical evidence will be warranted to establish the extent to which the consortia have achieved their goals in providing a more valid assessment of ELL and SWD groups.
