Abstract

Introduction
The intended ultimate effect of the mandated assessment of English language learner (ELL) students is their successful academic achievement in U.S. public schools. Information yielded by assessments is used to inform state and federal agencies about the efficacy of states’ efforts to support this achievement and, at the individual student level, is used to inform educators how best to respond to the instructional needs (both linguistic and academic) of their students. 1 Absent any official language planning policy in the United States, currently assessment decisions affecting ELL students operate as de facto language policies in the way that they predominantly privilege English proficiency over the maintenance of minority languages for content learning (Menken, 2008). Specifically, “The language policies currently being created in U.S. schools as a byproduct of testing policy occur in an ad hoc way, without careful language planning” (p. 5). We might go so far as to argue that additionally there is no coordinated testing policy; rather, there exists a collection of somewhat disconnected mandated federal and widely varying state-level regulations along with optional initiatives and recommendations offered to school districts for them to implement. Assessment of ELL students therefore involves many components, with different aspects having different purposes directed from different loci of control (federal, state, local) and based on different sources of funding (National Research Council [NRC], 2011). 2 Consequently, we are reluctant to call the suite of ELL assessments a full-fledged system until greater scrutiny of it has been conducted and a defensible argument made in its favor.
However, with many components interacting with one another, a systems view is still needed to both effectively document progression of ELL student language and content learning and coordinate efforts to support and monitor this learning in more strategic and tactical ways. Indeed, many aspects of ELL assessment in the United States need to be considered equally from federal, state, and local perspectives so that meaningful and comparable practices might be implemented with ELL students wherever they reside in the nation. The current lack of systematicity is particularly pernicious because of the high-stakes nature of ELL testing policies and practices, not just for states, districts, and schools in terms of funding but also at the individual student level in terms of student placement, access to core content, and eligibility for gifted and other advanced programs. We add these concerns to other aspects of ELL assessment that have already been called out for greater capacity building by federal, state, and local authorities in the areas of monitoring current and redesignated ELL students, establishing realistic time lines for language proficiency attainment, and expectations for ELL student academic achievement (Hopkins, Thompson, Linquanti, Hakuta, & August, 2013).
The purpose of this chapter is twofold: (1) to provide a detailed review of current language assessment policies and practices with ELL students under the federal requirements of the No Child Left Behind Act (NCLB; 2001) and relevant research in order to evaluate their technical quality and validity, and (2) to examine the intersection of language assessment and academic content assessment in terms of their purposeful interpretation and use by educators in decision making.
We outline a theory of action suggested by the established policies and practices of the putative ELL assessment system and critique the interpretive claims, uses, and asserted outcomes based on the available evidence. 3 Recommendations are offered for further research that focuses on aligning and improving the disparate parts and purposes of the ELL assessment system to affect positive outcomes in ELL education. The four main components of the system function to discriminate between those students who should and those who should not receive services under Title III of NCLB with its focus on ELL language support. The system begins with the ubiquitous yet nonmandated and widely varying practice of administering a home language survey (HLS) to families to identify students as potential ELL students (see Figure 1).

A Systems Overview of English Language Learner Assessment
The next component in the system is the screening tool and or placement test used to confirm ELL status and instructional placement for Title III services (e.g., any language programming, such as Structured English Immersion or bilingual education, collectively known as Language Instruction Educational Programs [LIEPs]; see Faulkner-Bond et al., 2012). This is followed by the required annual monitoring of student English language growth and proficiency using a standards-based English language proficiency assessment (ELPA). Here we also include the desirable but professional development–dependent use of formative assessment techniques for instructional decision making in the areas of both language learning and academic content instruction with ELL students. The fourth and final component of the system comprises the reclassification of students as English-proficient and their exit from ELL programming based on state and local district formula with 2-year monitoring post exit. The entire system as such needs to be evaluated, not just a single component or even several components evaluated independently.
In addition to English language assessment, ELL students are included in academic content assessment on an annual basis under the provisions of the Elementary and Secondary Education Act (ESEA) through its reauthorization in the form of NCLB, and there is no reason not to expect this mandate to continue with the reauthorization of ESEA in the near future. NCLB “challenged states to develop an integrated system of ELP standards, assessments, and objectives that are linked to states’ academic content and student achievement standards set in accordance with other parts of ESEA” (NRC, 2011, p. 6). Specifically, States now must annually assess ELL students’ progress in becoming English language proficient, and they must include these students in annual assessments in all content areas. The states are being held accountable for demonstrating that ELL students are making progress in learning academic subjects. (NRC, 2011, p. 6)
States must expect that all educators will hold ELL students to high academic content standards. However, when students are being assessed for content knowledge in a language they are still learning, fair and valid (i.e., meaningful) interpretations depend on clear measurement of the construct (e.g., avoiding irrelevant construct variance caused by measuring language abilities rather than the intended mathematics or science knowledge) and appropriately implementing testing accommodations.
Describing ELL Assessment Needs: English Language and Content Learning
If language testing is tantamount to language policy in the United States, it behooves us to clearly state here how we have inferred the educational goals of the collection of assessments used with ELL students. The current suite of assessments ELL students encounter for the most part are geared toward the educational goal of having students enter or remain in ELL programming when deemed necessary and exiting ELL programming when ready to learn grade-level academic content in English without that support. 4 English language proficiency and academic content assessments should be considered concurrently by those who make educational programming decisions for ELL students (Ragan & Lesaux, 2006).
Theory of Action for an ELL Assessment System
In the past decade of rewards and sanctions associated with the implementation of NCLB (2001), individual states and consortia have developed, validated, and refined their assessment instruments piece by piece. It is only a working assumption that these components interact to produce the overall desired outcome of successful academic achievement for ELL students, although efforts have been made to help states foster a “common interpretive argument” for the validity of their ELL assessment systems. This common interpretative argument is representative of the mutual interpretative claims and assumptions inherent in the theories of action or logic models underlying the specified purposes and goals of the states’ respective ELPA and the broader ELL assessment systems into which they fit (Perie & Forte, 2011). According to Forte, Perie, and Paek (2012), the fundamental validity question is “whether a student who is deemed proficient by an ELPA can successfully function without language supports in academic classes taught in English” (p. 8). By combining a theory of inputs, outputs, and eventual outcomes (Bennett, 2010) with an interpretation/use argument (Kane, 2013), the theory of action we infer from established ELL assessment policies and practices includes a level of specificity for each assessment use (i.e., how data and test scores are used at the federal/state and individual student levels) and outlines the serial connections between the components of an overarching ELL assessment system.
The theory of action represented in Table 1 shows the four main components, each with an output serving as prerequisite for the next component. Cycling through these components culminates in the global assertion that language minority students as a protected class of individuals under federal law (those with home language influences other than or additional to English) can be accurately identified and their English language measured at various junctures to receive (or continue to receive) support services to ensure successful completion of K–12 education under Title III of NCLB. To wit, the ELL assessment system should be designed to identify all potential ELL students, identify who will receive Title III services (either Initial Fluent English Proficient [IFEP], i.e., not eligible for Title III services; or Limited English Proficient [LEP], the terminology of the federal law for those eligible for Title III services), measure progress attained within Title III services (for both accountability as well as instructional purposes), monitor progress in content areas under Title I, 5 measure attainment of proficiency and exit from Title III services, and monitor maintenance of proficiency for exited (former ELL) students.
Theory of Action With Embedded Interpretive Claims and Uses of the ELL Assessment System
Note. ELL = English language learner; IFEP = Initial Fluent English Proficient; LEP = Limited English Proficient; FEP= Fluent English Proficient; ELP = English language proficiency; R-FEP = Reclassified Fluent English Proficient.
According to the NRC (2011), the federal government currently allocates Title III funding to states based on the U.S. Census Bureau’s American Community Survey data rather than on counts of students found eligible via ELP screening.
These actions are prerequisite to a two-year R-FEP monitoring system, but we limit our review of the ELL assessment system to students’ reaching this phase.
For federal NCLB reporting, states report totals and percentages for Title III accountability, and these results are used to determine allocation of funding. Thus, the theory of action for federal NCLB reporting is that the ELL assessment system facilitates fair and appropriate allocation of Title III funding, which will in turn enable schools to provide adequate Title III services to all identifiable ELL students with the goal of successful K–12 completion for all. 6 Such allocation of funding may be evaluated in comparison with other states, thus standardization and comparability are key.
For student-level data use, states report scale scores and levels to schools, and schools use these data to determine Title III placement, continuation, and exit for individual students. Thus, the theory of action for student-level data use is that the ELL assessment system facilitates fair and appropriate allocation of Title III services to all identifiable ELL students with the goal of successful K–12 completion for all. Such allocation of funding may be evaluated in comparison to curricular pathways and outcomes for comparable students, and thus accuracy at the individual level is key. The ELL assessment system stands unique in its reliance on large-scale standardized test data for use in individual educational programming, despite the common agreement that multiple sources of evidence are preferred for high-stakes decision making. 7
A student designated LEP (more widely referred to as ELL) is expected to make progress and subsequently reach a point where other language influences are not impeding English progress, at which point the student is considered ready for redesignation (i.e., reclassified as FEP or commonly R-FEP) and exit from Title III services. It is expected that this exit point is calibrated to ensure students’ placement in English-only classes is successful. Whether it be IFEP, LEP, or FEP, the resultant educational programming is intended to ensure that all students successfully complete their K–12 education. Notably absent from this description of a theory of action, yet integral to student success, are measures of instructional quality and opportunity to learn. This is indicative of a system built piecemeal around a specific funded mandate (Title III) and should be considered when further managing or evaluating the ELL assessment system. 8
It is important to note that the ELL assessments as a whole differ from the assessment systems being developed to measure proficiency on the new content standards (e.g., Common Core State Standards Initiative [CCSS Initiative], 2010a, 2010b; the Next Generation Science Standards [NGSS] Lead States, 2013) in that the data generated by ELL assessments are used to create designations (IFEP, LEP, FEP) that affect educational programming at the individual student level. These designations are used to determine a language minority student’s placement into and out of Title III programming as well as the level (e.g., Level 2 “Advanced Beginner”) and intensity (i.e., frequency and duration) of services given in that instructional program. Thus, by design there are cause-effect relationships among data-generating mechanisms, decision-making routines, and instructional components in accordance with the designations of IFEP, LEP, and FEP, which can make data difficult to interpret. Importantly, data cannot act alone: In fact, data do not act at all. Decision makers who interpret these data, alongside other sources of evidence, are prone to biases and limitations, yet their observant use of data is essential to ongoing evaluation and betterment of the system as a whole. With this conceptual grounding, we turn now to the review of assessment practices more specifically, first for ELP followed by current content assessment practices with ELL students.
English Language Proficiency Assessment
Prior to NCLB, ELP assessments were used for educational programming only. The mandates accompanying NCLB mean that measures of ELP are now largely fashioned for accountability purposes. With the strong emphasis on students reaching proficient levels of competence on the content area standards, there has also been a shift in rhetoric from assessing the English language development of students to assessing students’ English language proficiency for standardized testing purposes. This situation is reflected in some states and consortia adopting English language proficiency standards (e.g., WIDA Consortium, 2004, 2007). The most recent manifestation of an intentional difference in the use of the two terms is seen in the Framework for English Language Proficiency Development Standards (ELPD Framework; Council of Chief State School Officers [CCSSO], 2012). Unfortunately this adoption of both terms in one framework has the potential to add further confusion to the already quixotic nomenclature found in the field of education. We do surmise, however, from the use of both proficiency and development in the title of the framework that proficiency in English is prized but couched in a manner that also emphasizes proficiency developing incrementally over time. In this chapter we use the abbreviation ELD/P when referring to English language standards for students acquiring English as a second or an additional language because states continue with standards that are termed either ELD or ELP. 9
Below we review assessment of ELL students organized by six identifiable assessment uses: (1) initially identifying students who may be potential ELL students; (2) confirming ELL status and eligibility for Title III services as well as placing students within programs; (3) periodic assessment that can be further divided into (a) the much-needed emphasis on classroom-level assessment (i.e., formative assessment) of ELD for ongoing instructional purposes at the point students are acquiring both English and new content area material, (b) the monitoring of annual progress on route to fluent or “full” English proficiency, (c) for on-going proficiency status placement, and (4) reclassifying students to FEP and their exit from Title III services (e.g., G. García, McKoon, & August, 2006; see also the schematic in NRC, 2011, p. 78). 10
Initial Identification of Potential ELL Students
Much relies on the type and quality of instruments states and districts use to identify the initial pool of students who, with further screening or assessment, may prove eligible for Title III services. To identify potential ELL students, the vast majority of states use some form of survey administered to parents or guardians of students at first enrollment in public education no matter their linguistic background (Bailey & Kelly, 2013; Kindler, 2002; NRC, 2011; Wolf et al., 2008). Such surveys may ask parents to identify which language they consider to be their child’s dominant or primary language, which language their child learned to speak first, and which language they, the parents, use with their child. However, as Bailey and Kelly (2013) point out, The U.S. Department of Education Office for Civil Rights December 3, 1985 Memorandum and 1991 OCR Policy address the requirement to have a program in place for adequately identifying students in need of services, but recognize that this may differ widely due to student demographics. . . . No wording in these memoranda obligates states to specifically enforce the use of an HLS in order to initially identify students. (p. 799)
Nevertheless, in their review, Bailey and Kelly (2013) identified all but 4 states relying on an HLS to make the initial identification of which students should be in the pool for further English language screening or assessment. The majority (23 states and Washington, D.C.) has created a single HLS form and mandates its use in schools statewide. A further 17 states mandate use of an HLS and have created a sample HLS for districts to adopt or substitute with their own version, and 6 states mandate use of an HLS but have created neither a required nor sample HLS, allowing districts to create their own. The 4 states with no mandated HLS recommend options for districts to follow that include an HLS to identify students for further screening but may also list alternate practices such as the use of existing reading scores on state tests and observational scales.
The theory of action we have adopted allows us to clarify the intended purpose(s) of the initial identification instrument. Such an instrument has one apparent objective: to focus in on the pool of students in the general K–12 population who, by dint of their language minority standing, are most likely in need of the language and academic support services to which they are entitled. This group of potential ELLs can then be screened or assessed further to determine actual need. Frequently it is as informative to state what use an instrument is not intended for as it is to state what it is intended for, and in the case of the HLS it bears emphasizing that it is not intended to measure English “proficiency.” However, a secondary purpose may be to yield information on the students’ language backgrounds that may aide schools and teachers programmatically (i.e., offering bilingual education programs in high-incidence languages) and instructionally (e.g., teaching students strategies for transferring reading skills in a first language to their English reading development). As it stands, the HLS is a poor substitute for instruments that could accomplish these measurement tasks.
Even though HLS use is the most common practice in initial identification, it does not necessarily follow that it is a best practice. Other instruments such as interviews, observational protocols, preschool assessment results, and universal screening, although rarely used, have been proposed as alternatives to or for use in combination with HLS as multiple indictors of potential ELL status (Abedi, 2008; Bailey & Kelly, 2013). Moreover, HLS use is a practice based on little research of the quality of the data HLS yield (Bailey, 2010). In fact, there has been little attention by states and the research community to how tight the initial identification instrument or “net” must be in order to be sure of including all students who may be potential ELL students (Linquanti & Bailey, 2014). Consequently, we are hesitant to single out any given HLS to say it is the “best” HLS for the intended purpose of identifying potential ELL students. Students from a language minority background are an extremely heterogeneous group in terms of actual language proficiency; they may know a lot, a little, or no English and may speak a language other than English entirely, be fluently bilingual, or be English-only speaking (while nevertheless being exposed to a minority language by parents and other family members; Bailey & Kelly, 2013). It is likely that an HLS that incorporates this array of information will be most useful to schools and teachers in terms of program planning (e.g., see the tiered approach taken by the Home Language Identification Survey of the New York City Department of Education to elicit information for screening eligibility, instructional planning, and parent information), but it remains an empirical question, and one that could and should be tested, whether such extensive surveys are more effective at identifying the “right” pool of students for further screening and placement.
We do however get a sense of the type of information that is most relevant for determining which students should be considered for further screening or assessment in a rare natural experiment that presented itself in Arizona in 2009 and was subsequently capitalized on by Goldenberg and Quach (2010). For a short period of time, the state reduced a three-question HLS to ask for just the primary language of the student with parenthetical instructions to interpret primary as the language the student used most often. The focus on language dominance in terms of amount of language usage may lead to the underidentification of students who are potential ELL students because students may neither have received extensive exposure to English despite using it the most often nor have reached a level of proficiency sufficient for content learning in English. Indeed, Goldenberg and Quach were able to calculate before and after effects of the change in Arizona surveys because in two districts they worked with, families had completed both the original three-question survey and the replacement one-question survey. They found that among the students who were potential ELL students 11% of kindergarten students in one district and 18% of the K–5 students in a second district would have initially gone unidentified using the single primary language question. Of these students, the vast majority went on to be assessed and found to be eligible for English language services. Had the one-question HLS been the sole criterion, these students would have remained outside the ELL assessment and services systems indefinitely or until their need for language services became undeniable.
Bailey and Kelly (2013) have attempted to address issues of the technical quality of and practices with HLS by reviewing available empirical research and describing potentially relevant constructs to be included and validated with future HLS (e.g., current language dominance and current language exposure contexts, exposure histories, and degree of literacy in the first language that may signal academic readiness). In light of this review, and recent work by Palermo et al. (2013) at the preschool level that suggests exposure to English at home and with peers is associated with Spanish-speaking children’s English receptive and expressive vocabulary skills, Linquanti and Bailey (2014, p. 3) have articulated constructs for future HLS development in terms of degree of relevance for a student’s current language use:
Essential constructs: for example, student’s current language(s) spoken, frequency of English language use (by student), frequency of English language exposure (provided by parents, peers, others)
Associated constructs: for example, languages spoken among adults to one another in the home, history of student’s language environment such as first language spoken, use of other language(s); years in U.S. schooling, literacy skills in all language(s)
Irrelevant constructs: for example, country of origin, information that may have no bearing on a child’s current language use
With the constructs delineated in such a salient fashion, the usefulness of the information solicited by any given HLS should be easier to evaluate, but the composition of HLS based on these constructs suggested in the literature awaits empirical testing. A further and equally important contextual factor addressed by Linquanti and Bailey (2014) is the potential for an inconsistently or unfairly administered HLS to also undermine the validity of HLS interpretation. The HLS and its purposes need to be thoroughly explained to parents. The HLS needs to be completed by parents (not by administrators’ best guesses) receiving the survey in their native languages or given access to interpreters in cases where they may not be literate in their first language or English. Later we make specific recommendations calling for fundamental research focused on establishing the validity of HLS based on reasoned taxonomies of constructs and principled administrative procedures. Validity here is primarily concerned with devising an HLS that will not misidentify those students who if screened further would be found eligible for Title III services. Missing a screening opportunity may bring an even higher cost later on for those students who need services yet slip through the net; certainly there could be a time lag before classroom teachers refer students they observe struggling to appropriate ELL screening (see Bailey, 2011, for discussion of such a teacher safety net). The ELL assessment system needs to be set to avoid such issues. State and local administrators can err on the side of overidentifying students knowing that immediate further screening should remedy any misidentification of students who do not need services but safely include all those who do. We turn next to what happens to those students who rightly or wrongly are identified by the HLS as potential ELL students.
ELL Status Eligibility and Initial Placement
States have the option to choose one or more assessments or screeners to follow the initial identification stage to determine which students are indeed eligible for Title III services. According to the NRC (2011), some states currently administer the same ELPA used for annual testing (e.g., California and Connecticut) whereas most other states use something briefer, typically a placement test or a screener. These measures are meant to provide information about a student’s level of English proficiency in four domains (speaking, listening, reading, and writing) and differ in test length, item types, and content alignment (see NRC, 2011, for a review). According to the NRC, a majority of states mandate the use of a single test, either a screener (27 states) or their own ELPA (4 states). The remaining states allow districts to choose between their ELPA and a screener (2 states) or to select their own screener (17 states) although a list is generally provided (NRC, 2011, p. 84). Standard-setting procedures applied to these assessments determine a twofold educational placement: Students who meet or exceed the standard are designated IFEP, and receive instruction with non-ELL students, and students who fail to meet the standard are considered eligible for Title III services, designated as LEP, and provided with that programming.
Standard-setting procedures and decision rules applied to one or more sources of data should take into account the implications of Type I and Type II error for both groups of students and conduct validation studies accordingly. These decision rules set into motion the trajectory of instruction provided to students. Perhaps as a consequence of the legal effort that proceeded the Title III mandate (Hakuta, 2011), assessments—and their cutoffs, or standards—have been created to place all eligible language minority students in Title III programming. Yet concern over the growing number of students spending extended time in Title III, called “long-term ELLs,” has researchers asking how we place students out of Title III (Gándara, Rumberger, Maxwell-Jolly, & Callahan, 2003; Kim & Herman, 2010; Olsen, 2010; Robinson, 2011). Overdesignation of students in initial placement is a plausible, yet rarely mentioned, antecedent to long-term ELL schooling experience and its consequences, such as lower levels of school persistence and less access to college (Kim, 2011). Such incorrectly identified students may not quickly qualify for reclassification in subsequent assessment because reclassification is not automatically enacted at the district or school level for every grade K–12. For example, a study of reclassification in California found that nearly half or more of all districts report that they do not reclassify in the early grades (K–2): About 30 percent of districts report permitting reclassification in kindergarten, 47 percent in grade 1, and 54 percent in grade 2. (Hill, Weston, & Hayes, 2014, p. 28)
There is some evidence that districts who do not reclassify prior to third grade have lower overall rates of reclassification (Parrish et al., 2006), lending credence to concerns that ready-to-be reclassified students in early elementary grades may experience deleterious effects from remaining in Title III programming. Studies have found a relationship between protracted time designated as ELL and lack of school persistence, including dropping out of high school (e.g., Kim, 2011). However, the relationship between protracted ELL status, the type, the quantity and quality of instruction received, and academic outcomes is likely to vary at state, district, and school levels and would be best interpreted by mixed-method studies, an approach that is seldom used in this field. In reclassification at the higher grades, the exit criteria are far more extensive and challenging than the initial placement criterion, requiring a student to reach a threshold performance in ELA and in some instances additional content areas (see below). In the case of the inverse situation of underidentification, safeguards to protect against missing a student who should receive Title III services vary at state and local levels (e.g., teacher referral), and to date, no systematic review of such practices has been conducted to our knowledge.
Setting the IFEP standard depends on the state. According to available information (NRC, 2011), the classification as IFEP is based on one of the following: (1) a single score (e.g., screener), (2) a composite score (e.g., ELPA), or (3) an aggregated set of scores (e.g., ELPA and screener). Making a high-stakes decision based on one score is strongly discouraged (see Standard 12.10, American Educational Research Association [AERA], American Psychological Association, & National Council on Measurement in Education, 2014). 11 Just adding multiple sources of evidence, however, does not guarantee the benefit of enhanced overall validity. When one decision rule determines two educational trajectories (eligible for or exempt from Title III services), the management of Type I and Type II error rates (false positive and false negative, respectively) is more difficult. How multiple sources are prioritized and aggregated is equally important because decision accuracy can be severely attenuated by the number of measures (Mosier, 1943) and choice of decision rule (Abedi, 2004).
For criterion-referenced tests, standard-setting techniques are absolute (comparing to the standard) so the cut score is set to minimize either false negatives or false positives. In the context of ELL status eligibility, we consider “false negatives” as students wrongly identified as LEP when actually IFEP and as “false positives” as students wrongly identified IFEP when actually LEP. To minimize “false positives” and try to ensure Title III services for all eligible children, standards may be set at a high level to make the screener difficult to pass (e.g., using a response probability [RP] = .80 standard in the Bookmark method; Mitzel, Lewis, Patz, & Green, 2001). Casting this wide net, however, carries the risk of overidentifying English-proficient language minority students as ELLs. Federal law also requires that we minimize “false negatives” to protect children from inappropriate placement in LEP instruction, especially in cases where English is one of the child’s native languages (NCLB, 2001; Title VII, ESEA of 1965; The Civil Rights Act of 1964).
To err on the side of minimizing false negatives, standards could be set to a “readiness” cutoff (e.g., RP = .67 in the Bookmark method). It is up to each state to negotiate the standard that balances these known risks, thus determining the state (or local) definition 12 of what constitutes eligibility for, or exemption from, Title III programming. This standard-setting choice is especially impactful in states where the majority of initial placements happen prior to third grade. Title III requires two sources of standardized testing data to determine “proficiency,” an idea adopted by states to determine program exit. This means that students in some states have to wait for second-grade standardized testing data, often delivered to districts in the summer prior to third grade, as the second source of data. Thus, students placed in Title III services in kindergarten could be retained in services for 3 years until second-grade state standards-based test data are available for use in redesignation and reclassification. Discussion of time spent in programs designed for ELL students when students may be wrongly identified is not intended to imply that these programs are inferior in terms of access to academic content instruction; rather, it is meant to illustrate yet another example of how federal accountability has affected ELL testing and use of test data. Furthermore, it calls into question whether any placement test used at kindergarten is fit for the task of determining 3-year ELL placement, especially as some studies suggest the limited reliability of kindergarten ELP assessments for even a 1-year placement (e.g., the MI-ELP assessment in Michigan: Winke, 2011; the California English Language Development Test [CELDT] in California: García Bedolla & Rodriguez, 2011).
Language minority students are also subject to differing degrees of reliability in their designation and placement depending on their state, their grade at enrollment, the screener or placement test they are administered, and the cutoff standard used in that academic year. For example, a language minority student entering kindergarten in Illinois would be given the MODEL-K (WIDA Consortium, 2011), which is used for initial placement and consists of two subdomains naturally weighted: listening (50%) and speaking (50%). Illinois, like each state using the MODEL-K, sets its own cutoff score for IFEP placement, and this standard is subject to regular adjustments (“standards reconsideration”). That same language minority student entering kindergarten in Texas would be administered an Oral Language Proficiency Test measuring listening and speaking only but in California would be administered the CELDT, which includes reading and writing tests weighted as follows: listening (45%), speaking (45%), reading (5%), and writing (5%). The differing likelihoods of IFEP designation based on just these factors have yet to be documented. California statewide data suggest that the percentage of IFEP designations has decreased since the inclusion of reading and writing at the kindergarten level, but this trend is difficult to interpret without further evidence. In addition, the extent to which standard setting is affecting fair placement is as yet unknown.
The new content standards assessments are not likely to affect initial placement practices. However, if a student is transferring from one district or state to another, it is hoped that the new content standards assessments will in fact aid in appropriate ELL placement. With a common measure of achievement in English language arts, for example, it is hoped that the information a student brings to a new district will be more easily interpreted. Crossing state boundaries and encountering a different ELA assessment, however, continue to present challenges for equivalency in the placement of ELL students (Linquanti & Cook, 2013) and is a situation not likely to be completely remedied by establishing two new content assessment consortia to which different states can belong even with the intended linking efforts between the two consortia. Once placement is established, students will receive instruction (e.g., mainstream plus Title III services) and their English language progress will be measured in various ways and for various purposes as their language learning is assumed to progress.
Monitoring English Language Progress
We divide the use of ELL assessment for monitoring progress into three subsections: (a) assessment for ongoing language instruction, (b) monitoring annual progress, and (c) assessing for ongoing proficiency status and placement.
Assessment for language instruction
Once students are placed in Title III services, teachers are expected to be continually monitoring the progress of their students’ English language and literacy development. Assessments operate at either a macro level or a micro level in terms of the length of the period covered and the level of the detail of learning obtained (Black, Wilson, & Yao, 2011). Although NCLB requires the annual assessment of English language proficiency based on ELD/P standards (see the next section), the inferences drawn from the results of these macro-level assessments may be inadequate for instructional purposes. The language content sampled in such large-scale, standardized assessments cannot provide sufficient information to teachers about any one specific language skill or knowledge of their students (Durán, 2008).
From an assessment use perspective, such assessments were not designed for, and hence cannot be reliably used for, such purposes. For understanding and appropriately responding to student learning in more incremental ways, teachers additionally need a continual flow of information about their students’ language abilities at the micro level. This type of assessment is typically referred to as formative assessment, but this term covers a wide spectrum of approaches and may best be described as assessment used for learning rather than of learning (Black & Wiliam, 1998). Formative assessment has been used to describe the close monitoring that may take place during the act of instruction itself (e.g., Bailey & Heritage, 2008; Bailey, Heritage, & Butler, 2014; Durán, 2008; Heritage, 2010), or the more deliberate collection of information on student performances obtained after instruction has taken place (e.g., teacher-generated classroom assessments, Abedi, 2009, 2010; Llosa, 2008).
The former approach to formative assessment takes place from observing student-to-student talk, by regularly conferencing with or interviewing students about their learning, or by taking note of the ways in which students formulate their questions, for example, and allows teachers to most flexibly decide suitable next-steps instruction (Torrance & Pryor, 1998). Moreover, formative assessment in a moment-to-moment or “proximal” interactive manner (Erickson, 2007) provides teachers with a way to make contingent instructional responses to students’ immediate learning needs (Heritage & Heritage, 2011; Heritage, Walqui, & Linquanti, 2013) and, we might argue, is perhaps the most challenging and skillful application of the formative approach to assessment. This application of formative assessment may be especially effective for addressing the learning and assessment needs of ELL students; research shows that the immediate feedback that formative assessment can offer is most effective with students with less background knowledge and lower academic performance profiles (McMillan, 2010). Coleman and Goldenberg (2010) in a review of effective practices with ELL students found that schools and districts that included regular assessment used to inform instruction (along with sustained and coherent leadership, learning goals, consistent curricula, professional development, and ongoing support and supervision) reported higher academic achievement in ELL students.
A key construct concern with assessment for instruction includes defining the necessary English language skills and knowledge that are predictably needed for academic achievement (Bailey, 2007). Features of language used in academic settings that include teacher talk, texts, tests, and standards documents have been analyzed, and commonalities and differences in vocabulary, sentence structures and language functions (e.g., explanations, descriptions) across different content areas (e.g., Bailey, Butler, Stevens, & Lord, 2007), and differences between language use in academic and conversational settings in terms of degree of contextualization and formality (e.g., Snow & Uccelli, 2009) have been noted. One recent approach has been to identify the “Key Practices and Disciplinary Core Ideas” in the new content standards and the receptive and productive language functions that likely will be required to carry out these practices (ELPD Framework, CCSSO, 2012). For example, in the area of CCSS English language arts/literacy, students are expected to engage in the following practices: “construct valid arguments from evidence” and “critique the reasoning of others,” and the ELPD Framework postulates what language functions students may need in order to carry out these practices such as “Comprehend oral and written classroom discourse about argumentation” and “Providing explanation of an argument through the logical presentation of its steps” (pp. 14–15).
This approach includes high-level descriptions of language uses rather than attempting to specify discrete language structures that provide a foundation for language. Indeed, the following example that we have taken from Kindergarten Earth’s System Standards (K-ESS) of the NGSS (2013) demonstrates the complexities of identifying the English language inherent in the new content standards.
K-ESS2-1. Construct an argument supported by evidence for how plants and animals (including humans) can change the environment to meet their needs.
To successfully meet this standard, kindergarteners are expected to engage in argument (presumably orally) using the language structures and related vocabulary that will allow them to construct an argument with evidence to support a specific scientific claim. To meet the demands of NGSS K-ESS2-1 linguistically, young students will need the discourse skills to first state a claim and provide evidence (descriptive statements) for or against the claim in order to make an argument that either supports the claim or refutes the claim. At a minimum, they will require knowledge of cause and effect vocabulary and sentence structures (e.g., because, if . . . then) and will likely need control of modal verbs to express their stance and the more subtle relations between plants, animals, and the environment (e.g., Tree roots could, should, would be in search of water . . .).
Characterizing the inherent language demands (at the word, sentence, and discourse levels) of the new academic content standards, although challenging, will be necessary to support instructional practices and align ELL assessments (both classroom and large-scale) to the new academic content standards (Bailey & Wolf, 2012). One such attempt has been to build language learning progressions, analogous to the content progressions developed for science and mathematics learning. The Dynamic Language Learning Progressions Project (Bailey & Heritage, 2014) has collected data on student oral and written language in grades K–6 to create empirically based trajectories of student language development in the contexts of explanations of personal routines and mathematical problem-solving tasks. With such progressions, teachers will have access to a features analysis of audio- and text-based samples that gives them the capability of formulating customizable learning progressions taking into account a wide array of information on the linguistic and personal background characteristics of students. To date, educators have had no such system to guide them in the formative assessment of student language growth and next-steps instructional decision making.
Finally, it should also be stressed that many students who are acquiring two languages simultaneously may also be receiving content instruction in both English and another language. To present an accurate profile of their language (first and second) and content learning, these dual-language learners will therefore require ongoing assessment both of and in their two languages (e.g., Valdés & Figueroa, 1995).
Annual progress monitoring
Under NCLB, Annual Measurable Achievement Objective (AMAO) 1 is intended to measure students’ annual progress in learning English. States set targets for growth that are expected of ELL students in the domains of listening, speaking, reading, writing, and comprehension (typically a composite of the listening and reading comprehension subtests of the listening and reading domains of the state ELP assessment). Measuring this objective is a key intended purpose of the ELD/P standards-based assessment that each state is required to administer with ELL students on an annual basis under Title III. All but seven states belong to a consortium that currently has a test (the ACCESS for ELLS of the WIDA Consortium) or is developing an ELP assessment for annual assessment (the ELPA21 Consortium). The remaining states have developed their own (the CELDT in California, the New York State English as a Second Language Achievement Test in New York, the Texas English Language Proficiency Assessment System in Texas, the Idaho Language Proficiency Assessment in Idaho), have modified an existing commercial assessment (the Arizona English Language Learner Assessment in Arizona), or use a commercially available assessment (the LAS Links, Connecticut and Indiana).
Growth can be measured in increments of proficiency levels with students required to show an increase in at least one level per annum. Using baseline data from the earliest years of a state’s AMAO ELP assessment cohort data, targets are set for the expected number of ELL students to meet the annual target in any given year. Myriad attendant issues are raised, such as knowing what increments of growth are reasonable to set and whether the target number of students meeting the desired rate of progress should be set higher as states become familiar with meeting the instructional needs of ELL students (see Mayer, 2007) and the psychometric considerations thereof (e.g., Cook, Boals, Wilmes, & Santos, 2008; Kenyon, MacGregor, Li, & Cook, 2011; see also Boals, Kenyon, Blair, Cranley, Wilmes, & Wright, 2015, this volume).
In the remainder of this section, we focus on two key issues affecting the measurement of language growth: (1) the alignment of ELP assessments to state ELD/P standards and relatedly the adequacy of ELD/P standards to capture the relevant language demands of the academic context, and (2) accurately understanding and measuring meaningful growth in English proficiency.
Starting with the premise that inasmuch as ELP assessments are well aligned with ELD/P standards and curricula, their utility in measuring the language necessary to succeed in academic content classes and on content tests is tied to how well the ELD/P standard themselves reflect the tasks, activities, and knowledge of the academic content areas. ELP assessments have undergone considerable revision in recent years to create test items that reflect the academic uses of language at the K–12 level; they were modified or newly created first to reflect the inherent language of academic content standards required under Title I of NCLB and then more recently to reflect the more overt emphasis on the communication of content knowledge found in the CCSS and the NGSS (see Frantz, Bailey, Starr, & Perea, in press, for a recent review of these developments). Alignment between new ELP assessments and new ELD/P standards should be made tenable with the existence of guiding frameworks such as the ELPD Framework developed by language test developers and educational linguists working in tandem (CCSSO, 2012). Earlier ELD/P standards were derived from either ELA standards or hypothetical tasks possibly encountered in content areas rather than from academic content standards (for review, see Bailey & Huang, 2011). In contrast, the ELPD Framework and the ELPA21 Consortium ELP standards informed by it, as well as the WIDA Consortium ELD standards and the independently-created standards of states (Arizona, California, Connecticut, New York, and Texas), have all included a focus on the academic contexts of the new content standards in which presumably English language develops for the majority of ELL students.
However, less attention has been given to the language content in the new ELD/P standards, namely, specific features of English (e.g., word-, sentence- and discourse-level features) and how they develop over time as a result of instruction and experience (Bailey & Heritage, 2014). To date, the education field does not have good descriptions of the trajectories of English language growth in both monolingual English-speaking students and students learning English as a second or additional language (Hoff, 2013). Developmental trajectories or learning progressions are important because they can provide the specificity necessary to guide the kinds of language learning that should occur on route to proficiency in English. Empirically derived language learning progressions based on authentic examples of the features of student language performance at different developmental points could change assessment design by complementing the standards in identifying key characteristics for interim and summative assessment. In the area of formative assessment, they can provide the interpretative framework by which teachers can learn to draw inferences about an ELL student’s current language status during a variety of ongoing activity settings in the classroom (Bailey & Heritage, 2014).
Second, in terms of understanding and accurately measuring language growth, Bailey and Heritage (2014) have attempted to explicate the different ways in which language might be expected to progress in terms of amount, quality, rate, and order of development, and depth and breadth of repertoires, relative to the myriad instructional and experiential characteristics pertaining to students’ backgrounds. Detailed studies of the growth of language in K–12 student populations taking account of different student dimensions are still necessary if such studies are to inform ELP assessment development. Indeed, a recent study of adolescent ELL students suggests that different rates of English language growth do occur as a function of student background characteristics. Specifically, foreign-born ELL students arriving at ninth grade in U.S. schools acquired English proficiency at faster rates than their U.S.-born peers to attain the same levels of English proficiency by the end of high school (Slama, 2012). The same study highlights how U.S.-born ELL students have been on an enormously delayed trajectory to proficiency having already received Title III services for more than 9 years by the start of the study.
Given that the ELP assessment is used to capture annual growth for accountability purposes, in addition to being the instrument for ascertaining English proficient status, and in some circumstances also the instrument of screening and placement (e.g., the CELDT as it is currently used in California, and optionally LAS Links in Connecticut), serious doubts have been raised as to whether one assessment can realistically be expected to fulfill all of these tasks adequately. However, if an assessment were aligned to an underlying learning progression for English language, then it arguably should be able to serve each of these different purposes. In the absence of such test design and development, multipurpose ELP assessments have been criticized. For example, Stokes-Guinan and Goldenberg (2010) examined the use of the CELDT for multiple purposes including monitoring growth and concluded that it is neither valid nor reliable for measuring individual student growth. This ELP assessment was not originally designed to measure individual student growth year on year. Even with the creation of a common scale to compensate for the lack of vertical scaling needed for reporting yearly growth across the different forms of the test for five different grade spans, the authors caution against its use for reporting the language proficiency growth of individual students, because proficiency level designations were as high as 60% inaccurate in any given year. At best it was considered likely sufficiently valid and reliable for interpreting the growth in English proficiency of groups of students.
Stokes-Guinan and Goldenberg (2010), among many others, advise that rather than rely on one assessment for multiple purposes, educators should be relying on multiple assessments for one purpose to counteract the adverse impact of any one poorly devised test. It bears noting that the consideration of multiple sources of evidence in an “all evidence pass” approach (conjunctive) could produce even greater inaccuracy as the decision is based on the least reliable component, as opposed to a “some evidence pass” compensatory approach. Thus the choice of decision rule when aggregating evidence is also crucial (Carroll & Bailey, 2013a, 2013b).
Ongoing proficiency status and placement
Use of a state ELP assessment is also mandated through NCLB in AMAO 2 to account for the number of students attaining a level of English that is deemed “proficient.” The state ELP assessment is also used to indicate readiness to exit Title III services, but in this latter instance it is typically one among other measures depending on the state and district (see also the next section). It is also used to determine continuation in LIEPs, as well as to determine level of placement within those programs. The levels of proficiency determined by an ELP assessment can be the only source of data used to determine programming for students classified at Levels 1 to 5 13 who have not yet reached the cut point for English language “proficient.” These levels should determine to what extent the programming a student receives will differ in intensity, frequency, and curriculum (see Estrada, 2014; Estrada & Wang, 2013, for an example of these curricular streams).
In terms of determining the cut point for English language “proficient” (signaling readiness to exit Title III services on this one indicator), each state determines the way the four subdomains (listening, speaking, reading, and writing) are combined and where the final standard is set. There are four general decision rules (Carroll, 2012; Chester, 2003; Wise, 2011):
Conjunctive: all indicator pass
Compensatory: some indicator pass
Complementary: either/or indicator pass
Mixed: combination of models, for example, conjunctive-compensatory
Choice of model depends on the type of error one wishes to avoid. Standards set close to cut points are at high risk for measurement error (AERA et al., 2014), so for ELP assessments, compensatory models can be used to increase the reliability of the overall classification. In a state using two or more sources of evidence in a conjunctive approach, the entire decision may hinge on the measurement precision of the least reliable evidence. It is then that opt-in and opt-out safeguards, such as parent’s right to waive Title III services, protect the rights of individual students.
Although classification validity is routinely reported by test vendors in terms of decision accuracy or consistency, empirical evidence is rarely used to justify the use of a certain classification model or scheme. Empirical studies that have examined models applied to ELP assessments have found that conjunctive classification models can underidentify English language proficient cases (Carroll, 2012; Carroll & Bailey, 2013a, 2013b), yet are preferred to minimize false positives. 14 Evidence suggests that compensatory models are preferable for ELP assessments as subdomain tests (listening, speaking, reading, and writing) have few items, are highly correlated, and often have cut points set at or near the mean, increasing the likelihood of erroneous interpretation (Carroll & Bailey, 2013b). Compensatory models may allow undesirable “uneven” proficiency profiles where one subdomain score is dramatically lower than others. There are contexts where compensatory models have been avoided for this reason (see Abedi, 2004; Clauser, Clyman, Margolis, & Ross, 1996; Hambleton, Jaeger, Plake, & Mills, 2000; Hambleton & Pitoniak, 2006). The actual likelihood of uneven profiles in an ELP assessment context, however, has rarely been investigated. One study by Carroll and Bailey (2012) reanalyzed fifth-grade student ELP assessment performance data (N = 875) using conjunctive and compensatory models. The authors examined the prevalence of uneven profiles in comparison to false negatives (identification of proficient students as nonproficient) by model. The conjunctive model produced no uneven profiles yet a high number of potential false negatives (n = 400), whereas the compensatory model produced some uneven profiles (n = 15) yet far fewer potential false negatives (n = 162). This finding suggests the need for a well-reasoned model choice to mitigate overidentification and underidentification rather than avoidance of models that may yield uneven profiles.
Standard setting for overall performance standards (e.g., overall Levels 1–5) also determines the intensity and duration of educational programming (see Florez, 2012) and is subject to choice of standard-setting procedures (e.g., Bookmark method; Mitzel et al., 2001) and the alignment and/or vertical scaling techniques applied to test forms within the K–12 testing system (Kenyon et al., 2011). As ELP assessment forms are created in grade span clusters (e.g., K, 1–2, 3–5, 6–8, 9–12), standard-setting and alignment procedures are constrained within the assumptions of how students will perform within and between test forms (e.g., a test developer will set a cut point for the fourth-grade higher than the third-grade cut point but lower than the fifth-grade cut point). As ELL students are heterogeneous in many ways that may defy these assumptions (Durán, 2008), more evidence needs to be gathered, preferably from assessments other than ELP assessments, to verify these procedures.
Reclassification
The final component in the ELL assessment system is reclassification, which determines which students will exit from, or remain in, Title III services. Although effective classification is the goal, misclassification in either direction can be detrimental to student achievement. Instances of students being designated R-FEP and exiting Title III too soon have been the subject of federal oversight, and students reclassified during the early elementary grades have been found to experience academic difficulties later on, showing declining performances on standards-based assessments of mathematics and English language arts by fifth grade compared both to their earlier performances and to their non-ELL fifth-grade peers (Slama, 2014). Concern has also mounted for those who may exit too late—misclassification that leads to continuation in Title III services. Although such placements may not prevent access to grade-level content, students who do not need Title III services may spend some of their day receiving instruction below their linguistic capability, instruction that is often paired with additional remedial coursework, a phenomenon named by Estrada and Wang (2013) as an “additive remediation strategy” with “multiple interventions, each of which moves [ELLs] farther from access to the core and full curriculum and the mainstream” (p. 8). Kim (2011) has found that protracted time in ELL status was related to diminished school persistence. This “long-term ELL” epidemic (Gándara et al., 2003) is strongly linked to assessments and procedures associated with reclassification.
In terms of federal accountability, a change in Title III services is an accountability mandate and is subject to certain federal criteria (AMAOs 2 and 3). At the state (or local) level, this change in service affects educational programming and can be subject to additional local criteria, often grade-specific. The evidence used for reclassification (or Title III continuation) decisions includes ELP assessment classifications (overall “proficient” or “nonproficient”), state standards-based assessment of English/Language Arts (typically overall level of “Basic” or above), grades or GPA, teacher recommendation, and consultation with parents to obtain their opinions about a student’s proficiency in English. Typically at the younger grades (K–3), curriculum-based measures of language development (e.g., literacy, reading) are used in addition to, or in lieu of, state standards-based ELA assessment data, which are not collected until the end of second grade. As reclassification criteria are at once double-edged, determining who will exit and who will remain in Title III services, misclassification is difficult to prevent. Effective classification depends on many factors, including the reliability and validity of each measure and criterion, the timeliness of data relative to decision making, and the decision rules used to aggregate the criteria.
The ELP assessment criteria for reclassification are an overall “proficient” classification, whereas “nonproficient” signifies continuation in Title III programs. While investigating the validity of inferences from ELP assessment classifications, studies have investigated the test development process (Davidson, Kim, Lee, Li, & López, 2007; E. E. Garcia, Lawton, & Diniz de Figueiredo, 2010), the impact of test formats and measurement models (Zhang, 2010), and the impact of cut scores (Florez, 2012; Wang, Niemi, & Wang, 2007). Standard-setting procedures, including those commonly used in ELP assessment development, have been subject to scrutiny (e.g., the Bookmark method; Hein & Skaggs, 2009; Karatonis & Sireci, 2006), yet these investigations alone are insufficient for assuring validity of procedures used for setting levels within subdomain tests in ELPAs. Classifications from ELPAs have been criticized for being too lenient in comparison to other sources of proficiency (e.g., Stanford English Language Proficiency Test [SELP]; Mahoney, Haladyna, & MacSwan, 2009) or inconsistent in comparison to other ELPAs (Del Vecchio & Guerrero, 1995, as cited in Estrada, 2010), yet analyses have rarely investigated the impact of an overall classification in terms of reliability, fairness, or ability to produce valid inferences for educational programming.
Although ELP assessment classifications are considered prima facie evidence for readiness to exit Title III, the use of multiple measures is preferred (Ragan & Lesaux, 2006). However, the prioritization and consideration of multiple sources of data depend entirely on the decision rules (e.g., conjunctive—all indicator pass; compensatory—some indicator pass). For example, when all reclassification criteria must be attained (i.e., conjunctive decision rule), a classification of “nonproficient” on the ELP assessment can be the single source of evidence used to retain a student in Title III. This is the case in many states, if not all, as illustrated in the commonly used augmented-classification model (Abedi, 2008). Under a compensatory approach, all evidence may be considered with the allowance of one criterion falling slightly below a cutoff, or within a predetermined zone of indecision (e.g., confidence interval). Although rigid cutoffs may be necessary for federal accountability purposes, compensatory approaches are more aligned with evidence-based data use for individual-level decisions as they allow for greater reliability of the overall aggregate classification. In addition, procedural safeguards such as opt-in and opt-out or parent waivers should accompany an intensified scrutiny of the reliability and validity of each inference, knowing that the least reliable indicator could unfortunately become the gatekeeper. For a more comprehensive review of reliability and validity information gathered on ELP measures, see NRC (2011) and Porter and Vega (2007).
The criterion of “English proficient” as measured by a state’s standards-based assessment of English language arts is a matter of state (or local) control. Recommendations have been made for the optimal criterion, which range from “above the 35th to 40th percentile” (Gándara, 2000), yet it has been reported that states interpret age/grade appropriate levels “as scoring above the 50th percentile . . . or even at the 32nd percentile” (Abedi, 2008, p. 21), which illustrates the propensity for within and between state variation. In addition, state standards assessments have gone through many changes in the past decades, moving past the “minimum competency test” programs of the 1970s and 1980s to NCLB era assessments that embodied “much higher standards” (Heubert, 2004 p. 220) to the college- and career-ready standards of the current era. 15
With ELL assessment consortia (i.e., ELPA21 and WIDA) and statewide data systems (e.g., California Longitudinal Pupil Achievement Data System in California), existing data that come with the student should be more easily interpreted between and within states. Even though states will have their own exit standards attached to these ELP assessments that can include performance on the new content assessments, the potential for more transparency and comparability across states is welcome. See also Linquanti and Cook (2013) and Cook and MacDonald (2014) for the role of reference performance-level descriptors (PLDs) for English proficiency levels in an attempt to create a “common definition of English learner” across states in this regard.
This raises the problem of (un)timely reclassification evidence. In their recent Title III report, Cook, Linquanti, Chinen, and Jung (2012) suggest that policymakers review the timing of the administration of assessments as well as the delivery of assessment data: Where the ELP is administered several months before the academic achievement assessment, EL students’ actual level of English language proficiency when they take the academic content assessment may be quite different from that indicated by their ELP assessment result. Furthermore, the direction of this difference may vary, in part, according to the exact time of year when each assessment is given, and the linguistic environment of the EL student population. (p. 27)
The timing of assessments, data delivery, and reclassification reporting deadlines are also issues of great importance at the local level (Estrada & Wang, 2013). The practical mechanism of decision making depends on timely, useful data, and this is the joint responsibility of policymakers and state departments of education. When assessments used for federal accountability are also used for local decision making, it is the child-level decisions that should take priority. Data delivery systems should be designed to first fulfill this purpose.
Overall, the determination of readiness for Title III program exit has been the focus of most validation studies investigating ELPAs and mostly with an eye on accountability: specifically, looking at the ability of English language “proficient” classifications to predict future academic success (Ragan & Lesaux, 2006). Such studies have examined the outcomes of reclassified students in comparison to English-only students (Crane, Barrat, & Huang, 2011), in comparison to other reclassified students with higher or lower eligibility levels for R-FEP (Kim & Herman, 2010) and in comparison to students who did not reclassify (Abedi, 2008; Grissom, 2004). Studies focused on policy have explored methods for setting reclassification standards (Robinson, 2011) and performance standards for accountability targets related to reclassification (Cook et al., 2008). Few studies have examined academic outcomes for students remaining in LIEPs despite high levels of reclassification readiness, with a notable exception being the current work of Estrada and Wang (2013). In the year-one findings of this reclassification study, researchers found ELL students at all grades, but especially third grade, who were prevented from reclassification due to ELP assessment evidence despite strong performances on all other criteria. These students were subsequently kept in educational programming below their abilities, limiting their chances to access pathways, or curricular streams (Estrada, 2014), that would more likely ensure college readiness.
The other indicators for Title III exit, such as grades, teacher report, and parent recommendation, are complex and difficult to interpret in their own ways. The shrinking role of these “nonstandardized” indicators is becoming evident as accountability mandates stipulate the exclusive use of standardized measures in federal reporting. However, their importance and usefulness at the individual level cannot be understated. What remains to be seen is to what extent a prolonged emphasis on standardized measures will change the regard of personal accounts in this process. In this current era of new academic content standards, the types of evidence that comprise the ELL assessment system will continue to be, in the words of Elmore (2004), “to say the least, works in progress” (as cited in Fuhrman & Elmore, 2004, p. 278).
Academic Content Assessment With ELL Students
In this section, we review research focused on two key threats to the valid inferences drawn from academic content assessment use with ELL students, namely, construct irrelevant variance and misuse of test accommodations and how both may be evolving in light of the changing language demands of the new content standards. Both validity concerns arise in the context of interpreting the meaningfulness of results obtained for AMAO 3 at the nexus of Title I and Title III under NCLB. This third and final mandated objective of ELL assessment focuses on the English language arts and mathematics achievement of ELL students as a subgroup of the general student population for federal accountability reporting purposes (see Lane & Leventhal, 2015, this volume; Sireci & Faulkner-Bond, 2015, this volume). Every ELL student who has been resident in the United States for longer than 12 months must participate in their state’s annual testing program along with their English-speaking peers. However, where cultural and linguistic factors are paramount in ELL student testing (Durán, 2011; Solano-Flores, 2011), there are additional caveats necessary for the interpretation of scores on state standards-based content assessments as accurate indicators of ELL student content knowledge.
Changing Views on Construct Irrelevant Variance
The NRC (2011) warns, The sizable ELL population is a particular challenge because students are at varying levels of ELP and may not be sufficiently proficient in English to demonstrate proficiency in academic content areas. Because they have the task of learning English and academic content simultaneously, it is not surprising that, as a group, they do not meet the proficient level in academic subjects: the academic gap between the group and the non-ELL population is considerable. (p. 7)
It is also important to remember that the most proficient students in the ELL population are exited from ELL status, meaning that students in the ELL subgroup are necessarily and by definition those students not yet proficient in English.
However, Dutro (2006) has referred to the academic achievement gap between ELL and non-ELL students as the “linguistic gap” because of the language demands placed on ELL students that may eclipse their display of academic content knowledge. In fact, we may argue that we do not have sufficient details about the content knowledge of ELL students under existing testing conditions to validly interpret scores as demonstrating a gap in content knowledge between ELL and non-ELL students. Some proportion of the academic achievement gap may be due not to an ELL student’s lack of content knowledge but to the content assessment’s inability to accurately measure that knowledge when insufficient language proficiency stands in the way. This is the construct irrelevant variance that has previously been identified as a major threat to the validity and therefore potential usefulness of assessments of the content knowledge of ELL students (Abedi, 2002; Haladyna & Downing, 2004). If the language proficiency level of an ELL student is insufficient for the student to understand the language of a mathematics assessment, for example, then the assessment may, in part, be measuring the wrong construct (i.e., measuring language knowledge rather than mathematics knowledge).
In the past, researchers have documented differential item functioning with ELL students that was thought to be the result of construct-irrelevant variance. For example, in the study of ELL and non-ELL student performances on a state standards-based mathematics test, Martiniello (2009) found that greater lexical and syntactic complexity of math word problems favored the math outcomes of non-ELL students. Furthermore, she found that differential item functioning is attenuated when items included nonlinguistic schematic representations that ELL students could use to make meaning of the mathematics test items. However, the new content standards specify the teaching and assessment of the communication of content knowledge in addition to the content knowledge itself, and it may be pertinent to now consider how effectively students can communicate their mathematics content knowledge, for example, as an additional aspect of the mathematics construct to be assessed (Haladyna & Downing, 2004). In effect, language has become construct-relevant in the era of the new content standards. Furthermore, the four Cs, namely, critical thinking, communication, collaboration, and creativity, that have been articulated by the Partnership for 21st Century Skills are reflected in the new content standards (National Education Association, 2011). In the future, the onus will be on test developers to clearly articulate the content construct so that item writers can include or not include additional verbiage as appropriate to avoid construct-irrelevant linguistic complexity. Unfortunately, the distinction between communication and unnecessary linguistic complexity may have become less determinate with the new standards.
Henceforth students will need to be equipped with the linguistic acumen to take part in classroom interactions that support their deeper content learning. For example, when partnered with other students, they will need familiarity with language practices and routines to negotiate their involvement in activities, solve problems cooperatively, and discuss and support one another’s ideas (CCSS Initiative, 2010a, 2010b). Studies will be needed to document the kinds of language that teachers report students need to participate in various tasks tied to their implementation of the new content standards. For example, one observation that the participating teachers in the Dynamic Language Learning Progressions Project have made has been around the development of a repertoire of modal verbs and of causal embedding in students’ language that appear necessary for negotiating collaborative activities such as building representations of their content learning (e.g., I would like to build . . ., we should do this part first because . . ., If we do this part first then we could . . .; Bailey & Heritage, 2014).
The communication and collaboration emphases in the new content standards will have ramifications for the interpretation of academic content test scores for ELL students, as well as important implications for valid development of both academic content and ELP assessments. First, this change in emphasis in the academic content standards may mean that ELL students will find the new content assessments an even greater challenge than existing assessments in English language arts, mathematics, and science if ELD instruction is not commensurately redesigned and implemented to keep up with the new language demands of the new academic content standards. 16 Potentially, poor performances may be misinterpreted as lack of the deeper content knowledge called for in the new content standards but may be due to either a lack of opportunity to learn content as a result of the new linguistically and communicatively more demanding instructional environment (witness the complexity of the language of modals above) and/or the language demands of the new content assessments, which will act as a barrier to students being able to show what they know.
Second, this change in emphasis implies that the assessments of content knowledge may no longer be unidimensional; rather, they may have both a content knowledge construct and an interactive/communicative ability component that determines how well students can convey their academic content knowledge to others. Moreover, new curriculum-based assessments may need to be developed that also address dimensionality issues so that they are capable of measuring the range of linguistic abilities as well as new communicative competencies aligned with the realities of classrooms configured to teach the objectives of the new content standards.
Implementing Effective Testing Accommodations
Accommodations are provided to address concerns with accessibility of academic content assessments and meaningful interpretation of scores. These testing accommodations are made available to students who need them, including students with certain disabilities and students classified with limited English proficiency. However, fair and equitable accommodation practices differ for ELL students in meaningful ways that affect valid interpretation and use of scores. Furthermore, the use of scores from content area testing affects educational programming and Title III service designations for ELL students, which further magnifies the importance of effective accommodations.
Durán (2008) raises two main concerns in the case of accommodations for ELL students: (1) Does the accommodation adequately facilitate ELL students’ ability to access the information required for problem-solving of content area items? (2) Does the accommodation confer an advantage that could alter the meaning of the underlying construct of the measure? Administrators selecting accommodations for ELLs taking state content assessments may be aware that accommodations are classified by whether they alter the underlying construct being tested (e.g., in the unlikely and extreme case that a dictionary were provided during a spelling test) or not, but this classification system is not currently based on research findings. Although there are ways to address the effectiveness of accommodations psychometrically (e.g., the interaction hypothesis, see Sireci, Li, & Scarpati, 2003) the broad heterogeneity of the ELL student population makes interpretation and generalizability of findings challenging. Kieffer, Rivera, and Francis (2012) have provided an updated compendium of the recommended uses of accommodations on large-scale assessments with ELL students, including specific guidelines applicable to state-level decision makers. Abedi, Hofstetter, and Lord (2004) also raise the concern of feasibility for decision makers: “Is this accommodation strategy practical and affordable, even for large-scale assessments?” (p. 15). Lack of feasibility, or lack of funding to support accommodation practices, is an unfortunate reality for many schools striving to address the needs of ELL students. Furthermore, ELL students with disabilities are entitled to accommodations required by their Individual Education Plans to address their disability as well as accommodations to assist with their English language needs, which may lead to a complex set of choices for educators (see Abedi 2009, for discussion).
The promise of the new computer-adaptive content assessments is that accommodations will be built into the interactive platform of the tests, thus giving test vendors and state departments of education the opportunity to choose, implement, and monitor the effectiveness, validity, and impact of the accommodations. However, the efforts to create reference PLDs common across the states will also contribute to increasing the comparability of accommodations use at certain PLDs (Linquanti & Cook, 2013). Such issues of fair, valid, and interpretable inferences from scores resulting from accommodated testing are paramount and are further discussed by Thurlow and Kopriva (2015, this volume; see also Kieffer et al., 2012; Kieffer, Lesaux, Rivera, & Francis, 2009; Young et al., 2008, Young et al., 2010).
ELL Assessment Components: One System or Many Pieces?
If initial identification practices are sound, then one should expect few students to be identified as IFEP at the next stage of ELL assessment as this would signal that the questions on the HLS overidentified students as potential ELL students when they were not. In other words, HLS questions may have measured construct-irrelevant factors such as having been born outside the United States or even prevented parents from indicating that their child was fluent in both English and another language. The very use of the initial identification phase is to narrow down the population in order to avoid the unnecessary testing of students ineligible for Title III services, a cost both to the student and to the state, but without severely curtailing the chances of including those students who may need services. To date there is no guidance on what proportion of a state’s students should reasonably be expected to test as IFEP and that would give an indication that the “right” students had been filtered through the initial identification stage and on to screening or assessment to determine ELL status eligibility.
Similarly in the routines of placement in, continuation in, and placement out of Title III services, our theory of action states that the intended ultimate effects of these components is the progress and eventual English language proficiency of ELL students. Yet measurement error abounds, not only within the assessments themselves but in the assignment of cutoffs and proficiency standards for each decision-making purpose. For the system, and each measure within it, to yield useful interpretations that are valid for each use of testing data (Kane, 2013), all stakeholders need access to information that allows interpretation of scores and error at the individual level. Particularly for state ELP assessments that are developed and validated in grade clusters (e.g., for Grades 3–5), progress monitoring relies heavily on the accuracy of cutoffs and proficiency standards.
Turning to the roles of language and academic content assessments within the system, federal reporting requirements integrated with Title I (i.e., AMAO 3) clearly provide a structure by which states and districts are held accountable for the academic progress of the ELL student subgroup. Furthermore, there is integration of the two assessment components through the service delivery model that NCLB supports, namely, Title III services for those identified as ELL are embedded within the wider Title I services (Figure 1, p. 631, Winke, 2011). Less clear, however, is how the content of the two assessment components (English language and academic content) might be connected in a cogent way. The ELPD Framework (CCSSO, 2012) has played the most visible role in this regard and is a device by which to consider the relationship between language and academic content, specifically adopting a view that the new content standards “articulate both disciplinary practices and embedded language practices” (p. 2). The ELPD Framework is predicated on the conceptualization adopted by the Understanding Language Initiative (e.g., van Lier & Walqui, 2012) that students learn language and content simultaneously in complex adaptive systems by responding to “affordances” emerging from dynamic communicative situations. As already mentioned, the purpose of the framework is to guide state ELD/P standards creation and evaluation. However, by its framing of key language practices corresponding with the new academic content standards, the framework may help forge stronger ties between ELD/P and academic content assessment. For example, its extension to assessment scenarios could provide closer connections between the language observed/monitored at the classroom level and the language students will need to display during interactions in content classrooms, or closer connections between the language tested on new ELP assessments and the language students need for their responses to test items on the new academic content assessments.
Recommendations for System Improvements and Research
The research reviewed here has revealed that all components of the system as well as the system in its entirety have flaws requiring changes and subsequent evaluation if the theory of action we initially laid out is to be fully realized. If language testing policies are functioning as de facto language planning policy, then the research and education communities must critically evaluate the assessment system, not just for the technical quality of the assessments but also for the larger purpose and consequences they have on the education of ELL students. First, in the area of initial identification, the HLS needs to be reinvented as a tool to help gauge the language environments a child is currently exposed to and thus the likelihood a child has acquired English prior to enrolling in a U.S. school. As part of the larger effort by CCSSO to support the move toward a “common definition of English learner” across states (Linquanti & Cook, 2013), Linquanti and Bailey (2014) have proposed the development of new HLS informed by the construct-relevant taxonomy outlined in the section on initial identification. By following this proposal, although states may not develop a common HLS, they will have articulated the same set of underlying constructs and they will have reflectively chosen which constructs to include or exclude. It is anticipated that interpretation of HLS results will therefore be more effective because parent responses to HLS questions will more accurately identify the constructs of interest. 17 Moreover, to improve initial identification, we reiterate the recommendation made by Bailey and Kelly (2013) to states to conduct studies of redesigned HLS instruments with parent groups, administrators, and subsequent language proficiency data to determine if they are valid for the purposes to which they are put (see Standard 12.13 in AERA, 2014).
Second, to ensure fair and equitable allocation of Title III services, initial placement tools and the accompanying standards or cutoffs will require ongoing review. Although recent studies have suggested state placement test cutoffs would place many non-ELL students into Title III programs (e.g., 50% of first graders and 75% of kindergarteners in non-ELL sample; García Bedolla & Rodriguez, 2011), it seems that test developers and state departments of education alike are unable or unwilling to interpret these findings in their own Title III systems. There is little doubt that cutoff scores on initial placement tests affect the prevalence of misclassified students. What seems clear from the research reviewed here is that the solution to misclassification, whatever the extent, will not be psychometric. There is simply no way to achieve acceptable levels of misclassification risk for both false positives and false negatives, especially on grade cluster tests, which already suffer from borderline reliability at each grade level. State- and district-level Title III coordinators would do well to establish wraparound procedures for each system component in order to amend for the known risks of misclassification. Such procedures could include systematic double-check procedures for students whose scores fall within the confidence intervals of the cutoff and streamlined opt-in and opt-out procedures that include the input of teachers and parents so decisions are made as close to the child as possible. This may include an exploration of the role of bilingual assessment for intake and placement decisions as well. 18 Furthermore, the prevalence and characteristics of the students found ineligible for Title III services (i.e., designated as IFEP) is an area still in need of study along with more robust validation procedures to ensure standard setting and reconsideration are meeting established testing standards (AERA et al., 2014).
Third, to successfully assess the language and content learning of ELL students for instructional purposes, educational administrators need to provide the necessary supports and opportunities for professional development so teachers can learn how to design valid English language assessments for learning and effectively use the information they yield (Bailey & Heritage, 2008; Bailey, Heritage, & Butler, 2014; Briggs, Ruiz-Primo, Furtak, Shepard, & Yin, 2012; Heritage, 2010; Llosa, 2008). The expertise of ELD specialists (including ESL, ELD, and LIEP teachers) and content teachers will need to be combined and used in order to successfully implement classroom assessment of language and academic content knowledge during the course of instruction. The ELPD Framework (CCSSO, 2012), although laying out extensive connections between language and academic content, is not intended as an aide to teachers despite its importance to their work. Consequently, additional collaboration between researchers, policymakers, and teachers is needed to ensure that teachers are supported in their work. 19 For example, teachers could greatly benefit from exposure to authentic exemplars of language performances to which they could hold ELL students, as well as information about how language develops along situated learning progressions that they could scaffold for ELL students (Bailey et al., 2014; Bailey & Heritage, 2014). How teachers may be assisted to use progressions in on-going formative assessment and how well students learn from changes to teacher assessment knowledge and practices still need to be determined by research. Furthermore, language learning progressions could be used to form the underlying backbone or spine of a unified instructional and assessment system. Specifically, both instruction and assessment could be “aligned” via the same set of learning progressions articulating the incremental developments in student English language and offering far more specificity than aspirations or expectations found in standards. To date, much of the research on learning progressions has been confined to the fields of mathematics and science (e.g., Sztajn, Confrey, Wilson, & Edgington, 2012), but if the work were to be extended more concertedly to language learning contexts it could help improve both assessment and instruction and help states measure growth and proficiency using a desirable multiple-measures approach.
Fourth, we recommend more validity studies focused on the consequences of ELP assessment like Winke’s (2011) study conducted to better understand how ELP assessment in Michigan played out at the local level in the hands of districts, principals, and teachers. Such studies will help “ensure that large-scale testing programs like ELP assessment are accountable not only to the entities that mandated them, but also to those for whom the tests are intended to serve—students, educators, and the public at large” (p. 654). With an eye to the academic content assessments being developed by Smarter Balanced and the Partnership for Assessment of Readiness for College and Careers, we especially need to consider innovative ways to assess academic content with students who are acquiring English. This can include greater exploration of bilingual assessment of the content areas.
Use of language accommodations on the new content assessment will require evaluation for impact on underlying ELA and mathematics constructs. Not only would this information be necessary for states making accommodation selection decisions, but states would also benefit from additional research that can elaborate on the validity and effectiveness of the use of multiple accommodations in the case of ELL students with disabilities. Related to the use of accommodations is the opportunity for new kinds of assessment formats due to the electronic platforms of the new content assessments. An extension of Obtaining Necessary Parity Through Academic Rigor (ONPAR) assessments from the realm of students with disabilities to ELL students, for example, is one attempt to help remedy language as a confound of content measurement (Kopriva & Albers, 2013; Kopriva, Gabel, & Cameron, 2011). ONPAR assessments are computer-based multisemiotic representations (i.e., visual simulations or animations) of knowledge and can minimize the oral and written language needed by students to display their learning. A combination of verbal and nonverbal methods can be used for ELLs to determine their content knowledge both with and without potential language barriers. Such combinations could address not only construct-irrelevant variance concerns at one level but also the need for the communication of content knowledge (cf. the new routines and activities requiring language and expected of the learning inherent in the new content standards) when language becomes construct-essential for displaying content knowledge. Both the inclusion of technology and the notion of communication of content knowledge take ELL assessment into new realms that will need to be carefully examined for both linguistic and cultural aspects of the validity argument. Admittedly these efforts could be cost-prohibitive if undertaken by individual states, although ELP consortia could leverage developments in the content assessment arena for use with the ELL assessment system. Minimally, states can conduct accessibility reviews of current test items to uncover threats to the interpretive claims of ELP assessments. Of course, even with embedded accommodations implemented during computer administration, professional development will still need to address the skills required of teachers and administrators for identifying accommodations that meet individual ELL student needs (including the complexities of accommodations used with students with disabilities).
Fifth, there are concerns that exit standards-setting procedures may be limiting qualified students from achieving FEP designation (Carroll, 2012, 2014). As with initial placement concerns dealt with in the second recommendation, opt-in and opt-out procedures, such as parent waiver, could help mitigate these inaccuracies and help ensure fair and appropriate continued educational services. Currently, each state (or local) system is left to determine, develop and validate its own opt-in or opt-out system. In the absence of such procedures, students who are underidentified (e.g., Arizona; see Goldenberg & Quach, 2010) or overidentified (e.g., California; see García Bedolla & Rodriguez, 2011) are likely to remain in a mismatched educational setting, perhaps indefinitely. We suggest a more careful, reasoned use of multiple measures and opt-in, opt-out systems, starting with each state or district outlining the decision rules, or proprietary formulas, used at any/all points in the ELL assessment system (see Standard 12.1, AERA et al., 2014). In addition, more careful documentation of multiple sources of validity (beyond coefficient alpha, which is of limited interpretive value in this context; Brown, 2014), including classification consistency, should accompany each state’s ELP assessment for each interpretation and use of scores (Kane, 2013). Finally, we recommend systematic review of the academic performances and socioemotional outcomes of former ELL students who are monitored for 2 years after exiting Title III services. The information this yields can provide further evidence of whether educational programs as a whole are meeting the different needs of ELL students (e.g., Castro-Olivo, Preciado, Sanford, & Perry, 2011; Kim & Herman, 2010; Ragan & Lesaux, 2006).
Concluding Remarks
The validity argument for ELL student assessment is based on the premise that the collection of instruments measure what they claim to (e.g., levels of English that suggest students are eligible for Title III services and the relevant language proficiency needed for academic achievement) and are used in intended ways. The theory of action outlined here has framed the review of ELL assessment polices and related research, and it has assisted in evaluating the system as a whole. The system needs improvements at every level, and to be most effective, we make a final recommendation that the ELL assessment system also be improved on as a whole; different ELL assessment components need to relate to one another in a more cohesive manner so that ultimately the elicited data and score inferences are meaningful and useful to states in making certain that all students have received the education that the new academic content standards were created to ensure. 20
