Abstract
English language proficiency assessments (ELPA) are used in the United States to measure annually the English language progress and proficiency of English-language learners (ELLs), a subgroup of language minority students who receive language acquisition support mandated and largely funded by Title III (NCLB, 2001). ELPA proficient and non-proficient classifications are determined by applying decision rules to combine the sub-domains of listening, speaking, reading, and writing in a conjunctive, compensatory, mixed or complementary manner in order that an ELP performance standard can be set. Although the ELP performance standard is used to set accountability objectives for federal reporting, it also is used to reveal students’ readiness for exit from English language services. This study operationalizes and tests the ELP performance standard for student-level decision making by describing to what extent students are classified as non-proficient under different models and rules and the effect of these differences on their eligibility for redesignation. Test performances from one state’s ELPA were gathered from a statewide sample of ELL (n = 875) and randomly selected sample of native English speaker students (non-ELL, n = 92) in fifth grade. Findings indicate sizable differences in non-proficient classifications for ELLs, non-ELLs, and a constructed subgroup of academically high-performing students. There were also observed differences in redesignation eligibility in all groups suggesting that choice of model and decision rule can extend the length of time even high-performing students spend in English language services. Discussion includes implications for validation of high-stakes classification systems.
Keywords
Background
English-language learner (ELL) is a designation given to school-aged language minority students in the United States who are acquiring the English language necessary to meaningfully participate in or benefit fully from instruction delivered in English. According to the US Department of Education, an estimated 4.4 million students, or 9.1% of the school-aged population, were designated as ELL in 2011–12. The proportion of ELL students in each state varies from 0.7% in West Virginia to 23.2% in California (United States Department of Education, 2014). The ELL designation, designed to be temporary, activates funding for states to provide an appropriate combination of language development and content-based instruction with the goal of raising student English language proficiency (ELP) to a point where it does not hinder success in English-medium instructional environments.
Devising and validating measures of ELP accelerated under No Child Left Behind (NCLB) with all states now in compliance (National Research Council, 2011). Test consortia and individual states have devised ELP assessments (ELPA 1 ) which help determine a language minority student’s eligibility to enter into, receive leveled instruction within, 2 and exit from English language acquisition and development services, henceforth ELD services (Bailey & Carroll, 2015). Classification accuracy for these high-stakes decisions depends not only on items, cut scores, and performance standard validity, but also on models and decision rules for how sub-domains of ELPA are combined which can also vary by state.
From raw scores to redesignation, there are several points where ELPA validation is needed. To determine proficient and non-proficient classifications, test consortia and individual states choose models and decision rules, set sub-domain test weighting, set performance standards on sub-domain tests, create composites of ELP, and set an overall performance standard. To determine readiness for redesignation and exit from ELD services, state policymakers choose criteria, set performance standards, and prioritize criteria in order of decision use. All these choices allow for discrepancies between and even within states, which affect directly the redesignation eligibility of individual students and warrant careful study. Whether overt or tacit, these choices adhere to theories of language acquisition, language development, and proficiency. When applied in a multivariate assessment, classification models and decision rules embody and enforce these theories.
To date, efforts have been made to ensure the psychometric quality of ELPA, and most recently moving towards a common or comparable definition of “English Learner”, as determined by ELPA performance standards. These efforts have recognized concerns about a lack of remedies for ELL misclassification, arising from assessment imprecision and “current federal law provides no guidance” (Linquanti & Cook, 2013, p. 11). In this paper we use performance data from ELL and non-ELL students in one state – “State A” – to examine a foundational yet seldom studied issue of how ELPA models and decision rules play a role in defining who is “non-proficient” and ineligible for redesignation, that is, who remains an “English Learner”. We begin by defining and describing classification models in relation to ELPA.
A classification model is a guiding principle for combining two or more measures, subtests, or indicators to which a performance standard and decision rule for classification is applied. All decision rules used in classification of ELL students can be organized under one of the following four models:
conjunctive (all indicators at the standard);
compensatory (uneven indicators, some allowed below the standard);
mixed (two models combined);
complementary (one of two possible indicators at the standard).
All decision rules operating under these models are either composite, for combining sub-tests towards an overall classification such as reclassification as ELP “proficient,” or aggregate, for combining separate indicators (e.g., ELPA and academic content assessments) for a multiple-measures decision such as redesignation as Fluent/Fully English Proficient (FEP) required in some states and/or districts. Decision rule choice has been found to affect outcomes based on composites, such as the General Education Degree test (Douglas & Mislevy, 2010), and aggregates, such as fourth grade promotion based on eight possible indicators including class grades, standardized test scores, and a multidisciplinary project (Chester, 2003). The purpose of this study is to describe the impact of model and decision rule choice for an ELP assessment. We will explore both the composite (i.e., combination of subtests), in terms of which students are classified as ELPA non-proficient, and the aggregate (i.e., use of multiple measures), in terms of which students are ineligible for redesignation as FEP.
Model choice for ELP assessment composites has been mandated by federal policy starting with compensatory under the 1994 revision of the Elementary and Secondary Education Act (ESEA) and conjunctive under the 2001 revision (NCLB). As of 2008, any model can be used, “so long as the State can demonstrate that the composite score meaningfully measures student progress and proficiency in each of the language domains and, overall, is a valid and reliable measure of student progress and proficiency in English, consistent with the purpose for which the assessment is used” (United States Department of Education, 2008).
Evidence to argue the validity of a chosen model and decision rule should be provided by state policymakers and test developers for each use of the classifications. Such evidence should be examined by other test developers or states under peer review agreements with the US Department of Education to ensure the choice of models, decision rules, and performance standards meet the recommendations set forth by the Standards for Psychological and Educational Testing (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 2014). Determining decision rules for ELP assessments is fraught with challenges, such as accounting for the dimensionality of sub-domains (Abedi, 2007) and the controversial nature of standard setting on a measure tied to both federal accountability and student-level decision making (Bailey & Carroll, 2015; see also Chapelle, 2012).
ELP assessment models and decision rules adhere to different theoretical stances on second language acquisition. These assumptions should be made explicit for each intended use of scores and classifications to facilitate peer review and evaluation. To illustrate, we propose theoretical reasoning for each model and provide decision rule examples that each entails in Table 1. In the following paragraphs we describe conjunctive, compensatory, mixed, and complementary models in relation to ELP assessments.
Classification models applied to English language proficiency assessments.
Sub-domains typically include listening, speaking, reading, and writing.
Conjunctive models require all sub-domain scores to meet or exceed sub-domain performance standards regardless of the overall score. This requires high accuracy for all cut scores as, despite a high overall score, missing one sub-domain by one point would result in a non-proficient classification. For example, if the sub-domain performance standard is 18 out of 25, students scoring 17 in listening and 25 in speaking, reading, and writing would be “non-proficient”, despite achieving 92% overall. Students scoring 18, 24, 25, and 25 would be “proficient”, also reaching 92% overall, as would students scoring 18, 18, 18, and 18, equaling 72% overall. Proficient and non-proficient classifications at the same high overall score are difficult to justify when sub-domains are highly correlated and could yield false negatives. Yet conjunctive models disallow sub-standard performances to minimize false positives, which is also desirable (Abedi, 2004, 2013). In the current study, a conjunctive model is used by “State A” with its own ELP assessment.
Compensatory models allow uneven performances where high scores on some sub-domains compensate for low scores on others. This allows students with high overall scores to be classified as proficient to a greater extent than conjunctive models, minimizing false negatives. Even with the possibility of false positives, there is reason to believe that extremely uneven profiles are less prevalent (Carroll & Bailey, 2012) especially when sub-domains are highly correlated. Since other measures could help minimize the likelihood of false positives during redesignation, compensatory models are desirable to minimize false negatives. A compensatory model is used by New Jersey that uses ACCESS for ELLsTM (WIDA, 2013).
Mixed models combine two models, often conjunctive and compensatory. For example, a mixed model could require a certain overall level (compensatory), while requiring three or more sub-domains at or above a certain level (compensatory) and none below a certain level (conjunctive). Alternatively, a mixed model could require an overall performance standard or score (compensatory) plus all sub-domains at or above a certain level (conjunctive). However, any model including a conjunctive component will, to a certain extent, be disregarding the overall score, as it requires even high-scoring performances to meet additional sub-domain requirements. Mixed models are used by California with its own ELP assessment, and by several states using ACCESS for ELLsTM including Alaska, Illinois, New Hampshire, North Carolina, and Montana.
Complementary models take an either/or approach, which, for example, could allow use of prior ELPA scores for example, for students moving from a different state, to be used to verify readiness for redesignation. Complementary models could also allow for “proficient” by either conjunctive or compensatory rules on the same ELP assessment. In this way, complementary models allow for several pathways to “proficient”.
Models and decision rules underpin performance standard setting for federal accountability, but the classifications produced are also used to decide which students are proficient or non-proficient. This raises the possibility that poorly validated models and decision rules, not just performance standards, could be contributing to the misclassification of students. This misclassification can result in premature exit from or extended tenure in ELD services.
Timely redesignation and exit are particularly high stakes for fifth-grade students transitioning into secondary school settings. A misclassification of proficient for a student actually non-proficient (false positive) could mean full-time placement in mainstream and gifted instructional settings without language supports. A misclassification of non-proficient for a student actually proficient (false negative) could mean one or more years designated as an ELL with extended time in instructional settings below that child’s linguistic competence and a delay of full-time placement in mainstream and gifted settings. Specialized language instruction can support academic progress and biliteracy when facilitated by well-trained teachers to students who need it (August & Shanahan, 2006; Faulkner-Bond et al., 2012) but is argued to differ in the depth, breadth, pacing, and complexity of content, when compared with mainstream curricula and instruction (Walqui et al., 2010). In addition, literature shows that students with ELL status can often be grouped into low-performing content classrooms, limiting the opportunity to learn and making it difficult for students to reach the academic achievement criteria necessary for redesignation and program exit (Abedi & Herman, 2010; Callahan, 2005; Estrada, 2014; Xiong & Zhou, 2006). Particularly for students also identified for remediation classes in mathematics or other content areas, the clustered scheduling of language, remediation, and low-performing classes moves students with ELL status further away from access to the full curriculum (Estrada, 2014; Estrada & Wang, 2013). Despite the known benefits of ELD services, students with ELL status over the long term have less opportunity to engage with grade-level content and vocabulary, less access to proficient language models among grade-level peers including native English speakers, limited access to advanced and college-preparatory courses, and lower levels of school persistence (Gándara, Rumberger, Maxwell-Jolly, & Callahan, 2003; Kim, 2011; Slama, 2014).
Literature review
To date, studies related to ELPA have investigated the test development process (Garcia, Lawton, & Diniz de Figueiredo, 2010), the constructs tested (Bailey, 2007), the impact of test formats and measurement models (Zhang, 2010), the impact of cut scores (Florez, 2012; Wang, Niemi, & Wang, 2007), the predictive ability of classifications as readiness indicators for federal reporting of progress and proficiency (Abedi, 2008b; Robinson, 2011), and the performance of students who have been reclassified as FEP (Abedi, 2008b; Kim & Herman, 2010; Ragan & Lesaux, 2006; see also Solórzano, 2008 for a review). In terms of classifications, ELPA studies have been conducted by test developers and consortia to report technical qualities and validate the inference of “proficient” for state Title III accountability reporting, but not for individual student-level decision making such as ELD instructional placement and redesignation.
Two studies were conducted by a test developer analyzing the performance of ELL and non-ELL students (n = 400) on the Stanford English Language Proficiency assessment (SELP, 2003). In the first study, Stephenson, Johnson, Jorgensen, and Young (2003) used analyses of variance (ANOVA) to compare the performances of ELL and non-ELL students grouped as “proficient” and “non-proficient” by the SELP on the SAT9 reading comprehension subtest. They claimed that validity of SELP classifications was justified by evidence that students classified as “non-proficient” scored lower on average on the SAT9 than students classified as “proficient”. There was no examination of individual-level performance data for anomalies such as a “non-proficient” SELP classification converging with a high SAT9 performance. Further investigation to rule out any such counterevidence will be necessary in order to support the warrant that an ELP classification inference accurately and consistently identifies ELP at the individual level. In a related study, Stephenson, Jiao, and Wall (2004) again used ANOVA to claim validity of SELP classifications based on evidence that ELL students scored lower on average than non-ELL students. However, discriminant analyses by group membership in the same study indicated some non-ELL students were classified as “non-proficient” (by grade cluster: primary, 36%; elementary, 28%; middle, 13%; and high, 17%). The authors cited this as evidence that classifications at primary and elementary levels are “not very reliable” (p. 8). Unfortunately, these findings were not accompanied by recommendations for how classifications should be interpreted when used in decision making for individual ELL students. Again, there was no investigation of the classification model itself or its potential contribution to lower levels of reliability and validity.
The California Department of Education (CDE, 2011) used ELL and non-ELL student performances to investigate validity of reading and writing sub-domains on the California English Language Development Test (CELDT) in kindergarten (non-ELL, n = 1386; ELL, n = 4350) and first grade (non-ELL, n = 495; ELL, n = 3985). Non-ELL students scored higher on average than ELL students, which CDE cited as evidence that CELDT scores are valid for allocating ELL support, even though those decisions depend on CELDT classifications, not scores. In addition, researchers cited group membership predictability as validity of classification accuracy as the majority of ELL students fell below the “early advanced” cut point 3 (K, 94%; 1st, 92%). However, a majority of non-ELL students also fell below the “early advanced” cut point (K, 74%; 1st, 50%), a finding which was not interpreted in the study. The CDE report noted that if the CELDT were used as a stand-alone criterion “a great many more students would be receiving [English language development] services” (p. 46), implying that the K/1 CELDT with its current decision rule can over-identify proficient students as non-proficient. No recommendations were made for interpreting these classifications, even though districts do rely solely on the CELDT to determine which language minority students should receive English language services.
Although some studies have suggested methods for setting an ELP performance standard, there has only been mention of the standards’ sufficiency for state Title III accountability. Validity for individual student-level decisions may be implied, but it has not been assured. Linquanti and George (2007) present California’s approach to establishing annual measurement achievable objectives (AMAOs) to measure and report on progress towards meeting Title III accountability goals. Within the development of AMAO 2 used to measure student attainment of ELP (i.e., percentage of proficiency), the authors report efforts to operationalize and test decision rules to set the state ELP performance standard. Three rules were created: two mixed and one conjunctive. Although the method and sample were unspecified, Linquanti and George reported a “relatively small difference in outcomes” between the mixed rules, which were said to be “substantially more challenging to attain” than the conjunctive (p. 109). Based on this finding, researchers recommended the first mixed rule as “sufficiently rigorous” (p. 109) as an ELP performance standard for use in the development of AMAO 2. The report neither defined nor interrogated the sufficiency of the ELP standard for other uses, such as determining which students should continue to receive ELD services.
Elsewhere, Cook (2009) has outlined how the State of Kentucky set its ELP performance standard for AMAO 2. For this process, ELP was defined as “the point where students’ language proficiency level becomes less related to academic achievement” (p. 4). ACCESS for ELLsTM and the Kentucky Core Content Test (KCCT) tests of reading and mathematics were used to analyze the relationship between ELP and academic achievement, respectively. Kentucky’s overall ELP composite used weights provided by the WIDA consortium (listening*0.15 + speaking*0.15 + reading*0.35 + writing*0.35), creating a continuous variable along which a point could be chosen to define the ELP performance standard. To identify where the relationship between ELP and reading/mathematics became less related, researchers first conducted a correlation and then a decision consistency analysis. Six decision rules were chosen: three mixed and three conjunctive. To test the impact of each decision rule, a logistic regression was conducted to predict the likelihood of reading and mathematics proficiency for students meeting or exceeding the ELP standard set by each rule. Results were presented in clusters for grades 3–5 and 6–8, and individually for grades 10 and 11; students in Kindergarten, grades 1 and 2, and those taking the Tier A test were excluded. Based on the discussion of these results, the expert panel chose a mixed decision rule which classified the fewest students as ELP proficient, 20% less than one alternative, yet offered the highest percentage predicted to be proficient in reading and mathematics. However, in relation to impact, students deemed “non-proficient” by the chosen standard were not described, nor was their level of reading or mathematics predicted. This omission suggests the possibility of students with actual reading and mathematics proficiency, and the language ability to achieve that result, being classified as ELP non-proficient in error.
Recent recommendations to policymakers have acknowledged that setting ELP criterion too high can negatively affect exit decisions. Cook, Linquanti, Chinen, and Jung (2012) offer three approaches to setting an ELP performance standard empirically connected to academic content assessments. They warn that a criterion set too high “will be less predictive of content proficiency at higher ELP levels” (p. 9) and this imprecision adds greater error, which could lead to inaccurate inferences. The authors note that such imprecision might also lead policymakers to establish higher criterion on content assessments, thus raising expectations for ELL exit decisions beyond what is currently attained by mainstream non-ELLs. Of concern to the current study is the impact of these methods on inferences used for decision making at the individual level. While the level of imprecision associated with these models may be deemed tolerable for Title III accountability and reporting at the state level, imprecision leading to the misclassification of individual students in terms of their English language services is a policy outcome that also requires a method of evaluation and accountability. The imprecision may be inevitable, but the damage need not be.
We now propose an argument for validity of inferences related to the use of ELP assessments at the individual level.
Interpretation/use argument
The paucity of peer-reviewed studies investigating validity of classifications for each use emphasizes the need to revisit frameworks for and examples of validation in language assessment. The Standards (AERA, APA, & NCME, 2014) define validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (p. 11). Over the last decade, frameworks for language assessment validation research, using an assessment use argument (AUA) put forth by Bachman (2005), have been proposed for language assessments such as ELPAs (Forte, Perie, & Paek, 2012; Wolf & Farnsworth, 2014; Wolf, Farnsworth, & Herman, 2008) and the Test of English as a Foreign Language (Chapelle, Enright, & Jamieson, 2008). Some authors have provided suggestions for methods and sources of evidence to use (Sireci, Han, & Wells, 2008) and examples in practice, such as Llosa (2008), who applies an AUA validity argument to investigate the validity of inferences from teacher determinations of performance levels based on an ELD classroom assessment.
Kane (2013) has proposed an augmentation to the AUA: an interpretation/use argument (IUA) that includes “the network of inferences and assumptions inherent in the proposed interpretation and use” (p. 1). 4 For this study, we have fashioned an illustrative IUA, which includes the warrants and claims related to the use of proficient and non-proficient classifications for reclassifying ELL students as FEP and eligible for exit from ELD services (see Figure 1). According to Francis and Rivera (2007), the fundamental validity question regarding ELPA and ELL students is “whether a student who scores in the proficient range of the test can function independently in an English-speaking classroom without specific language supports” (p. 20). This study will focus on one aspect of the IUA (Data 3 and Rebuttal 3) for one grade (fifth). ELP assessment classification for individual-level use has two main claims: (1) the inference of proficient signals readiness to receive all instruction in English-only settings; and (2) the inference of non-proficient signals lack of readiness to receive all instruction in English-only settings, thus inferring the continuing need for language services, including those which may supplant grade-level instruction. The warrant for this claim states that the decision rule has been created in a way that minimizes misclassification. The backings for this warrant are based on ELPA data: raw and scale scores, performance levels, and proficiency classifications. The present study focuses on ELPA Data 3, a student’s proficiency classification, and Backing 3, the reliability and validity of model and decision rule, by describing evidence for Rebuttal 3 which states: Classification model and/or decision rule is not appropriate for this construct and/or combining these sub-domains; thus, the extent of misclassification produced may prove to be untenable for each/any classification use.

Sample interpretation/use argument for ELPA classifications for redesignation.
If, for example, a reliability coefficient suggests 20% misclassification, the test developer and state policymaker must choose how to calibrate the decision rule. 5 The estimated amount of misclassification could be deemed untenable for a certain use and adjustments to the decision rule could follow. Although such calibrations are not the focus of this study, we know imprecision cannot be eliminated fully with psychometric techniques. Classification accuracy and consistency must be reported transparently by test vendors so that inferences drawn in service of individual-level decisions can be fully informed.
External criteria for convergent validity
The availability of ELL and non-ELL student performances on both an ELPA and a standards-based achievement assessment (SBAA) provided the unique opportunity to examine classifications using external criteria for convergent validity. Using the substantive argument that a proficient level of English-language proficiency can predict success in an English-only instructional setting, it would reasonably follow that non-ELL students who are currently receiving instruction in English-only settings could be used as a “known-to-be-proficient” comparison group. The extent to which ELPA decision rules classify these students as non-proficient can allow us to draw inferences related to the validity of those classifications. If any non-ELL students are considered by the ELPA to be non-proficient and thus “not likely to be successful in an English-only classroom”, especially non-ELL students who are academically high-performing, it seems logical to explore other decision rules that would be more predictive of actual proficiency. The extent to which different models and decision rules produce more or less potential for misclassification can be examined for this likely-proficient group in order to approximate the extent to which false negatives may be produced for ELL students. The approximation of false positive classifications, although equally important, is not possible with these data.
State A SBAA performances, at some levels, can provide convergent evidence of likely proficiency. State A uses SBAA-Reading performances of “Basic”, “Proficient”, or “Advanced” as criteria for ELL redesignation. Even though SBAA-Reading does not measure all language modalities (e.g., listening, speaking, and writing), we assume that students who perform strongly on this test are utilizing their English skills to do so. Therefore, strong SBAA performances suggest possession of skills consistent with a likely proficient level of English. Performances of “Below Basic” on SBAA-Reading may indicate either low levels of content knowledge or low levels of English language proficiency 6 and cannot stand alone as an indicator of non-proficiency in English (Linquanti, 2001). In point of fact, lack of achievement in SBAA content can be observed in native English speakers within a full range of communicative abilities. In this study, SBAA-Reading performances of “Below Basic” indicate ineligibility for redesignation according to State A guidelines.
To summarize the relationship between ELPA and SBAA data, an ELPA classification of proficient, or redesignation eligible, implies the following: (1) the latent variable of English-language proficiency is present at redesignation-eligible levels; (2) the influence of “Home Language Other than English” has been overcome to the extent that it is no longer a source of error on the ELPA; and (3) any errors on academic content tests administered in English are not likely to be the result of deficits in English-language proficiency. By this same logic, an ELPA classification of non-proficient, or redesignation ineligible, implies the following: (1) the latent variable of English-language proficiency is not present at redesignation eligible levels; (2) the influence of “Home Language Other than English” has not been overcome and is likely a source of error on the ELPA (or, that another influence known to affect language acquisition – e.g., specific learning disability, lack of opportunity to learn, poor attendance, poverty – has not been overcome); and (3) a source of error on tests administered in English is likely to be lack of English-language proficiency, thus predicting inability to show content knowledge and acquisition of standards through SBAA given in English even in cases where testing accommodations have been provided. 7
SBAA performances of ELL and non-ELL students serve as an external criterion related to convergent validity. It is clear that high performance levels on the SBAA are likely to confirm the presence of English-language proficiency at redesignation-eligible levels, but no SBAA performance level(s) can confirm the lack of English-language proficiency. Only measures of English-language proficiency, without construct irrelevant variance, can accomplish that. Therefore, SBAA levels can only predict the convergence of redesignation-eligibility on SBAA (“Basic” or above) and ELPA (proficient). To identify a level of performance where redesignation-eligible English-language proficiency is most defensibly present, the criteria of “Proficient” or “Advanced” in all SBAA domains (Reading, Language Usage, Mathematics, and Science) will be used to construct a subgroup of “academically high-performing” students. By our logic, the ELPA classification of this constructed sub-set should almost certainly be proficient, with no students classified as “non-proficient”, and thus can be used to describe an approximation of the classification sensitivity of decision rules in terms of the ability to predict true negatives.
Research questions
The current study describes the extent to which decision rules differ in producing non-proficient classifications from a State A ELPA to infer the need for continuation in ELD services. Researchers asked the following questions:
To what extent do decision rules differ in the number of students who would be classified as non-proficient?
To what extent to decision rules classify academically high-performing students as non-proficient?
To what extent do decision rules differ in the number of students who would be classified as ineligible for redesignation?
Method
Data source
Data were collected by the State A Department of Education during annual testing of previously identified ELL students in the spring of 2010 from two groups of K-12 students: ELL (n = 14,513) and non-ELL students (n = 1049). The ELL students in the original data set are all those enrolled in the state. The non-ELL students in this sample were chosen at random from districts who agreed to participate in the state’s data collection initiative. All participants completed the State A ELPA and SBAA.
Sample
For this study, one grade level (fifth) was extracted for analysis (ELL, n = 1119; non-ELL, n = 103). None of these students had been redesignated. We excluded any student who was given the newcomer test form (ELL, n = 70; non-ELL, n = 4) as this test has different sub-domain totals and cut scores requiring additional analyses that would benefit from a larger sample size. We excluded any student who had received testing accommodations on either test or had a special education designation; these data await separate analysis (ELL, n = 153). We excluded one duplicate student (ELL, n = 1), students with missing ELPA or SBAA-Reading data (ELL, n = 12, non-ELL, n = 3), and students with potentially conflicting information about native language (ELLs listed as native English speakers, n = 8; non-ELLs listed as native Spanish speakers, n = 2). The final sample was comprised of 875 ELL students and 92 non-ELL students (n = 967).
Measures
The information in this section was adapted from technical documents by permission of the State A Department of Education (State A DOE, 2009a).
State A English Language Proficiency Assessment (ELPA)
The State A ELPA is comprised of tests in four sub-domains 8 – listening, speaking, reading, and writing - and is administered to each student by grade cluster and test form difficulty. Fifth graders take “Cluster 3–5” (third, fourth, and fifth grades) in either Form 1 (for newcomer students in the United States for 0–12 months) or Form 2 (non-newcomers in the United States for more than 12 months). The State A ELPA is an untimed test which is group administered for some sub-domains (writing, part of reading) and individually administered for others (listening, speaking, and part of reading). Answers to multiple-choice items are recorded by students on test booklets and sent to the test vendor for machine scoring. The constructed-response items in speaking are scored by the examiner in real time. Writing and reading constructed-response items are sent to the test vendor for scoring. Each student response is read and scored by one rater with 20% read by a second rater.
Student performances in each sub-domain are reported in terms of raw score, scale score, and proficiency level. The raw score is the total number of correct answers on multiple-choice items plus the number of points earned on open-ended items. The scale scores for the 2009–2010 ELPA were determined by using a logit difficulty scale from the 2006 administration. For sub-domain scale scores, Advanced Beginning and Early Fluent proficiency level cut scores were set to pre-specified values. For the total scale score, Early Fluent and Fluent level cut scores were set to pre-specified values.
As determined by Bookmark Procedure standard setting (Mitzel, Lewis, Patz, & Green, 2001) the performance levels are as follows: Beginning (1), Advanced Beginning (2), Intermediate (3), Early Fluent (4), and Fluent (5). For each sub-domain test, a proficient performance standard was established at Early Fluent (4) or above. A conjunctive model was used to create overall performance standard. The decision rule calls for the four sub-domain scale scores to be combined, equally weighted, and summed. According to the State A conjunctive rule, a student is defined as “proficient” in English on the ELPA if a student tests at the early fluent and above (EF+) level within each sub-domain assessed on the ELPA.
State A Standards-Based Achievement Assessment (SBAA)
The State A SBAA is a computer-administered, non-adaptive, multiple-choice test, which measures academic content standards in four domains: Language Arts – Reading, Language Arts – Language Usage, Mathematics, and Science. Students complete the test online which the test vendor scores and processes. Automatic and management-initiated audits are in place to ensure the accuracy and reliability of score reports. Performances are reported in raw scores, scale scores, and four proficiency levels: Below Basic, Basic, Proficient, and Advanced.
Procedures
The ELPA was administered to all ELL students in State A as part of usual and federally mandated annual testing for ELL students, as well as a sample of non-ELL students randomly selected from volunteer districts by the state for research purposes during the regularly scheduled testing window (February 22 through April 2, 2010). The SBAA was administered to all ELL and non-ELL students as required by federal mandate during the regular testing window (April 12 through May 7, 2010).
Data analysis plan
To examine how decision rules differed in producing non-proficient classifications, four decision rules were created to illustrate typical ELPA composites: conjunctive (current State A model), compensatory-1, compensatory-2, and mixed (see Table 2). All four rules were applied to student ELPA performances. Researchers hypothesized that the conjunctive model would produce the most non-proficient classifications.
Classification models and decision rules used for this study.
Note: Listening cut score = 21/25, lower-bound = 17; Speaking cut score = 21/25, lower-bound = 17; Reading cut score = 19/25, lower-bound = 15; and Writing cut score = 16/25, lower-bound =12.
To examine the classification of high-achieving students under different decision rules, researchers constructed a subgroup of academically high-performing students as defined by achievement of the performance level of “Proficient” or “Advanced” on SBAA-Reading, Language Usage, Mathematics, and Science. All four decision rules were applied to student performances. According to the intended inference of ELP, researchers posited that all academically high-performing non-ELL students would be classified as ELPA proficient and expected the same for most if not all academically high-performing ELL students. Although the SBAA doesn’t specifically measure listening, speaking, or writing, inferences from high performances on SBAA administered in English are more directly pertinent to readiness for redesignation than inferences from ELPA classifications. Researchers hypothesized that the conjunctive model would create the most non-proficient classifications and a mixed model the fewest.
To examine the impact of the four decision rules on redesignation ineligibility, researchers calculated the number and percentage of students meeting both criteria of ELPA “non-proficient” by each model and SBAA-Reading “Below Basic”. It was hypothesized that the conjunctive model would create the largest group of ineligible students, but that no decision rule would classify academically high-performing non-ELLs as ineligible for redesignation.
Results
Descriptive statistics for the full sample and the constructed subgroup of academically high-performing students were compiled (see Table 3). This sample contained 875 ELL and 92 non-ELL students similar in gender distribution (percentage male: ELL, 52.2; non-ELL, 46.7). On other standard demographic variables, differences between ELLs and non-ELLs were expected and were seen in percentage free or reduced-price lunch (ELL, 80.3; non-ELL, 22.8), percentage identified as gifted and talented (ELL, 0.1; non-ELL, 8.7), percentage homeless (ELL, 1.6; non-ELL, 0), and percentage Title I (ELL, 75.1; non-ELL, 40.2). These differences continue to be observed even between subgroups of academically high-performing ELL and non-ELL students. This highlights the similarity of these constructed subgroups with the whole group, even though the non-ELL group has a higher percentage of high-performing students (64.1%, n = 59) than the ELL group (13.1%, n = 115). Due to our exclusion criteria, there were no students identified as receiving special education services, nor testing accommodations.
Demographic variables by group.
Academically High-Performing students achieved “Proficient” or “Advanced” in all SBAA domains (Reading, Language Usage, Mathematics, and Science). bAll values in parentheses are percentages according to row, rounded to the nearest integer.
Thirty-six languages (n = 32, specified; n = 4, unspecified) were reported for the ELL students. The largest native language group was Spanish, comprising 85.9%, followed by North American Indian (2.2%), Russian (1.5%), Arabic (1.1%), and Bosnian (0.9%). None of the non-ELL students retained in this sample reported a native language other than English. The ethnicity reported for most ELL students was Hispanic of any race comprising 85.1%, followed by White (5.7%), Black/African-American (3.4%), Asian (2.4%), American Indian/Alaskan Native (2.3%), Native Hawaiian/Pacific Islander (0.5%), Multiracial (0.3%),and unspecified (0.2%). For the non-ELL students, the ethnicity most reported was White, comprising 93.5%, followed by American Indian/Alaskan Native (2.2%), Multiracial (1.1%), and unspecified (3.3%).
Descriptive statistics of all performance variables were compiled to determine the means and standard deviations (see Table 4). Missing data were excluded pairwise to take advantage of all possible analyses. Data were analyzed to check that the assumptions of independence, normality, absence of outliers, and homogeneity of variance were not violated.
Means (and standard deviations) of State A ELPA and SBAA by group.
Note: n = 967; ELL (n = 875), non-ELL (n = 92). For the ELPA, raw scores are reported out of 100 total (25 for each sub-domain); for the SBAA, scale scores are reported (max, scores for Reading = 257, Language Usage = 258, Mathematics = 263, and Science = 253). For Mathematics (ELL, n = 873, non-ELL n = 92) and Science (ELL, n = 873; non-ELL, n = 89) missing data were excluded pairwise.
In Table 5, ELPA overall levels are reported. When we combine levels 4 and 5, we find that a majority of students achieved an “early fluent or above” standard: 79.3% 9 of ELLs (n = 694), and 94.5% of non-ELLs (n = 87). In Table 6, SBAA levels are reported. A majority of students achieved the SBAA redesignation criteria for State A which is “Basic” or above in Reading: ELLs, 87%, n = 761; non-ELLs 98.9%, n = 91. Few students were “Below Basic” in Reading: ELLs 13%; non-ELLs, 1.1%. However many more ELLs were “Below Basic” in Language Usage, 23.8%, than non-ELLs, 6.5%.
Distribution of overall level on State A ELPA by group.
All values in parentheses are percentages according to row.
Frequency and percentage of students in each SBAA level for each domain.
All values in parentheses are percentages according to row.
Extent of non-proficient by decision rules: Full sample
To answer the first research question, “To what extent do decision rules differ in the number of students who would be classified as non-proficient?”, four decision rules were applied to student performance data and the number of non-proficient cases was calculated, as reported in Table 7. The conjunctive classified the most students as non-proficient (ELL n = 498, 56.9%; non-ELL n = 25, 27.2%). Of the remaining decision rules, the mixed rule classified the most students as non-proficient (ELL n = 284, 32.5%; non-ELL n = 8, 8.7%), while the compensatory rule produced the fewest (comp-1: ELL, n = 240, 27.4%; non-ELL n = 8, 8.7%; comp-2: ELL n = 181, 20.7%; non-ELL n = 5, 5.4%). The difference between decision rule outcomes is sizable, adding up to 36.2% of ELL students (n = 317) and 21.7% of non-ELL students (n = 20).
Frequency and percentage of students classified as non-proficient by decision rule and group.
Note: n = 967;a “Academically High-Performing” is a constructed subgroup of students who achieved “Proficient” or “Advanced” in all SBAA domains (Reading, Language Usage, Mathematics, and Science). bAll values in parentheses are percentages of total group as listed in each column.
Extent of non-proficient by decision rule: Academically high-performing students
To answer the second research question, “To what extent do decision rules classify academically high-performing students as non-proficient?”, a subgroup was constructed of academically high-performing students who had achieved “Proficient” or “Advanced” performances on all State A SBAA domains (Reading, Language Usage, Mathematics, and Science). All four decision rules were applied and the number of students classified as non-proficient was calculated. As with the full sample, the conjunctive classified the most students as non-proficient (ELL n = 34, 29.6%; non-ELL n = 6, 10.2%). The remaining decision rules classified very few ELL students as non-proficient (comp-1 n = 6, 5.2%; mixed n = 4, 3.5%; comp-2 n = 2, 1.7%) and no non-ELL students as non-proficient. Findings by group show that academically high-performing ELL students are classified as non-proficient in higher numbers (i.e., as many as 29.6%), than academically high-performing non-ELL students (i.e., as many as 10.2%).
Extent of redesignation ineligibility by decision rule
To answer the third research question, “To what extent do decision rules differ in the number of students who would be classified as ineligible for redesignation?”, ineligibility 10 was calculated by applying the first two State A redesignation criteria: (1) State A ELPA “non-proficient” by each model, and (2) State A SBAA-Reading “Below Basic”. Findings indicate a sizable difference in total ineligibility between decision rules, as seen in Table 8. When applying the compensatory-2 rule instead of the currently used conjunctive rule, there were nearly one-third fewer ELL students found to be ineligible for redesignation (29.6%, n = 259) and nearly one-fifth fewer non-ELL students (18.5%, n = 17). All differences between the current conjunctive rule and the other models and rules were greater than zero. This indicates that inferences about individual students’ redesignation eligibility can differ according to the model and decision rule applied to ELPA scores, even in cases where students are eligible for redesignation according to SBAA criteria.
Frequency and percentage of students ineligible for redesignation by criterion and group.
Note: Full sample of ELL students, n = 875; academically high-performing ELL students, n = 115; a“Academically High-Performing” is a constructed subgroup of students who achieved “Proficient” or “Advanced” in all SBAA domains (Reading, Language Usage, Mathematics, and Science). bAll values in parentheses are percentages of the total group as listed in each column.
To explore redesignation ineligibility further, ELPA decision rule outcomes were examined at each level of SBAA-Reading for ELL and non-ELL groups (see Table 9). For students who achieved an “Advanced” level on SBAA-Reading, between 0 and 15% were classified as non-proficient by ELPA decision rules. For students who achieved the “Proficient” level on SBAA-Reading, between 6 and 44% were classified as non-proficient by ELPA decision rules. Of all students who achieved “Advanced” or “Proficient” levels on SBAA-Reading, 40.6% of ELLs (n = 208) and 20.5% of non-ELLs (n = 17) were classified as non-proficient by the current State A conjunctive rule. By contrast, only 6.3% of those ELLs (n = 32) and none of the non-ELLs were classified as non-proficient by the compensatory-2 rule. Although not fully interpretable as false-negatives without examining other evidence, this finding illustrates that a sizable and meaningful difference between outcomes by decision rule is possible, even for students who are one or two levels above the SBAA criterion.
Concordance of two criteria for redesignation eligibility by decision rule and group.
Incongruence = SBAA-Reading “Basic”, “Proficient” or “Advanced”, and ELPA “Non-Proficient”; or SBAA-Reading “Below Basic” and ELPA “Proficient”. bConjunctive = all four sub-domain pass; Compensatory-1 = three or more sub-domain pass; Compensatory-2 = overall level four or five; Mixed= overall 80% or higher plus all sub-domains at or above lower-bound of 95% confidence interval.
The greatest incongruence was seen for students who achieved “Basic” on SBAA-Reading, as those between 25 and 87% were classified as non-proficient by ELPA decision rules. While these data are not interpretable as false negatives without examining other evidence, this finding also illustrates the difficulty of interpreting inferences from the “Basic” level on SBAA-Reading for this state, a test which notably has only four performance bands.
For students who achieved “Below Basic” in SBAA-Reading, between 14 and 39% were classified as ELPA “proficient”. These data are not interpretable as false-positives without examining other evidence.
Findings for non-ELL students are illustrative rather than indicative, as non-ELL students already receive instruction in mainstream classes by dint of their native speaker status. Differences in decision rule outcomes for non-ELL students can suggest that such differences are not population dependent. For example, 17 non-ELL students who achieved “Proficient” or “Advanced” in SBAA-Reading were classified as non-proficient by the currently used conjunctive rule, whereas only two of these students were classified as non-proficient by the compensatory-2 rule. Notably, the one non-ELL student who achieved “Below Basic” in SBAA-Reading was classified non-proficient by every ELPA decision rule.
In summary, observed differences between decision rule outcomes were greater than zero for all groups. This indicates that, for students at all academic achievement levels, the inference of English language proficiency can differ according to model and decision rule with direct implications for redesignation eligibility.
Discussion
This study explored to what extent “non-proficient” classifications produced inferences valid for deciding which students should remain in or exit from ELD services. In response to prior studies which interpreted mean differences between ELL and non-ELL populations as evidence of classification validity, this study explored the differences by decision rule, the extent of decision rule impact in classifying academically high-performing students as non-proficient, and the extent of decision rule impact on the number of students classified as ineligible for redesignation.
It was found that models and decision rules differed in the number of non-proficient classifications created. Overall, the State A conjunctive decision rule produced the most non-proficient classifications in the full sample, 56.9% of ELLs and 27.2% of non-ELLs, and within the constructed subgroup of academically high-performing students, 29.6% of ELLs, and 10.2% of non-ELLs. The mixed decision rule produced fewer non-proficient classifications than the conjunctive: 24.4% of ELL students and 18.5% of non-ELL students. Compensatory rules produced fewer non-proficient classifications across and within all groups; however, they did differ from 3.5 to 6.7%. Compensatory-2 produced the fewest non-proficient: 20.7% of ELLs, 5.4% of non-ELLs, and 1.7% of academically high-performing students. One instance of concurrence was seen with the academically high-performing non-ELL subgroup: no students were classified as non-proficient by compensatory or mixed rules.
Models and rules also differed in terms of how many students would be classified as ineligible for redesignation. This was observed in the number of students who were found ineligible for redesignation by the SBAA-Reading criterion of “Below Basic” who were also deemed ineligible by ELPA “non-proficient” classifications according to each model and rule. Considering ELPA then SBAA stepwise, according to the State A aggregate decision model, the State A conjunctive rule led to 58.7% of ELLs (n = 514) becoming ineligible for redesignation including 29.6% of the academically high-performing ELLs (n = 34). This was the largest group of ineligible students created by any model or rule. By contrast, the compensatory-2 rule resulted in only 25.8% of ELLs (n = 226) becoming ineligible for redesignation, including only 1.7% (n = 2) of the academically high-performing ELLs.
To summarize, ELPA decision rule choice impacts redesignation eligibility, especially in a state using a conjunctive ELPA composite classification rule and a conjunctive, ELPA-first stepwise aggregate redesignation rule. Furthermore, the difference between the decision rules illustrated in the present study is sizable in terms of the number and academic ability of students who would fall within a zone of uncertainty regarding redesignation eligibility. Nearly one-third of students found ineligible under the current rule would be eligible for redesignation under the compensatory-2 rule: 33% of ELL students (n = 288) and 28% of the academically high-performing ELL students (n = 32). This finding demonstrates that conjunctive decision rules do not simply select the highest-performing students for redesignation. Rather, this State A conjunctive rule has created an inference of non-proficient which is at odds with inferences from other indicators of language proficiency such as high levels of achievement in content area tests and native English speaker status. This finding also emphasizes the value of a sampling design which includes ELL and non-ELL students.
In the larger context of redesignation decisions, most states mandate the use of multiple indicators to determine exit, but allow continuation in ELD services based solely on the first or only unmet criteria (see Mahoney & MacSwan, 2005; National Research Council, 2011; Ragan & Lesaux, 2006). In State A, other indicators are not considered unless the ELPA classifies a student as proficient. Thus, the determination of continuance in ELD services is based solely on the ELPA. When ELPA models result in a high number of non-proficient classifications for students already proficient according to other sources of evidence (e.g., SBAA, grades, teacher report, parent report) within this conjunctive aggregate, it is a single classification inference created from the ELPA decision rule that is determining redesignation ineligibility. This stands in opposition to best practices of using multiple measures when making high-stakes decisions, as stated in the Standards (AERA, APA, & NCME, 2014, p. 198): “Standard 12.10: In educational settings, a decision or characterization that will have a major impact on a student should take into consideration not just scores from a single test but other relevant information.”
ELL status should be temporary; yet when no other evidence is used to challenge an ELPA classification of non-proficient, it seems likely that students erroneously classified by the ELPA are at great risk for delay or denial of program exit. At the inception of NCLB, researchers such as Linquanti (2001) and Abedi (2004) heralded the ELPA as superior to the SBAA for redesignation decisions, as many states had unfortunately been weighing SBAA performances too heavily in such decisions. However, the findings of the current study suggest that the reliance on ELPA classifications as sole or primary evidence is also problematic if it disallows the consideration of SBAA performances and other indicators of proficiency. In cases of incongruent ELPA and SBAA data, researchers who take a more nuanced look at sub-domain performances could find that an ELPA “non-proficient” student is within a margin of error of sub-domain cut scores, or that achieving “Proficient” and “Advanced” on SBAA could indicate the presence of content-specific English language proficiency not measured by the ELPA. Arguments that presuppose an ELPA non-proficient student will not be successful in English-only instruction are too reliant on ELPA as a predictive measure. Concurrent performances on state standards-based tests and curriculum-based measures of achievement, along with teacher and parent reports, are crucial to supporting or refuting those claims.
The findings of the current study also suggest that academically high-achieving ELL students can be denied redesignation by a conjunctive decision rule and, to a lesser extent, by even compensatory and mixed rules. While this subgroup of ELL students is not typically considered to be at risk of negative outcomes, misclassification is known to carry consequences, especially for fifth graders transitioning to secondary school. Overall, extended time spent in ELD services when not needed can be detrimental to students’ school persistence (Kim, 2011) and supplant other coursework making on-time high school graduation more difficult to achieve. In fact, a recent study of districts with higher ELPA requirements has shown lower graduation rates (Hill, Weston, & Hayes, 2014). In terms of psychosocial development, a false classification of non-proficient can limit a students’ exposure to English-only peers and to the kinds of social capital seen as essential for academic success (e.g., important social networks and high-achieving peers, Gándara, Rumberger, Maxwell-Jolly, & Callahan, 2003; see also Castro-Olivio, Preciado, Sanford, & Perry, 2011).
From test development to test score use, it is the inferences which carry the burden of validity – not the assessment (Kane, 2013). Inferences from ELPA classifications are unique in that a non-proficient classification is meant to be temporary, not permanent. Based on claims that most students can acquire English in 4 to 7 years when the conditions are right (Hakuta, Butler, & Witt, 2000), annual measurable achievement objectives under Title III expect one ELPA performance level of growth per academic year; however, as in the case of Kentucky (Cook, 2009), states often set goals that anticipate less than 20% of ELL students to achieve proficiency each year. It is unclear to what extent performance standards can be set to cap the number of students who may achieve proficiency each year. If the choice of model, decision rule, weighting or performance standard were to generate a large number of false negatives annually, a problem would emerge – namely, the backlog of students waiting to be reclassified in the next assessment cycle. The current data cannot address the extent of this backlog or if it is being resolved adequately. Such an inquiry would require access to decision-making practices in schools and standard-setting practices at the state level. Nonetheless, evidence from this study that as many as 33% of ELL students, including 28% of high-performing ELL students, could be false non-masters suggests serious ramifications for ELL instructional programming. Especially for high-performing students transitioning into upper-grade settings with ELD classes replacing other grade-level coursework, false non-masters could be placed in courses below their linguistic competence and could comprise a meaningful percentage of long-term ELLs. Investigations of potential causes of long-term ELL phenomena would do well to carefully review the choice of models, decision rules, sub-domain weights, and performance standards.
This study does not intend to recommend one best model or decision rule for all states and all ELP assessments. Rather, findings are intended to stimulate thinking and discussion about how we interpret the risk and cost of misclassification when individual-level decisions are at stake. The ELPA classification system needs to indicate accurately and consistently when ELL students have reached a level of English-language proficiency that can be supported adequately with the resources of the general education or gifted classroom. Validation studies such as this one, using a modest number of non-ELL students, need not be expensive and can provide such evidence as recommended by the Standards (AERA, APA, & NCME, 2014) and researchers (Wolf, Herman, & Dietel, 2010). Although non-ELL and ELL students differ in important ways owing to the unique processes of first and second language acquisition, it is practical to consider using a known-to-be-proficient population when choosing and calibrating models, decision rules, sub-domain weights, and performance standards.
Differences in decision rule outcomes also suggest that the theoretical bases for classification systems are not inconsequential, although they may be hidden from public view. For ELL assessment systems, as well as other complex systems that rely on multivariate classifications such as special education eligibility, gifted program eligibility, certification, and value-added, these findings illustrate the importance of reporting and evaluating model and decision rule choice in any validation of inferences for high-stakes test score use.
Limitations and future research
The current study utilized a limited sample of students in one grade level for one state’s ELP assessment. Generalization of these findings to other grade levels, other test forms, or other states’ assessments cannot be made without further investigation. Future research might conduct the same analyses for all grade levels, grade clusters, and test forms for other states’ ELP assessments and also explore additional decision rules. Unmeasured factors such as absenteeism, student motivation, and access to high-quality instruction as well as the time between ELPA and SBAA administration (from 10 to 74 days) may have affected scores and inferences made about them; thus, efforts should be made in future studies to control for these factors. The non-ELL students in this study were a representative random sample drawn by State A from the entire population of non-ELL students in the state, but a smaller sample size than we would have liked. Future studies might include a larger non-ELL sample as well as comparison groups of students receiving special education and students receiving testing accommodations. Additional investigations of misclassification rates could consider the effects of score transformation and standard setting along each step in the test development process from students’ raw scores to final classifications, and the likelihood of misclassification at entrance to ELD programs. Finally, future research might consider the impact of aggregate classification models that use ELPA data before SBAA data in decisions about program exit. Such studies could investigate redesignation decision accuracy in relation to the order in which data is considered and the performance standard setting for each criterion.
Unfair score-based decisions, such as those based on misclassifications, can have profound consequences on test takers as well as society at large by raising concerns about the fairness of the test (Xi, 2010). Research could investigate the effect of incongruent evidence on those perceptions. Especially for districts not using the child study team approach to redesignation as recommended by Mahoney and MacSwan (2005), additional research could document and analyze how students with incongruent data proceed through English learner programs and curricular streams (Estrada, 2014) towards timely graduation. Findings could identify the extent to which decision rules contribute to extended ELL status and known negative outcomes, such as lack of school persistence.
The next generation ELP assessments are being developed to align better with college and career-ready standards, lending hope that the resultant classifications will provide better inferences of ELP in academic settings (Bailey & Wolf, 2012). Nonetheless, no amount of new content or item types will redress the shortcomings of erroneous classifications if poorly verified approaches prevail. State policymakers and test developers who set performance standards for ELP assessments should ensure that validation studies employ robust methods for verifying the accuracy of classifications for all uses, including student-level decisions. Researchers also need access. When all test consortia, states and test developers provide adequate transparency of how performance standards were determined, including the choice of models, decision rules, sub-domain weights, and performance standards, robust external evaluations and peer review can be realized fully.
Concluding remarks and significance
In his reflection on the history of language testing, Spolsky (2008, p. 302) observed:
When we realized over 100 years ago the inevitability of error in the measurement of human capacity (Edgeworth, 1890), we set out to try to reduce the size of the error, rather than trying to understand the risk of making decisions about the fate of human beings using erroneous data … Even as we discovered the intricate co-construction of normal conversation, we chose to take the abstract formalization of idealized but non-existent monolingual speakers of standard languages as a norm against which to measure real language use (McNamara & Roever, 2006).
The decision rule based on non-existent monolingual speakers proved to be too difficult for the real monolinguals in our study, even the highest achievers. The authentic, uneven language performances of the academically highest performing non-ELL students in this study indicate that there is still careful thinking to be done about what constitutes non-proficient versus proficient and ready for English-medium instruction. Despite the preference for conjunctive decision rules to minimize false positives, our findings suggest that other rules may be preferable to minimize false negatives. The choice of models and decision rules warrants a substantive, theoretical review based on evidence that goes beyond a methodological discussion of false positives and negatives. Such an effort from the measurement community would help ensure that next-generation assessment systems are more effective and equitable for all students.
Footnotes
Acknowledgements
The authors would like to thank State A for the use of these data, and the editor and anonymous reviewers for their comments and feedback. The authors would also like to thank their colleagues at UCLA for their comments on earlier iterations of this work. Aspects of this work were carried out in partial fulfillment of the first author’s master’s degree at UCLA and presented at the annual meeting of the American Educational Research Association in 2012. The authors are responsible for any errors in this publication.
Declaration of conflicting interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/ or publication of this article: The first author acknowledges support received from the UCLA Graduate Division Graduate Summer Research Mentorship program in 2011. The second author acknowledges funding from the Enhanced Assessment Grant (CFDA 84.368) awarded by the U.S. Department of Education to the OSPI of the State of Washington and Co-PIs Ellen Forte, (edCount, LLC), Marianne Perie (formerly of the Center for Assessment) and Alison Bailey (UCLA). The authors accept sole responsibility for the content presented and the views articulated which do not necessarily represent those of the U.S. Department of Education, OSPI, edCount, the Center for Assessment, UCLA, or CSULA.
