Abstract
The current interest in evaluating teachers and teacher education programs provides an opportunity to consider the education of diverse learners in K-12 schools in the United States. We address teaching English language learners (ELLs), a rapidly growing population. Challenges lie in holding content teachers of ELLs accountable as they are not adequately prepared to teach ELLs. In this article, we outline complexities surrounding the definition of the quality of teaching content to ELLs, provide an overview of existing teacher evaluation instruments, and discuss the difficulties in using ELL test scores to estimate the teacher contributions to academic growth of ELLs.
Keywords
Introduction
In the United States, there is currently a federal mandate to ensure that teachers provide students with high-quality education, meet rigorous standards, and are ready to teach (U.S. Department of Education [USDE], 2011). The Blueprint for Reauthorization of the Elementary and Secondary Education Act (USDE, 2010) and guidelines for Race to the Top Program funding (USDE, 2009) have renewed interest among state and national policymakers in linking the academic growth of students to their teachers to evaluate and improve teacher effectiveness in U.S. K-12 schools. This call also emphasizes more rigorous teacher training and improvements to teacher education programs based on measures such as teacher observations and student academic gains, which policymakers claim can be linked back to both teachers and their schools of education. Student academic gains are examined through value-added analyses of student test scores. Value-added analyses are used to make inferences about the contributions of teacher effects to student academic growth.
The political emphasis on teacher accountability is not new; research relating student achievement to teaching effectiveness and school effectiveness has a long history (e.g., Brophy, 1973; Doyle, 1977; Gage, 1989; Rosenshine, 1970). The current focus on accountability for teachers and teacher training programs provides an opportunity to address teaching English language learners (ELLs), a population that has been growing rapidly in U.S. elementary and secondary schools (Valdés & Castellón, 2010). This opportunity includes the challenges of holding content teachers accountable for teaching ELLs when in fact many content teachers, although well qualified to teach their areas of specialty, are not adequately prepared to address the unique linguistics needs of ELLs (de Jong & Harper, 2005; Gándara, Maxwell-Jolly, & Rumberger, 2008; Menken & Antunez, 2001). A significant gap exists between being prepared to teach content and being held accountable for outcomes related to ELLs.
This mismatch raises validity concerns because little is known about how existing teacher evaluation instruments and systems can be used to interpret the performance of content teachers who are trying to teach ELLs. Interpretations about teacher performance in the classroom might run the risk of being unfair and of questionable usefulness as validity demands that uses of tests, test scores, and interpretations of those scores be aligned (Kane, 2001). For Kane and his predecessors (Cronbach, 1988; Messick, 1989), the validity of interpretations from assessments is established through a series of arguments. Kane (2001) suggests that an interpretive argument for an assessment “lays out the network of inferences leading from the test scores to the conclusions to be drawn and any decisions to be based on these conclusions” (p. 329). In the context of evaluating content teachers of ELLs, inferences are made from constructs defining teacher quality and quality English as a Second Language (ESL) teaching, assessment instruments based on those construct definitions, and the performance of ELLs on those instruments. However, the inferences made from these three elements need to be investigated to ensure that valid conclusions about teacher performance are made.
Our purpose in this article is to explore issues concerning the evaluation of content teachers of ELLs by outlining the complexities of (a) defining effective content instruction for ELLs, (b) using existing evaluation instruments, and (c) measuring ELLs’ academic growth using large-scale state assessment scores. We first describe the context for teaching content to ELLs to highlight the challenges involved in evaluating content instruction to ELLs. Second, we discuss existing approaches to defining teaching quality with regard to ELLs. Third, we describe the most common observation instruments used to assess and improve the teaching of content to ELLs. Fourth, we describe issues associated with linking ELLs’ academic gains to teacher effects. Finally, we identify opportunities for improving teacher evaluation by taking into account the challenges of assessing content teaching effectiveness with regard to ELLs.
Context of Teaching Content to ELLs
Although there is controversy surrounding the interchangeability of the ELL and limited English proficient (LEP) student labels (Abedi, 2004; Solano-Flores, 2009), the federal government in the No Child Left Behind (NCLB) Act of 2001 defines LEPs as students
who are 3-21 years old, who are enrolled or are preparing to enroll in an elementary or secondary school, and who (1) were not born in the United States or whose native language is a language other than English, and (2) whose difficulties in speaking, writing, reading, or understanding the English language may be sufficient to (a) deny them the ability to achieve academic proficiency on state achievement tests, (b) prevent their success in English-medium classrooms, or (c) limit their opportunities to participate fully in society. (Valdés & Castellón, 2010, p. 25)
The number of students receiving LEP services has increased substantially in recent decades. In 1998, ELLs represented approximately 8% of the total population of K-12 students in the United States. A decade later, they increased to 11% of the total student population (National Clearinghouse for English Language Acquisition, 2011).
There is also evidence that ELLs exhibit lower academic achievement than their peers who are native speakers of English. For example, Grade 8 reading and mathematics scores from the National Assessment of Educational Progress (NAEP) suggest a persistent performance gap between students who are ELLs and those who are not. In both 2005 and 2011, the percentages of ELLs scoring below the basic level and those scoring at the proficient level in mathematics were approximately 70% and 5%, respectively; the percentages of non-ELLs in the same performance categories were approximately 25% and 25%, respectively (USDE, 2012). Similar gaps in performance have been found on NAEP reading assessments and other large-scale reading and mathematics assessments (Abedi, 2002; Young et al., 2008). In fact, it has been suggested that there is a significant relationship between reading skills and performance in mathematics (Beal, Adams, & Cohen, 2010). Although this relationship has been repeatedly pointed out before (Martiniello, 2008), ELLs with low English literacy skills are placed in mainstream English-speaking classes. What is worse is that ELLs are allowed to take content assessments based on the assumption that 1 year of academic English instruction is sufficient for them to perform academically and linguistically (Guerrero, 2004).
As the ELL population in the United States continues to increase, educators are being held accountable for addressing the significant achievement gap between ELLs and non-ELLs (Reeves, 2006). Policymakers recognize that teachers play a key role in helping ELLs succeed academically and believe they are most responsible for narrowing the achievement gap. Typically, teachers fall into two general categories: those who specialize in teaching content and those who specialize in teaching ESL. 1 Most content teachers are not adequately trained or do not hold appropriate teaching credentials to teach ELLs (Gándara et al., 2008; Menken & Antunez, 2001). de Jong and Harper (2005) indicate that only about 13% of teachers who reported having ELLs in their classrooms received professional development to specifically help ELLs achieve in academics. Therefore, content teachers commonly turn to available ESL teachers to help ELLs acquire English language proficiency along with academic content knowledge. As such, contributions to student learning and growth in both language and content-area knowledge could be attributable to both content teachers and ESL teachers.
Furthermore, there is evidence to suggest that entry-level content teachers, even those with specialized ESL training, find it challenging to work with ELLs for a variety of reasons. For example, Gándara et al. (2005) found that teachers in California have faced challenges in communicating with parents due to their lack of familiarity with ELLs’ home cultures as well as not having sufficient time and expertise to address the academic and linguistic needs of ELLs. When faced with these kinds of challenges, educators may simply resort to teaching simplified content and procedural knowledge so that ELLs can be “up to par, grade level” as measured by standardized tests (Gersten, 1999, p. 47). Specifically, the pattern of instructional practices of the four teachers teaching grades 4, 5, and 6 in Gersten’s study indicates that they chose to sacrifice some academic content and focus instead on helping ELLs produce grade-level written papers and use grammatically accurate English. In general, when faced with accountability requirements, both entry-level and veteran content teachers may fall back on teaching to the test or reducing the academic rigor of their instruction because they are not prepared to meet the linguistic needs of ELLs.
Teachers’ struggles with ELLs might reflect the shortcomings of teacher education programs or the lack of influence ESL teacher education plays in the mainstream teacher education programs. According to de Oliveira and Athanases (2007), a key problem is that faculty in teacher education programs may not be prepared to impart essential knowledge and skills for teaching ELLs to future teachers. Also, advocacy for ELL instruction may not be part of a program-wide agenda for teacher education. Individual educators might emphasize pedagogies that help teacher trainees meet the needs of ELLs in their classrooms, but these instructional practices may not be adopted by all the faculty within a given program (McDonald, 2005). Although teacher education programs typically consider diversity in general, they rarely address the integration of language with content instruction (Lucas, Villegas, & Freedson-Gonzalez, 2008; Zeichner, 2003).
It is apparent that there are challenges to ensuring that the growing numbers of ELLs receive quality content instruction that meets their linguistic and academic needs. Content teachers along with ESL teachers need to be involved in closing the achievement gap. To enable greater accountability on the part of content teachers and their teacher training programs to address the unique needs of ELLs, it is essential to clearly define what quality teaching means.
Defining the Quality of Teaching Content to ELLs
Teachers must have a broad range of knowledge and skills to effectively educate students in U.S. elementary and secondary schools. Most importantly, teachers are expected to have a sound grasp of both their content areas and the pedagogy required to foster supportive learning conditions for learners with diverse background and needs (Cochran-Smith & Zeichner, 2005; Darling-Hammond & Bransford, 2006). However, the evaluation of teacher knowledge and skills is complex, especially in relationship to addressing the needs of ELLs. There is no standard, clear, or simple definition of what content teachers need to know to effectively educate ELLs (Janzen, 2008). Some postulate that effectiveness of teaching content to ELLs can be defined as the nexus of teachers’ knowledge of second language acquisition, students, content, and linguistics as it relates to teaching that particular content (Harper & de Jong, 2004; Lucas et al., 2008; Samway & McKeon, 2007; Schleppegrell, 2004). Here, we present two perspectives: (a) systemic functional linguistics (SFL) and (b) sociocultural approaches for two reasons. First, they constitute the most widely cited body of literature. Second, they are complementary in advocating that teachers should apply their knowledge of linguistics and of students and draw on ELLs’ home language and culture to effectively teach content to ELLs.
One line of research approaches the understanding of ELL teaching quality from a systemic functional linguistics perspective (Christie & Martin, 1997; Fang, 2006; Gebhard, Willett, Pablo, Caicedo, & Piedra, 2011; Halliday, 1985; Schleppegrell, 2004). From this perspective, language is the medium for understanding, using, and teaching the particular discourse of a content area. Knowledge of the discourse of a discipline allows one to socialize into ways of acting (genres) and ways of being (styles) within the particular discipline (Fairclough, 1989). Along these lines, teachers’ knowledge of the discourse is reflected by their abilities to model its features and characteristics and to engage ELLs in using it. The effective teaching of ELLs according to the SFL perspective requires not only understanding of relevant linguistic features but also engaging ELLs in using the academic discourse of the discipline through reading, writing, and speaking.
Experts with a systemic functional linguistics perspective examine the language demands of specific areas of content and offer recommendations for effective instructional practice. These demands are often culled from textbook analyses or the ways that teachers talk to their students. Typical teaching strategies might involve simplifying sentence structures, scaffolding content, and “unpacking” the language needed to make content accessible (Coelho, 2004). By unpacking, effective teachers will explicitly help students with linguistic features commonly demanded by the content. For example, a sentence from a typical science discourse may be challenging to ELLs because the underlined noun clause is densely embedded: “A tornado is a rapidly whirling, funnel-shaped cloud that reaches down from a storm cloud to touch Earth’s surface” (Fang, 2006, p. 501, emphasis added). The teacher could unpack this same sentence by phrasing it differently in everyday language: “A tornado is a kind of cloud. It is shaped like a funnel and moves very quickly. It reaches down from a storm cloud to touch Earth’s surface” (p. 501). Incidentally, the process of unpacking might involve identifying the relationships between linguistic forms and different kinds of meanings, such as “interpersonal,” “textual,” or “experiential” (see Schleppegrell, Achugar, & Oteíza, 2004).
The effective teaching of content to ELLs can also be viewed through a sociocultural lens. Vygotsky (1978) posited that learning occurs on a social plane. Social interactions demand the use of language as both a tool and a system (Khisty, 2001). Thus, active learning happens primarily through interacting with peers while engaging with the language of a particular content area. To cultivate these interactions, teachers would create collaborative learning environments that would allow ELLs to negotiate and communicate meaning around academic content such as mathematics (Furner, Yahya, & Duffy, 2005; Garrison & Mora, 1999; Khisty, 2001). Collaboration encourages comprehensible language use as well as dialogic negotiation of meaning in the classroom (Tharp, Estrada, Dalton, & Yamauchi, 2000).
In addition, effective teachers, according to a sociocultural perspective, should draw on their knowledge of ELLs’ home languages and cultures to build connections between students’ backgrounds and the content area to make the content accessible (Gonzalez, Andrade, Civil, & Moll, 2001; Gutiérrez, 2002; Moll, 1990). For instance, several studies suggest that teachers could scaffold learning by drawing on students’ current knowledge and early literacy experiences by exposing students to age-appropriate books in the native language (Hancock, 2002; Helman & Burns, 2008) and by providing them with the opportunity to use their native language (Reese, Garnier, Gallimore, & Goldenberg, 2000). By permitting the use of students’ native languages in the classroom, teachers respect, affirm, and legitimize their roles in helping students to learn to read and write in English and to engage in discussions of text (Franquiz & de la Luz Reyes, 1998). Effective teachers should also help elicit and use knowledge that ELLs have acquired from outside of school (Gonzalez et al., 2001; Gutiérrez, 2002; Moll, 1990), such as knowledge of farming and animal husbandry related to a student’s rural origins, knowledge about construction and buildings related to urban life, and knowledge about trade, business, and finance (Gonzalez, Moll, & Amanti, 2005). Other examples include teachers’ building on students’ family experiences (Civil, 2002) and incorporating elders’ ways of knowing mathematics (Lipka et al., 2005). Hence, teachers can use ELLs’ home languages and backgrounds as supports for learning.
In general, both approaches offer instructional strategies to make content more accessible to ELLs although with differing emphases on linguistics, culture, and content. Although theoretical constructs for effective content instruction for ELLs are still being defined, the operational definitions of the quality teaching constructs have been tried out in a number of teaching evaluation instruments. Next, we take a look at four of the most commonly known instruments in existence.
Commonly Used ELL Teacher Evaluation Instruments
The following overview of four commonly used evaluation instruments highlights some of the challenges to measuring the quality of content instruction for ELLs. Only in the past decade or so have instruments been developed to facilitate observation of and improvement in ELL teaching practices (Echevarría, Vogt, & Short, 2008; Waxman, Hilberg, & Tharp, 2004). Existing evaluation instruments typically look at the integration of language and content, student behavior, teacher strategies, and classroom characteristics. The goal is to gain information about how well teachers provide ELLs with opportunities to acquire and improve language proficiency along with grade-level academic content. This information could result in professional development opportunities to help content teachers and teacher educators better address the achievement gap faced by ELLs. The four instruments reviewed here all focus on various aspects of classroom instruction in mainstream classrooms. These instruments are Sheltered Instruction Observation Protocol (SIOP; Echevarría et al., 2008), the Student Behavior Observation Schedule (COS; Waxman, Wang, Lindvall, & Anderson, 1988), the Teacher Roles Observation Schedule (TROS; Waxman et al., 1988), and the Classroom Observation Measure (COM; Ross & Smith, 1996).
In SIOP, quality of teaching is defined based on the model of Sheltered Instruction (SI), an approach that aims to provide language support to ELLs by sheltering them from the linguistic demands of mainstream content classrooms (Gort, 2003). In sheltered classrooms, teachers provide language instruction targeting linguistic challenges ELLs might confront in the mainstream classrooms. The unit of analysis in SIOP is the teacher. SIOP provides a framework for teachers to deliver lessons in three categories (preparation, examination, and review/evaluation). The three categories are comprised of eight components for instruction: lesson preparation, building background, comprehensible input, strategies, interaction, practice/application, lesson delivery, and review/assessment. It is apparent that SIOP has a professional development aspect as it aims to help content teachers of ELLs prepare lessons that address both the academic and linguistic needs of ELLs by engaging them in meaningful interactions about the content (Echevarría et al., 2008). Thus, effective SIOP lessons are characterized by students engaging with the teacher and the teacher guiding students to communicate about complex concepts relevant to the content area (Echevarría et al., 2008).
In SIOP, integration of content with language objectives is treated as a criterion for selecting ideas and activities for lessons (Vogt & Echevarria, 2008). However, it is unclear how this criterion is used for specific content areas. That is, SIOP does not address the specifics of how a content-area teacher would unpack the linguistic characteristics specific to the discourse of a particular content area, as suggested by an SFL perspective. Given that content teachers do not receive special training in linguistics and second language acquisition processes, such teachers may find it especially difficult to build these connections. In fact, Settlage, Madsen, and Rustad (2005) also suggest that SIOP strategies might sometimes fail to address specific instructional strategies needed for effective instruction in specific content areas. In their study, they found that SIOP strategies were not in alignment with the inductive approach to science that the participating teachers found helpful for their ELLs. Also, teachers misinterpreted the SIOP approach and attempted to enact more explicit vocabulary instruction, which did not appear to be effective for inquiry science. In her response to Settlage, Madsen, and Rustad’s findings, Echevarria (2005) maintains SIOP is quite distinct in that it offers an instructional framework for teachers to implement best practices for teaching academic English while improving their content understanding.
Several quasi-experimental as well as correlational studies (Ardisana, 2007; Dennis, 2004; Short & Echevarria, 1999; Miner, 2006) indicate the effectiveness of the SIOP with elementary and middle school teachers and their ELLs and show positive relationships between the SIOP model and students’ achievement scores and vocabulary development. McIntyre, Kyle, Chen, Muñoz, and Beldon (2010) similarly examined the relationship between teachers’ implementations of the SIOP model instruction and elementary ELLs’ reading achievements. The study findings show that students who received the SIOP model instruction benefited more than those who did not. One caveat the authors imply, though, is that while SIOP is not geared explicitly toward reading, it will not be harmful to improving ELLs’ reading achievements either.
As compared with SIOP, the other three instruments (COS, TROS, COM) focus more heavily on evaluation (Waxman, Padron, Franco-Fuenmayor, & Huang, 2009). The common ground for these three instruments is that they define quality of teaching based on the degree to which teacher–student interactions or behaviors and learning environments are conducive to student learning. Also, engaging ELLs in language use as an indicator for effective teaching of ELLs is mostly highlighted in SIOP when compared with the other two instruments.
COS (Waxman et al., 1988) is focused on the individual students as the units of analysis. It reports their classroom behaviors during classroom instructional processes. It is grounded on the premises of evaluating teaching effectiveness through understanding how students learn in classroom environments. This premise allows it to be used to investigate the relationships between school environments and students’ classroom behaviors and manners (e.g., Waxman & Huang, 1997). One of the advantages of using the student as the unit of measurement is that a smaller sample size of classrooms is required than that of the other instruments that are focused on teachers as the units of measurement (Waxman & Padron, 2004).
Most specifically, COS helps to collect data on students’ interactions with the teacher and the materials they use to engage in the type of activities they do. Some of the instructional processes analyzed with COS include setting, interaction, assignment of activities, types of activities, manner, and language used. Setting refers to the location in which students are organized. There are four settings specified: whole class, small group, pairs, and individual. Manner describes students’ uses of classroom time. Types of manner include on task, waiting for teacher, distracted, disruptive, and other (when the former four categories do not apply). COS also captures which language(s) (English, Spanish, or both) students use when interacting verbally and in writing. By focusing on language use, the researcher is able to examine subgroup differences such as bilingual or English-monolingual students. Furthermore, some of the teacher activities observed include, but are not limited to, working on written assignments, interacting, watching or listening, presenting/acting, tutoring peers, and working with materials/equipment. The focus on instructional activities in COS implies that students’ engagement is inferred from the teacher activities reported from the classroom. In fact, Waxman and Huang (1997) reported that 92% of the time observed students were engaging in the most common activities such as listening or watching as well as interacting and working on written assignments. Overall, COS was found to be reliable with a high interrater reliability (r > .95; Padron & Waxman, 1999).
In TROS (Waxman et al., 1988), teachers’ classroom behaviors are observed, specifically in relation to “(a) their interactions with students, (b) the instructional setting in which the observed behavior occurs, (c) the language used, (d) the purpose of interaction, and (e) the nature of interaction” (p. 73). Both TROS and COS examine setting, language used, and interaction with students; however, COS is more focused on the student as the unit of measurement whereas TROS is more interested in collecting information about teacher behaviors. Similar to COS, in TROS, instructional settings refer to conditions of learning within “whole class, small group, with an individual student, at the teacher’s desk, student’s desk, or while floating” (Doherty, Hilberg, Epaloose, & Tharp, 2002, p. 83). Furthermore, the nature of interaction section is comprised of items such as “use of questioning, listening, explaining, commenting, cueing/prompting, demonstrating, modeling, and other behavior by the teacher” (Doherty et al., 2002, p. 83). With that, the data collected provide insights into teachers’ uses of collaborative/cooperative learning, direct instruction with the entire class, and independent work. Padron, Waxman, and Huang (1999) confirmed the reliability and validity of this instrument, also finding the interrater reliability (Cohen’s kappa) to be perfect with a coefficient of .94.
COM is another observation instrument designed to collect data on classroom characteristics, teacher/student behaviors, and instructional strategies. It is comprised of six major components:
(1) classroom ecology and resources, (2) classroom makeup and physical environments, (3) interval coding of subject taught, teaching orientation, student engagement, teacher activities, and student behavior, (4) session teacher behaviors (e.g., praises good performance, motivates students, maintains positive climate, and keeps students’ attention), (5) session teaching methods (e.g., uses cooperative learning, direct instruction, tutoring, ability grouping, and seatwork), and (6) field notes. (Waxman & Padron, 2004, p. 146)
COM helps to evaluate whether the teachers use certain instructional strategies such as “sustained writing activities, using technologies, cooperative/collaborative learning, student discussion and self-assessment; questioning; providing feedback; using alternative assessments; and tutoring (by teacher, peer, or aide)” (Doherty et al., 2002, p. 83). Given the components COM is focused on, it could be inferred that in comparison with TROS and COS, COM is more encompassing of the whole classroom while examining classroom characteristics, teacher and student behaviors, and instructional strategies. Also, Waxman et al. (2008) claim that COM is a high inference instrument more than the COS as the rater needs to make judgments of his or her own about the behaviors observed in the classroom. COM was found to have high interrater reliability and a good estimate of validity (Ross & Smith, 1996).
Although COS, TROS, and COM instruments have been found to be reliable and valid by empirical research, we are aware of only two recent empirical studies documenting the combined use of these three instruments for evaluation purposes (Padron, 1994; Waxman, Padron, Shin, & Rivera, 2008). First, Waxman et al. (2008) conducted a meta-analysis of the studies that utilized survey and observational data to examine the relationship between Hispanic student resilience and their classroom learning environments. The data collected from teachers and learning environment surveys, as well as student observations, helped to categorize students in reference to their resilience, an attribute needed to face difficult obstacles to become successful at school. Results suggest significant differences between resilient and non-resilient students in terms of their classroom behaviors and learning environments as measured by the COS and COM instruments. Resilient students were engaged in their classroom learning environment more actively and interacted with their peers and teachers more than the non-resilient students. Also, their manners toward learning were more task-oriented and organized than non-resilient students. Furthermore, resilient students perceived their teachers to have higher expectations of their performance. These findings suggest that the learning and school environments and teacher behaviors might contribute to the way students perceive themselves and cope with barriers at school.
Second, Padron (1994) combined the use of both COS and TROS instruments to examine the classroom processes when teaching students from diverse backgrounds. In this study, classrooms predominantly with Hispanic ELLs were compared with the other classrooms from inner-city schools with a more culturally diverse student body (Hispanic, African American, White, and Other), but not many ELLs. A total of 166 students
were observed with reference to (a) their interactions with teachers and/or peers and the purpose of such interactions, (b) the settings in which observed behaviors occur, (c) the types of material with which they are working, and (d) the specific types of activities in which they engage. (Padron, 1994, p. 55)
Also, 47 teachers were observed using TROS according to their interactions with students, the settings and materials they were working on, and teacher-assigned or student-selected activity. Padron (1994) found that instructional practices were very passive and students interacted with the teachers less frequently at schools with large numbers of Hispanics and ELLs than at inner-city schools. Neither of the two studies that empirically illustrate the use of the TROS, COM, and COS observation instruments offer sufficient evidence to support the use of the instruments in building causal relationships between teacher effectiveness and student achievement.
The four instruments reviewed serve both professional development (SIOP) and evaluation purposes (COS, TROS, COM; Waxman et al., 2009), each taking a different approach to teaching quality. For instance, SIOP emphasizes the theory of language and culture more saliently than the COS, TROS, and COM combined. Because teacher effectiveness is defined differently across these instruments, none of them appear to be suited for use as a single measure of effectiveness in teaching content to ELLs. For instance, if SIOP were used as the sole measure in evaluating the instructional practice of a content teacher of ELLs, the evaluator might not be able to assess the full spectrum of the knowledge of language that the teacher uses to identify the language demands of the content. In fact, none of the instruments focus on how teachers unpack linguistic features specific to particular content-area discourse. Likewise, if used alone or in combination, COS, TROS, and COM would not help to account for teachers’ attention to the linguistic and cultural needs of ELLs as they are narrowly focused on student–teacher classroom behaviors and learning environments. Last, only SIOP appears to have examined the relationship between student achievement and teacher behaviors. The fact that there is little research connecting the instruments to achievement scores and explaining teaching quality does not help rationalize the use of the instruments for the purposes of teacher evaluation.
As mentioned above, the observation instruments, such as SIOP, have not been rigorously connected to student achievement and growth. In relation to accounting for student achievement, little is known about what aspects of effective ELL teaching are more important than others for any of the instruments. Also, the degree to which these observation instruments are used across and within districts varies, hence resulting in lack of standardization in using these instruments. Lack of standardization might threaten the reliability and validity of research done with the observation instruments. The inadequacy of instruments evaluating the effective teaching of content to ELLs might explain the interest from policymakers in using Value Added Measures or student academic gains as a proxy measure of effective teaching and quality teacher training programs. In fact, the use of value-added student test scores to evaluate teachers of ELLs has recently gained attention in the teacher accountability literature as well (Jones, Buzick & Turkan, 2013; Darling-Hammond et al., 2010). The overall message conveyed by this scholarship is that complexities surround the accurate measurement of the academic growth of ELLs and the attribution of ELL student test score gains to specific teachers. We now turn to discussing challenges in using academic gains of ELLs in evaluating educator effectiveness.
Problems Using Value-Added Models to Evaluate Teachers of ELLs
The learning outcomes of K-12 students are being used as another indicator in the evaluation of teachers and teacher education programs (USED, 2009; see also USED, 2011, for federal support for using student gains to evaluate teacher education programs and Goldhaber & Liddle, 2011, for an exemplary empirical study). We situate our discussion on using student academic gains in educator evaluation around a commonly used and researched approach—value-added modeling. 2 We argue here that test scores from ELLs cannot be used in isolation to evaluate effectiveness of teaching.
Value-added modeling is an approach that statistically adjusts student test scores for prior academic achievement to estimate the contributions of teachers (or schools) to student academic growth (see Braun, 2005; Harris, 2011). Value-added models, some of which include student demographics, require that student test scores from annual large-scale state assessments across two or more years be linked to each student’s current classroom teacher. Although the technical adequacy of value-added models for the purpose of evaluating individual teachers has been debated for several decades, they are currently being considered by states for use in evaluating teachers (for a summary of state Race to the Top applications, see Learning Point Associates, 2010) and teacher education programs (e.g., Louisiana Department of Education, 2011). Even if value-added models are found to be generally useful in evaluating teachers, their applicability to content teachers with substantial numbers of ELLs may be problematic due to unique challenges in measuring the academic growth of ELLs and attributing such growth to teachers.
In this section, we describe challenges that threaten the validity of inferences about the academic growth of ELLs and, consequently, the validity of inferences drawn from value-added scores about content teachers of ELLs. Because there is little published research specifically focused on using ELL outcomes in teacher evaluation, we draw on research about ELLs’ performance on large-scale assessments and theory about measuring growth to discuss the challenges. Briefly, the two challenges we refer to are (a) threats to the validity of inferences made about ELLs’ academic growth from test scores on large-scale state assessments (for an overview in the context of static performance, see Abedi, 2006) and (b) high mobility.
Regarding ELL student performance on large-scale state assessments, there is uncertainty about the extent to which such assessments are measuring the knowledge and skills that they purport to measure for ELLs (see Young, 2009, for an overview of various types of validity studies for ELLs). The technical quality of measures assessing academic gains for ELLs on large-scale state assessments are suspect for three reasons: ELLs’ low performance (Abedi, 2002; Young et al., 2008), accommodation use (e.g., Wolf, Kao, Rivera, & Chang, 2012), and linguistic accessibility (Abedi, 2011). First, low performance affects the quality of measurements from typical assessments that are aligned with the difficulty level of the general population; it may decrease the accuracy of estimated student academic ability. In addition, true growth may not be reflected because everyone scores low (i.e., floor effects). Second, testing accommodations may affect the measurement of growth, particularly when they are associated with increased test scores (see Pennock-Roman & Rivera, 2011, for a meta-analysis). This can be due to different accommodations assigned to ELLs each year or only ineffective accommodations being available (Abedi, 2006; Rivera & Collum, 2006; Young & King, 2008). Third, limited English language proficiency can be a barrier to accessing some test item content, thereby introducing construct-irrelevant variance into estimates of student ability and academic growth (Abedi, 2006). Wolf and Leon (2009), for example, found that test items with higher linguistic complexity are more likely to function differently, particularly for ELLs with low English proficiency. Young (2009) summarizes additional studies on the relationship between linguistic complexity of assessment items and the psychometric properties of the assessment.
High student mobility is another factor that presents technical, policy, logistical, and classification problems to the use of value-added growth models. High mobility within districts and states, particularly for migrant Hispanic students (American Federation of Teachers, 2004), contributes to missing data and difficulties in linking ELL student test scores from different districts to teachers and accounting for previous test scores in value-added models. From a technical perspective, student mobility contributes little to bias, based on evidence suggesting that value-added models are not sensitive to missing data from low-scoring students (McCaffrey & Lockwood, 2011). However, from a policy perspective, the use of value-added modeling does not hold teachers accountable for educating ELLs when test scores from many ELL students are not included in the model. As a solution to the mobility issue, highly mobile students can be included in statewide value-added systems if students are likely to move to districts that can be linked within the state. In addition, the new assessments that are being developed to align with the Common Core State Standards may provide an opportunity to better link highly mobile students to their classroom teachers and include prior years’ test scores as multiple states will offer common test items.
There has been little empirical research on the academic growth of ELLs over time due to the complexity involved with measuring both the acquisition of the English language and content knowledge. Likewise, little is known about the performance of initial English proficient students (those who arrive in the United States already proficient in English), long-term ELLs, and reclassified (former) ELLs, although this is beginning to change (see Abedi, 2008, for information on classifying ELLs and Young et al., 2008, for information on subgroup performance and validity). Long-term and former ELLs may have unique characteristics that differ from students classified as ELLs for 1 to 3 years. For example, Olsen (2010) points out that long-term ELLs may have limited access to the full curriculum and may not have received appropriate English language instruction. Reclassification of ELLs also presents a challenge when controlling for student demographic characteristics in value-added models. Reclassified ELL students are in the classroom with English proficient students but may not have had the chance to catch up on the content knowledge they missed while they were ELLs. Because of differences in classification across districts and states (Abedi, 2008), as well as differences in academic experiences because of changes in classification, the tested construct can change over time as content knowledge and English language proficiency change and interact.
The challenges outlined above affect the technical quality of measures of academic growth for ELLs and value-added scores assigned to content teachers of ELLs. These challenges, as well as limited empirical research on how content knowledge learning trajectories may differ for ELLs relative to non-ELLs, call into question the appropriate use of test scores from ELLs in teacher evaluation. For example, academic growth may occur more in later years as English proficiency improves (i.e., improved opportunity to learn and easier access to test items). This could systematically lead to lower value-added scores for educators of ELLs in early grades. Research on the learning trajectories of ELLs would facilitate policy discussions about the meaning of teacher contributions to the growth of low-scoring students relative to average- and high-scoring students. Research may suggest the need for a model that gives different weights to test score gains at different points along the score scale. Given that measuring the academic growth of ELLs is challenging and that the technical adequacy of value-added modeling has not been fully investigated for all students, test scores from ELLs cannot be validly used in isolation as indicators of teacher effectiveness to make high-stakes decisions about teachers.
Discussion
In this article, we have provided an overview of some of the complexities in evaluating the quality instruction for ELLs. We presented two approaches to defining quality of teaching content to ELLs and suggested that theories of effective content instruction for ELLs are being developed. We presented an overview of existing instruments, arguing that these instruments alone do not serve the sole function of evaluating quality of teaching content to ELLs. Furthermore, we pointed out that there is limited empirical research that warrants the use of these observation instruments for explaining student achievement. We also discussed the difficulties in measuring the academic progress of ELLs and linking ELL academic gains to content teachers.
The challenges mentioned throughout this article bring about opportunities for research and policy considerations. In this section, we discuss opportunities and implications specifically in relation to the challenges raised, including context for teaching content to ELLs, defining effective teaching of content to ELLs, developing new instruments to evaluate teachers, and using both ELL student academic growth and growth in English language proficiency to evaluate effective teaching.
First, in relation to the context for teaching content to ELLs, we highlighted the complexity of relying on content teachers without adequate preparation to teach ELLs. A natural consequence of this challenge is the divided role between ESL and content teachers in helping ELLs acquire both English language proficiency and academic content knowledge. We also pointed out that this division could lead to divided contribution by ESL and content teachers to ELL learning and growth in both language and content-area knowledge. As a solution, we recommend using assessment design to mitigate the challenge of linking teaching effectiveness to ELL student achievement. Student assessments aligned with the Common Core State Standards that share items across states will make available more test scores from prior years for highly mobile students who move across states. This provides an opportunity to include more ELL student test scores in teacher effectiveness indicators. In addition, an adaptive assessment, such as a multistage assessment (see Hendrickson, 2007), in which test takers are routed through two or more item clusters that are each designed with a specific ability range, offers more precise estimates of students’ abilities at the extremes of the test score distribution. This can improve validity for initially low performing students. Finally, universal design principles (Johnstone, Altman, & Thurlow, 2006) guide the development of accessible tests for all students, including ELLs. Linguistic accommodations, such as a bilingual dictionary, can facilitate access to test item content for ELL students. Such improvements to assessment design would strengthen the validity of inferences about teacher effectiveness in educating ELLs based on student gains on large-scale state assessments.
To further address the problem of divided contributions from ESL and content teachers, scenario-based teacher knowledge measures assess what teachers know about incorporating both linguistic and content aspects of the instructional practice to teach content to ELLs. For instance, in our work, we designed a teacher knowledge measure to assess classroom scenarios for mathematics and science teachers of ELLs. Teachers were informed about specific content, learning and language objectives, ELL background characteristics, and ELL language proficiency levels. When we interviewed a subsample of teachers who had taken the test, the emerging pattern was that most mathematics and science teachers were able to visualize the instructional scenario and reason through the most effective instructional strategy. If this method of developing teacher knowledge measure is explored further, linguistic and content aspects of teaching ELLs could be systematically measured as part of a reliable and coherent assessment of teaching effectiveness. This would in turn encourage both content and ESL teachers to be vested in assuring quality instruction for ELLs.
Second, we identified that there is no uniform definition of necessary teaching knowledge and skills to be effective teachers of ELLs. The existing research within the SFL and sociocultural approaches to defining effective teaching of ELLs is not adequate to address the degree to which teacher knowledge and skills matter more than others in teacher preparation for ELLs. A framework defining the essential teacher knowledge and skills needed for effective teaching of ELLs (see Turkan et al., 2014) could not only guide research systematically, but it would also encourage the design of theoretically grounded instruments for evaluation. With that, the framework would establish the normative and practical knowledge base essential for teaching content to ELLs. The framework would aid the development of a more specialized evaluative measure that captures the specialized practices and skills required for all teachers to teach content to ELLs, addressing both the linguistic and content aspects, and thus eradicating the dichotomy between content teachers and ESL teachers. However, research should be done to make valid and reliable interpretations of teacher performance on the measures of effective ELL teaching. To validate each of the measures, it is important to conduct research examining the four types of inferences as framed in Kane’s (2006) argument-based model of validation. Extrapolation inference becomes especially relevant to using multiple measures of effective teaching as it helps to evaluate the degree to which performance on one measure correlates with performance on other measures (Jones & Brownell, 2014). Different aspects of teaching quality might be captured by multiple measures, and the degree of correlation in teacher performance among various measures builds evidence for validating the measures.
Third, the review of instruments (COS, TROS, COM, SIOP) revealed that the linguistic aspect of teaching ELLs is partially addressed only by SIOP. None of the instruments seemed to be useful for evaluation or improvement of teaching in particular content areas. Three of the instruments (COS, TROS, COM) function more as a general teaching evaluation tool as compared with SIOP. However, they all have varying degrees of potential for professional development or learning opportunities, depending on how they are used in a particular research or teaching context. One limitation across all four instruments is that they have not been researched long enough to identify and validate what aspects of effective ELL teaching work best. The other limitation is that a coherent framework of what it means to teach ELLs effectively does not guide the design of these instruments.
The limitations of existing definitions of ELL teacher effectiveness and observation instruments provide an opportunity for researchers in the fields of content and ESL teaching to synthesize existing research on effective ELL teaching. They also allow researchers to engage in more rigorous research to determine which principles, strategies, and approaches hold the most promise in teacher preparation. The work of designing and validating new instruments can help deepen understandings of effective instructional practice for diverse learners as well as examine and describe instructional practices that might lead to positive learning outcomes (Waxman et al., 2009). Specifically, if the evaluation instruments were based on the synthesis of research and a coherent framework of effective ELL teaching, multiple instruments such as observation protocols and knowledge measures can be designed to assess the practice of teaching content to ELLs and holistic scores could be assigned to the content teachers. In particular, to render observation instruments more reliable and useful for professional development and evaluation, rigorous research on rating and rater behaviors might be pursued. These instruments, if developed reliably and linked to student growth trajectories, could also help to provide a solution to the challenges of differentiating separate and joint contributions of multiple teachers as well as capturing the interaction of English language acquisition and the learning of content knowledge.
Fourth, we pointed out that the use of value-added models might be problematic, especially for attributing the academic growth of ELLs to their teachers. Practical solutions to some of the challenges involved with measuring the academic growth of ELLs can potentially be alleviated through statistical modeling. For example, when value-added models are used to derive teacher effects from student performance, information about ELLs should be included in a value-added model to control for performance differences that are not related to teacher effectiveness. Specifically, accommodation use can be modeled as a time-varying covariate in value-added models (McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004). Including the percentage of ELLs in a teacher’s classroom can help account for low initial performance (e.g., Newton, Darling-Hammond, Haertel, & Thomas, 2010). Nonetheless, the quality of ELL status as a proxy for unmeasured student characteristics is questionable due to the heterogeneity of the subgroup and the complex interaction of content knowledge and English language acquisition that affects both learning and assessment. This warrants research on the adequacy of ELL status as a statistical control in value-added models (e.g., does using a categorical variable—initially English proficient, short-term ELL, long-term ELL, and former ELL—result in different value-added scores than does including a dichotomous indicator for whether or not a student is an ELL?).
The ELL status is classified into categories using standardized English language proficiency assessments or more localized surveys such as home language surveys. Emerging policy on English language proficiency assessments shows promise in advocating for the link between students’ simultaneous acquisition of the English language and content knowledge (Van Lier & Walqui, 2012). Without reliable and valid measures of growth in ELLs’ language proficiency and content knowledge, evaluations of multiple teachers contributing to the same students can only be based on weights derived from human judgment. Therefore, we propose interdisciplinary research partnerships to address language and content in validating English language proficiency assessments. One potential area of research is to explore valid ways of measuring English proficiency annually along with content knowledge and incorporating growth in English language proficiency into teacher effectiveness indicators. The measurement and statistical challenges outlined herein highlight the need for policy discussions on the incentives created by the use of student test score gains in holding teachers accountable for the quality of education that students receive. If improving education and learning outcomes for all students is a priority for the nation, then estimates of the contributions of teachers to this effort should specifically include accurate measurements of student gains from subgroups, including ELLs. In the meantime, as states and districts have the need to move faster than the research in developing evaluation systems with components based on student outcomes, we encourage them to initiate plans to validate their efforts, in particular with regard to ELLs and other special student populations, and make changes in the future if needed.
Conclusion
Many educators face the challenge of teaching content to ELLs in comprehensible and accessible ways. All teachers—not just ESL teachers—share the responsibility to meet the needs of these students. States and districts are working to respond to the call for adopting valid and reliable ways of assessing teaching quality. Valid interpretations of teacher performance depend on the extent to which we are able to create opportunities for better understanding, identifying, and assessing the quality of teaching content to ELLs, as well as valid and reliable assessment of ELL academic gains on large-scale tests.
In relation to understanding and assessing teaching quality for ELLs, we stressed the importance of developing a framework that is connected to specialized measures that would serve to assess essential knowledge base for teaching content to ELLs. This knowledge base would be assessed based on a framework that identifies content-area teacher competencies and effective areas of ELL teaching practices situated in the tasks of teaching content. New observation instruments, which would be used exclusively to monitor teaching effectiveness in the classroom, could be developed based on the same holistic framework describing what teachers need to know as they perform the tasks of teaching content to ELLs.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
