Abstract
The purpose of this essay is to provide an overview of the challenges of accounting for students with disabilities (SWDs) and English learners (ELs) in the evaluation of mainstream teachers. We focus on the two prominent indicators of teaching quality—classroom observations and value-added scores. We begin by describing each indicator and outlining the specific challenges related to the inclusion of SWDs and ELs in mainstream teacher evaluation. We then suggest recommendations for states and districts to ensure that teacher evaluation systems adequately and fairly account for these students. Finally, we provide researchers with a set of recommendations for improving the evidence base surrounding the validity of teacher evaluation measures with regard to SWDs and ELs.
Within the U.S. K–12 education system, there is consensus among policy makers, administrators, and educators that students’ opportunities for learning depend on the quality of teaching they receive in schools. Existing educator evaluation systems have been criticized for a lack of fidelity and rigor (e.g., Weisberg, Sexton, Mulhern, & Keeling, 2009). Consequently, the federal Race to the Top program has emphasized the need for multiple measures in educator evaluation systems, with a specific emphasis on student growth (U.S. Department of Education, 2010). Emerging teacher evaluation efforts have focused mainly on two indicators of teaching quality—observations and student test scores. States and districts are including local indicators as well to create weighted evaluation systems that avoid overreliance on any one measure, particularly student achievement. Examples of weighted indicators in state/district evaluation systems are displayed in Table 1.
Examples of State/District Evaluation Systems (as of August 2012)
Note. These examples focus on components of the evaluation system for teachers in tested grades and subjects. Each evaluation system has other components for evaluating educators who do not have sufficient data from students to compute value-added scores.
Despite advances in research on teacher evaluation (for summaries, see Harris, 2011; Bell et al., 2012), there has been virtually no attention given to whether teachers are effectively educating exceptional populations—namely students with disabilities (SWDs) and English learners (ELs). For all the criticism of No Child Left Behind, one of its achievements was requiring that districts attend to the achievement of subgroups of students, including SWDs and ELs. In contrast, although Race to the Top was designed to ensure that students have highly effective teachers, there is no explicit mention of teachers’ effectiveness at differentiating their instruction. In this essay, we argue that if teacher evaluation systems fail to acknowledge the presence of SWDs and ELs in teachers’ classrooms, it is problematic in terms of validity (as teachers’ evaluations may represent an incomplete and potentially inaccurate picture of teachers’ instruction) and equity (by providing disincentives for attending to these students’ needs).
We discuss these two student populations together for several reasons. First, both represent critical subgroups in U.S. schools. Among K–12 students, approximately 12% receive special education services and 11% are ELs (U.S. Department of Education, 2007, 2008). Most SWDs and ELs educated in mainstream classrooms—the majority of SWDs spend 80% or more of their time in regular classrooms (Table 2 shows the distribution by disability subtype), and, although there are no national statistics on the time ELs spend in mainstream classrooms, trends suggest that more ELs are being included in regular classroom instruction for longer periods of time (e.g., Zehler et al., 2003). Second, SWDs and ELs in teachers’ classrooms can contribute meaningfully to teachers’ practices because they often require teachers to modify or supplement their instruction. Third, many have questioned whether the performance of SWDs and ELs on standardized assessments supports valid inferences about their learning and academic growth. At the same time, we acknowledge that there is wide variation within and between these two populations in terms of classroom context and accessibility needs and that a small percentage of students both have disabilities and are ELs. We therefore outline recommendations for how evaluation systems can better attend to the specific disability categories of teachers’ students, as well as different levels of students’ English proficiency, and, where appropriate, we differentiate challenges and recommendations that may be unique to either subgroup.
Percentage of Time Spent by Students With Disabilities in the General Education Classroom
Note. Percentages include students served under IDEA who attend regular schools. Students in homebound/hospital placement, a correctional facility, or a separate public facility, or those who attend private school but receive publically funded special education are not included in the table.
Source: U.S. Department of Education, National Center for Education Statistics (2011). Digest of Education Statistics, 2010 (NCES 2011-015), Chapter 2.
The remainder of this essay is organized around two common indicators of teaching quality—student achievement, namely value-added scores, and classroom observations. 1 For each indicator, we first briefly describe the measure and outline the challenges in accounting for SWDs and ELs in mainstream teacher evaluation. We then suggest recommendations for states and districts to ensure that teacher evaluation systems adequately and fairly account for the inclusion of SWDs and ELs in mainstream classrooms. Finally, we provide researchers with a set of recommendations for improving the evidence base surrounding the validity of teacher evaluation measures with regard to SWDs and ELs. These suggestions are also summarized in Table 3.
Summary of Suggestions for Researchers and Practitioners in Evaluating Teachers Who Educate Students From Special Populations
The relative importance of the challenges we describe below depends largely on local context and may also depend on the number of teachers with SWDs and/or ELs in a state/district or, for individual teachers, the proportion of students from special populations in a given classroom. Our goal in highlighting these challenges is to encourage states/districts/researchers to explore these challenges and consider short- and long-term solutions for those that are found to be the most threatening to the credibility of their evaluation systems. We advocate for a system that holds teachers accountable for educating all students and contributes to professional development while minimizing unintended consequences, including, for example, not differentiating instruction and discouraging teachers from entering the profession.
Value-Added Scores
Challenges of Accounting for SWDs and ELs in Teacher Value-Added Models
Value-added scores are derived from statistical models that attempt to explain the contribution of individual teachers to student achievement. To estimate a teacher-specific effect on achievement, value-added models take into account student prior achievement on standardized tests and may also control for other student and school characteristics. 2 There are potential benefits to using value-added scores, particularly relative to other indicators based on student outcomes. Value-added scores (a) can provide a standardized, common metric for estimating teacher effects, (b) are intended to support causal inferences about a teacher’s impact on student growth, (c) are based on large-scale, standardized assessments that may have more desirable psychometric properties than other student assessments, and (d) are able to be evaluated for validity. Nonetheless, their appropriateness for making high-stakes decisions about teachers has been controversial because of a number of strong limitations. One concern is that the year-to-year correlation of value-added scores appears to be “small” to “moderate,” though there is some evidence that some component of performance persists within teachers over time (Goldhaber & Hansen, 2010; McCaffrey, Sass, Lockwood, & Mihaly, 2009). Concerns have been raised that value-added scores may lead to the misclassification of teachers, or that they do not adequately account for the selection of teachers and students into schools. However, evidence varies regarding the scope of this bias in value-added scores (e.g., Briggs & Dominique, 2011; Chetty, Friedman, & Rockoff, 2011; Rothstein, 2010). There are also a number of logistic challenges that must be addressed, including how to develop a fair evaluation system when only a small percentage of teachers can have individual value-added scores estimated. 3 We refer the reader to Baker et al. (2010) and Glazerman et al. (2010) for more details and additional technical concerns related to value-added models in general.
We suggest that in addition to general concerns, SWDs and ELs present unique challenges that can affect the quality of value-added scores. Broadly, these challenges include (a) factors that threaten the validity of inferences about academic achievement over time for SWDs and ELs and (b) the complex instructional contexts in which SWDs and ELs are taught. In any given classroom, as the number of SWDs or ELs increases, these challenges will have greater impact on value-added scores. Although value-added models can statistically control for time-invariant student characteristics, factors outside of the control of the teacher that change over time can present challenges for estimating teacher effects.
One threat to the validity of inferences about academic progress for both SWDs and ELs is inconsistent use of testing accommodations. There is evidence that some accommodations used by SWDs and ELs are associated with changes in performance, though the direction of this impact varies (for ELs: Pennock-Roman & Rivera, 2011; for SWDs: Sireci, Scarpeti, & Li, 2005). Although accommodations are intended to remove barriers to students’ ability to access test items, they can be inappropriately assigned or ineffective (e.g., ELs: Abedi, Hofstetter, & Lord, 2004; SWDs: Ketterlin-Geller, Alonzo, Braun-Monegan, & Tindal, 2007), which can increase measurement error and misrepresent students’ true academic growth. Despite the increased emphasis on including ELs and SWDs in large-scale assessments and providing them with testing accommodations, states have not systematically completed the work of assigning appropriate accommodations based on ELs’ history of formal schooling in the United States and in their home country, as well as their literacy skills in English and in their native language (Kopriva, Emick, Hipolito-Delgado, & Cameron, 2007), nor have they resolved debates about controversial accommodations for SWDs that may interact with the measured construct (e.g., read aloud). The uncertainty regarding the allocation of appropriate accommodations, limited resources, or students’ changing needs can result in accommodations being delivered inconsistently across years, which can inflate or deflate measures of student growth, potentially affecting value-added scores.
A second measurement challenge is that a large proportion of SWDs and ELs exhibit low performance on state assessments (ELs: Perie, Grigg, & Donahue, 2005; SWDs: Thurlow, Bremer, & Albus, 2011). This puts into question the quality of value-added scores for teachers with large numbers of SWDs or ELs because extreme scores have higher measurement error and are less reliable, properties that are compounded when using multiple test scores. In addition, student learning gains may not be realized because of floor effects. Although these issues are not confined to SWDs and ELs, the systemic and predictable nature of these low-scoring subgroups may engender the perception of unfairness.
In addition to the aforementioned measurement challenges, additional complications for modeling value-added scores include heterogeneity within the subgroups of ELs and SWDs, as well as the context in which SWDs and ELs are taught. Students in both subgroups vary in terms of student characteristics, opportunity to learn, accessibility, and special services. For example, for some students with disabilities, such as those with learning disabilities or emotional and behavioral disorders, movement into or out of special education can change the services students receive and the amount of time spent in the regular classroom, making it more difficult to isolate teacher effects. For ELs, late-arrival students with low English proficiency pose different challenges from those ELs who are designated to be Fluent English Proficient but are underachieving on state content assessments. For the latter group, the proportion of time spent learning content and the English language changes as students progress; this may lower students’ test performance temporarily, independent of a teacher’s efforts. There may also be peer effects associated with nondisabled, English-proficient students who share the classroom with SWDs and/or ELs that may differ depending on disability subtype or level of English proficiency. That is, the performance of all students in a classroom may be affected—positively or negatively—by the presence of a co-teacher, extra funding support for special services, peer behaviors, or other factors not directly related to an individual teacher. 4 Simply including a discrete variable for SWD or EL status may not account for such sources of heterogeneity in a value-added model.
A final challenge for incorporating SWDs and ELs into value-added models is attributing students’ growth to individual teachers. For mainstream teachers, it is not uncommon that they share responsibility for instruction with special education teachers, particularly for students with high incidence disabilities. Similarly, for ELs, there is often shared responsibility between mainstream teachers and English as second language (ESL) teachers. For SWDs and ELs, academic growth in a given year is likely dependent on the quality (and content) of instruction received in both settings, as well as the degree to which this instruction aligns, factors that cannot be directly estimated by value-added models.
Value-Added Scores: Suggestions for Practitioners and Researchers
There are various strategies for practitioners to attribute SWD and EL performance to individual teachers when responsibility for individual students is shared across mainstream and special education or ESL teachers. We propose that districts adopt a roster validation system, which can increase both the face validity of value-added scores as well as the accuracy of estimates (Hock & Isenberg, 2011). The Houston Independent School District, for example, uses a system where teachers can regularly log in and verify the accuracy of their rosters. As part of the roster system, Hock and Isenberg (2011) recommend that both the general education and special education teachers receive 100% responsibility of their shared students. Although roster validation will not address the question of whether general education teachers have adequately differentiated their instruction for SWDs or ELs, it will decrease the likelihood that these students are viewed as the sole responsibility of special education or EL teachers.
To improve the quality of value-added scores relative to SWDs and ELs, we suggest that state personnel support accessible assessments that offer precise measurement along the score scale, such as a multistage adaptive assessment, 5 and develop systems to accurately assign, record, and monitor the use of testing accommodations. 6 Also, we recommend that administrators work with teachers to understand the validity of their value-added score given their particular classroom context, in order to minimize unintended consequences and perceived unfairness. As we suggest below, more research is necessary to better understand how decisions about including or excluding SWDs and ELs affect the validity of teachers’ value-added scores. In the short term, we recommend that states and districts balance the unintended consequences of not including test scores from SWDs and ELs (e.g., teachers focusing their reading instruction only on students who will be included in their value-added estimates) against the various threats to validity when scores are included.
For researchers, there is a need to conduct more nuanced validity investigations for value-added modeling, testing various assumptions regarding the presence of scores from SWDs and ELs in mainstream teachers’ value-added scores and exploring variables that account for heterogeneity within these subgroups. Rather than including a single variable for percentage of SWDs and/or ELs in a class, we encourage research on value-added models with respect to inconsistent accommodation use across years, disability subtype, and changes in English proficiency or classifications for ELs (e.g., limited English proficiency, reclassified English proficient). Examples of analyses include computing the correlation of teacher rankings based on several value-added models and estimating whether these finer-grained categories change the inferences made about teachers.
We believe that if research finds that individual teachers’ value-added scores are robust to these theoretically important variables, SWDs and ELs should be included in the models in practice, which would promote the perception that educators in mainstream classrooms are being held accountable for the quality of instruction for SWDs and ELs. Alternatively, if the proposed research finds that additional information on SWDs and ELs improves the validity of value-added scores, we encourage states/district to use variables specific to special populations in their value-added models, with attention given to the interpretation and practical consequences of their use.
Observations of Classroom Instruction
Challenges of Accounting for SWDs and ELs in Observations of Teachers’ Instruction
Although there has been considerable research and commentary on the quality of value-added models, there has been comparatively little scrutiny given to observation protocols in the context of teacher evaluation, despite the fact that most new models of teacher evaluation require their use. There is also little research on whether observation systems adequately measure teachers’ effectiveness with regard to SWDs and ELs. In the following, we describe approaches for observing classroom instruction, discuss considerations related to SWDs and ELs, and provide recommendations for practitioners and for future research.
Teacher observation systems are commonly based on a set of theoretical dimensions that are intended to define critical aspects of teaching (e.g., classroom climate, quality of feedback to students); all operate with a working definition of what constitutes “good teaching.” Observation protocols typically fall into one of two categories—either they are intended to be used across all settings (e.g., Framework for Teaching [FFT]; Danielson, 2007) or they are developed for use in specific subject areas (e.g., the Mathematical Quality of Instruction assessment; Hill et al., 2008).
With states and districts feeling pressure to simultaneously develop and implement multiple teacher evaluation measures, most districts and states have decided to limit administrative burden by adopting a single observation protocol, with the use of FFT being the most common. 7 This approach eliminates concerns about differences in validity and reliability across various subject-specific and student population–specific protocols; however, it presents challenges for holding teachers accountable for the performance of SWDs and ELs in their classroom. That is, for observation protocols to support valid inferences about teacher effectiveness, they must adequately address whether teachers are using practices that are identified as effective for SWDs and ELs. The observation protocols being considered for teacher evaluation (e.g., FFT) typically do not outline expectations for the instruction provided to SWDs and ELs; rather, the interactions between teachers and their students are described in general terms. Further, as we outline below, there is evidence that the instructional needs of SWDs and ELs vary from their peers in important ways. If instructional practices deemed effective for SWDs or ELs are not represented in the observation systems used by states/districts, it may provide disincentives for teachers to adopt such practices in their teaching.
For SWDs, much of the literature on effective instructional practice has been developed around the inclusion of students with high-incidence disabilities (such as learning disabilities) in general education settings, particularly in the delivery of reading instruction. For example, Vaughn and Linan-Thompson (2003) suggest that the most promising instructional approaches for students with learning disabilities are those that are “characterized as being well specified, explicit, carefully designed, and closely related to the area of instructional need (e.g., reading, spelling, math)” (p. 142). In addition to an emphasis on instruction that is direct and explicit, Brownell and colleagues have emphasized the need for general educators to demonstrate knowledge related to basic reading skills, peer learning, and self-management techniques (Brownell et al., 2009).
Likewise, there are specific instructional practices that benefit ELs—and which may not be accounted for in general observation protocols. This might be due to the multitude of approaches to defining effective instructional practice for ELs. Some argue that ELs benefit from the basic instructional practices that have been proven to be effective for native English speakers but they need modifications to this instruction (Gersten & Baker, 2000). In line with this thinking, ELs often benefit from a sustained instructional emphasis on vocabulary development, including, for example, the multiple meanings of words in English (e.g., August, Carlo, Dressler, & Snow, 2005). It has also been argued that effective teaching of ELs goes beyond vocabulary instruction (Schleppegrell, 2004).
In addition to providing instruction focused on vocabulary and reading skills, successful teachers of EL students are attentive to students’ home language and culture in their instruction (e.g., Paneque & Barbetta, 2006). Others also argued that effective EL teaching requires specific knowledge, in that teachers of ELs should be able to engage ELs in using the academic language of a particular discipline (Brisk & Zisselsberger, 2011). It is likely that successful teachers of ELs are able to integrate both language and content objectives, as is outlined in the sheltered academic instructional practices proposed by Echevarria, Vogt, and Short (2008). 8 In summary, if the goal of observation systems is to assess the quality of instruction provided to all students, then a major challenge moving forward will be how to account for the kinds of instructional practices described above that appear uniquely beneficial for SWDs and ELs.
One final issue relates to the reliability of observers’ scores. If observation protocols are to support valid interpretations about teachers’ instructional quality for ELs and SWDs, researchers must attend to whether observers themselves can reliably differentiate between teachers who do and do not make use of effective instructional practices for these populations. A handful of recent large-scale research studies suggest it can be challenging to teach observers to score in reliable ways (Bell et al., 2012). Researchers have identified multiple sources of variation in observation scores for teachers, such as differences across observers or across lessons or class periods (Bill & Melinda Gates Foundation, 2012; Hill, Charalambous, & Kraft, 2012). We argue that in addition to these sources of error, an observer’s training and background specific to ELs or SWDs may lead to additional variation, and there is evidence that principals and other school administrators often lack the expertise necessary to evaluate teachers’ instruction of SWDs (Blanton et al., 2006); the same appears true for ELs.
Observations: Suggestions for Practitioners and Researchers
Observation systems should ideally reflect teachers’ responsiveness to their students’ needs, as effective teachers would make different instructional decisions (e.g., grouping practices, emphasis on beginning reading skills) depending on the characteristics of the SWDs and ELs in their classroom. One option for districts would be to adopt observation protocols designed specifically for use in classrooms with SWDs and ELs; the literature provides several examples. 9 However, given the enormous implementation challenges that come with introducing even a single protocol (e.g., training and certifying school and district personnel to reliably score the protocol, calibrating raters’ scoring over time, scheduling observations around administrators’ many time commitments, combining observation scores with other sources of evaluation data), it is unlikely—at least in the short term—that districts and states would be in a position to adopt multiple observation systems. A second option, which we believe is more likely to be adopted by districts/states, is to supplement an existing observation protocol (e.g., FFT) with a subset of items specific to teaching SWDs and ELs. 10 Alternatively, existing response categories on observation protocols could be adapted to more appropriately reflect teachers’ interactions with SWDs and ELs.
One viable short-term solution would be to develop “scoring support documents” to assist observers in the scoring process, with an emphasis on the kinds of evidence-based practices that have proven to be effective for each of these populations. In the absence of such direction, it will be unlikely that observers will examine whether teachers attend to the needs of SWDs and ELs. If districts are to adopt such modified rubrics, they would benefit from guidance from researchers regarding how to help develop such assessments and whether these revisions have implications for the validity and reliability of the instruments. Thus far, however, researchers have not addressed these issues.
To improve observer familiarity with instruction for SWDs and/or ELs, districts could ensure that observers have some training or background specific to special student populations. It may be difficult for them to attend to whether teachers are providing appropriate instruction for all of their students. There is also a need for research that more closely examines observer performance specific to SWDs and ELs. For example, future research could investigate whether observer experience with SWDs and ELs improves their ability to assess teachers’ practice reliably, whether there are strategies that districts can adopt to train observers to attend to teachers’ effectiveness in educating SWDs and ELs, and whether modified versions of existing protocols improve observers’ reliability in scoring teachers’ instruction for SWDs and ELs.
Conclusion
The challenges related to developing useful measures of teaching effectiveness have received significant attention from researchers and practitioners. In this essay, we have presented challenges and potential solutions for using indicators of teacher effectiveness from two subgroups of students that have received little attention in teacher evaluation. As states and districts develop and modify teacher evaluation systems, we encourage them to attend to the unique challenges associated with including SWDs and ELs in each of the indicators. Not accounting for these challenges would undermine the validity of inferences about teachers’ effectiveness, particularly in cases of teachers who have a large proportion of either SWDs or ELs in their classroom, and it would be counterproductive to the goal of providing a high-quality education to all students.
