Abstract
As states and districts implement more rigorous teacher evaluation systems, measures of teacher performance are increasingly being used to support instruction and inform retention decisions. Classroom observations take a central role in these systems, accounting for the majority of teacher ratings upon which accountability decisions are based. Using data from the Measures of Effective Teaching study, we explore the extent to which classroom composition influences measured teacher performance based on classroom observation scores. The context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ measured performance. Furthermore, the intentional sorting of teachers to students has a significant influence on measured performance. Implications for high-stakes teacher accountability policies are discussed.
Introduction
Interest in understanding the role of school context in teachers’ professional development has recently emerged. For example, Kraft and Papay (2014) find that school context—including supportive and collaborative environments, effective principal instructional leadership, and positive school culture—can facilitate greater returns to teacher experience or the extent to which teachers become more effective over time. Other studies have pointed to the important role that school composition plays in shaping teacher effectiveness. In one study, Loeb, Beteille, and Kalogrides (2012) found that teachers improve more rapidly in more effective schools—those more able to raise the test-score performance of their students. In a second, Jackson and Bruegmann (2009) found that students of less-experienced teachers realize larger achievement gains when their teachers have higher performing colleagues. Sass, Hannaway, Xu, Figlio, and Feng (2012) show that teachers improve more quickly in schools serving fewer low-income students.
Other work suggests that the characteristics of a teacher’s students play a meaningful role in determining teacher ratings. New evidence finds that classroom observation scores tend to be lower among teachers whose students are more disadvantaged and have lower incoming achievement (Chaplin, Gill, Thompkins, & Miller, 2014; Whitehurst, Chingos, & Lindquist, 2014). These lower scores may reflect the nonrandom sorting of teachers to classes of students (Monk, 1987; Steinberg & Garrett, 2015) and the tendency of schools to assign novice, less-effective teachers to more disadvantaged students, while more experienced and more effective teachers are more likely to work with higher achieving students (Clotfelter, Ladd, & Vigdor, 2006; Kalogrides & Loeb, 2013; Kalogrides, Loeb, & Beteille, 2013; Monk, 1987).
Furthermore, evidence suggests that observation-based measures of a teacher’s performance tend to be weakly correlated over time. Indeed, research using data from the Measures of Effective Teaching (MET) study finds that a teacher’s observation scores are only moderately correlated across 2 consecutive years (Garrett & Steinberg, 2015), on the order of the year-to-year, within-teacher correlation of student test-score-based measures of teacher performance found in other settings (Glazerman et al., 2010; Goldhaber & Hansen, 2010; McCaffrey, Sass, Lockwood, & Mihaly, 2009). These findings suggest not only that a teacher’s classroom practice may vary over time but also that it may be influenced by classroom composition—the students and classes to which teachers are assigned.
As researchers and policymakers begin to better understand how different measures may differentially reflect distinct aspects of a teacher’s instructional performance, we need to further place these relationships into context by understanding how the composition of the measures may be sensitive to the composition of the class itself. Therefore, a closer investigation is warranted to more fully understand how classroom composition may influence observation-based measures of teacher performance, and whether these influences differentially affect teachers in different settings (e.g., math or reading subject areas, classroom teachers vs. subject specialists).
In this article, we focus on understanding the relationship between the characteristics of a teacher’s class and a teacher’s performance, as measured by observations of his or her classroom instruction and management. Given existing evidence that teachers who are judged to be more effective tend to be assigned higher achieving students, particular attention is paid to the relationship between students’ incoming achievement and measured teacher performance based on the Danielson Framework for Teaching (FFT) observation protocol. We leverage variation in student composition over time to estimate the influence on measured teacher effectiveness, allowing us to control for fixed characteristics of teachers (e.g., teacher quality endowments) and isolate the idiosyncratic influence of incoming student achievement. We further assess whether these relationships are more or less sensitive when considering specific aspects of classroom management and instruction and the extent to which the influence of incoming achievement differs for classroom generalists compared with subject-matter specialists. Moreover, we address the nonrandom sorting of teachers to classes that has been shown elsewhere to bias student achievement-based measures of teacher effectiveness (Rothstein, 2009, 2010). To do so, we take advantage of a hallmark of the MET study project—the randomization of teachers to classes that occurred just prior to the beginning of the MET study’s second year (i.e., 2010–2011 school year). Specifically, we exploit the randomization of teachers to classes within randomization blocks (i.e., grade-subject combinations of classes within a school) to generate experimental estimates of the effect of incoming achievement on measured teacher performance. Finally, we implement an instrumental variables (IV) strategy to formalize the nature of the bias in observation scores that is due to the nonrandom sorting of teachers to classes. Then, we compare estimates generated by the teacher fixed-effects (FE) approach with ordinary least squares (OLS) and IV estimates to lend insight into whether the nonrandom matching of teachers to classes is a concern in the context of classroom observation-based measures of teacher performance.
We find that teacher performance, based on classroom observation, is significantly influenced by the context in which teachers work. In particular, students’ prior year (i.e., incoming) achievement positively affects a teacher’s measured performance captured by the FFT. Students’ incoming achievement is more strongly associated with observed English Language Art (ELA) instruction than math instruction. Although the influence of incoming achievement is concentrated among subject specialists, who work with multiple classes of students within a single school day, compared with classroom generalist teachers who work with the same classroom of students all day, every day, we are unable to uniquely attribute these differences to either departmentalized (compared with generalist) class status or grade-level differences. Furthermore, incoming student achievement significantly influences aspects of teacher performance that capture teachers’ interactions with their students, whereas core instructional practices are not prone to this influence. Finally, we offer evidence that the intentional, nonrandom matching of teachers to classrooms due to unobserved time-invariant teacher effects biases the relationship between incoming student achievement and observed classroom performance.
In the context of newly implemented teacher evaluation systems that incorporate multiple measures of teacher performance (including value-added measures [VAMs], student learning objectives [SLOs], and observation scores) into teachers’ summative evaluation ratings, recent evidence has indicated that principals’ human capital decisions around teacher effectiveness are based predominantly on their observations of teacher practice (Goldring et al., 2015). Moreover, some have argued that classroom observations provide a more transparent and objective means of evaluating and measuring teacher effectiveness and, in doing so, avoid many of the validity concerns that tend to be associated with VAMs (Goldring et al., 2015). However, our findings indicate that observation scores as measures of teacher effectiveness present similar validity concerns, as these scores reflect both the systematic, nonrandom assignment of teachers to classes and the incoming achievement of a teacher’s students. Therefore, our results suggest that greater caution must be taken by policymakers and practitioners when making high-stakes personnel decisions based largely (if not entirely) on teachers’ classroom observation scores.
In the next section, we discuss the evolving landscape of teacher evaluation reform and the evidence on measures of teacher effectiveness being incorporated into new evaluation systems. Next, we specify a model relating teacher and classroom characteristics to observed measures of teacher performance. We then describe the data and empirical approach used to capture the influence of incoming student achievement on teacher performance. Finally, we present our findings and discuss their implications for high-stakes personnel decisions made in the context of newly developed and implemented teacher evaluation systems.
Teacher Evaluation Reform Landscape
In the wake of the federal Race to the Top (RTTT) initiative, education policy efforts have recently led to dramatic revisions to teacher evaluation systems and substantial effort toward improving measures of teacher effectiveness. At the state and local levels, policymakers are revising and implementing new evaluation systems to incorporate more rigorous performance evaluations through the use of multiple measures of teacher performance—valued-added measures (VAM) and standards-based classroom observations, among them—and multiple teacher ratings categories. By the start of the 2014–2015 school year, 78% of all states and 85% of the largest 25 school districts and the District of Columbia had revised and implemented teacher evaluation reforms (Steinberg & Donaldson, in press).
Revisions to teacher evaluation systems and the measures upon which teacher evaluation is based have been spurred by evidence in support of the critical role teachers play in improving student outcomes (Goldhaber, 2002; Rivkin, Hanushek, & Kain, 2005; Rockoff, 2004), while also acknowledging both the substantial within-school heterogeneity in teacher effectiveness (Aaronson, Barrow, & Sander, 2007; Rivkin et al., 2005) and the inability of traditional teacher-evaluation systems to differentiate among teachers in their contribution to student learning (Weisberg, Sexton, Mulhern, & Keeling, 2009). Efforts to reform teacher evaluation are underscored by two fundamental goals of personnel evaluation in education: to systematically support improvements in instructional quality and to identify low-performing teachers for remediation or dismissal. The latter goal aims to inject more rigorous accountability into teacher evaluation systems that historically did a poor job of identifying and removing underperforming teachers (Weisberg et al., 2009). Early evidence on the effectiveness of teacher evaluation reforms for improving teaching and learning and identifying and removing the lowest performing teachers is promising (Dee & Wyckoff, 2013; Sartain & Steinberg, in press; Steinberg & Sartain, 2015; Taylor & Tyler, 2012), suggesting that more rigorous teacher evaluation can meaningfully improve teacher quality and student achievement while satisfying the accountability function of newly developed evaluation systems.
As states and districts reform teacher evaluation and begin to make high-stakes personnel decisions, increasing attention is being given to the measures being incorporated into new evaluation systems. Eighty percent of both states and the largest districts implementing new evaluation systems are using one or more measures of teacher performance based on student test score data; these include VAM and student growth percentiles (SGP; Steinberg & Donaldson, in press). However, evidence suggests that there is substantial within-teacher variability in a teacher’s annual VAM scores (Goldhaber & Hansen, 2010; McCaffrey et al., 2009) and that this variability depends on the nature of the student assessment being used (Papay, 2011). This year-to-year variability constrains the efficacy of new evaluation systems to serve the accountability function of identifying effective teachers (Rothstein, 2009, 2010). Moreover, VAM scores are limited in their ability to inform teachers’ instructional practice, as these measures are generated and provided to teachers after the conclusion of the school year, and offer no guidance for how teachers may improve their instructional practice.
Although educators and scholars alike point to the limitations implicit in test-score based measures of teacher effectiveness, classroom observations of teacher practice—designed to provide formative feedback to teachers to improve their instructional practice as well as to more carefully evaluate instructional performance—offer promise as a complementary measure for differentiating teacher performance (Kane, McCaffrey, Miller, & Staiger, 2013). The role of classroom observation in measuring teacher performance is particularly notable given that observation-based scores constitute the majority, and often the entirety, of summative evaluation scores (Steinberg & Donaldson, in press), with nearly 70% of teachers nationwide teaching in nontested grades and subjects where student-test-score data—on which VAM scores are based—are unavailable (Watson, Kraemer, & Thorn, 2009). As noted in the introduction, however, new evidence suggests that observational measures of teacher practice alone are unable to identify teacher effectiveness (Garrett & Steinberg, 2015). Therefore, the reliance on observation-based measures of teacher performance raises concerns about the ability of newly implemented evaluation systems to both accurately capture teacher performance and satisfy the accountability goal of teacher evaluation reform.
Production of Teacher Performance
How teachers perform in the classroom depends on a number of factors. Consider the following model where measured teacher performance is a function of a teacher’s endowed instructional ability and the student composition of his or her classroom:
In Equation 1, Perform jct is teacher j’s measured performance, based on his or her classroom observation scores, in classroom c in school year t; θ j captures the stable, time-invariant aspects of instructional skill (e.g., teacher quality) that a teacher brings to the classroom, which include a teacher’s ability to design lessons, provide instruction based on those lessons, and design and implement classroom management strategies to maximize the efficacy of instructional lessons within a classroom’s idiosyncratic environment. The endowment of teacher j’s instructional skills (captured by θ j ) is considered independent of the students in his or her class, such that endowed teacher quality is not influenced by the classroom (c) to which the teacher is assigned. The variable Xct represents the characteristics of students in classroom c to whom a teacher is assigned in school year t; such characteristics may vary across years and likely influence how teachers perform from one year (and class) to the next. The variable ε jct is an idiosyncratic error term capturing all other inputs to measured teacher performance (e.g., curriculum, teacher policy, observer/rater quality).
Just as the composition of teacher j’s classroom may influence measured performance, teachers may also alter their instruction based on the composition of students in their classes. That is, a teacher’s instruction and classroom management may differentially respond to classroom composition and the students to whom he or she is assigned. We can further decompose the variable ε jct into an aggregate match effect, γ jc , and all other inputs to measured teacher performance, µ jct . We can then rewrite Equation 1 as follows:
This aggregate match effect (γ jc ) accounts for processes that occur both before teachers enter the classroom and those that occur during the course of teachers’ interactions with their classes. Specifically, the match effect captures (a) the process that assigns teachers to classrooms, which occurs prior to measured teacher performance and has been shown to incorporate historical teacher and student performance into the assignment decision (Clotfelter et al., 2006; Kalogrides & Loeb, 2013; Kalogrides et al., 2013; Monk, 1987; Steinberg & Garrett, 2015) and (b) the extent to which teacher j’s instructional and classroom management skills interact with and are more (or less) effective with a particular mix of students in classroom c.
In light of evidence that teachers with higher observation scores tend to work with higher achieving students, this article extends the existing literature on the influence that classroom composition—in particular, incoming student achievement—has on measured teacher effectiveness. We investigate whether the composition of students (described by the variable Xct) to whom a teacher is assigned accounts for variation in measured classroom performance, beyond what is attributable to the teacher alone (θ j ). We explore this relationship by using consecutive years of data to examine how incoming achievement influences teachers’ observation scores on the most widely used measure of instructional performance, the Danielson FFT. This teacher fixed-effects approach, described in more detail in the following sections, allows us to account for the contribution of a teacher’s endowed ability to his or her observation scores.
As formalized in Equation 2, we recognize that the purposive sorting of teachers to students may exert its own, independent influence on measured teacher performance, thereby confounding our ability to uniquely capture the influence of classroom composition on teacher effectiveness. We, therefore, further test the relationship between incoming achievement and observed measures of teacher performance by leveraging the random assignment of teachers to classes, a prominent feature of the MET study. In the second year (2010–2011) of the MET study, a subset of teachers was randomly assigned to classes of students. 1 In principle, the random assignment of teachers to classes disrupts the assortative matching process that can confound measures of teacher effectiveness (Clotfelter et al., 2006; Garrett & Steinberg, 2015; Rothstein, 2009, 2010; Whitehurst et al., 2014). However, an unintended consequence of the Year 2 randomization in the MET study was significant noncompliance among students, who ended up with teachers other than those who were randomly assigned to their classes (Kane et al., 2013; White & Rowan, 2012).
In prior work, we found that classroom-level noncompliance with the teacher random assignment in the MET study can be characterized in three ways: (a) full compliance classrooms are those where all students, within a classroom to which teacher j was randomly assigned, remained with teacher j; (b) partial-compliance classrooms are those where at least one student, within a classroom to which teacher j was randomly assigned, did not comply with the random teacher assignment and instead ended up in another teacher’s class; and (c) full noncompliance classrooms are those where no student, within a classroom to which teacher j was randomly assigned, remained with the randomized teacher. Full noncompliance classrooms were those for which principals likely assigned teachers to classes prior to receipt of the random teacher assignments, therefore (intentionally or not) ignoring the random assignment of teachers to classes as part of the Year 2 randomization sample in the MET study (Steinberg & Garrett, 2015; White & Rowan, 2012). These particular consequences of ex post, nonrandom sorting on classroom-level compliance with randomization allows for additional insight into the influence of the nonrandom assignment of teachers to classes on measured teacher performance. 2
Data and Sample
We use data from the MET study, which was carried out over 2 school years (2009–2010 and 2010–2011) and across six districts. 3 The six participating districts were Charlotte-Mecklenburg Schools (NC), Dallas Independent School District (TX), Denver Public Schools (CO), Hillsborough County Public Schools (FL), Memphis City Schools (TN), and the New York City Department of Education (NY; White & Rowan, 2012).
Teacher Sample
The teacher-level analytic sample consists of 834 teachers in Grades 4 to 9, of which 354 taught ELA, 304 taught math, and 176 taught both subjects. 4 For the 834 teachers in our sample, we incorporate the following background characteristics: gender, race/ethnicity, degree status, and years of teaching experience within the district. We also observe whether a teacher was a generalist or departmentalized teacher. Generalists are subject-matter generalists who taught both ELA and mathematics to a single class of students; departmentalized teachers are subject-matter specialists who taught exclusively ELA or math, and worked with multiple classes of students. Furthermore, each teacher record includes unique district, school, and classroom (section) identifiers, variables identifying the grade taught, and the grade-subject combinations (e.g., fifth-grade math). We are able to use this unique identifying information to link students to their randomly assigned teacher as well as to their actual classroom teacher, allowing us to identify the compliance category of the class to which a teacher was randomly assigned in Year 2 (i.e., full compliance, partial compliance, or full noncompliance).
Table 1 summarizes the characteristics of all teachers in our sample and by subject-specific sample (i.e., ELA or math). In addition to the full sample of teachers, we construct two additional analytic samples by subject area. The IV sample includes teachers in Grades 5 to 9, and from this sample, we construct IV estimates (of the effect of incoming achievement on measured teacher performance), which we compare with teacher FE estimates for an assessment of the extent of bias in observation scores due to the nonrandom assignment of teachers to classes. The randomization sample includes those randomization blocks (i.e., grade-subject combinations of classes within a school) in which all MET study teachers were in full compliance classrooms. We use the randomization sample to generate experimental estimates of the effect of incoming achievement on teacher performance. 5
Teacher Characteristics
Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher characteristics except experience (mean and standard deviation reported). Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students. Of the 834 teachers, data on a teacher’s gender is available for 799 teachers, 798 report their race, 797 report years of experience (in the district), and 618 report degree attainment (e.g., master’s or higher). Of the 530 teachers in the ELA sample, data on a teacher’s gender and race are available for 503 teachers, 501 report years of experience (in the district), and 381 report degree attainment (e.g., master’s or higher). Of the 480 teachers in the math sample, data on a teacher’s gender is available for 462 teachers, 461 report their race, 462 report years of experience (in the district), and 333 report degree attainment (e.g., master’s or higher). ELA = English Language Art; IV = instrumental variables; MET = Measures of Effective Teaching.
Among our sample of 834 teachers, 46% were in full compliance classrooms, 36% were in partial-compliance classrooms, and 18% were in full noncompliance classrooms. Table 2 summarizes the teacher characteristics by classroom compliance status. Among the ELA sample, teachers differed significantly by years of teaching experience (in district), master’s degree attainment, and subject-matter generalist status. Indeed, fewer teachers in full compliance classes were subject-matter generalists teaching fourth or fifth grade, suggesting that much of the noncompliance with the Year 2 randomization took place in the earlier grades among the MET sample of ELA teachers. We also observe very similar patterns among the sample of math teachers.
Teacher Characteristics, by Classroom Compliance Status
Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher characteristics except experience (mean and standard deviation reported). Full Comply refers to teachers in full compliance classrooms; Partial Comply refers to teachers in partial compliance classrooms; and Noncomply refers to teachers in full noncompliance classrooms in the second year of the MET study. ELA = English Language Art; MET = Measures of Effective Teaching.
Differences by classroom compliance status, by subject, statistically significant at the *10%, **5%, and ***1% levels.
Nearly a quarter of all teachers in our sample were subject-matter generalists teaching multiple subjects to one class of students. Table 3 summarizes the characteristics of teachers by generalist and departmentalized status. In both the ELA and math samples, generalists were significantly less likely to be male and White, although significantly more likely to be Black and have a master’s degree. Generalist teachers had approximately less than 2.5 years of teaching experience than departmentalized teachers, with nearly all (98%) generalists teaching in either Grade 4 or 5. Furthermore, we find that early-grade, generalist teachers were significantly more likely to be in full noncompliance classrooms (and, by extension, less likely to be in full compliance classrooms) than departmentalized teachers. This suggests that it may have been more difficult for school leaders to intentionally ignore the teacher random assignment among departmentalized teachers and purposively match entire classes of students to these teachers. In contrast, it appears that the purposive matching of teachers to classes (and the abandoning of the teacher random assignment) was more easily accomplished among the elementary grade (e.g., fourth and fifth grade) teachers teaching in self-contained classrooms.
Teacher Characteristics, by Generalist/Departmentalized Status
Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher characteristics except experience (mean and standard deviation reported). Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students. Departmentalized teachers are subject matter specialists who taught ELA or math to more than one class of students. ELA = English Language Art; MET = Measures of Effective Teaching.
Differences by generalist/departmentalized status, by subject, statistically significant at the *10%, **5%, and ***1% levels.
Teacher Performance Measure: Classroom Observation Scores
We use the measure of teacher effectiveness that is among the most widespread in newly developed teacher evaluation systems—observation scores of a teacher’s classroom practice using Charlotte Danielson’s FFT. Measures of a teacher’s instructional practice are captured through FFT scores generated by MET raters using videos of subject-specific (e.g., math or ELA) lessons. 6 Generalist classroom teachers were videoed, on average, on 4 separate days throughout the year, with each day producing one ELA and one math lesson video. In the first year of the study, departmentalized teachers were videoed on 2 separate days, capturing two different sections taught by the teacher. We randomly selected one of a departmentalized teacher’s two sections from the first year for inclusion in our analysis. 7 In the second year of the study, departmentalized teachers were videoed on 4 different days and focused on one of the teacher’s sections. Video recordings were spread over time to capture greater representativeness of teacher instruction. MET raters watched the lesson videos remotely, coding them using eight components from two of the FFT domains: classroom environment and instruction (see White & Rowan, 2012 for a detailed discussion of the process of video recording and scoring teachers’ lessons). Each component was scored on an integer scale from 1 (unsatisfactory) to 4 (distinguished). The scores for each component were then averaged within subject (using the harmonic mean) across all segments (i.e., lesson videos) to create section-level aggregate scores for each component. These aggregate scores represent the average FFT scores for a teacher’s subject-specific instruction for a single class (section) of students.
For our measure of teacher effectiveness, we created a composite performance measure in the following manner. Using the section-level means (from multiple observation scores) of each of the eight FFT components, we averaged across the eight section-level aggregate component scores within a section, by subject. For example, each of the eight FFT component means based on ELA instruction from a teacher’s class section were averaged for a final composite ELA FFT score. Previous research using the MET data found this to be an appropriate approach for aggregating teacher effectiveness measures based on classroom observation scores (Garrett & Steinberg, 2015; Kane et al., 2013; Mihaly, McCaffrey, Staiger, & Lockwood, 2013).
Incoming Achievement and Classroom Composition
The MET data contain information on student achievement for Grades 4 to 8 provided by each district for its state accountability tests in ELA and mathematics. Student achievement information was provided for the 2 years of the study (e.g., 2009–2010 and 2010–2011) and up to 3 years prior to the study, as available. As the state assessments differ across locations, scores were standardized (e.g., z scores) at the student level within district, grade, and subject to enable comparability across sites. The MET data also capture key background characteristics of a teacher’s students that we include as controls for classroom composition, including class size, student race/ethnicity, gender, age, special education status (SPED), free or reduced-price lunch status (FRPL; as a proxy for student poverty), gifted-status, and whether a student is an English language learner (ELL; see Table 4). Both student background and achievement characteristics were aggregated to the classroom level. The aggregated, classroom-level achievement represents the average (ELA or math) achievement for all students in a given class.
Classroom Composition
Note. Each cell reports the mean (standard deviation) proportion of students, by characteristic, in a teacher’s classroom, except for class size, age, and achievement. Y1 represents the first year of the MET study (2009–2010 school year) and Y2 represents the second year of the MET study (2010–2011 school year). For class size, we report the mean (standard deviation) number of students in a teacher’s class. Achievement represents the average incoming achievement (in standard deviation units) among students in a teacher’s class (e.g., Y1 achievement, in ELA and math, is for the 2008–2009 school year, and Y2 achievement is for the 2009–2010 school year), standardized at the subject/grade/district level. ELA = English Language Art; FRPL = free or reduced-price lunch status; ELL = English language learner; SPED = special education status; MET = Measures of Effective Teaching.
Importantly, the average prior-year achievement of all students in the class captures the academic skills students have when arriving to a class, a focal consideration for assessing the influence of classroom composition on measured teacher performance.
Empirical Approach
Our aim is to understand how the incoming academic achievement of a class influences a teacher’s performance as measured by his or her classroom observation scores. Assuming additive separability of inputs to teacher performance, we can write Equation 1 as follows:
where Perform jct is the observation score for teacher j with class c in year t; PriorAchieve is the average prior-year (i.e., incoming) achievement of teacher j’s class c in year t; and Xct captures other observable characteristics of students in teacher j’s classroom in year t, including class size, the average age of students in the class, the proportion of students receiving FRPL, the proportion of students who are ELLs, the proportion of students receiving special education services (i.e., SPED), the proportion of gifted students, and the percentage of minority (i.e., Black or Hispanic) students; θ j is a teacher FE and ε jct is a random error term. Accounting for a teacher’s endowed ability in this teacher fixed-effects set-up will enable us to deconfound the relationship between incoming achievement and teacher performance that may be due to fixed performance characteristics of teachers. We cluster the standard errors at the teacher level.
We estimate Equation 3 separately for each of the two subject-specific (math and ELA) teacher samples to examine the extent to which incoming achievement may differentially influence measured teacher performance across content areas. In addition to teachers’ aggregate classroom observation scores, we examine the relationship between incoming achievement and individual teacher practices across two domains—instruction and classroom management. Furthermore, we estimate the extent to which incoming achievement differentially influences the observation scores for subject-matter generalists and departmentalized, subject-matter specialists.
As previously discussed, measured teacher performance may reflect, in part, the extent to which a teacher’s instruction and classroom management interacts with the students to whom he or she is assigned, as well as the process by which teachers are assigned to classrooms. As formalized in Equation 2, the ex ante assignment process matching teachers to students and the ex post interaction of a teacher’s instruction with the particular composition of the class may be captured by an aggregate match effect, γ
jc
. To distinguish between these ex ante and ex post contributions to the aggregate match effect, we characterize the influence of nonrandom teacher-classroom matching as
Experimental Estimates of Incoming Achievement
One approach for addressing the concern that time-varying shocks may be correlated with the assignment of teachers to classes and bias the FE estimates is to leverage the randomization of teachers to classes that took place in the second year of the MET study. As previously described, a subset of all MET study teachers was randomly assigned to classes within randomization blocks, or grade-subject combinations of classes consisting of at least two participating MET teachers. If the random assignment of teachers to classes (within randomization block) was implemented faithfully, all teachers would be in full compliance classrooms in the 2010–2011 school year. Then, via OLS, we could estimate the causal effect of incoming achievement (which, conditional on the randomization block, should be exogenous to the teacher). However, in our full ELA sample, 51% of teachers were in either partial or full noncompliance classrooms; among our full math sample, 61% of teachers were in classes with some noncompliance.
Even though we observe extensive classroom-level noncompliance with the teacher random assignment, we identified a subset of randomization blocks in which all MET study teachers were in full compliance classrooms (see Table 1 for summary statistics on the randomization sample). Of the 260 randomization blocks in the ELA sample, we identified 58 where all MET study teachers were in full compliance classes; for the math sample, 38 of the 244 randomization blocks included only full compliance classes. Given that only a subset of the randomization blocks was included in our randomization sample, we first assessed the fidelity of the randomization. Specifically, if the randomization held among teachers (and their classes) in our randomization sample, then the classroom composition variables jointly should not predict a teacher’s prerandomization measured performance (i.e., 2009–2010 FFT scores). To conduct this covariate balance test, we estimate the following model for both the ELA and math samples:
where Performj,t−1 is teacher j’s prerandomization classroom observation score from year t − 1 (i.e., 2009–2010 school year); PriorAchieve is the average prior-year (i.e., incoming) achievement of teacher j’s class c in year t (i.e., 2010–2011 school year); and Xct captures other observable characteristics of students in teacher j’s classroom in year t, as in Equation 3. Given that the randomization was conducted at the randomization block level, we include randomization block FE (υ r ); εj,t−1 is a random error term, and we cluster the standard errors at the randomization block level. To determine whether the characteristics of teacher j’s class c in the 2010–2011 school year jointly predicted his or her prerandomization measured performance from the 2009–2010 school year, we assess the F statistic (and associated p value) from the joint test of the null hypothesis for Equation 4. If classroom composition was distributed randomly across teachers within randomization blocks, then we will be unable to reject the joint null hypothesis. 8 In such cases, OLS estimates of Equation 3 for the randomization year (2010–2011), replacing only the teacher FE (θ j ) with the randomization block FE (υ r ), will produce unbiased estimates of the effect of incoming achievement on measured teacher performance. We present both OLS and teacher FE estimates for the ELA and math randomization samples, and discuss the validity of these estimates in light of the covariate balance test described above.
Examining Bias Due to Nonrandom Matching
Prior evidence suggests that when teachers are nonrandomly assigned to classes, systematic assortative matching occurs such that higher performing teachers are assigned to higher performing students (Clotfelter et al., 2006; Kalogrides & Loeb, 2013; Kalogrides et al., 2013; Steinberg & Garrett, 2015). In addition to the systematic matching of teachers to classes on observed performance characteristics, policymakers and practitioners should also be concerned that teachers are assigned to classes based on unobserved characteristics. If teacher-class assignment depends, in part, on unobserved time-varying and time-invariant characteristics of teachers, then observation-based measures of teacher performance will be prone to bias.
Our aim is to uncover the extent to which such unobserved matching bias that is due to teacher-class sorting based on unobserved and time-invariant teacher quality may influence teachers’ measured performance. To do so, we compare estimates from the teacher FE approach with Year 2 OLS estimates based on Equation 3 and estimates from an IV approach (please see the appendix for details on the IV approach and the formal comparison between FE and IV strategies). Neither the OLS nor IV estimates will provide unbiased causal estimates of the effect of incoming achievement. Rather, the OLS estimates (based on Year 2 data) are compared with the FE estimates (based on 2 consecutive years of data) to reveal whether omitting time-invariant teacher quality biases estimates of the effect of incoming achievement on measured teacher performance. Furthermore, the IV estimates are compared with the FE estimates to examine whether omitting the correlation between time-invariant teacher quality and incoming classroom achievement biases estimates of the effect of incoming achievement on measured teacher performance. Therefore, both the OLS and IV comparisons to the FE estimates aim to shed light on how omitting fixed teacher quality biases the effect of incoming achievement on measured teacher performance. Most critically for policy and practice, these comparisons will reveal whether bias in teacher observation scores may be present due to the nonrandom assignment of teachers to classes based on fixed teacher quality.
Results
To what extent does measured teacher performance, based on classroom observation scores, reflect the incoming achievement of the students to whom teachers are assigned? To answer this question, we begin by showing the distribution of teacher effectiveness, measured by classroom observation scores, among teachers who teach students with varying levels of academic performance. We present results from the 2009–2010 school year only because all teachers were nonrandomly assigned to classes in the first year of the MET study. We find that ELA teachers were more than twice as likely to be rated in the top performance quintile if assigned the highest achieving students compared with teachers assigned the lowest achieving students; math teachers were more than 6 times as likely (see Figure 1). Furthermore, approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two performance quintiles if assigned the highest performing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing students were highly rated based on classroom observation scores.

Measured teacher performance by incoming student achievement.
The patterns observed in Figure 1 likely reflect a number of factors—both observable and unobservable characteristics of teachers and their students—that influence how teachers perform in the classroom and their subsequent performance ratings. To lend further insight into the influence of student achievement on measured teacher performance, we next present our main estimates of the effect of incoming student achievement on teachers’ classroom observation scores. Among the full sample of ELA and math teachers, estimates from the teacher FE approach reveal that incoming achievement has a large and significant effect on measured teacher performance (see column 2, Table 5). For the ELA sample, the coefficient of 0.11 FFT points suggests that if a teacher were to be assigned to a class with 1 standard deviation higher incoming achievement, that teacher would realize one third of a standard deviation increase in measured teacher performance. For math teachers, the effect of incoming math achievement, although smaller in magnitude, suggests that a 1 standard deviation increase in incoming student math achievement would generate approximately a 0.2 standard deviation increase in measured performance.
Incoming Achievement and Measured Teacher Performance
Note. Each column (within a panel) reports a separate regression. For pooled OLS and IV estimates, coefficients reported (with robust standard errors clustered at the school level); for teacher FE estimates, coefficients reported (with robust standard errors clustered at the teacher level); for OLS (Y2) estimates, coefficients reported (with robust standard errors clustered at the randomization block level). Columns 2, 4, and 6 include teacher FE; column 3 includes district FE, and column 5 includes randomization block FE. The IV sample includes teachers in Grades 5 to 9 during the 2010–2011 school year. Prior Achievement is the average incoming subject-specific (e.g., ELA or math) achievement among students in a teacher’s class. For the IV estimates, Prior Achievement estimates the effect of incoming classroom achievement on measured teacher performance when instrumenting with the class’ once-lagged prior achievement (e.g., twice-lagged achievement). The Covariate Balance Test reports the F statistic (p value) from the joint test of the null hypothesis from a regression of prerandomization (e.g., 2009–2010) measured teacher performance (Performj,t−1) on Prior Achievement, controlling for classroom characteristics and randomization block FE. All regressions control for the following classroom characteristics: class size and the share of students by race/ethnicity, gender, age, special education status, free or reduced-price lunch status, gifted-status, and English language learner status. IV = instrumental variables; OLS = ordinary least squares; FE = fixed-effects; ELA = English Language Art; FFT = Framework for Teaching.
Coefficients statistically significant at the *10%, **5%, and ***1% levels.
When we restrict the sample to teachers in Grades 5 to 9 (IV Sample), the FE estimates again suggest a substantive and statistically significant effect of incoming achievement on measured teacher performance. For ELA teachers, a 0.14 FFT point effect is equivalent to increasing measured teacher performance by 0.4 standard deviations (with a commensurate 1 standard deviation increase in incoming student ELA achievement); for math teachers, a 0.08 FFT point effect is equivalent to increasing measured teacher performance by 0.24 standard deviations. We note that the FE estimates for both ELA and math teachers in the IV sample are larger in magnitude than the FE estimates among the full sample, suggesting that the influence of incoming achievement may be less salient for teachers in earlier grades, as Grade 4 teachers were excluded from the IV sample. We return to this point when we examine the effect of incoming achievement among generalist and departmentalized teachers.
As previously discussed, the FE estimates will be biased in the presence of time-varying shocks to teacher-class assignment. To address this concern, we present experimental estimates of the effect of incoming achievement on measured teacher performance (see column 5, Table 5). For the randomization sample of ELA teachers, the covariate balance test indicates that the randomization was preserved. However, the randomization does not appear to have held for the sample of math teachers, as we can reject the joint null hypothesis (p value = .03) that classroom composition was uncorrelated with prerandomization measured teacher performance in math. As a result, the experimental estimate for only the sample of ELA teachers may be considered a valid causal effect of incoming achievement on measured teacher performance. For ELA teachers, we find a substantively large (0.17 FFT points), although not precisely estimated, effect of incoming achievement. The FE estimate (0.16 FFT points) is nearly identical to the experimental estimate and more precisely estimated, confirming the assumptions of the FE model, that is, E(ε jct | PriorAchieve ct , X, θ j ) = 0 from Equation 3. We further note that the FE estimate for the randomization sample of math teachers (0.07) is remarkably similar to the FE estimates for the full and IV math samples. Taken together, the FE and experimental estimates suggest that incoming achievement plays a significant role in determining measured teacher performance.
Does Incoming Achievement Influence Specific Teacher Practices?
To shed further light on the specific teacher practices that may be differentially influenced by incoming achievement, we next examine eight teacher practices across two domains—classroom environment and instruction—for which external observers in the MET study rated teacher performance. 9 We first examine the extent to which the incoming achievement of an ELA teacher’s students influences these specific teacher practices (see Panel A, Table 6). We note that all estimates are net of teacher FE. We find consistent evidence that incoming achievement influences each of the four practices related to the classroom environment. These findings may not be particularly surprising. The composition of a classroom is likely to have an influence on measures of teacher practice that require teachers and students to coconstruct learning activities and collaborate on classroom procedures, while higher achieving students may be more engaged in learning, shaping the overall behavioral climate of the classroom. These findings therefore suggest that, independent of fixed teacher quality, teachers who are assigned higher achieving students will be judged to perform better on measures that evaluate their capacity to create and sustain a higher functioning learning environment.
Incoming Achievement and Teacher Practices
Note. Each column (within a panel) reports a separate regression. Coefficients (with robust standard errors clustered at the teacher level) reported. Each column represents a separate regression of an individual FFT component (e.g., teacher practice) on classroom composition. Prior Achievement is the average incoming ELA or math achievement among students in a teacher’s class. Domain 2 (classroom environment) includes the following FFT components: (2a) creating an environment of respect and rapport, (2b) establishing a culture for learning, (2c) managing classroom procedures, and (2d) managing student behavior. Domain 3 (instruction) includes the following FFT components: (3a) communicating with students, (3b) using questioning and discussion techniques, (3c) engaging students in learning, and (3d) using assessment in instruction (see Garrett & Steinberg, 2015, for more detail on each of the eight FFT components). Classroom characteristics include class size and the share of students by race/ethnicity, gender, age, special education status, free or reduced-price lunch status, gifted-status, and English language learner status. ELA = English Language Art; FE = fixed effects; FFT = Framework for Teaching.
Coefficients statistically significant at the *10%, **5%, and ***1% levels.
Moreover, evidence from Table 6 indicates that aspects of a teacher’s instructional practices that may be more sensitive to interactions between teachers and their students are influenced by the achievement of their students, independent of a teacher’s own instructional ability. Indeed, teachers with higher achieving students are rated higher in two areas of instruction—communicating with students (3a) and engaging students in learning (3c). Notably, however, measures of instructional performance that depend more on strategies or instructional tools that teachers may bring with them to the classroom—using questioning techniques (3b) and assessment (3d) to drive instruction—are invariant to the level of achievement of their students.
We find that, unlike with ELA teachers, the measured performance of math teachers on individual practices is largely unrelated to the incoming achievement of their students (see Panel B, Table 6). The only exception is for a math teacher’s performance in establishing a culture for learning in the class (FFT component 2b). These results are not surprising given that a math teacher’s aggregate performance on the Danielson FFT has only a modest and marginally statistically significant relationship with prior student achievement (see column 2, Panel B of Table 5). It is possible that the ways in which math instruction is delivered may play an important role in insulating teachers from the influence of incoming student achievement on measured performance. Although math instruction tends to rely more on direct instruction, ELA teachers tend to use more opportunities to coconstruct literacy and reading instruction through, for example, balanced literacy approaches (such as group read alouds and small-group instruction). In the current policy environment, however, this distinction between approaches to instruction across ELA and math subject areas may become less pronounced. In particular, the Common Core State Standards in mathematics (CCSS-M) reflect a key shift in mathematics instruction that is focused more on the coconstruction of knowledge in the classroom than on direct instruction. For example, the CCSS-M requires students to “understand and use stated assumptions, definitions, and previously stated results in constructing arguments” and to “justify their conclusions, communicate them to others, and respond to the arguments of others” (National Governors Association Center for Best Practices & Council of Chief State School Officers, 2010, pp. 6–7). With the adoption of and commensurate shift in instruction to a more constructivist approach under the CCSS-M, measured teacher performance in mathematics may become more similarly influenced by student achievement over time to what we observe among ELA teachers in the MET study sample.
Does Incoming Achievement Differentially Influence Subject-Matter Generalists and Subject-Specific Teachers?
As we have shown, incoming achievement may exert a differential influence on measured teacher performance depending on the subject a teacher teaches. To what extent, then, are teachers’ observation scores sensitive to whether they teach multiple subjects to the same class of students (e.g., subject-matter generalist teachers) or whether they teach one subject (math or ELA) to more than one class of students (e.g., departmentalized teachers)? Table 7 summarizes these results. 10
The Influence of Incoming Achievement, by Generalist/Departmentalized Status
Note. Each column reports a separate regression. Coefficients (with robust standard errors clustered at the teacher level) reported. Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students; departmentalized teachers are subject matter specialists who taught ELA or math to more than one class of students. We note that three ELA teachers and three math teachers were generalists in the 2010–2011 school year and departmentalized in the 2009–2010 school year; we treat these teachers as departmentalized for the purposes of the teacher FE models. Chow tests confirm that the process by which measured teacher performance depends on classroom composition is statistically significantly different across generalist and departmentalized teachers for both the ELA, χ2(13) = 37.49, p value = .0003, and math, χ2(13) = 35.70, p value = .0007, samples. Classroom characteristics include class size and the share of students by race/ethnicity, gender, age, special education status, free or reduced-price lunch status, gifted-status, and English language learner status. ELA = English Language Art; FE = fixed effects; FFT = Framework for Teaching.
Coefficients statistically significant at the *10%, **5%, and ***1% levels.
We find that incoming achievement has no significant influence on the measured performance of ELA generalist teachers (teachers who teach ELA in addition to math and other subject areas). However, departmentalized teachers with students whose incoming achievement is 1 standard deviation better realize better observed performance, on the order of 0.15 FFT points (or 0.41 standard deviations). For math generalists (those who teach math in addition to ELA and other subject areas to the same class of students), incoming achievement is not significantly related to measured performance. Again, however, departmentalized math teachers realize better observed performance (0.08 FFT points, or approximately 0.22 standard deviations) with higher performing students.
Why might there be a systematic difference in the influence of incoming achievement for subject generalists and their subject-specific counterparts? The difference might be due to the fact that subject generalists teach fewer students as they face just one classroom of students each school day, while departmentalized teachers teach the same subject to multiple classes of students. As a result, subject generalists have more time with each student, which may provide more opportunities to adjust to students’ learning needs. However, we cannot definitively attribute these differences to generalist versus departmentalized status, as we also find significant grade-level differences between generalist and departmentalized teachers (see Table 3). Indeed, 98% of generalist teachers (in both the ELA and math samples) taught either Grade 4 or Grade 5, whereas only 20% of ELA (and 18% of math) teachers in departmentalized classes taught these grades. It is possible that the observed differences between classroom generalists and subject specialists are driven by differences in instruction for earlier grades compared with later grades, and thus, we are ultimately unable to empirically attribute these differences to class status (generalist vs. departmentalized) or grade-level differences.
Does the Nonrandom Assignment of Teachers to Classes Bias Teachers’ Measured Performance?
As we have shown, the subject to which teachers were assigned, and whether or not they were assigned to teach a single class of students, had different consequences for teachers’ measured performance. In addition to the influence of the incoming achievement of a teacher’s class, consideration must also be given to the potential influence that the nonrandom matching of teachers to classes based on fixed, unobserved teacher characteristics may exert on measured teacher performance (as formalized in Equation 2). To examine the extent to which the systematic, nonrandom assignment of teachers to classes may bias estimates of measured teacher performance, we compare estimates from the FE approach with OLS and IV estimates, by classroom compliance status.
If the randomization held among teachers in full compliance classes, then unobserved, time-invariant teacher heterogeneity should be uncorrelated with incoming student achievement. 11 In such cases, the OLS, IV, and FE estimates should not differ (see the appendix for a formalization of this result for the IV and FE approaches). For the ELA sample, this is what we find (see Full Comply column, Panel A of Table 8). Among math teachers in full compliance classes, however, the OLS, IV, and FE estimates differ, and this difference is due to the fact that we find significant nonrandom selection of math teachers into full compliance classes (i.e., classroom composition predicted prerandomization measured teacher effectiveness). As a result, we focus our attention on the ELA teachers in partial compliance classes (where endogenous student sorting is most relevant) for an assessment of the extent (and direction) of bias due to nonrandom teacher assignment to classes.
Examining Bias Due to Matching: Comparison of OLS, IV, and FE Estimates
Note. Each column (within a panel) reports a separate regression. For OLS estimates, coefficients reported (with robust standard errors clustered at the school level); for teacher FE estimates, coefficients reported (with robust standard errors clustered at the teacher level); for IV estimates, coefficients reported (with robust standard errors clustered at the school level). FE regressions include teacher FE; OLS and IV regressions include district FE. The sample includes teachers in Grades 5 to 9 during the 2010–2011 school year (excluding Grade 4 teachers without twice-lagged student achievement test scores). Full Comply refers to teachers in full compliance classrooms; Partial Comply refers to teachers in partial compliance classrooms; and Noncomply refers to teachers in full noncompliance classrooms in the second year of the MET study. For the IV estimates, Prior Achievement estimates the effect of incoming classroom achievement on measured teacher performance when instrumenting with the class’ once-lagged prior achievement (e.g., twice-lagged achievement). All regressions control for the following classroom characteristics: class size and the share of students by race/ethnicity, gender, age, special education status, free or reduced-price lunch status, gifted-status, and English language learner status. OLS = ordinary least squares; IV = instrumental variables; FE = fixed effects; ELA = English Language Art; FFT = Framework for Teaching; MET = Measures of Effective Teaching.
Coefficients statistically significant at the *10%, **5%, and ***1% levels.
Among ELA teachers in partial compliance classes, we find evidence of matching based on fixed, unobservable teacher characteristics. Specifically, the OLS (0.21 FFT points) and IV (0.17 FFT points) estimates are biased upward relative to the FE estimate (0.14 FFT points). This suggests that, by not accounting for time-invariant teacher heterogeneity in the teacher-class matching process, estimates of the influence of incoming class achievement on measured teacher performance will be overstated. This result further suggests that higher performing students were endogenously sorted into the classes of higher performing teachers (as has been shown in prior work with the MET data; see, for example, Steinberg & Garrett, 2015). Therefore, the nonrandom and positive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher characteristics would reveal more effective teacher performance, as measured by classroom observation scores, than may actually be true.
Discussion
High-stakes accountability systems aim to provide more information about teacher performance for the purposes of improving classroom instruction and satisfying the accountability objective of newly developed teacher evaluation systems—identifying, remediating, and, if necessary, removing underperforming teachers from the classroom. However, when information about teacher performance does not reflect a teacher’s practice but rather the students to whom the teacher is assigned, such systems are at risk of misidentifying and mislabeling teacher performance. The misidentification of teachers’ performance levels has real implications for personnel decisions and fundamentally calls into question an evaluation system’s ability to effectively and equitably improve, reward, and sanction teachers.
In this article, we find that the incoming achievement of a teacher’s students significantly and substantively influences observation-based measures of teacher performance. Indeed, teachers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality that are fixed over time. Moreover, incoming achievement matters differently for teachers in different classroom settings. Specifically, incoming student achievement exerts a larger influence on the measured performance of ELA teachers (compared with math teachers) and subject-matter specialists (compared with their generalist counterparts). These results are particularly relevant for policy decisions around newly implemented teacher evaluation reforms, given that classroom observation scores account for the majority, and in some cases the entirety, of a teacher’s summative evaluation rating across many state and district evaluation systems (Steinberg & Donaldson, in press).
When examining specific aspects of classroom management and instruction, we find that measures of teacher performance related to the classroom’s climate and learning culture were more sensitive to student achievement than measures of a teacher’s instructional practices for ELA instruction. These findings suggest that teachers who are assigned higher performing students may be inaccurately judged to be better managers of the classroom environment than they actually are. In contrast, this evidence may also indicate that teachers perform better when assigned higher achieving students. This uncertainty is important for policymakers and practitioners to recognize. Moreover, this evidence also suggests that the goals of policy reforms may be best met through the inclusion of measures that are less sensitive to the student composition of the classroom and that focus more on specific instructional skills—such as the use of questioning and discussion techniques—to provide better signals of teacher performance.
We also find that the nonrandom process by which teachers are often assigned to classes biases estimates of teacher performance based on classroom observation scores. This result is consistent with prior research examining the influence of nonrandom student sorting on measured teacher effectiveness based on value-added scores (Rothstein, 2009). Just as earlier evidence from the MET study demonstrates that observation scores of teacher performance alone are limited in their ability to identify effective teachers (Garrett & Steinberg, 2015), this work shows that the use of classroom observation scores for high-stakes decisions may be further complicated by the fact that such scores reflect, to a large extent, the students to whom teachers are assigned.
Why might incoming student achievement exert an important influence on a teacher’s measured performance based on their classroom observation scores? One potential explanation is that, of the observable student characteristics, test scores may incorporate information about students’ cognitive and noncognitive (e.g., behavior) skills. Our finding that aspects of classroom environment are more influenced by incoming student achievement than are aspects of a teacher’s instructional practices supports this interpretation. Ideally, we would observe measures of student behavior from the previous year, such as disciplinary infractions and/or the number (and extent) of disciplinary consequences, including in-school and out-of-school suspensions. The inclusion of such behavioral measures would enable a more direct assessment of how student behavior affects a teacher’s measured performance, while also distinguishing the influence of noncognitive aspects of student performance from cognitive achievement patterns. Yet even this could be complicated by the fact that student behavior in one context (i.e., a given group of peers matched with one teacher) may shift considerably when placed in a different context (i.e., a new group of peers matched with another teacher). Nonetheless, a deeper understanding of how student behavior patterns influence measures of teacher performance based on classroom observation, independent of incoming student achievement on state assessments, is an important area for further research.
To account for the influence that classroom composition has on teachers’ observation scores, researchers have recommended that policymakers and district leaders adjust observation scores based on student demographic characteristics (Whitehurst et al., 2014). If higher achieving students are more easily instructed, then a teacher’s observation scores will, in part, reflect these easier-to-teach students. Ultimately, we are unable to definitively determine whether the incoming achievement effect reflects bias in observation scores due to the composition of a teacher’s class or whether teachers are systematically higher performing when assigned to higher achieving students. However, our findings suggest that even if teachers’ observation scores were adjusted to account for observed differences across classrooms—both demographic and, particularly, achievement characteristics—unmeasured differences due to the nonrandom assignment of teachers to classes based on fixed, unobserved teacher characteristics will continue to bias estimates of teacher performance based on observation scores. Unless observation scores are generated using multiple years of teacher data (and/or across multiple classes within the same year) to adjust for the nonrandom assignment of teachers to classes based on fixed teacher characteristics, annual performance evaluation that relies heavily on observation-based measures will likely mischaracterize teacher effectiveness. We believe this is particularly noteworthy for policymakers and district leaders to consider as they develop and revise teacher evaluation systems and make high-stakes decisions about teachers based, in large measure, on classroom observation scores.
Footnotes
Appendix
Acknowledgements
The authors thank Matthew Kraft, Rebecca Maynard, John Papay, Jonah Rockoff, Brian Rowan, Andrew Wayne, and participants at the Measures of Effective Teaching (MET) Early Career Scholars Grantee Meeting for helpful comments and suggestions, and Jennifer Moore for editorial assistance.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding from the National Academy of Education MET Early Career Grantee program is gratefully acknowledged.
