Abstract
Keywords
Since the mid-1990s, with the advent of the accountability era in some pioneering states, policy makers in the United States have experimented with new forms of public management in education that incorporate principles from the private sector (Aucoin, 1990; Ferlie, Ashburner, Fitzgerald, & Pettigrew, 1996; Hood, 1991; Osborne & Gaebler, 1992; Mintrop & Sunderman, 2009). Beginning in 2008, with the Teacher Incentive Fund initiative (TIF; http://www2.ed.gov/programs/teacherincentive/index.html), the federal government spurred experimentation with a performance management approach that drilled down to individual teacher performance by way of value-added test results and evaluations of teaching with presumed consequences for careers and benefits. Subsequently, a number of states (36 states and Washington, D.C.) adopted teaching evaluations as a component in their performance management systems. TIF stipulated that recipients of bonus pay grants, such as public school districts, charter management, or support organizations, use a mix of indicators to identify educators at various levels of performance and reward or sanction them accordingly. Performance was supposed to be calculated based on standardized tests, where possible on the “value added” of individual teachers, externally scored teaching evaluations, principal ratings of teachers’ progress, and participation in school development activities.
Performance management systems of the TIF-type work according to a logic that links summative performance scores (e.g., teaching evaluation scores) to extrinsic incentives (e.g., bonus pay) in order to spur teachers’ learning and efforts to improve instruction, ultimately resulting in better student outcomes (Glazerman, Chiang, Wellington, Constantine, & Player, 2011). Thanks to evaluations, funded by the U.S. federal government in parallel with TIF grants to a wide swath of educational jurisdictions across the country, we have a corpus of evaluation research that fairly conclusively demonstrates that performance management systems of the TIF-type do not seem to work as intended (see the meta-analysis of the varius evaluation studies in Yuan et al., 2012). The purpose of this article is not to add to this corpus of evaluations. Rather, it reports on a study that explores in more depth why the TIF-inspired performance management logic might be faulty. With a focus on one aspect of the system, teaching evaluations, the study does this by disentangling the logic and tracing its elements over a period of 3 years in the lives of schools which volunteered to adopt the system.
While in the logic of performance management, summative and formative purposes of evaluation fuse, so that performance information guides learning and improvement, we treat these purposes as in tension with each other. While the system comes across as an incentive system with substantial performance-contingent bonus money attached to it, we begin from the assumption that at its core, the system induces or obligates procedures (e.g., use of tools or artifacts) in exchange for the receipt of money, and that this exchange might, or might not, generate incentives. Thus, the interplay between teaching evaluations as formative and summative, the use of procedures, tools, and artifacts obligated by the local TIF system, and bonus payments as extrinsic motivators is the focus of the study.
A Conducive Environment for Performance Management
Our cases are public charter schools that volunteered to take on TIF as a new performance management system. TIF was adopted by school-level administrators at the initiative of a local charter support provider. The leaders of these schools may not have known the concrete shape of the system, but they could presumably see the main architecture. Some scholars have argued that evaluation and pay for performance cannot develop their full potential because of political factors (Podgursky & Springer, 2007). Public sector institutional contexts, such as public school districts, have shown political resistance to management reforms along private sector lines (Trujillo, 2013). Unions are known to defend single salary scales (Podgursky, 2006), tenure may make teachers less likely to take evaluations seriously (Millman & Darling-Hammond, 1990), and administrators accommodating teachers’ micro-political power may shy away from potential conflict (Timperley & Robinson, 1998).We chose to study cases in which these types of political dynamics would presumably play less of a role.
Teachers in the charter schools were nontenured, nonunionized, and were paid on a salary schedule that allowed for differential pay beyond seniority. Contract renewal from year to year depended on performance and was in the discretion of school administrators and governing boards. Not atypically for these types of schools, the three schools, but especially two of them, tended to recruit from a pool of young novice teachers with uncertain commitment to stay (Ingersoll, 2001; Johnson, 2004). As independent nonnetworked schools, the schools experienced the full weight of accountability, managerial independence, and market competition with district and other charter schools. Thus, unlike typical public schools, the three charter schools existed in a relatively deregulated space, commonly thought to be conducive for performance management. We want to know how under these auspicious circumstances, external teaching evaluations in conjunction with bonus pay, both novel features for the three schools, shape how teachers engage in efforts to improve instruction.
Conceptual Framework
We develop concepts for the interplay between teaching evaluations, the use of tools and artifacts induced or obligated by the local TIF system, and bonus payments as potential extrinsic motivators with three literatures: literature on teacher evaluation, literature on shared cognition and artifact use, and literature on work incentives.
Teacher Evaluation
Evaluations may serve several main functions. They may render summative, authoritative judgments of work performance that differentiate and make visible distinct performance statuses; provide information for formative feedback that may enable learning and improvement; structure the task itself that is being evaluated through the prism of the system’s tools and artifacts; motivate effort by appealing to workers’ motives related to craft, service, prestige, or rewards and sanctions; and enable instructional leaders to manage teacher performance (Danielson & McGreal, 2000; Darling-Hammond, Wise, & Pease, 1983; Stronge, 2005).
In its summative function, evaluations refer to common quality criteria that enable quantifiable comparisons or classifications across a group of workers. Grades, scores, ratings, or other numerical or verbal categorizations, such as the distinction between an “effective” or “ineffective” teacher, communicate performance statuses in shorthand (Blase & Blase, 1998; Blase & Kirby, 2008; Goldstein, 2007). Workers can be compared with each other and they can compare themselves with others in similar lines of work. Comparisons potentially foster accountability and competition for merit (Buunk & Gibbons, 2007). Formatively, evaluations render diagnostics and feedback in longhand so that employees have sufficient and precise information on how to improve on the evaluated tasks (Glickman, 2002; Hatry, 2013; Hunter & Nielsen, 2013; Kluger & DiNisi, 2006).
While current standards-based reform initiatives attempt to combine formative and summative approaches for the purposes of both teacher professional development and high-stakes salary and personnel decisions (Bill & Melinda Gates Foundation, 2013; Dee & Wyckoff, 2015; Weisberg, Sexton, Mulhern, & Keeling, 2009), formative and summative purposes are difficult to integrate or balance (Chen, 1996; Darling-Hammond, 2013; Darling-Hammond et al., 1983; Millman & Darling-Hammond, 1990). Research on standards-based performance evaluations, implemented over the past few years (Hatry, Greiner, & Ashford, 1994; Kimball, 2002; Milanowski & Heneman, 2001; Odden, Kelley, Heneman, & Milanowski, 2001) shows that standards-based performance evaluation systems have a higher chance to succeed when teachers value the evaluation criteria as an expression of their own internal standards of quality, when they believe that the evaluation system renders judgments that are procedurally and distributively fair, when they believe they have the capacity or efficacy to reach satisfactory performance (Kimball, 2002; Milanowski & Heneman, 2001), when they consider the evaluator credible and trustworthy (Range, Young, & Hvidston, 2013; Stiggins & Duke, 1988), and when the feedback is precise and accurate enough so that teachers know how to improve and achieve better results (Archer, Kerr, & Pianta, 2014).
The literature on instructional leadership and supervision, however, indicates that instructional leaders have been neither strong supervisors willing to make high-stakes decisions based on evaluations (Darling-Hammond et al., 1983; Murphy, Hallinger, & Heck, 2013; Tucker, 1997), nor expert diagnosticians providing powerful feedback (Darling-Hammond, 2013; Milanowski & Heneman, 2001; Scriven, 1974). The literature is rife with complaints that evaluation and instructional supervision have played a largely symbolic role (Darling-Hammond, 2013; Hill, Kapitula, & Umland, 2011; Kane, Kerr, & Pianta, 2014).
Depending on how teachers feel personally implicated in evaluations, they may process the messages of the system in a more expedient or superficial or thorough and systematic fashion (Gregoire, 2003). In the former case, schools may buffer external evaluative judgment, relegate such judgments to the periphery of their attention, and process evaluation procedures expediently (Gregoire, 2003). In the latter case, evaluations could be leveraged to communicate, and enforce, a shared framework of good teaching. Teachers could adopt external evaluation criteria as guidance and internal cues of competence in their work. Summative performance judgments would, thus, become internalized as a symbolic marker of professional excellence (Milanowski, Heneman, & Kimball, 2011; Sansone & Harackiewicz, 2000).
Structuring Tasks Through Induced or Obligated Procedures
Although the name suggests that the TIF, as an experimental policy instrument proffered by the U.S. federal government, is about “incentives,” for the jurisdictions that are involved in the program, TIF is at the heart an inducement. Oftentimes, the terms inducement and incentive are treated as synonyms, but for our purposes, a clear distinction is necessary. Inducements are transfers of money to agencies in return for the production of certain goods that the government values. Inducements often come with regulations spelling out circumscribed activities that recipients are expected to carry out. Hence, they obligate specific procedures or practices (McDonnell & Elmore, 1987). TIF spells out that monetary bonuses need to be disbursed contingent on performance in multiple dimensions. With respect to teacher evaluations, the TIF grant requirements spell out that teaching evaluations should be summative, done by “objective measures,” and with external scoring (Max et al., 2014).
Contrasting with inducement, an incentive has the connotation of motivating an outcome or a performance, to kindle or encourage hard work or effort as workers anticipate rewards (Mitchell, Ortiz, & Mitchell, 1987). TIF brings resources to participating jurisdictions in return for implementing procedures or practices that locally may, or may not, produce performance-contingent incentives. Whether the TIF inducement transforms into management by incentives largely depends on local organizational managers and leaders. Employees may implement the required practices in return for money, but whether they feel compelled to increase effort and improve practices according to stipulated evaluative indicators is a different matter. Literature on work incentives, discussed in the following section, helps us shed light on this dynamic.
In an inducement logic, schools may conceivably comply with obligated procedures in return for money without actually incentivizing performance. At the most basic level, TIF obligates teachers to repeatedly engage with specific artifacts and procedures in return for resources that accrue to the organization as a whole and individually. Regardless of the motivating function of evaluations, the literature on distributed cognition shows how the presence of artifacts, tools, or instruments in continuous use shape existing practices of organizations or transform them (Halverson, 2003; Halverson & Clifford, 2006; Halverson, Kelley, & Kimball, 2004; Hutchins, 1995; Nemeth, O’Connor, Klock, & Cook, 2006; Salomon, 1993). Artifacts are the tools, routines, policies, or symbols that serve to reduce task complexity and structure work behavior in organizations. Artifacts can infiltrate the language with which individuals process and make sense of their tasks (Hutchins, 1995). Relevant artifacts for the purposes of the TIF project are the “Summative Evaluation of Teaching” (SET) tool, the production of lesson videos, the California Standards for the Teaching Profession used by principals as orientations for formative feedback, the TIF data dashboard, an online platform for teachers to check external evaluation scores and raters’ feedback, and finally, protocols used in teacher inquiry groups or professional development sessions to structure conversations about instruction.
The distributed cognition literature explores the interaction between artifacts and their users by looking at the constraints and affordances that artifacts engender (Hutchins, 1995; Salomon, 1993; Zhang & Patel, 2006). In the case of formative and summative evaluations, a lesson observation tool affords an analytical lens through which the complexity of teaching can be captured, but it also constrains this lens by way of the behavioral indicators that the tool privileges for observation, irrespective of the motivational quality of the evaluation situation. A video of a lesson creates visibility of ordinarily shielded instructional practices, but a summative stand-alone video also constrains what is seen. As actors (or users) interact with these tools, the tools or artifacts inscribe themselves individually and collectively into actors’ cognitions. Actors make sense of artifacts in the context of existing beliefs, understandings, knowledge, skills, norms, values, and practices that are shared within the organization. Artifacts are translated and incorporated into these cultural contexts selectively (Hutchins, 1995). Integrating evaluation tools into existing organizational cultures of schools can help reshape attention, make goals more precise, make performances more visible, and, in this way, enhance organizational learning (Halverson, 2003; Halverson & Clifford, 2006; Halverson et al., 2004).
Work Motivation and Monetary Incentives
With respect to monetary incentives attached to evaluation scores, there are currently two lines of thinking about work motivation of public service workers, including teachers. Public service motivation research (Perry & Hondeghem, 2008) contends that service commitments (e.g., compassion, civic duty, public interest, self-sacrifice) remain a strong counterpoint to the increasingly dominant emphasis on extrinsic reward calculation with career, prestige, status, or money. This is echoed by a strand in the teacher work motivation literature that sees teachers as primarily intrinsically motivated, that is, motivated by notions of socially and personally meaningful work (see Firestone & Pennell, 1993, for a review). But with extrinsic performance incentive systems having taken hold in education and other public services during the high-stakes accountability era, studies have shown that goals and incentives may mobilize teachers to increase effort and engage in improvement (Finnigan & Gross, 2007; Kelley, Heneman, & Milanowski, 2002; Kelley & Protsik, 1997; Milanowski, 2000; Mintrop, 2004).
Monetary incentives are a special case. Those who advocate pay for performance claim that bonus pay may motivate teachers to work harder and encourage organizational dynamics to be more goal driven (Lazear, 2003; Podgursky & Springer, 2007). But the pay for performance literature tells us that monetary incentives have had a patchy track record in education as to their impact on student outcomes, effects on teachers’ beliefs and practices, and sustainability over time, both during the 1980s (Murnane & Cohen, 1985) and nowadays (National Research Council, 2011; Springer, 2009; Yuan et al., 2012). Recent studies have shed some light on why this is so. They find that pay for performance is often implemented with serious flaws as to reliability of procedures, validity of measures and payouts, required knowledge, skills, and capacity, and meaningfulness for practice (Glazerman & Seifullah, 2012; Malen et al., 2015; Marsh et al., 2011; for a review, see also Yuan et al., 2012). Pay for performance schemes seem to be more effective when they occur in tandem with activities that enhance capacity (Glazerman & Seifullah, 2012; see also Taut, 2014, for the Chilean case), presumably facilitating the fusing of formative and summative purposes.
The literature on the use of pay for performance in other industries suggests that bonus pay works best when employees are largely extrinsically motivated (Lazear, 2000), for example, in sales, where the use of bonuses are overt features, performance statuses are sharply distinguished based on clear metrics, and performance status exerts a forceful motivational punch (see Banker, Lee, & Potter, 1996). When employees are largely intrinsically motivated, fulfill complex tasks that are difficult to capture with simple metrics, and produce outcomes in teams, bonuses become a double-edged sword (Frey, Homberg, & Osterloh, 2013). Strong extrinsic incentives, for example, fear of sanctions or expectations of being rewarded monetarily for a job well done, can become controlling when they overshadow existing service or citizenship commitments. Frey et al. (2013) describe this phenomenon as a crowding-out effect (Frey, 1994; Frey et al., 2013; Frey & Jegen, 2001; Frey & Osterloh, 2002; Jacobsen, Hvitved, & Andersen, 2014). Crowding out of intrinsic or service commitments may manifest itself, for example, in the phenomenon of demoralization when employees feel controlled by extrinsic incentives, for example, the power of sanctions, and invalidated in their intrinsic motives.
Inquiry Questions
We did not begin the study with a fixed conceptual model. Rather, we generated from the literatures discussed above a set of theory-guided inquiry questions that we pursued over several years:
How might the charter school environment accommodate the TIF-inspired performance management logic?
How might summative judgment and formative learning around evaluations interact with each other over time?
How might the repeated use of evaluation tools and procedures create affordances for learning and improvement efforts?
How might summative evaluative judgments about work become internalized cues for competence?
How might bonus money by itself incentivize the striving for higher evaluation scores?
How might controlling summative judgment reinforced by extrinsic incentives constrain, or crowd out, learning and efforts to improve?
Method and Data
The data were generated in a longitudinal study that began in the 2011-2012 school year. Data from three school years are analyzed. The school year 2011-2012 was the first year of implementation, the school year 2013-2014 was the last school year before the sun-setting of the project. We collected data in three secondary schools that we treated as three separate cases. In the findings section, we report on overall patterns and point out where patterns differed from school to school.
The local TIF project grant was directed by a nonprofit organization. This nonprofit supported the implementation of the TIF grant at the three schools. The director of the nonprofit as well as the principal of School A (at the time of the grant) conceptualized and wrote the TIF grant. Once awarded, the TIF evaluation system was designed by the nonprofit director and school leaders from all three schools. Additionally, feedback from teacher focus groups was incorporated. The local system consisted of multiple student assessment measures, formative and summative teacher evaluation, and participation in leadership and collegial learning activities. The overall number of performance measures across grades and subjects shifted between 20 and 24. Characteristically, depending on subject and grade level, teachers could garner bonus pay through about 12 different measures. A table in Appendix B, available in the online version of the article, lists the main indicators and bonus amounts. In this article, we focus on merely two scores: the Formative Evaluation of Teaching (FET) and the SET scores.
Cases
The three schools are relatively small in size (see Table 1). Two of the schools are high schools, and one combines middle and high school grades. The schools are located in distinctly poor sections of a metropolitan area in northern California. Within the state accountability system in the school year 2011-2012, School C performed strongly for its demographic profile with an Academic Performance Index (API, now defunct) close to 800 (the state’s target). Schools A and B, on the other hand, were poor performers by state standards.
Demographics and Performance 2012.
Note. API = Academic Performance Index; “C” indicates that the school had significant demographic changes and will not have any growth or target information.
The three schools have in common that they blend charter school autonomy with a strong avowed commitment to social justice and serving economically marginalized students of color and immigrants.
Data
The study is qualitative. We conducted eight rounds of semistructured interviews. All in all, 52 teachers and 15 administrators participated in data collection over a period of 3 years. A total number of 130 interviews were carried out, augmented by 65 hours of observations of faculty meetings and meetings at grade and subject matter levels during which learning around instruction took place (see Table 2).
Summary of Interviews and Meeting Observations per School.
Codes for analysis of interviews were developed following the concepts derived from the relevant literatures (Miles, Huberman, & Saldaña, 2013). The interviews were coded using a qualitative data analysis software (Dedoose). We developed 27 distinct codes, the majority of them derived from theory. Some additional codes captured emergent phenomena (see Appendix A, available in the online version of the article). The theory-derived codes were defined, operationalized, and illustrated with representative quotes from interviews. We trained a team of four coders for interrater reliability. Twenty percent of interview excerpts were coded by two coders and the discrepancies were treated collaboratively to clarify the concepts among the coders.
The coding proceeded in four steps: coding according to five broad descriptors; coding with theory-guided subcodes and some emergent codes in each broad bin by initially two coders and, once interrater reliability was established, one coder; sorting the coded interviews according to time or interview round. The “data dashboard release,” that is, the release of performance data and bonus payments each year, marked stages in the life of TIF. These stages roughly followed Years 1, 2, and 3 of implementation. Finally, we compared and looked for patterned differences across time for each code, in the process reducing data with the help of a metamatrix (Miles et al., 2013) that summarizes patterns in concrete descriptive language for each relevant code and each stage.
Findings
In the three charter schools, the adoption period was characterized by consonance, the belief that the TIF-inspired performance management system could be implemented with relative ease and that costs of implementation outweighed the benefits. In midlife, dissonance set in. Performance contingencies attached to both bonus and external evaluations were perceived as disconfirming the values of the schools. Incentives and status competition were largely rebuffed and relegated to the periphery. Once the power of incentives became latent, a period of resonance set in. During this phase, first administrators and then teachers came to interact with the two main artifacts, videos of lessons and the SET observation tool, in ways that contrasted with the previous period. We report findings in three sections that roughly follow the 3 years of implementation (see Table 3).
Three TIF Implementation Phases (Qualitative Metamatrix).
Note. TIF = Teacher Incentive Fund; SET = Summative Evaluation of Teaching.
Consonance
Adoption of the TIF project began when the principals and the nonprofit support provider (the “provider”) received the TIF grant. From the beginning, this group of leaders entertained two motives: to garner additional resources for the schools and teachers and to implement a performance management model that had the potential to improve instruction and learning. The executive director of the provider articulates these dual goals best:
Initially, I looked at it, and I said, “You know, I’m not convinced about paying for test scores, and I’m just not certain that this is up our alley.” And she [school leader] said to me, “You’re thinking about this all wrong. We don’t have to just submit a proposal that’s a simple test scores go up, you get bonuses. We can really use this as an opportunity to really think more broadly about what do we want to incentivize? And how can this be a tool and a lever for improving schools?” So really, a combination of just wanting to get more resources to schools, especially get more resources to teachers, and the opportunity to try to think a little bit differently about how we could design this, is what enticed me to do it. (The provider, Director)
While the TIF money was clearly a strong motive, the management philosophy behind TIF, its stress on evaluation and its intent to improve instruction by rewarding teachers, made sense to the adopting leaders. They did consider the system as a “lever of improvement” as much as a way of attracting more monies to the schools. “Incentivizing” teachers was something the school leaders were familiar with. They paid differentially for a number of organizational purposes, such as teacher leadership functions, teaching in hard-to-staff subjects, rewarding senior teachers, and so on. The principal at School A, who was instrumental in writing the TIF grant and designing the system, expressed her ideas in this way:
Believing we had these two things in designing the way we did, was so that there was some check and balance for the principal and whatever that relationship is with the teacher being subjective. . . . That’s why we created this system where there is still this regular evaluation, and there is a bonus for that. And that was the basis for a lot of the instructional improvement and the professional development in working with principals to work with their teachers in this way. I think something can be said for financial incentives over all. It’s good to be able to give more money to teachers. I think there’s something to be said for having a performance-based bonus system. . . . I don’t think it should be all about test scores by any means. But, it’s good for people to feel respected, rewarded. (Principal, School A)
The early affirmation and excitement supporting the evaluation component of TIF is captured by an instructional leader at School C:
My goals for the evaluation system or what I hope their goal is, is sort of twofold. One is to streamline what we think a [School C] teacher is and what [School C] practices are. This is our tenth year. We have to be able to say to new staff, here’s what [School C] teachers do. Here’s what we’re looking for. . . . And I think the second goal of the evaluation should be consistency of practice. And by that, I don’t mean that every teacher should be teaching the same way because that’s crazy. But instead, that we’re meeting some sort of expectation, that all students are engaged, that our content is rigorous. (School C, 125)
Data, evidence, evaluation, incentives, and rewards, the buzzwords of the new performance management system, held a certain appeal and did not seem to be in conflict with the strong collegial cultures that these leaders had established in their schools and held dear. As described by an instructional coach at School C:
On the whole, everybody at School C is willing to work together and help each other out and has collegiality, truly. And part of the reason I’m excited about having the evaluation system is so we can say here’s what we want. And here’s what we don’t want. Because I think sometimes it’s hard for [our principal] to say that—to just say this isn’t working. (School C, 208)
The core mission, widely held in the three schools, was one of strenuous service and care, which placed the quality of relationships among students and between students and adults in the center. Below are quotes from each school exemplifying this sentiment:
I think most people in our school work really hard. They spend a lot of time caring for the students. . . . I think the faculty at this school is . . . motivated in the sense that this is really personal for them. (School A, 423) I know that everybody has sort of a shared vision of how much we care about doing the best thing for the students. . . . I think we’re serving a really important population of kids that aren’t served in lots of other places. (School B, 304) We’ve defined five tenets of what we do . . . and I think we’ve kept really good fidelity to it. We have the rigorous curriculum, keeping high expectations for all students, a component of knowing each student and family well, and meeting the needs of those people and those groups on an individual basis . . . and the last one which is I think the most descriptive—which is knowing, valuing, and trusting teachers as professionals. (School C, 232)
The three schools varied in the degree to which they stressed academics. Common was the idea that the students’ marginal social status in society should not deter them from going to college and that it was the role of teachers to work hard for this goal and to further social justice in society with their unstinting engagement and prosocial commitment to students. One of the schools had a reputation for being strongly focused on academics and it had the above-average test scores for its demographic to show for it. One school connected its focus on relationships with the aim of cultivating critical thinkers and critical citizens. In the third school, the faculty was dedicated to serving students who had failed in regular district school settings. At the outset, a strong service and social justice orientation toward marginalized students did not seem to conflict with the TIF management philosophy of measuring performance and rewarding differentially.
Established adult performance and learning culture
All three faculties functioned with the widely shared assumption that everybody at the schools worked very hard and did indeed go the extra mile. Teachers and administrators held that additional incentives to increase effort were actually not needed. Given faculty attrition, especially high in Schools A and B, but not as much of a problem in School C, principals aimed at cultivating their longer term staff to stay at the school longer than 3 years. Each year, a few teachers, deemed ineffective by the administrators, were not asked to return, but their numbers were very small. For administrators in Schools A and B, getting teachers to stay was a looming challenge in the face of which “weeding out” ineffective teachers paled as a concern. In all schools, salary incentives were used to recruit and reward teachers in hard-to-staff subjects and reward effective senior teachers.
Instructional supervision was handled in the spirit of support and openness. Principals reported that they tended to observe their teachers informally and then debrief with the observed teachers over email or through brief conversations in person. The California Standards for the Teaching Profession were used as a broadly structured framework to generate a metric of accomplishment. But it was used in a casual way. The most important objective of supervision was to socialize teachers into the culture of the schools. As a result, direct feedback on instruction would be blended with suggestions about collegial learning and fitting in with the “ways of the school.” One principal stated that his purpose of using formative feedback was to communicate the “School C way” of doing things.
When the project started, the teachers described the professional learning culture in the schools as intense, but loose. A teacher in School A described professional development this way:
There’s a little bit of PD, like we’ve targeted certain things—ELL, socio-emotional relationship building, so those are our two things that surfaced in the early spring . . . that was informative for those, and I think on a figuring out level, that would be more systematized. That makes sense. But it wasn’t articulated. It was like, “Oh, that makes sense to do this,” so there wasn’t a whole lot of transparency, not with the intent of trying to be non-transparent, but we’re figuring it out as we go. (School A, 422)
In School B, teachers learned together as a whole faculty once per week, and the meetings were focused on building social emotional relationships between teachers and students. Administrators at the school would also meet with teachers informally throughout the year to provide coaching and feedback on instruction. Once a year, administrators would meet with teachers during a conference to discuss a performance management plan that defined goals for future professional development. The principal described the performance management plan this way:
I mean, it’s clear but it’s not very effective for teachers because there’s only a small section about teaching and it’s too generalized. And so a lot of it has to do with, like, other—like, sort of other kinds—other responsibilities like calling home to parents, which is all part of teaching, but I’m talking about, like, implementing curriculum and instruction with helping in the classroom. There’s very little if anything about that. (School B, 317)
School C had an established approach to professional learning. All teachers regularly met in inquiry groups as grade-level or subject matter teams, and beginning teachers were served by a group of experienced instructional coaches with subject matter expertise. The inquiry groups met to discuss student work and student behavior, collaborated on lesson planning, and strategized for long-term goals, such as getting all students to graduate on time and apply to college. While coaching was established, the observation and feedback structure was loose and informal. An instructional coach at School C drives home this point:
We haven’t received anything like a rubric or what it [an evaluation rubric] will look like in the future. My teachers—we all have room to grow. I certainly do, maybe more than others. I spent a lot of last year being like, okay, is this what you’re looking for? Here is what we’re doing. I’ve never in my life of teaching ever received a formal, real evaluation of more than four minutes . . . (School C, 208)
The adoption of the TIF performance management system unfolded within these loosely structured frameworks of professional values and adult learning, with teachers and leaders at the time of adoption desiring more specific feedback with defined structures and routines to improve teaching and student learning.
Evaluation of teaching
Teachers appreciated that they received feedback in a supportive and constructive spirit, “It’s nice to have another pair of eyes and someone who is far more experienced than I am to help me grow and also just the way that it’s done like in a very loving way that is honest” (School B, 309). Yet they wished that supervisors, coaches, or mentors would look at teaching more closely, that they actually “would encourage [teachers] to change [their] teaching practice” (School A, 403). More specifically, many teachers indicated that they would like more precision when receiving feedback:
What I would hope for it would be just like qualitative feedback from experts about what I was doing well and what could be improved. Maybe just some actionable steps for moving forward in my teaching. (School C, 237) I think that having the system by which we were being monitored and given the feedback and helped to improve, not just saying, “Here’s where you’re screwing up,” but having you as a teacher identify, “This is something I am struggling with,” and then hope your mentor says, “Okay, let’s look at this and figure out what we can do about it. Let’s figure out what we can try.” (School A, 409) I’ve never gotten feedback that said, “I saw you doing this. These are my questions about it. Next time I think you should do this.” Usually the feedback I have gotten has been more in the form of like, “I notice. I wonder. I appreciate.” (School C, 222)
In answering to these initial desires for more precise feedback, TIF provider, school administrators, and university-based researchers collaborated with each other and designed an evaluation tool in a consensual manner. Several teacher focus groups gave input. Out of this work came the SET instrument (later renamed Sample of Effective Teaching). Mindful of keeping evaluations straightforward for the many beginning teachers, the tool aimed at simple teaching functions, such as activating prior knowledge and interest, introducing new content through modeling and co-constructing dialogue, checking for understanding, student practice, or giving feedback, and it attached low-inference behavioral indicators to these functions.
The program evaluators conducted independent observations of lessons across the three schools using the SET instruments during the early adoption phase to establish a baseline of instructional quality in the three schools. While a number of teachers taught exemplary lessons, many lessons would have indeed benefited from closer clinical attention. Observations showed, among other things, that many lessons were oriented toward practice with little new stimulating content, new content was introduced through modeling and teacher monologue, little explicit feedback was given, few instances of students learning from error or misconception were observed, and very few lessons attempted closure other than “exit tickets.”
In sum, there was an articulated need for clinical supervision and learning. Teachers clamored for more precision in feedback and were eager to learn. The SET tool seemed consensual. Performance evaluation per se was not considered detrimental to established learning cultures, and visibility of instruction was prized over privacy in the culture of the schools. Evaluation was associated with precise feedback and learning, not primarily with judgment or reward. TIF-inspired teacher evaluations seemed to be off to a good enough start.
Bonus money
Extra money is always welcome was the pervasive answer at the beginning of TIF. Most teachers considered monetary rewards as recognition and validation for effort and commitment already expended on their work, and not as an incentive to motivate more effort or new behaviors. Money was an inducement to keep doing what one had been doing all along; and this inducement generated positive feelings. Comments abounded, such as these: It is nice to be recognized—I am thankful for the money. Money is a way to compensate us for the hard work we are doing. Administrators and the provider framed the TIF project in this way as well. For the hard work that teachers were doing, they often stated publicly, they deserved higher salaries than what the schools could pay. And TIF was an opportunity for the schools to attract more funds to augment salaries:
This whole process has been very interesting for me, yeah. There are times where I wonder [given our demographics . . . ] if we were the only high school that scored in the 800s, would I wanna be recognized for that? I know it took work. You know, I know it added all these gray hairs to my head. Would I wanna have some financial recognition for that so that me and my family could take a vacation? Then I’m like, “Yeah, of course.” I see it as a gravy component though; I don’t see it as a driver. I see it as a recognition. I don’t see it as a motivation, if that makes any sense. (School C, 232)
TIF money was a token recognition of collective deservingness.
Dissonance
Unanticipated implementation challenges abounded, not unlike ones documented in the literature on pay for performance (Chiang et al., 2015; Marsh et al., 2011; Milanowski, Witham, Schuermann, Kimball, & Pietryka, 2010; Rice et al., 2012). A far-reaching decision was made early on in the project that the substantial funds, paid by the federal TIF grant for capacity building around TIF metrics and implementation, were rolled over to the schools in a lump sum and were no longer available to the provider. All three schools used good portions of these funds to compensate for coincidental state budget cuts so that they could keep staff or buy essential equipment. In an inducement logic, this decision made sense. As long as the recipient of funds engages in obligated practices, the funds are justified. In an incentive logic, the decision was detrimental because it made it difficult to find time to familiarize faculties with the performance management system and facilitate the internalization of its judgments. Instead, introduction to the new system was curtailed to a few short presentations during faculty meetings, resulting in many teachers feeling left in the dark about the complexity of the system.
Dissonance also set in among the local TIF leaders when it became apparent that data processing and data dashboard design were tasks much more demanding than envisioned. The technical assistance vendor seemed ill-equipped to deal with this complexity, yet no funds were available to procure additional services. The result was that principals were left to carry a large part of the technical side of the performance management system, a role that absorbed much energy for TIF almost to Year 3.
In two schools, different principals took over during the dissonance phase and soon came to guide the implementation of TIF. In School A, a new coprincipal in charge of TIF appeared to be unwilling to support TIF, all the way to the point of refusing basic compliance with formal procedures. In School B, the new coprincipal was simply overwhelmed with multiple duties. These events diminished implementation quality greatly.
Evaluation of teaching
While in the early phases, teachers were primarily concerned about the clarity of the metrics and the process, the situation changed dramatically after the first data dashboard release in the fall of 2012 when the first summary performance scores were released and teachers had to cope with performance judgments. The scores for the FETs were drawn from the quarterly conferences with instructional supervisors. Supervisors partly drew FET scores from classroom observations, but holistic judgment was their main base. As a result formative or FET scores were more in line with teachers’ expectations:
Usually I feel it’s accurate. Of course when only 20 minute observations are being done, sometimes I feel maybe there was something missed or whatever, but in the most part, yeah I think. (School B, 309) That’s always my most helpful thing is when he observes me, so to get him in here on a set schedule I think was definitely helpful and I always learn from that. (School C, 220)
FET scores were on the whole higher than SET scores which were summative and externally generated. When SET scores were released, they were considered conspicuously low, despite the fact that 61% of participating teachers received a score of “applying” that qualified them for a monetary award. For some participating teachers, a score below 3 was not troublesome because they considered themselves novices or learners:
The things that are highlighted as problems or things that I need to work on are the same things that I was thinking oh, yeah, that didn’t go well. (School B, 303) I’m a brand new teacher, this is my third year. So even telling me: you’re not very good at things, well, yeah, I know I’m not very good at things. (School C, 202)
But for more senior teachers who considered themselves effective or had been rated effective by their instructional supervisors in the past, the SET scores provoked disbelief:
When people didn’t get the highest SET score that are used to be regarded as model teachers here or had model classrooms before, then people’s SET scores seemed to be more of a discrepancy or not in line with the [school leaders]. That’s where I saw more of the hurt. (School C, 220).
Two reactions to the performance pattern could have been conceivable: teachers revisiting their videos and trying to understand their score or discarding the score as invalid and the whole process as spurious. The latter was the prevalent response.
A host of reasons were mentioned as justification for the SET scores’ lack of validity and usefulness. Interviewees stated that their ideas about good teaching were not aligned with the SET tool because their teaching was more constructivist, a position held by many senior teachers especially at School C; the lesson structure encouraged by the SET tool was so basic as not to deserve attention; external judgment was out of context; and teaching a lesson to a formal template resulted in a sense of, what Stephen Ball (2003) has called, performativity:
I didn’t—you know, my thing was like I didn’t want to create like a play, like I was rehearsing for a stage thing. Because I was thinking, you know, how useful is it for me to do a fake stage that like I wasn’t interested in that. (School A, 421) Right now it just seems like something we have to do at the end of the year. I feel like a student scrambling to get things done. (School C, 212)
The ratings seemed arbitrary and without more detailed formative feedback, the summative score was seen as useless. Teachers did not “have a strong feel for the distinctions between the different levels” (School C, 219), and they wished that:
Whoever observed the video to then sit down and have a conversation with me. Have a coaching session with me based on what they saw. I would take free coaching in a minute even though it means being observed and evaluated. But to get a piece of paper that says these indicators were present is not as useful to me. (School B, 301). And so, fine, if you want me to type it into a paper, I will, but I’m doing it for you; I’m not doing it for me. If you were going to sit down with me beforehand, talk me through it, observe me, discuss it afterwards, okay, then I have a chance to benefit from that. But if it’s just me emailing you something so that you can say, “We have this many teachers that did this well in the SET lesson,” like, that’s not really for me. (School C, 223)
For some, the evaluative judgment was upsetting:
I care about being a good teacher. And this year, my score was really low. It was lower than any—like my first year teaching, which, in my head, I was like well, this definitely isn’t a valuable tool to me because I have definitely improved as an educator since my first year of teaching. So I think the tool—I get nervous and I feel vulnerable because I’m being evaluated, so it makes me feel vulnerable. (School B, 309)
In sum, the prevailing sentiment was to question clarity, validity, fairness, and usefulness of the evaluations, and with it, the need to learn from the information which the tool or the evaluations could potentially provide vanished. These responses are in line with what has been recorded in the literature on standards-based evaluations mentioned earlier (Kimball, 2002).
When it came time to ramp up for the Year 2 SET submissions, the skepticism toward the summative evaluation had spread across faculties and became a collective stance that expressed itself in repetitive commentary that the SET was largely invalid:
I don’t trust the SET because there’s not like a connection between the person—like who is this person giving me feedback? (School B, 309)
Principals, as well, sensing their teachers’ negative attitudes, did not press the point and backed off the SET by remaining silent. One principal, echoing shared sentiments in a reflective meeting with principals, asserted that she made a point of not highlighting the system (during the phase of dissonance) because she herself did not feel anymore that the system yielded reliable information and she did not want to upset her faculty.
Originally, the plan was for formative conferences and summative scoring of videos to become increasingly aligned around a common tool, but instructional supervisors retreated from this intention as teachers became more and more disenchanted. As a result, summative judgments seemed disconnected from formative conversations and practice:
I was, like, please tell me how to do a better lesson closing. Please. I would love it. I know that my lesson closings are not effective because my class gets so differentiated that at the end I’m like: oh, God. Not once, zero times, has somebody said to me how about I watch five of your lesson closings and then you and I sit down together for an hour and talk about an alternative which you try. Then, I’ll watch again. Any money that’s being spent on SET. Are you kidding me? Pay that person to come do that with me. I would love that. (School C, 223).
Participating in the SET was voluntary and the principals’ silence reinforced teachers’ discretion in retreating from the summative evaluation. SET video submissions declined during the dissonance phase, as shown in Table 4.
Teachers Eligible for SET Reward and Percent Submitted.
Bonus money
The way the three schools dealt with evaluations mirrored the way they dealt with bonus monies. That is, over time acceptance dwindled. While in Year 1, money had the aura of a certain innocence, in Year 2 money was still welcome but also an annoying and discordant feature of organizational life:
In general everybody seemed confused. They were like okay, that’s cool I got money, but I don’t really understand. (School C, 220) I don’t even know the whole matrix calculation. (School B, 306) I think it’s just like some based on like luck and other things based on like just how scores were calculated by the state. (School B, 304) I think people feel like whether they earn their full bonus or not is largely not dependent on what they as an individual do on a day by day basis. (School C, 217)
In the consonance phase, bonus monies were met with a sense of collective deservingness. In the dissonance phase, this belief was undermined. Interviewees hinted at their suspicion, or their knowledge, that payouts were surprisingly unequal across groups of teachers. Principals, as well, reported in the TIF steering committee that it was readily transparent that teachers teaching 12th-grade courses were advantaged because their bonuses depended largely on “easier” internal school measures, while teachers in lower grade levels were assessed on state tests. Evaluative judgment from teaching evaluations were the greatest sting:
[ . . . ] People were, like, great we got extra money, but was it worth all this extra stuff that we’ve all been talking about and doing? That’s generally how people were acting. Some people seemed put off like I have this traditionally high status and I got one of the lowest amounts. [ . . . ] What’s going on with that? Even with the SET ratings I actually saw people get more upset about their SET scores than about the money itself. When people didn’t get the highest SET score that are used to be regarded as model teachers here or had model classrooms before, that people’s SET scores seemed to be more of a discrepancy [ . . . ] That’s where I saw more of the hurt. The money thing was like we got this money whatever. (School A, 407)
In the dissonance phase, bonus money became a submerged topic of communication. Teachers revealed that administrators encouraged teachers to use discretion. Administrators, in the steering committee, confirmed: “We never spoke about the money. We made sure that the money issue never made it big. We knew what was behind all of this . . . and so we kept quiet.” Other principals concurred with this comment. Money, in contrast to teaching, which fell under norms of visibility, was treated as a personal issue: “I think that the money thing has always been a private thing” (School B, 306). Teachers, for their part, were careful avoiding comparisons, being sensitive to people getting different amounts, and being careful about making people feel bad. As one teacher explained:
I think right now it’s been kind of like this air of secrecy and privacy where like we haven’t necessarily been encouraged to talk about [the scores and the money]. I think that’s been a little damaging. . . . Don’t gloat or ask people about [it] is what we say to the kids, right. Like they’ll tell you if they want to tell you. It kind of created this environment, I think, where people were reluctant to talk about [it]. (School C, 217)
This feeling of discomfort was punctured when the notion spread among teachers that TIF was just one big “piñata.” The notion originated in the TIF steering committee convened by the provider. Two things came together. In the transition to the Common Core Standards, the state government had abandoned its state test and the No Child Left Behind–like sanctions regime. The state performance measures, however, were the linchpin of the TIF bonus pay system and were now no longer under serious consideration. The “coup de grace” came when the bonus monies were paid on time in the fall, but because of glitches in the database, the performance data were not released until months later, making the glaring disconnect between bonus award and performance visible to all. The notion of a “piñata” took the sting out of the differential amounts that the TIF system bestowed on teachers. It returned the system to its original “innocence” during the period of consonance when it was unconnected to performance and was embraced as bringing money to the school in whatever form.
Resonance
As in the two previous periods, new developments originated in the TIF leadership team. The state tests had vanished and the No Child Left Behind sanctions regime had imploded. The technical side of the TIF system was in shambles. Participation in the SET, the only part of the system that may have had any viability left, had shrunk. Paradoxically, new resonance sprouted when provider, school administrators, and evaluators jointly acknowledged that the incentive function of the performance management system had been a failure. Concurrently a new director took the helm of the provider organization. At the nadir of the system, this new director changed course. From now on, a portion of every steering committee meeting was dedicated to analyzing teacher-submitted videos and practicing principal–teacher instructional conferences. In this context, watching the videos and, for the first time, actually becoming aware of the quality of the lessons, motivated the school leaders to return to the original formative purpose of the SET, that is, encouraging teachers to learn around instruction. Incentives moved to the background, while concerns for clinical learning came to the front.
Evaluation of teaching and bonus money
In Year 3 of implementation, teachers found ways to inure themselves to the divisive or discomforting sentiments generated by the performance management system. Most teachers stopped interacting with the data dashboard altogether and did not check their performance scores anymore, such as these teachers:
I think at some point you just decide that it’s not valid information or something because the information that you get back on it isn’t just—I don’t find it to be especially useful as far as figuring out what I need to work on. (School C, 217) I actually find it to be punitive sometimes rather than encouraging. I think it just picks at things that are happening. (School B, 301)
Performance scores and bonus payments were treated with silence and became a largely private affair. But below outward silence, they exerted their latent presence, not as valid measures of one’s performance, but as an irksome feature that could potentially disquiet one’s self-perception as a good performer and one’s sense of being fairly compensated for one’s work relative to others. Those who had negative experiences with their scores tried to distance themselves:
I told [my supervisor] what would work for me is if for her to see the data and then because she knows how to work with me, she knows how to be gentle or whatever, so if she reads the data and then she could help me through that data like oh, these are the things you did great. (School B, 308)
Despite this distancing, the evaluation remained a latent concern:
I’m curious. I’m curious as to know, what they’re (the external evaluators) going to say and—but I don’t have like, you know, big expectations. (School A, 419) I think with the SET last year it was like—a little bit like, oh—you know, I hadn’t really necessarily tried a ton. But, you know, you’d like to, you know be rated well, and I wasn’t. (School C, 221)
Obligated procedures
While the use of artifacts associated with teacher evaluation—namely, the SET observation instrument and the videos—receded for summative and incentive purposes, their use for professional learning in faculty professional development advanced. This was largely due to the role of the provider during the 2013-2014 school year. The provider invited principals and other instructional leaders from the schools to systematically analyze samples of teaching from the SET video submissions and to recognize strengths and weaknesses in submitted lessons. Viewing lessons submitted by their teachers created problem awareness among leaders. Especially at Schools A and B, to varying degrees, professional development became organized around improving lessons as a result. Principals (Schools A and B) and instructional leaders (in all three schools) reframed the function of the SET tool as a way to analyze lessons in a more precise fashion and learning from this analysis. Given space constraints, we merely give a flavor of the kinds of conversations that resulted. They show that depth of analysis and precision varied. The use of videos and observation tools were fused with existing ways of looking at lessons which tended to invite open-ended reflections on “wows and wonders.”
School A
At School A, the school principal and the instructional coaches started a professional development cycle that they called “lesson study,” making explicit reference to the Japanese practice of analyzing lessons collectively. In total, they held nine sessions in this cycle during Spring 2014, facilitated by the principal and the coaches (thereafter the coaching team). In the first sessions, the focus of lesson study inquiry was on “communicating criteria for success” as a strategy for students to monitor their own learning. The SET tool and the video were not used as central artifacts during these sessions. Teachers planned lessons together and subsequently used an open-ended observation tool which asked them to state how well they thought students were self-monitoring their own learning.
This changed after two lesson study cycles when meetings came to address the submission of video lessons. Teachers felt unclear about the SET evaluation. A coach gave the following explanation:
They’re looking for indicators, and we’ll go over those next week. It’s indicators of stuff you do. There are only certain things that can actually be observable. It’s really clear if you’re just like—it’s observable if there’s no opening. It’s super observable if your class is over, and you’re like, “All right. See you tomorrow,” and there’s no closure. You can’t imagine how many times we’ve seen that. You just did a whole lesson, and the kids are like, “All right. Later.” Then you just walk out the door with no sense of what happened, or wrapping it up, and stuff like that. (School A, instructional coach)
Interestingly, the SET observation or scoring guide was not directly used when teachers planned a lesson for the SET submission, but several planning guides were available:
Do you guys all have the lesson plan template that you use?
Just the [SCHOOL] one. Do you recommend like the [PROVIDER] one [i.e., SET tool]?
Not necessarily, I think all of them have the same stuff.
In the next session, coaches handed out a guide that explained the various phases of an effective lesson. Given disorientation among the staff, this handout summarized the main aspects derived from the various tools that were in use in the school. There was still confusion at the end of the session as to what the expectations for good lessons were. In the following and last session, the SET tool was used directly. Teachers studied the ways of introducing new content through teacher modeling or co-constructing dialogue with students, suggested by the instrument. Working in pairs, they compiled a list of elements of good teacher moves (e.g., questions, scaffolds) for modeling or co-constructing. Then, they watched a video lesson and traced each SET indicator for the focal lesson segment. The same day, teachers repeated this exercise in their department teams. In the department meetings, teachers watched their colleagues’ videos and practiced giving feedback according to specific behavioral indicators enumerated on the SET tool.
School B
In School B, attention to the clinical nature of the SET had already begun in the fall of 2013 ahead of the other two schools. Led and organized by the school’s instructional coach, the practice of lesson study was fully structured around the SET tool and video records. Guided by these artifacts, the lesson study was focused on analyzing different phases of effective lessons by watching videos of teaching segments. The leadership team of this school saw value in the precision of the SET tool and they communicated it to their staff:
It’s kind of exciting to me just looking at that data knowing that we did target. With each of you individually, in some way we did talk about the checking for understanding phase and how we were going to improve it. Then there was like pretty substantial—it was almost a full point of growth on a four-point scale. That’s pretty substantial.
We can get better.
We can get better at the things we focus on.
The lesson study cycle usually began with a theoretical review of essentials of each lesson dimension. The teachers asked questions, shared their intuitive understanding, and came to an agreement on what the behavioral indicators of the tool were signifying:
Does somebody have their own way that they think—their own nutshell of what they think of as modeling, like in their own words, based on this or what you had from before? What is the modeling phase really?
I think it’s providing a very clear—you’re teaching a concept. You’re teaching the skill, basically how you show that. Point it out very clearly or organize in some fashion that builds up to being explained.
I feel like it’s doing an example of what I want them to do with slowing it down and explaining it, deconstructing.
I also kind of think it’s about the zone of proximal development. You’re connecting what is known with what is new in a way that builds on people’s shared understanding in a room or kind of like shared capacity in the room. I think it’s like discovering together this is what we know. This is what we’re going to know. You’re kind of modeling for them in that, where those two meet. Then that’s leading to them being able to do it on their own.
This was followed by video analysis. The teachers took turns in presenting artifacts. When the group ran out of time, they picked up the task in the next session. Video analysis would begin with the presenting teacher giving some explanation of context upfront, then, the group watching the video and analyzing it with the indicators of the SET tool. It is in the moment of feedback when teachers showed the most intense use of the SET. The tone was appreciative. “You always start everyone with a wow and make a suggestion with a wonder” (School B, 321). The feedback made references to indicators of the tool (italics). In this case, the teachers discussed the opening of a lesson:
You really had their attention. No one was side talking. My scene looks very different.
I thought there were really clear expectations on their behavior and what to do all along the way.
It looked like there was a visual aid there too that was probably very helpful.
In the beginning, there were a few students who were engaged when you were talking about what a thesis is, and I heard students responding and asking questions because it seemed like they were curious.
I wrote a list. I love how you had the visual projected. I love how you were talking about the government and something that’s current. You mentioned how the government had just gone back into business, and I noticed the students were interested in that. It was a good hook. I don’t know if they consider it a sentence starter, but when you had them start working, you said, “I believe blank because blank.” That was good structure in case they didn’t know what to do. I also loved that you gave them a time limit. You said, “Three minutes.” It kept the class going. And then, the elbow partners.
I thought you were really clear on stating what the objective was. Not only the expectations for what they should be doing, but where you were going with it and what the objective of the day was and that lesson. Then, being clear and breaking down the topic too.
Sharing one’s teaching with the group and observing it, especially through video, was seen as useful:
This is good stuff. Thanks, guys [ . . . ] Actually, I really enjoyed this process. I think this is better than being observed live because I’m able to hear myself, so that was cool. I feel really safe with you guys. I just wanna say that. I didn’t think I was gonna like the wonders, but I wanna thank you guys for that. I really enjoyed it, and I wanted more. (School B, 306).
Lesson study around the SET was held roughly every 3 weeks, having in total five sessions. In the middle of Spring 2014, the administration decided to end the lesson study sessions due to a major organizational challenge that they faced that year. However, during the following year, lesson study continued, as the coach described it:
So we have it twice a month. So out of our four meetings a month, two are lesson studies. We have it in two groups. So we meet all together. Similar from the first meeting of a phase is reading about that phase. Talking about it. What are some of our best practices are. Looking at the indicators. Then, watching a video from an external source like somebody else’s SET from another school that scored well on that phase. [ . . . ] Then, as we watch, we fill out indicators on our little blank rubric and talk about it. Then, we do a protocol. So then, we break into our two groups. It’s a Math/Science group and a Humanities group. We do a lesson study protocol where we bring the written lesson plan and get feedback from our colleagues. Then, the next week, we go and teach it and video it. Then, we come back and show the video. We do a protocol where they give feedback on the video. (School B, 301).
School C
The story of School C is more akin to School A, where the SET tool was loosely tied to organizational routines, but did not have the analytical power leveraged by School B. In School C, the instructional coaches felt strongly that the SET tool would need to be modified to better suit the needs of their teachers, many of whom were said to teach within a constructivist, rather than direct instruction, lesson framework. A new option for a constructivist lesson was added in the fall of 2013. The revised tool improved the acceptance of the SET in School C, but its use for professional learning was limited. A few teacher inquiry groups picked it up, and coaches thought it was useful for training new teachers. Participation in SET submission increased quite a bit that year, but this was partly due to administrators worrying about TIF money and asking their teachers to comply with the procedures.
Whether an inquiry group took up TIF-inspired artifacts and procedures depended largely on the initiative of the group leader. The Math inquiry group mostly shunned these artifacts and procedures or covered them briefly as a compliance item immediately prior to SET submission. For the Math group leader, the TIF procedures never lost their stain of being summative:
I have never done the SET before even though I was supposed to have done it for the past two years, I just don’t do it. But people take it really seriously because I think people get really anxious about evaluation. And there’s just a huge amount of anxiety about evaluation and my belief is like well, my students are there every day. I’m just as anxious about every day because the kids are all in there; they’re all evaluating it, like that’s who I feel are my evaluators. Not a bunch of people in the room watching a video of me, I feel more—so I don’t—I’m like if somebody wants to evaluate me great. I’m evaluated by 17 year olds every day also. (School C, 223)
The instructional leaders from the humanities and science inquiry groups were both involved in the revision of the SET instrument for School C and saw merit in using the tool:
I definitely used the revised tool this year in a couple of forms. One is most basic starting point, I used it as a baseline for inquiry group to think about just the lesson components that need to be there in terms of kind of backed into it in terms of talking about what needed to be there for their SET, but it allowed for my repeated reminder that it’s good instruction. That we need an opening and a middle and we need to give feedback and check for understanding in the middle and have a closing, and that’s what we need to do, so I use it in that way. (School C, 221) I ran one coaching cycle with a brand new teacher where we looked at—actually used the tool and the indicators to evaluate someone else’s openings for two weeks of lesson planning and that was pivotal for her. Like that experience of like of going through and saying oh, this opening engages students. This opening, wow, it engages students and it was retrieval. Like that, just that process was really beneficial for her to see to use it in terms of seeing someone else’s practice. (School C, 221).
The humanities inquiry group is typical of how TIF afforded conversations around instruction. Over the course of the spring, the group spent several weeks planning a lesson that they would all teach. In debriefing the lesson, the group used an observation rubric that was a mixture of the SET and a simpler and open-ended observation tool. The subsequent conversations tended to be about noteworthy, but isolated episodes, individual student behavior, and “tricks of the trade” rather than about distinct lesson features that may have enabled or disabled a flow of student learning:
I taught a lesson in your class yesterday. Did you see most people having conversations?
Um-hm.
My argument would be that there’s a way of poking at them so that that happens.
So that they think deeply about it? Is that what you mean?
You just give them a little poke and walk away. Then they can’t talk to you about it. They have to talk to each other. You say, “Hey, there’s something missing there. What did we see today that’s missing? Oh, the gas. Yeah. You should figure out what to do with that.” Then you walk away. If you stand there, they’re going to say, “Is it this? Is it this? Is it this?” If you walk away, they only have themselves to talk to. Then I’ll notice so-and-so is not talking. Then I’ll look at their paper. “Oh, that’s a really good idea. You should share.” You’re essentially the counterbalance to the imbalance, if that makes sense.
But analysis of lesson structure, though less frequent, was present as well:
Like some of the moves we’ve made that I think have been really helpful. I think having that check for understanding was really helpful just because the pre-work we did, I’m just trying to think in terms of actual lesson structure. I thought we did a good job changing how we did our opening, the changes we made were helpful. And I felt it was helpful just to produce the formula right away and do the teacher model. (School C, humanities inquiry group).
Differences between schools
Resonance with the SET tool and procedure differed by school. School B embraced it most strongly, School A less strongly, and School C only in parts. This was due partly to the mission and structure of the schools, and partly due to micro-politics during the phase of dissonance. Resonance originated from the Steering Committee of the TIF grant where the provider, the principals, other instructional leaders from the schools, and the researchers from the university congregated monthly for the whole duration of the project. School C leaders, distinctly academic in orientation and fully aware of the school’s solid academic reputation and superior performance in the terms of the state’s academic performance measures, were certain that they produced a superior product that extended to instructional quality. School C also had a strong core of senior teachers who had been in the school for a long time, some of them dating from the time of its foundation. The principal was not going to upset his relationship with these senior teachers. As participation rates in the summative evaluations indicate (see Table 4), backlash against the system was most severe in School C. School A and School B leaders, by comparison, were proud of the social and learning climate they had created in their schools, but were far less certain of academic rigor and instructional quality. Viewing videos of lessons their teachers taught opened their eyes, and they were eager to learn as instructional leaders how to improve matters. In addition, since turnover was higher in their schools, more teachers were seen as being in need of support for basic lesson quality, and senior teachers were less of a force to reckon with in these schools.
Less experienced teachers, as we saw, were less perturbed by the evaluation scores during the dissonance phase than senior teachers. This dynamic played out especially in School C in which dissonance was felt more strongly than in other schools. School C senior teachers were convinced of their strength, and a less-than-expected evaluation score could not shake this belief. They saw value in the precision of the evaluation procedures and tools for novice teachers, but not for themselves. The principal in turn considered these senior teachers his instructional leaders; hence, he gave them wide discretion. As a result, in School C, resonance depended on the preference of the senior teachers who led specific inquiry groups or coached novice teachers.
In School B, teacher discourse gravitated toward commitments toward service and social–emotional support for students, in the face of which discussion of instruction took second place. As a result of the TIF grant, however, a senior classroom teacher was put in charge of leading and organizing professional development for the faculty. This teacher was convinced by the precision of the SET tool. She framed the SET as a formative tool, and she broke down the complexity of the evaluation tool into easily digestible steps that came to structure the regular after-school professional development sessions for the faculty.
Discussion
The purpose of the article is to understand how a performance management system, with its varied components mandated by the TIF architecture, influences teacher effort to improve instruction. We pursue this purpose by untangling dynamics related to evaluative judgments, monetary incentives, and procedures obligated by the system’s tools or artifacts. We saw from the literature that for proponents of incentive-driven performance management of the TIF type, the various dynamics often fuse: Presumably, evaluation tools spell out what matters for good teaching and diagnose the gap between desired and actual performance. Evaluative judgment compels teachers to take the need to learn and improve seriously, and rewards additionally spur teachers to strive toward desired outcomes. But we also know from the literature that incentive systems have rarely unfolded according to expected program logics. The pattern uncovered in this study is no exception in this regard, and it is consistent with earlier findings, for example, the symbolic and expedient nature of teaching evaluations (Murphy et al., 2013), the preference of educators for egalitarian monetary rewards signaling collective deservingness, the backlash against performance-contingent programs, and the tendency for leaders to shroud these types of programs in mystery (Marsh et al., 2011). But this study contributes something novel that has heretofore been less documented in the literature: Incentives, evaluative judgments, and obligated procedures may intertwine, but also crowd each other out. And discrete elements of the system may resonate differently over the life cycle of the system, given enduring school cultures.
In a nutshell, we saw that the interaction of evaluation, bonus pay, artifacts obligated by the system, and teacher learning around instruction shifted over three distinct periods that we called consonance, dissonance, and resonance. In the consonance phase, all elements fused, but this fusion was dependent on a selective perception and interpretation of what the system was all about. At the outset, for the three charter schools, the ideas associated with performance management did not seem to conflict with their established culture of social justice and espoused self-determination. In the consonance phase, attracting additional money was a strong, if not primary motive, but serving the adult learning needs of the schools through a tighter evaluation system was of importance as well. Incentivizing work was not an unfamiliar idea, but the logic of managing through differential pay for measured performance remained fuzzy. The leaders did make a direct connection between a strong evaluation system and teacher learning, but were not clear about such a direct relationship between incentives and performance. In the dissonance phase, the elements become discordant to each other and to the established values of the schools. The desire to learn in a formative way decoupled from evaluations as teachers’ and administrators’ concerns about the system’s faulty incentive function took over. In the resonance phase, the pattern shifted to the opposite. The incentive function (summative evaluations, bonus pay) was largely rebuffed by the customary values and norms of the organizations and relegated to the periphery of the schools’ attention. Once the power of incentives became latent, first administrators and then teachers came to interact with the two main artifacts obligated by the system, videos of lessons and the clinically oriented SET observation tool, in novel ways. In conjunction with professional development supplied by the provider for the instructional leaders, the videos and the clinical observation tool became resonant with internal concerns for higher quality teaching. The artifacts reconnected to the initial concern for precise feedback. Once the tools were overtly decoupled from incentives, they could become affordances for learning, though the obligation to engage with them to secure federal money, and the evaluative discomfort caused by external evaluation and differential bonus awards remained latent forces.
Yet resonance does not mean that the tools and practices obligated by TIF became powerful drivers of teacher learning in all schools. They at first pricked the surface and began to seep into established practices. In all three schools, the procedures advanced the idea of focusing attention on lessons and on studying the evidence from lesson delivery. Only one school truly engaged with the challenge of producing a Sample of Effective Teaching as an occasion for formative learning. The teachers in this school came to appreciate the clinical precision of the SET tool, a precision that heretofore had been absent. One school adapted the system’s lesson observation tool and largely used simplified versions for their professional learning. Another school translated the idea of clinical observation into an open, nonclinical inquiry format that had been their traditional approach, and only used the SET tool for novice teachers. It is important, however, to note that the resonance we found was impossible until faculties in the schools had blunted the power of the incentive function which came to be seen as controlling or endangering valences (we all deserve more money because of our service) and autonomy (we are about growing as persons). It was not until then that the tools’ affordances for learning could come into view and reconnect to initial desires to learn and grow.
We see two substrata in the data, one related to formative learning and feedback, the other to summative evaluation and incentives. As to learning, TIF began with the hope and articulated need among faculty that the evaluative side of TIF may contribute to teacher learning, most notably by making feedback on instruction mandatory, more frequent, and more precise. In the beginning, TIF was not associated with summative evaluation but with formative feedback promoting learning. This made sense given the context of an adult learning culture that valued high visibility, collegiality, and engagement in personal growth. As to incentives, TIF began with the notion of collective deservingness of rewards. TIF monies were not interpreted as performance contingent incentives, but as inducements for good work all around. This made sense, given the strong service ethic and communal orientation of the faculties, including administrators and teachers.
The TIF incentive function had to find ways to enter educators’ initial assumptions and established routines. Summative evaluation (e.g., SET scores) and differential incentives (bonus awards) needed to anchor in the substratum of collective deservingness and desire for formative learning. The literature on performance management, evaluation, and incentives, reviewed earlier, would predict some dissonance in this process. But under the right kind of circumstances, summative evaluation and formative learning may fuse, and differential bonuses may become markers of competence and deserved reward. Alternatively, summative judgment and differential reward may be rejected and the system may be abandoned wholesale. In each case, incentives and procedures either work together or fail together. This is not the case here. In our study, we see separate dynamics and we can disentangle them.
The literature discusses several explanations for the success or failure of evaluation and incentive systems in schools. These explanations apply here as well:
Implementation quality: Our data show that data management was complex and the construction of fair metrics eluded the local TIF leaders. Moreover, capacity building around the new metrics was very limited.
Reward expectancy: Our data show that teachers found it hard to connect monetary rewards to effort and performance.
Clarity and procedural or distributive fairness: Our data show that summative evaluations were viewed as low in all these respects.
But our study demonstrates the salience of two other explanations that have found less attention in the literature. We need to disentangle the power of incentives from the power of artifacts (e.g., tools, procedures), each perhaps contributing independently to the success or failure of performance management systems. In the longitudinal analysis, we saw how the dominance of one element in the lives of schools alternated with the dominance of the other. Evaluation procedures and monetary rewards, rather than being complements, could also be substitutes in relationship to instructional improvement or quality, something that West’s quantitative analysis of the Minnesota Q-Comp program (West, 2012) hints at. To speak with crowding-out theory (Frey & Osterloh, 2002), in our qualitative study, controlling incentives may have crowded out the formative affordances of evaluation procedures (but not the desire to learn and grow) until values of collegiality and self-determination anchored in the deep culture of the three schools crowded out the controlling effect of incentives. Under these circumstances, bonuses placed on evaluation scores diminished the chances of learning from evaluation. Learning could reattach itself to the clinical tools once the controlling effects of incentives were buffered. Crowding-out theory captures these sorts of effects. Crowding-out theory assumes that crowding-out effects are contingent. They do not depend on the nature of the command, incentive, or obligated procedure per se, but on whether the latter are perceived as controlling, superseding, or invalidating existing autonomy, service, or pleasure motives. It is a testament to the strength of the collegial culture and the intention of the leaders to maintain this culture in the three schools that they could “sort through” the elements of the TIF system and find ways to use the system to their own desired ends. The organizational autonomy of the three stand-alone charter schools may have facilitated this sorting. It is doubtful that more regulated and controlled public school district environments (Rice et al., 2012; Rice & Malen, 2016) would have given school leaders the discretion which they took advantage of in the environment studied here.
Throughout all phases, whether consonance, dissonance, or resonance, the tools, that is, the SET and the videos, and the procedures, that is, the formative quarterly conferences and external scoring, were always there as obligations to be engaged with, and as latent nuisance to justify the flow of money. Over time, they seeped into faculties’ shared cognitions. Befitting the established culture of the schools, TIF induced behaviors, it never incentivized performance. Once the TIF obligation is gone and the money is no longer available, would the use of the system tools continue and the idea of regular and precise feedback on teaching maintain its hold?
The narrative of the three charter schools’ engagement with incentives and evaluations is open-ended. No simple lessons for the policy and practice of incentives and evaluation reform can be derived, but two implications for reformers are nevertheless noteworthy. One is an appeal to look at process and outcome of a complex project such as TIF-like performance management not in a wholesale manner by asking whether “it” worked, but by disentangling the “it” into its distinct dynamics. A longer term view, such as this longitudinal study afforded, helps seeing effects emanating from one element of the complex system only to fade away and make room for effects emanating from others. The surprise may be, as it is in our case, that the TIF performance management project generated a variety of dynamics, but none were directly related to its core mission, presumably envisioned by federal policy makers who gave it its name, that is, to enable leaders to improve performance through the power of monetary incentives.
A second implication has to do with the sequence of implementation phases that the project experienced. Implementation began, not unlike many such complex reform projects, with consonance, a phase of initial adoption that draws from a variety of uncertainties, fuzzy motives, and multiple constituencies. The project, then, experienced a period of dissonance once the full scope of what was expected became visible. From the implementation literature, this sequence is not unexpected for projects that are voluntarily adopted. (For those that are forced on implementers, dissonance may set in right away.) Once schools discover incompatibilities between project demands and established goals, norms, values, and capabilities, they either drop or adapt the project. The subsequent period of resonance in the schools could be interpreted as such an adaptation, but one that constituted a “lethal mutation” for the core intent of the TIF program associated with incentives.
A period of dissonance in the implementation sequence should be expected and should not deter reformers to pursue complex change projects since delayed realizations, insecurities, and conflicts may be “natural” companions of opening up to new challenges and learning new things. But while dissonance in the middle of the innovation life cycle should not unsettle the reformers’ countenance, consonance and resonance should be reversed. What if the implementation sequence had been resonance–dissonance–consonance? Then, the school leaders would have been mindful of the needs and desires associated with the new evaluation system, namely, more regular and precise feedback for learning. They could have used the artifacts linked with evaluation, namely, videos of submitted lessons and the observation tool, to create problem awareness among their teachers and develop a vision of desirable performance. Only then, once resonance with the existing needs and desires among the teachers was established, might they have gradually introduced further elements of the performance management system. Clearly, some aspects of this system would be anxiety provoking for some teachers, whether, as was the case here, for more experienced teachers or, as might be the case, for novices. After some necessary dissonance, consonance—no longer incipient, fuzzy, or facile—may become a solid base on which further experimentation with new practices may rest. Cynthia Coburn (2006) has shown that policy implementation in schools has many characteristics of social movements. Social movements address problems and advocate solutions by creating resonance. Resonance, despite or in the midst of dissonance, may result in subsequent consonant forceful action. The problem is that rarely do policy makers allow for such a dynamic to unfold.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Appendix A and B are available in the online version of the article.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
