Abstract
Teacher accountability based on teacher value-added measures could have far-reaching effects on classroom instruction and student learning, for good and for ill. To date, however, research has focused almost entirely on the statistical properties of the measures. While a useful starting point, the validity and reliability of the measures tell us very little about the effects on teaching and learning that come from embedding value added into policies like teacher evaluation, tenure, and compensation. We pose dozens of unanswered questions, not only about the net effects of these policies on measurable student outcomes, but about the numerous, often indirect ways in which these and less easily observed effects might arise. Drawing in part on other articles in the special issue, we consider perspectives from labor economics, sociology of organizations, and psychology. Some of the pathways of these policy effects directly influence teaching and learning and in intentional ways, while other pathways are indirect and unintentional. While research is just beginning to answer the key questions, a key initial theme of recent research is that both the opponents and advocates are partly correct about the influence of these policies.
The development of value-added methodologies and, in particular, their use to evaluate teacher effectiveness, has become an issue of intense interest and concern within the educational community. This interest is well placed. If the arguments on either side turn out to be correct, the use of teacher value-added measures could have a greater influence on classroom instruction than perhaps any single reform in decades—for good and for ill.
Like many “new” policy ideas, this is one with old roots. The term “value added” is relatively new, but the underlying concept of measuring teacher performance based on student test scores goes back more than a century (Reese, 2013). Even the idea of focusing on student growth—the underlying approach of value added—is almost a hundred years old (Lancelot, Barr, & Betts, 1935).
But nothing in the past compares with the wave of value-added-based teacher accountability brought on by President Obama’s Race to the Top. Since its inception in 2009, our academic journals have been filled with articles about the validity and reliability of value-added measures. Books and reports have been written that describe the pros and cons of value added (Goldhaber, Harris, Loeb, McCaffrey, & Raudenbush, 2015; Harris, 2011; McCaffrey, Lockwood, Koretz, & Hamilton, 2003). Others have expressed skepticism (American Statistical Association, 2014; Amrein-Beardsley, 2014; Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012; National Research Council, 2010), whereas prominent foundations and some think tank reports have been more positive (Bill & Melinda Gates Foundation, 2012; Glazerman et al., 2010; Gordon, Kane, & Staiger, 2006; R. Hanushek & Rivkin, 2004).
The various sides look at the same evidence and see a different picture partly because the evidence we have is so far removed from the decisions that have to be made. The issue is not whether value-added measures are valid but whether they can be used in a way that improves teaching and learning. How do educators actually respond to policies that use value-added measures? On this question, we know very little. Therefore, in the debate about these policies, perspective has taken over where the evidence trail ends.
Roughly 6 years after the first national wave of teacher accountability based on value added, this special issue considers the key questions, the best available evidence beginning to emerge on the use of value added, and the large remaining holes in the literature.
Unanswered Questions
Teachers’ responses to value-added-based teacher accountability will depend on a mixture of their perceptions, attitudes, beliefs, information, motivations, and incentives. This leads to several obvious questions: How do the educators—whose practices we are trying to change—make sense of value-added measures? Do they believe they have the ability to change their performance in a way that will show up in better results? Since value-added-based accountability is external to schools, what do they understand about the measures, how they were created, and for what purpose? How are these perceptions shaped by the motives of policymakers as well as the statistical properties of the measures, the complexity of the methods used to create them, and the inconsistency in performance categorization between value-added and other measures?
Ultimately we are interested not in educators’ perspectives per se but in how these perceptions shape behavioral responses. Does the use of value-added measures, as critics theorize, lead to more teaching to the test and shallow instruction, aimed more at developing basic skills than at critical thinking and creative problem solving? Does the approach further narrow the curriculum and lead to more gaming of the system or, following Campbell’s law, distort the measures in ways that make them less informative? In the process of sanctioning and dismissing low-performing teachers, do value-added-based accountability policies sap the creativity and motivation of our best teachers?
Adding to this focus on possible individual perceptions and responses, Johnson (this issue, pp. 117–126) introduces a broader, organizational perspective that leads to still more questions. Does the use of value-added measures reduce trust and undermine collaboration among educators and principals, thus weakening the organization as a whole? Are these unintended effects made worse by the common misperception that when teachers help their colleagues within the school, they could reduce their own value-added measures? Do the measures encourage or discourage the types of schoolwide improvement initiatives that prior research has found promising? Aside from these incentives, does the general orientation toward individual performance lead teachers to think less about their colleagues and organizational roles and responsibilities? Given the interdependence of individuals in schools, does “pulling on one string” in the organization with value-added measures lead to unintended consequences elsewhere? And returning full circle, how do these organization-level factors filter down to the classroom?
The above questions focus on the unintended consequences of using value-added measures, but are there benefits, intended or otherwise? The main underlying theory of these policies is that teacher accountability will motivate teachers to work harder and smarter and help attract and retain only those who are successful. Does this happen in practice? Does the increased scrutiny lead educators to work harder and smarter in helping their students? Does the recognition that comes with high performance ratings encourage a stronger focus on the student outcomes on which the educator performance measures are based? Are teachers more likely to demand and seek out instructional leadership from their principals, peers, coaches, and other sources?
And what about the indirect effects? Has the greater focus on classroom observations spawned by value-added measures led to better instruction, perhaps through more frequent peer feedback? Do these systems, of which value added is just one part, increase cohesion around common goals and expectations at the school level? As Johnson (this issue) points out, cohesion and common goals are key precursors of school improvement.
The supply and the quality of the teaching workforce may also be influenced indirectly by value-added-based teacher accountability. For example, are teacher and administrator preparation programs helping to prepare future educators for value-added-based accountability by explaining the measures and their potential uses and misuses? Are teacher preparation programs focusing more on improving or expanding their focus on pedagogical content knowledge, lesson planning, classroom management, and real-world practicums that are thought to improve academic achievement? Is value-added-based accountability leading education administration programs to focus more on instructional leadership so that future principals are better able to provide constructive feedback and development opportunities for teachers to improve their measured performance? Or like earlier waves of performance-based compensation (Murnane & Cohen, 1986), do preparation programs see value added as another passing fad, not worth the attention?
Building on the idea’s roots in labor economics (Goldhaber, this issue, pp. 87–95), it is equally important to ask, Do value-added-based accountability systems attract more academically talented teachers, raising expectations about the quality of school work environments, professional prestige, and/or potential compensation? Alternatively, might negative perceptions and headlines reinforcing the fears about value added noted above cohere to dissuade promising prospective teachers from entering the profession?
Educator responses may also depend on the design and systemwide implementation of value-added-based policies. This may be the most important topic of all. If there is one common theme of the policy implementation literature, it is that policies rarely affect practice as intended (e.g., McLaughlin, 1987; Wilson, 2003). Educators try to make sense of reforms and will often follow through when they have sufficient understanding, support, and capacity. But when these conditions do not hold, educators will tend toward minimal compliance, and school leaders will buffer their teachers from unwanted intrusions (Honig & Hatch, 2004; Rutledge, Harris, & Ingle, 2010).
In this vein, are value-added-based teacher accountability policies more effective when care is taken to communicate with teachers and principals about the purposes and details of the policies? Does it matter how much weight is given to value-added measures relative to other measures such as classroom observations? Is the potential success of the policies stymied by the antagonistic tone of some who support greater teacher accountability? Do schools with greater fidelity of implementation see positive behavioral responses? Does gaming the system among some educators widen this credibility gap among others who might otherwise support and seek to fulfill the spirit of the policies?
In the absence of evidence of how the use of value-added measures has influenced practice, researchers have either sidestepped the questions or speculated about what the answer might be. For example, many researchers have produced hopeful estimates of effects on student achievement from dismissing certain percentages of low-performing teachers based on value-added measures (Gordon et al., 2006; E. A. Hanushek, 2010; R. Hanushek & Rivkin, 2004; Wright, Horn, & Sanders, 1997). But the simulation approach falls short (Goldhaber, this issue). As the authors of these studies usually acknowledge, the simulations are too simplistic. We simply cannot assume there will be no negative effects on the supply of teachers, no gaming the system, no misunderstanding, and no resistance that reduces the fidelity of implementation. Although the authors usually acknowledge the speculation involved and attempt to avoid a false sense of precision, they usually remain confident at least in the “order of magnitude” of their effects, that is, whether they are big or small. But even these broad assessments are questionable. How is it possible to draw conclusions about the order of magnitude of the benefits without some notion of the order of magnitude of the many factors assumed away? Given the sheer number of unanswered questions, we know far too little to predict what these systems will mean for teaching and learning.
With this issue, we try to move past speculation and toward a more informed—and empirically driven—debate, gathering some of the best available evidence on these key questions of educational practice.
The Evidence
If the early returns are any indication, it appears that both the critics and the advocates of value-added-based teacher accountability have valid points.
Perhaps the most common theme of the articles in this issue is that teachers and principals trust classroom observations more than value added. This conclusion emerges from both the Goldring et al. (this issue, pp. 96–104) analysis across eight districts and an in-depth case study of Chicago (Jiang, Sporte, & Luppescu, this issue, pp. 105–116). This stronger support appears rooted in the more proximal and formative nature of classroom observations. From the view of school leaders, what they see in the classroom relates to what teachers actually do, which not only gives them more faith in the measures but provides the types of information educators need to get better. Teachers—especially the better ones—want to know what exactly they are doing well and doing poorly. In this respect, value-added measures are unhelpful.
Combining this finding with other evidence also suggests that the distrust in value-added measures may be partly due to frustration with high-stakes testing generally and/or misunderstandings about the measures. Research on earlier incarnations of accountability, prior to the focus on individual teachers, suggested that educators are less supportive of accountability than are noneducators (Howell, West, & Peterson, 2007), and this may have been undermining the morale of teachers (Jones et al., 1999). There also seem to be misunderstandings about the validity and reliability of value-added measures compared with other measures. Although it is arguably hard to make fair comparisons across these different types of measures, research seems to suggest that value-added measures are about as reliable and stable as classroom observations (Bill & Melinda Gates Foundation, 2012; Harris, 2013) 1 and perhaps just as valid. 2 Whatever the reason for the discrepancies between research findings and educator perceptions, the distrust of value added has serious implications for how they are interpreted and used.
Support for value added also appears stronger among administrators than teachers. This is most evident in the direct comparison of support in the Chicago analysis (Jiang et al., this issue). The stronger support among administrators is perhaps not surprising because the teachers are the ones being evaluated with the measures, not the principals. 3 For administrators, value-added measures provide an additional source of information and authority.
But principals are still somewhat skeptical—not seeing value-added measures as “real data” (Goldring et al., this issue). Does this skepticism affect how they use the measures? Principals seem willing, for example, to reassign teachers to different grades and subjects based on value added (Goldring et al., this issue). This combination of perceptions and decisions might be a case where modest skepticism is translating into modestly aggressive use. Decisions about which teachers are in which grades and subjects are important but arguably lower stakes than decisions about, which teachers keep their jobs.
Ballou and Springer (this issue, pp. 77–86) highlight the ways in which the data collection process may unintentionally reduce the validity and credibility of value-added measures. First, teachers often go through “roster verification” that allows them to confirm the students for whom they are responsible. The authors find that, whereas the vast majority of students are assigned to a teacher, those with lower scores are less likely to be claimed (Ballou & Springer, this issue).
Another key data collection issue, one that arises in any test-based accountability system, is who administers the tests. Ballou and Springer (this issue) find that student test scores are higher when teachers administer tests to their own students. Although these findings about test administration and roster verification are not absolute proof of gaming the system (or cheating), and these influences seem concentrated among a very small number of teachers and students in their data, such unhealthy responses could undermine the credibility of the system and prospects for long-term success.
Goldhaber (this issue) describes the complex labor market responses, including potential effects on who enters and exits the profession and how quickly teachers develop. One of his key observations is that most of the research to date on value added (e.g., randomized trials of performance pay) captures only some of the potential effects. He rightly focuses on randomized trials for their high degree of internal validity but also notes that they usually cannot identify long-term outcomes or indirect influences that arise when programs are scaled up. 4 For example, in randomized trials of value-added-based performance pay, the indirect and delayed effects on who enters and exits the profession are missed. More broadly, Goldhaber emphasizes that our knowledge of the full set of potential labor market responses is extremely tenuous.
Policy Implications
Even with more evidence, two factors will continue to complicate interpretation. The first is that value-added measures are almost always bundled with other measures, especially classroom observations. This makes it hard to separate the influence of value-added measures from the other measures.
In addition to the mix of measures, policies vary in how those measures are incorporated into personnel decisions, and even in which policies they are part of. Accountability varies in both its intensity and the types of decisions it can be designed to influence—tenure, certification, compensation, promotion, and dismissal, to name a few. Value-added measures can also be used simply to provide information to teachers, with limited formal accountability. 5 So to say that we are examining “value-added-based teacher accountability” is like saying we are studying the effects of “standards” or “curriculum.” As the research literature moves forward, greater nuance in the differences across policies will become possible.
The combination of the mixture of measures and how they are used in specific decisions may influence educators’ perceptions. Recall the differing levels of support for value added versus the accountability policies within which they are used and compared with the support for classroom observations. In Chicago, educators are at best modestly supportive of value-added measures, but they are noticeably more positive about the larger systems of teacher accountability systems within which the measures are used. Although there is limited evidence on this point, it seems likely that support for value added among educators will decrease as the stakes increase.
These early-stage strengths and weaknesses of value-added-based policies should also be placed in perspective and considered against a backdrop of the possible policy alternatives (Goldhaber, this issue; Harris, 2009, 2013). Perhaps the two most basic criteria for any personnel-related policy are the following: (a) Does the policy at least attempt to distinguish teachers based on their performance or effectiveness? and (b) Does the policy provide teachers with regular, useful feedback? The combination of certification, tenure, and a single salary schedule have failed in this regard (Harris, 2011). Can value-added-based teacher accountability do any better? Framing the question in this way is not some subtle ploy to justify suspect policies as some would have it (Collins & Beardsley, 2011). Rather, it is the fundamental question.
The argument for policy comparison also cuts both ways because, in other respects, the evidence here suggests that value-added measures are comparatively worse than other measures. Value-added measures suffer from much higher missing data rates than classroom observations (Jiang et al., this issue). The results here also confirm that the timing of value-added measures—that they arrive only once a year and during the middle of the school year when it is hard to adjust teaching assignments—is a real concern among teachers and principals alike (Goldring et al., this issue). Classroom observations, in contrast, can be made at regular intervals and with almost immediate feedback.
Ultimately, there are three general directions where the value-added-based teacher accountability strategy can go: (a) reduce external teacher accountability intensity and return to a system where teachers work with less test-based pressure and greater job security (in what Johnson [this issue] calls an egg-crate model); (b) maintain accountability intensity, but reduce or eliminate the use of value-added and/or student outcomes generally; or (c) make minor tweaks in the current structure, but maintain both the intensity of accountability and the use of value-added, classroom observations, and other measures. Although the results in this special issue are just the beginning, these results are suggestive about the strengths and weaknesses of these options.
As an example of small fixes, Ballou and Springer (this issue) highlight the potential ways to improve the design of the data collection process. It would be unreasonable not to involve teachers in roster verification. One possible solution is to require that all students be assigned to at least one teacher (Harris, 2011). Likewise, it would be straightforward to have teachers administer tests for other teachers’ students, not their own. Both of these steps involve some administrative burden, but the system cannot succeed without credibility. These seemingly small changes could meaningfully improve implementation and effectiveness.
A more substantial shift involves the use of value added as part of a “screening process” (Harris, 2012, 2013b) in which value-added measures are used to identify teachers whose performance initially appears questionable, but final determinations of performance are based entirely on other information subsequently collected (e.g., classroom observations). By taking value-added out of the final personnel decisions, we can avoid the missing data problem and address educators’ other concerns about value-added measures, all while avoiding the concern of accountability supporters that dropping value added would “take the foot off the gas” of accountability and necessarily lead to almost all teachers’ receiving high ratings (Weisberg, Sexton, Mulhern, & Keeling, 2009). With the screening approach, value added becomes one part of a system of checks and balances that prevents the worst excesses and may improve credibility.
Although evidence is lacking on this option as well, the screening example highlights an important fact about the use of value-added measures: that the intensity of accountability is a separate issue from the use of value added. If we want, we can have “strong” accountability without value added or “weak” accountability with value added. Therefore, as we interpret evidence about value-added-based teacher accountability, these two types of policy implications should be distinguished.
An Agenda for Future Research
We have identified a long list of important questions, and the studies in this issue provide preliminary answers to some of them. As researchers move forward, they should consider these new findings as well as prior evidence on related policies. Drawing on theoretical bases in economics (Goldhaber, this issue) and sociology (Johnson, this issue), we have rich literature to work with: the effects of school-level accountability (e.g., Booher-Jennings, 2005; Figlio, 2006; Figlio & Winicki, 2005), use of data (Coburn & Turner, 2012), merit pay (Murnane & Cohen, 1986; Podgursky & Springer, 2013), school-level supports for school improvement (Sebring, Allensworth, Bryk, Easton, & Luppescu, 2006), the organizational factors that reduce turnover (Ingersoll, 2001; Kraft & Papay, in press), professional communities, teacher labor markets (Goldhaber, this issue), teacher sense-making (Coburn, 2005), and policy implementation (McLaughlin, 1987; Spillane, Reiser, & Reimer, 2002; Wilson, 2003). This prior work can help guide our questions and provide ways to answer them.
At the same time, we cannot rely solely on this prior related research because it does not provide complete analogies to present circumstances. Evaluations of the use of teacher value-added measures are likely to be different from traditional school-based accountability, for example. Although both use student test scores, there is much wider acceptance of school-level measures generally, and they have better statistical properties. Similarly, merit pay is only one specific way in which value-added measures could be used. And even though the fidelity of implementation may be found wanting, this by itself does not necessarily make for bad policy. So although there is no need to start from scratch when studying the use of teacher value added, the distinctive features of the measures and associated policies must be recognized in the process.
This special issue makes clear that critics are correct in saying there is little evidence about value-added-based teacher accountability (Ravitch, 2009). But we cannot lose sight of the ample evidence against the traditional model (Harris, 2010). Something ought to change. Value-added-based teacher accountability may be part of the solution, or maybe there are other wiser approaches. For researchers, the challenge is to inform this important conversation with rigorous evidence that addresses the many unanswered questions and helps us learn from the experimentation now going on around us. Successful policy design and implementation are ultimately the product of evaluation and adaptation, and value-added-based teacher accountability should be no exception.
