Abstract
The provision of public schooling in the United States has primarily been the states’ responsibility, but states generally lack the capacity to manage day-to-day school operations. Thus, states delegate responsibility to districts while maintaining some oversight. Forms of oversight include regulations and political and market-based accountability. However, these can only do so much in holding schools accountable for providing high-quality schooling. Administrative accountability based on student outcomes and school process measures presents an alternative to complement other accountability mechanisms. Standardized measures of performance used for administrative accountability can better align curriculum with state standards, improve quality, and signal the skills that society wishes for students to build. However, they can be counterproductive if they are not reliable, valid, or comprehensive. We suggest in this article that no measure is perfect and that the usefulness of test-based accountability depends on whether the measures enhance educational opportunities and reflect shared goals with reliability, validity, and comprehensiveness.
Although “accountability” has become a contentious term in contemporary debates about the quality of education, it has a hallowed place in American history. At least from the early nineteenth century, during the period of “common school” reforms, questions about the quality of instruction and the unequal allocation of educational resources were fueled by the evolving American principle of holding public officials accountable to the citizens (Tyack 1974). As one tool used to provide parents and the public with information on how their schools were performing—an example of accountability—standardized testing emerged in the 1830s (Vinovskis, this volume; U.S. Congress Office of Technology Assessment 1992; Kaestle 2013). The phrase “testing and accountability,” therefore, has long and deep roots in U.S. educational history (e.g., McDonnell 2004; Feuer 2012).
Despite the long-standing presence of testing and accountability in the American public school system, discussions of the uses and purposes of testing remain fraught. In this article, we provide a framework for evaluating the current use of standardized measures of performance for providing oversight over schools and explore prospects for the future of testing as a tool for educational policy. We begin by describing models of accountability—regulatory, political, market, and administrative systems—that have evolved in many areas including, but not limited to, education. We then turn to the specific challenges of applying those models to the goal of enabling coherent oversight of a schooling system that is, by design, diffuse and fragmented. Though the provision of schooling in the United States has primarily been the responsibility of states, states generally lack the capacity to manage day-to-day school operations and delegate responsibility to districts—close to 14,000 nationwide—while maintaining some oversight. The complexities of structure and governance of schooling pose challenges to the design and implementation of coherent and effective accountability systems.
Our review of evidence—from research on accountability generally and from decades of trials of test-based accountability specifically—reaffirms a familiar finding, namely, that all measurement systems have imperfections. Consequently, we suggest that the criterion for judging the relative merits of testing as a source of accountability data should be whether, on balance, the benefits outweigh the costs and risks. Our “bottom line” is to favor continued development of reliable, valid, and comprehensive measures for test-based accountability, providing that they advance educational progress and preempt potential negative consequences.
Oversight and Accountability in American Schools
Throughout its history, U.S. education has been decentralized (Kaestle 1983; Feuer 2006). In the seventeenth and eighteenth centuries, local communities created their own schools based on their priorities and values. Today, states remain largely responsible for providing public education and make operational decisions including those related to class size, content standards, teacher licensure, and graduation requirements. Federal sources contribute only approximately 8 percent of elementary and secondary schools’ budgets under the current finance model. 1 Recent passage of the Every Student Succeeds Act (ESSA) reinforced the tradition of a delimited federal role by shifting considerable authority for standards and accountability back to the states (see also Hanushek, this volume; Vinovskis, this volume).
States vary in their capacity, and perhaps their desire, to manage daily operations of schools. Most have consequently delegated a portion, and usually a large portion, of the responsibility for running schools to local educational agencies (LEAs), also known as districts, which have the advantage of being closer to schools and thus having a better understanding of local contextual factors. Still, despite this assumed advantage, unchecked local control can lead to substantial variation across schools and create or exacerbate disparities in educational opportunities for students within (and between) states. Districts and schools differ in their student populations, community goals, and track records of tried and successful implementations (O’Day 2002; O’Day and Smith 2016; Spillane, Reiser, and Gomez 2006).
To reduce such disparities, states exercise some form of oversight over school districts, often by relying on standardized measures of school performance designed to provide relevant information for assessing the extent to which local actors—teachers, principals, leaders, and so forth—are progressing toward state goals. These measures may capture school processes, such as observations from school inspectors; or they may focus on student outcomes, such as standardized tests. However, no measure is perfect. Measures may not accurately reflect true performance; that is, they may be unreliable. They may not measure performance over a domain of true interest; that is, they may not be valid for the goals of the state. They may not cover all valued domains; that is, they may not be comprehensive. The question, then, is, given the shortcomings of particular measures, whether the available metrics yield worthwhile information for decision-making and oversight, and whether additional or alternative measures would be beneficial.
Regulation, Politics, and Markets
Not all forms of oversight require standardized measures of performance. One approach that states have taken to provide monitoring of schools is setting regulations that define legal requirements and resource allocations ex ante. Regulations are primarily preemptive, intending to prevent unintended risks and consequences. For example, regulatory policy can provide strict guidelines for processes such as the hiring and firing of school employees and the appropriate use of funds and can also determine standards for school inputs such as the maximum class size and the necessary credentials for teachers and administrators.
These regulatory forms of monitoring and oversight tend to set a minimum on quality. They are not designed to provide the information needed to better inform future decisions and do not encourage improvement in the quality of the system beyond the floor. Furthermore, such regulations may inadvertently cause unintended consequences; for example, teachers may gravitate toward low-cost, low-quality certification programs to fulfill certification requirements. Some regulations may also be unnecessarily expensive and inefficient for achieving society’s educational goals. Studies of the effects of various regulations on student achievement and educational attainment have found both positive and negative impacts (Hanushek 1997; Hanushek, this volume; Jepsen and Rivkin 2009).
While regulations can usefully set floors on quality, other forms of oversight and accountability aim to improve quality above a floor. These approaches can use political processes, market forces, or administrative data to encourage schools to reach defined goals. In political accountability, elected officials are the actors making decisions about schools on behalf of their electorate. For example, the election of school board members for a particular school district by local residents creates an infrastructure for running schools. In market-based accountability, families act as the choosers of schools for their children. A school-choice system can allow families to withdraw their children and their monetary support for public schools if they are dissatisfied, in theory holding schools accountable for their performance and quality. Neither political accountability (e.g., school boards) nor market-based accountability (e.g., school choice) needs to rely on standardized measures, such as test scores, though often they do.
Political accountability and market accountability are common if not determinant in most democracies. Political accountability to some extent always exists in a democracy because constituents have the opportunity to choose who best represents their preferences (Lindblom 1980). State oversight of schools is a form of political accountability. But local elections and governing bodies, such as school boards, may be better able to use local understanding of needs and opportunities than are state-level bodies. Thus, many states decentralize control to local jurisdictions that have their own elected bodies overseeing schools. Similarly, market-based accountability is present in any society in which individuals have options regarding where they live and whether to send their children to private schools, though the ability to use these mechanisms of choice is often limited to affluent families. Intradistrict choice, interdistrict choice, charter schools, and voucher programs all bring choice to a far wider population. Unlike the preventive nature of regulations, both political and market-based accountability rely on education consumers reacting to policies and performance outcomes that are important to them. These forms of accountability systems are used in other areas, such as health care, where experience underscores the need to consider limitations—for example, if those doing the oversight or making the choices do not possess the ability and skills to “translate sound measurement into wise selection” (Berwick, James, and Coye 2003).
A system based purely on political accountability and market-based accountability may not be sufficient for state oversight of education. Communities vary in technical capacity and preferences, and as a result, local political accountability may not sufficiently reduce variation in educational opportunities across schools within a state. Similarly, market-based accountability relies on decisions reached by parents who vary not necessarily in how much they care about their children’s outcomes but in their understanding of the types of schools that will serve their children best. Unequal access to relevant information is a theoretical and practical impediment to effective markets and market-based accountability. In a more equal society, these considerations may not be an issue; but in the light of existing (and projected) demographic, economic, and social differences, reliance on political and market-based accountability is less promising. Given the limitations of political and market-based accountability, administrative accountability arises as an option to supplement other forms of accountability. We turn now to a discussion of this alternative approach to accountability, which centers on measurement of “administrative” data.
Administrative Systems
Given the limitations of floor-setting regulations, as well as of political and market-based systems, administrative accountability is intended to help decision-makers determine whether schools are building capacities in students that society values and using resources effectively. Administrative accountability creates oversight by measuring school processes or student outcomes and by providing incentives or interventions based on those measures. Process-based measures aim to capture internal school processes and practices, either by sending observers into schools (inspectorates) or by surveying stakeholders such as students, parents, and teachers. Outcomes-based measures record student learning, behaviors, and academic achievements, using data from test scores, attendance, and graduation rates.
Reforms in the United States over at least the past two decades have focused largely on student outcome measures, but many states have expanded that focus to include process measures. Quality rating and improvement systems (QRIS) for early childhood education, for example, have used process measures more than outcomes measures. According to a 2011 OECD report, only the United States and three other countries out of thirty-five OECD countries in the report use any form of test-based accountability at the elementary level. In comparison, the majority of the countries employ process-based measures from inspection systems at the elementary level. Evidence on the benefits of inspection systems in the United States is still sparse and mostly descriptive given the lack of use of inspection systems compared to test-based measures, for which the evidence is more robust.
Unlike regulations and political and market-based accountability, administrative accountability calls for evidence of alignment of schooling with state standards or goals. Conceptualized in the 1980s, standards-based education entails setting standards for what skills and capacities students should attain at various points of their educational career and designing a governance system that focuses on aligning those goals with instructional practices (Smith and O’Day 1990). The approach aims to provide more equal access to quality curricula and instruction, as it determines whether schools meet criteria for resource, practice, and performance standards (O’Day and Smith 1993).
Outcomes-based accountability policies developed in the 1980s and 1990s, starting in Texas and North Carolina and spreading to many other states. With the passage of the No Child Left Behind Act of 2001 (NCLB), all states were required to adopt their own standards for students in their public schools, with the goal of all students reaching “proficiency,” as defined by their state. States could choose their own standards and did (see also Hanushek, this volume). However, since 2010, forty-one states, the District of Columbia, and four territories have adopted the Common Core State Standards (CCSS) for mathematics and English Language Arts (ELA). Despite ongoing political tensions surrounding the CCSS and the lack of strong empirical studies on the effects of the standards (Polikoff 2017), the adoption of CCSS signals a step toward a consistent framework for educating students across the nation.
The usefulness of administrative accountability hinges on whether available measures show state decision-makers the extent to which schools are meeting their standards and goals. Tests aligned to state standards are one commonly used measure in administrative accountability that have been designed and developed to measure this performance. Given that tests are invariably imperfect, the question remains whether they are useful.
Using Tests for Administrative Accountability
Measures of performance generally—and standardized tests in particular—can be used for a variety of purposes. They may provide information that allows local policy-makers and voters, in political models, or parents, in market models, to make informed decisions. For example, test performance measures are commonly provided by realtors to families choosing where to live. These measures theoretically inform parental choice, even in a limited market system without charter schools or voucher programs (for analysis of charter schools and voucher systems see, e.g., Rotberg and Glazer 2018; Murnane et al. 2017). Alternatively, states may hold schools more directly accountable for performance on the measures. For instance, states may intervene in schools with low performance, as required by NCLB, or provide positive incentives such as greater funding, as is the case of some QRIS for early education. In both of these cases of administrative accountability, school leaders may see benefits of improving their performance on the measures, and thus the measures create incentives for schools to focus on this form of improvement.
Measures of performance used for administrative accountability, however, can be problematic. The standards may not reflect true state goals, so that any measures based on standards would not capture valued outcomes. Several studies have found variation across states with respect to the content of standards and expectations (Porter, Polikoff, and Smithson 2009; Finn, Julian and Petrilli 2006; Wilson and Berenthal 2005). However, providing causal evidence of standards’ quality and their impact on students is difficult for various technical reasons (Polikoff 2017).
Even with well-defined standards, problems of reliability (whether the test is precise), validity (whether the test actually measures what it is designed to measure), or comprehensiveness (whether the test captures all domains of interest) reduce the value of measures of school progress toward valued goals. These measurement problems arise either because the measures do not sufficiently address the standards or because they do not sufficiently demonstrate schools’ progress toward those standards. Even if a measure accurately indicates student learning in ELA, for example, it may not accurately capture schools’ contribution to this learning.
In terms of reliability, scores with large measurement error provide little information. Even if schools improve on the dimension that the measures aim to address, the measures would not necessarily capture this improvement and would not be reliable. Alternatively, if a school did not improve on a given dimension, the measure would not reliably capture the lack of improvement. Measurement error can arise from poorly designed test instruments, or because of testing administration: if tests are taken during class periods, time limitations can affect score reliability (Hargreaves and Braun 2013). Measurement error can also arise from manipulating individual test scores into measures such as those measuring school performance or from combining a set of individual scores into an overall rating. In one recent study Hough, Kalogrides, and Loeb (2017) found that just by including data from social-emotional learning (SEL) and school culture/climate (CC) surveys, school ranking results changed, indicating an instability or lack of reliability in identifying the lowest performing schools. Again, all measures have some error, but for some measures, the errors are so great that they do not yield useful information. And even if a score provides some relevant information, the error may be frustrating to those being held accountable and create disincentives for them to work toward improvement on the targeted dimension (Heinrich and Marschke 2010).
In terms of validity, test scores may distort inferences about students’ performance on specific domains of knowledge and skill or about a school’s contribution to student achievement. Test scores of students in a given school may not be the right measure to judge how much the school (as opposed to other factors) contributes to student learning. The goal might be to measure a student’s skills in algebra; but alternatively, the goal may be to measure how much algebra a student learns after entering the school; or the goal might be to measure how much educators contribute to the student’s learning of algebra, taking out the effects of peers and other factors not necessarily under the control of the school. A measure that yields valid inferences for one definition of the goal may not lead to similarly valid inferences for other goals.
Who uses the information is also a critical consideration. If the state’s goal is determining whether students are learning enough algebra, understanding the level of students’ skills may be most important. For a parent choosing a school for their child, on the other hand, the goal of understanding how much a student is likely to learn may be most important. For the school board determining whether its schools are as effective as other schools, a third validity check would apply. It is a well-established tenet in the professional measurement community that the validity of inferences about students’ performance depends on the test that is used, and another is that relying on multiple indicators rather than on a single test may help to reduce threats to validity (Linn 2000).
In terms of comprehensiveness, no single measure captures progress toward all stated goals; and even in combination, no set of measures will likely cover all goals. Therefore, choosing which measures to adopt for an accountability purpose implies preferences for some goals over others. Sometimes this preference comes from a particular concern that the goals are not being met or from a belief that the chosen goals are the most important goals. For example, reliance on tests in mathematics and English language arts may stem from the belief that while many goals are valuable, students are unlikely to be able to reach other goals if they are not numerate and literate. Nonetheless, the limited scope of available measures can create potentially unintended consequences. Schools may focus their improvement only on measured dimensions, at the cost of effectiveness in other domains. Parents and policy-makers may make decisions based on available measures when nonmeasured domains are more important to them or more central to the decision in question. If a school’s performance on measured domains reflects their performance on unmeasured domains—which is often the case—then the limited coverage of available measures may not reduce the benefit of standardized data collection. However, if the measured and unmeasured domains are inversely related to each other—which could be the case if incentives lead schools to shift focus away from unmeasured areas—then the lack of coverage could result in outcomes-based accountability debilitating instead of helping decision-making.
All measures have problems of reliability, validity, and comprehensiveness, yet the alternative to imperfect measures would be to resort to techniques uninformed by standardized measures that pose other formidable problems. School board members would not know whether their district is providing the kinds of opportunities that other school districts are or whether there are groups of students learning less in their district than in similar districts. Parents, when choosing where to live, whether to attend a charter school or, if so which one, would only have their observations and the views of acquaintances as sources of information. The test of whether the usefulness of the measures outweigh their imperfections is whether they appear to improve educational opportunities for students and lead to better decisions. We turn now to a review of evidence on the usefulness of test-based accountability.
Test-Based Accountability: What the Evidence Suggests
Research suggests that outcomes-based accountability systems have improved average student outcomes, as measured by the National Assessment of Education Progress (NAEP) mathematics tests (Jennings and Lauen 2016; Carnoy and Loeb 2002; Dee and Jacob 2009; Hanushek and Raymond 2004; Jacob 2005, 2007; Rouse et al. 2007; Lauen and Gaddis 2012). Results on tests like NAEP that are not incentivized are smaller than those on state tests with direct incentives (Jacob 2007; Jennings and Lauen 2016; see also Fahle, Shear, and Shores, this volume; and for a more critical stance, see Hout and Elliott 2011). In addition, NAEP scores during the NCLB era showed that racial and ethnic achievement gaps were (modestly) closing, with African American students in some grades progressing at a faster rate than white students (Gamoran 2013). Gains in test scores from accountability systems are also evident elsewhere. In Wales, for example, student achievement at the secondary level grades decreased compared to England after school performance tables were no longer published, providing evidence to suggest that outcomes-based accountability aids student achievement (Burgess, Wilson, and Worth 2013).
Empirical studies also have found that accountability systems in which schools receive a letter grade on their quality create incentives for schools to make improvements to scheduling, professional development, as well as instruction and curricula (Rouse et al. 2007; Chiang 2009; Rockoff and Turner 2010). These gains may be attributable to the combination and alignment of standards, assessments, and funded interventions occurring concurrently; nevertheless, the studies suggest that educational assessments, aligned with state standards and interventions, can play a role in providing relevant information about, and incentives for improving, student and school performance.
Well-designed standardized tests also make it easier to assess whether other forms of oversight, such as floor-setting regulations, are adequate. As one example, bodies of research have utilized test outcomes to examine the question of whether class size reductions improve student performance, with mixed findings (Angrist and Lavy 1999; Krueger 1999; Rivkin, Hanushek, and Kain 2005; Hoxby 2000). Such uses of student test scores can potentially inform future regulatory policy decisions about class size. Educational assessments can also offer key information on whether current credentialing requirements are adequate to teaching the skills and capacities effectively in the classroom. Research has used student test outcomes to determine the association between teachers having certain credentials and students’ math and reading scores. A study in 2000 found that a higher percentage of teachers holding both a subject matter major and a full state certification was positively associated with NAEP math and reading scores (Darling-Hammond 2000; but see also Hanushek, this volume, for an alternate view).
These studies all address the use of tests for the domains they target. The scope and content of the tests dictate what decision-makers and educators understand about the system and serve as a signal of society’s values. These measures are not necessarily comprehensive. Brighouse and colleagues (2018) describe a range of valued educational outcomes in terms of the knowledge, skills, attitudes, and dispositions that individuals develop that contribute to their own flourishing and the flourishing of others.
“Knowledge” may involve being informed about the history of the United States, algebraic formulas, and grammar rules. “Skills” refer to one’s ability to perform things, such as being able to handle conflicts, analyze data, and present in front of a class. “Dispositions” are defined as the often implicit and subconscious tendencies to act upon one’s skills and knowledge. For example, courage may be a disposition that spurs people to use their knowledge and skills in times of danger. While similar to dispositions, “attitudes” differ slightly in that they involve conscious ways of thinking that may or may not motivate a certain response to a circumstance. A person may feel a positive attitude toward activities that they enjoy, for instance. The knowledge, skills, dispositions, and attitudes that individuals develop provide them with the capacity for economic productivity, personal autonomy, democratic competence, healthy personal relationships, the treatment of others as equals, and personal fulfillments. However, the specific forms of knowledge and skills, and the particularities of dispositions and attitudes, may change with time and context. For example, physical strength is far less important for economic productivity now than it was many years ago (Deming 2017).
Tests of mathematics and ELA do not measure all valued educational outcomes, and emphasizing those subjects, or even a range of academic subjects, might shift the focus of policy-makers, parents, and educators away from nontested domains. Schools tend to concentrate their attention on the subjects tested and on the grade levels in which test results have the highest stakes (Deere and Strayer 2001; Ladd and Zelli 2002; Stecher et al. 1998). Other studies (e.g., Hamilton, Berends, and Stecher 2005; Jacob and Levitt 2003; Jones et al. 1999; Linn 2000; Stecher et al. 1998) show that teachers and schools tend to narrow the curriculum and shift their instructional emphasis from nontested to tested subjects, while earlier work by Shepard and Dougherty (1991) and Romberg, Zarinnia, and Williams (1989) suggests that teachers focus more on tested content areas within specific subjects (see also Shepard, this volume). In a nationally representative survey of 349 districts from 2001 to 2007, schools reported increasing the time spent on English and math while decreasing the time committed to social studies, art, music, science, physical education, and recess (McMurrer 2007; Rothstein, Jacobsen, and Wilder 2008). However, research on the impacts of state accountability on nontested domains is quite limited, because by definition these domains are not measured on a large scale. As such, test-based accountability may indeed be leading to an inverse relationship between measured and unmeasured goals, but the empirical literature to date does not provide strong evidence of this.
Shifting the Focus of Test-Based Accountability to Teachers
The results that we have described pertain to schools, which until recently have been the main focus of public accountability. Tests were given to individual students each year, but results were reported at the school and district levels to provide evidence of performance of schools and districts but not of individual teachers or students. States initiated, and NCLB continued, this focus on school and district measures for outcomes-based accountability. However, more recent initiatives such as Race to the Top, NCLB waivers, and the Teacher Incentive Fund, shifted the attention to teacher accountability. The NCLB waivers, for example, stipulated that if states created a plan for a more rigorous accountability system in which student growth scores are used to evaluate teacher performance, they would not have to reach the NCLB objective that all students attain academic proficiency by 2014 (Collins and Amerin-Beardsely 2014; Shen, Simon, and Kelcey 2016). The Teacher Incentive Fund provided additional money for school initiatives that created performance-based teacher and principal compensation systems in high-need schools, offering states a financial incentive to create teacher accountability. Accordingly, there has been a growing dependence on value-added models to try to measure growth and improvement in inputs such as teacher performance (Roderick, Jacob, and Bryk 2002; Thorn and Meyer 2006; Hanushek, this volume).
Such models estimate how students in a teacher’s classroom would likely score on mathematics or ELA exams during that year with the average teacher, and then take the difference between students’ estimated (predicted) and actual scores. This difference is termed the teacher’s “value-added” to student test performance. While a range of states use these measures, observational measures of teaching are even more common and tend to receive greater weight in teacher accountability programs. The observational measures of performance include those based on the Danielson Framework 2 and the Classroom Assessment Scoring System (CLASS), 3 among others (see Hanushek, this volume, for a discussion of value-added models).
Outcomes-based measures for teachers have many of the same potential drawbacks as outcomes-based measures for schools—threats to reliability, validity, and comprehensiveness. As an example, teachers’ ratings on value-added measures of performance differ depending on the student test used to create the measure (Goldring et al. 2015; Loeb and Candelaria 2012; Papay 2011; Ballou and Springer 2015). This lack of consistency points to the importance of the validity of inferences from student tests. Similarly, teacher performance can vary substantially from year to year, even on the same test, pointing to the problem of reliability. Yet Jacob and Lefgren (2005) compared value-added measures to principal performance ratings and to measures of qualifications (education and experience) and found that value-added measures were better at predicting future student achievement, with the principals’ subjective assessments predicting future achievement more accurately than teachers’ education and experience.
Studies to date on the effects of programs in the United States that are aimed at creating incentives for teachers based on the performance of their students on standardized tests have found little effect. The National Center on Performance Incentives at Vanderbilt used strong causal research designs to assess the effects of a range of performance incentives across a range of states and found no consistent benefit (see, for example, Springer et al. 2012). The value-added measures in these studies had some validity in terms of capturing valued outcomes as measured by student test performance and adjusting for a range of factors outside of teachers’ control. However, they were unreliable enough to frustrate teachers and did not provide other forms of information that teachers needed to improve.
Results using observation-based classroom evaluations are more promising. Research on teacher-level accountability in Ohio showed improvements in student learning as a result of the observation-based Teacher Evaluation System (Taylor and Tyler 2012). Similarly, the IMPACT program in Washington, D.C., showed teachers improving at least on the incentivized measures of performance (Dee and Wyckoff 2013). Moreover, studies of programs in Tennessee that use observational measures to connect teachers with stronger skills in a given area with others who are not performing as well also demonstrated positive outcomes as a result of available measures of performance (Papay et al. 2016). However, research on the validity and comprehensiveness of observational measures is far less developed than on value-added measures.
Coping with Imperfection: Determining the Usefulness of Outcomes-Based Accountability
States bear responsibility for providing education to their children and youth, but they rarely have the capacity to manage their schools. As a result, they almost invariably (with Hawaii as the only outlier) delegate the running of schools to school districts. Yet states continue some oversight of schools, through regulations and through local political structures such as school boards. Market-based mechanisms that rely on choices of residential location or private school options; and charter schools, vouchers, and intradistrict and interdistrict school options also provide some accountability for schools. But states increasingly use administrative accountability approaches based on standardized measures of school processes and outcomes to inform policy-makers and school choosers and, at times, to create incentives for educators to work toward improvement on the measures.
All measures have problems. Thus, the potential advantages of using test-based measures depend on how reliable the results are, how valid the inferences are that emerge from the results, how comprehensive they are at assessing the full range of valued outcomes, and, perhaps most importantly, how much better or worse such measures are than the existing alternatives. The context in which measures are applied determines the extent to which they improve or hinder effective educational decision-making. If a lack of information about performance is not limiting decision-making and educators are working directly toward valued goals, then such measures are superfluous and potentially counterproductive. However, if a lack of information inhibits rational decision-making, then such measures may be beneficial even if they are imperfect (for a discussion of testing policy in terms of benefits and risks, see Feuer 2006). The initial evidence on school-level accountability shows potential gains from the new information. Early evidence on teacher-level accountability based on student test performance is not positive, though the use of observational measures shows more promise.
Standardized measures of school performance—whether those are from observation or student outcomes—are useful to the extent that they lead to improvement. They may lead to improvement by providing information to state policy-makers about where to intervene more directly, given their limited capacity to run every school in the state; or by creating incentives through rewards or punishments for school- and district-level educators to increase their schools’ scores. These mechanisms are examples of direct administrative accountability processes. Alternatively, standardized measures of performance may be beneficial for local policy-makers, such as school board members, in making decisions about their schools; and they may be useful for parents in choosing where to live. However, these measures, especially if they are designed poorly, can also be counterproductive. They can lead states to intervene in the wrong schools or educators to either become frustrated with the oversight or work toward outcomes that are not aligned with shared goals.
Ultimately, outcomes-based measures are imperfect. But context is everything: in a world in which regulations have proven unsatisfying, especially in a diverse and changing economy, and in which local political institutions and school choice have significant limitations, the downside risks of using any accountability system must be weighed against their potential benefits. Evidence to date gives reasons both for optimism and caution in being able to create measures that reflect shared goals for schools with reliability, validity, and suitable comprehensiveness.
Footnotes
Notes
Susanna Loeb is the director of the Annenberg Institute and a professor of international and public affairs and education at Brown University
Erika Byun is a Stanford University Institute for Economic Policy Research (SIEPR) predoctoral research fellow, focusing on the economics of education.
