Abstract
Public administrators are committed to improving public service delivery, as evidenced by decades of accountability efforts at all levels of government. This movement is especially salient in the public education system, where student standardized test scores are increasingly used as the key performance metric to evaluate schools, teachers - and most recently - teacher preparation program (TPP) effectiveness. Evaluating TPPs using a single quantitative performance metric at the student level is a complicated endeavor. This paper illustrates a key challenge in this type of accountability system, not yet examined in the literature: graduates of individual TPPs tend to cluster in a very small number of districts. We present a case study to show how geographic stratification inhibits the ability of statistical models to disentangle the effect of district and school from TPP on student achievement, particularly in rural states.
The public management literature has long suggested that one means of improving the delivery of public services is through enhanced accountability, particularly through performance measurement. There is widespread agreement with this idea both within the academy and, increasingly, among public leaders. But there exists a potentially deep divide between the idea of accountability and its effective implementation. The choice of metrics used to hold agencies accountable is partly responsible for this divide. Research suggests that selecting or designing appropriate measures is an extremely difficult task for public agencies at all levels (U.S. General Accounting Office, 1997; Wang, 2000). Performance metrics are expected to be most successful when they are closely aligned to the stated goals of the organization (Hatry, 1999), indicative of actual performance (Heckman, Heinrich, & Smith, 1999), and difficult to game (Baker, 2002). In addition, a single piece of information cannot meet the complex needs of multiple decision makers with sometimes conflicting goals (Moore & Braga, 2003; Radin, 2000). Finally, governments use performance metrics to improve organizational performance. Yet, if the metric measures the efforts of individuals, it becomes very difficult to establish a relationship between individual and organizational behavior (DeNisi, 2000; Taylor, 2009).
This article focuses on the challenges of using a single performance metric, K-12 student test scores, to evaluate teacher preparation programs (TPPs) of institutions of higher education. The evaluation of these programs with test scores as the metric has become increasingly popular following the passage of No Child Left Behind. At least four states, including Tennessee, Ohio, Massachusetts, and Louisiana, have undertaken comprehensive efforts to create datasets and projects that will evaluate the effects of TPPs using student achievement data.
Using a case study from the state of Kentucky, we argue that there are inherent problems in using test scores to evaluate TPPs that go beyond the issues previously raised in the public administration literature. These problems are related to characteristics of teacher labor markets that create more than the usual challenges for designing a meaningful performance metric. Teachers tend to have well-defined geographic preferences as to where they want to work after graduation, related strongly to their hometown (Boyd, Lankford, & Loeb, 2005). 1 These geographic preferences create analytical issues, and especially intractable issues, in rural states, where the tendency of graduates of individual TPPs to cluster in a very small number of districts will be more pronounced. We refer to this tendency as stratification. The evaluation problems for rural states are often overlooked in the literature but the majority—nearly 56%—of public school districts in the U.S. are located in rural areas and over 10.3 million students attend rural public schools (Provasnik et al., 2007). We will demonstrate that the stratification phenomenon limits the extent to which states can use student outcome data to identify the role of the TPPs in contributing to the effectiveness of a teacher. The implications of our case study extend not only to other states but also to other arenas of the public sector, including the evaluation of higher education disciplines of all types.
Previous Work
Public education provides an excellent context to explore the challenges in performance metric implementation. Education is the single largest expenditure item of state and local governments, 2 and incorrect measures could drive substantial misuse of state funds and misdirection in identifying highly performing public institutions. To the extent student test scores are made public and used, external stakeholders like parents and state policymakers seek to use them to evaluate each level (teachers, schools, districts; Figure 1). The desire to evaluate yet another element in the educational process—the TPP—is motivating state policymakers to extract yet another layer of information out of student test scores: Which TPPs in the state produce teachers who consistently add the most value to their students’ achievements?

Educational accountability of multiple agents based on the single performance measure of student test scores
Meaningful program accountability requires that the performance metrics are relevant to the program level (DeNisi, 2000; Gormley & Weimer, 1999; Jennings & Haist, 2004). In other words, to hold TPPs accountable for the performance of their graduates, the TPP needs to have a meaningful influence on what the teachers are doing in the classroom. Lipsky’s (1980) classic service-delivery level research uses teachers to exemplify “street-level bureaucrats” who play a critical role in implementation. The multiple sources of discretion that teachers use in the classroom, the multitude of contextual influences (including peer teachers, continuing education/professional development experiences, and curricular and instructional guidance received at the school and district level), and postgraduate training all affect the teachers’ performance. These factors may become more influential for the teachers’ classroom behaviors than the undergraduate training as time elapses following graduation. In particular, a principal who is responsible for the ongoing personnel decisions that affect the teacher may have a higher degree of influence over behaviors that the teacher exhibits in the classroom than the TPP. Performance information in the form of student test scores has a most direct and meaningful link when used by principals. A similar argument can be made that parents and superintendents have a more meaningful ongoing influence on service delivery by teachers than their TPPs.
To accurately evaluate teachers/schools/TPPs, the performance management research suggests that that the metrics must reflect the true objectives of the organization as closely as possible (Baker, 2002; Heckman et al., 1999). If the true objective of the organization is to create meaningful learning or to educate productive citizens, then the performance measure or measures used should capture that. However, as in many program evaluations, K-12 education accountability models must rely on proxies for the true desired outcome of actual student learning (Gormley & Weimer, 1999). Standardized student test scores generally serve as the proxy in the models used to measure teacher and school effectiveness. Disjuncture between the proxy and the desired performance objective can become quite problematic. Heckman, Heinrich, and Smith (1997) show that accountability systems that rely on proxies may encourage short-term gains in the proxies at the expense of the long-term gains in the desired outcome. In the education arena, this concern is frequently described as “teaching to the test”: teachers may be engaging in behaviors that result in immediate improvement of student test scores, but students’ long-term learning may suffer if, for example, teachers displace teaching in-depth content with test-taking skills. Studies indicate that teachers do change their classroom behaviors in attempts to maximize their students’ performance on the test (Figlio & Rouse, 2006; Jacob, 2005). This type of behavior is problematic both in individual teacher accountability systems and when trying to assess TPP performance. Rather than identifying true effectiveness in promoting student achievement, the estimates of teacher effect on score gains may simply be indications of the teachers’ success in teaching test-taking skills or narrowing the classroom focus to test content—or the TPP’s ability to prepare teachers to do this. The proxy measurement issue is accentuated in this case by the multiple layers of agent and agency responsibility that lie between the student in the classroom and the TPP.
Accountability systems, along with their beneficial effects, are likely to create situations where stakeholders have the incentive to “game” the system (Courty & Marschke, 1997). In other words, agencies act in ways that make it appear as though their agency is improving its performance, when in reality it is not (Heinrich, 2002). Schools may take strategic actions to demonstrate undeserved student gains by excluding certain low-performing students from taking the tests through reclassification or suspension (Cullen & Reback, 2006; Figlio & Getzler, 2006; Jacob, 2005). When schools select the higher achieving test takers during the accountability cycle, they attribute artificial gains to teachers, schools, or—if student test data are used to evaluate them—TPPs.
Conceptual Model and Description of Case Data
The teacher accountability literature provides considerable methodological guidance to researchers who are attempting TPP accountability models using student test scores. Research on TPP accountability uses value-added models (VAMs) to estimate the contribution that a TPP makes to student achievement (Boyd, Grossman, Lankford, Loeb, & Wyckoff, 2009; Noell, 2006). The value-added approach is attractive to researchers and decision makers because it nets out unchanging parental and student characteristic contributions to score gains. Because of this property, it is generally the first candidate to be considered when an input to internal evaluation processes is needed. Noell (2006), Noell, Porter, and Patt (2007), Noell, Porter, Patt, and Dahir (2008), and Boyd et al. (2009) used these models to address many of the issues with accountability metrics that are described above. Yet neither has fully addressed the challenge of geographic stratification of TPP graduates, which is the focus of the remainder of this article.
For example, Noell (2006) looks at the New Orleans school district and notes that it employed new teachers from 13 TPPs, the most of any district, but this only represented about 50% of the TPPs in the sample. The stratification of teachers is problematic because the unmeasured characteristics of the district may confound the TPP effect. For example, New Orleans public schools may have unique unmeasured characteristics that are correlated with its measured characteristics, such as being historically low performing (Louisiana Recovery School Districts, n.d.). If new graduates of a particular TPP do not teach in New Orleans schools, then it may be inappropriate to indicate that one has measured an average teacher effect of that institution’s graduates. In fact, their measured effect with students in this unique school district is unknown, as is whether they are equally as effective with New Orleans public school students as with other districts students.
The Boyd et al. (2009) study focuses on the New York City public school system. Like the Louisiana research, this study also does not address whether the distribution of recent graduates is geographically stratified across districts. However, it is less likely that stratification would pose as serious a threat to this type of analysis in that geographic region, given its density and scale. The New York City public school system educates over a million students a year and contains over 1,600 schools (New York City Department of Education, 2010). This type of environment provides so many potential avenues for teachers that New York state TPPs are likely to be represented, to a certain extent, among schools in all five New York City boroughs.
The remainder of this article focuses on the use of VAMs to evaluate TPPs conducted in Kentucky. Kentucky is a particularly valuable state to study because it is relatively rural and relatively poor. With almost 17% of its population in poverty, Kentucky currently is ranked among the four states with the highest poverty rates. It has experienced long-standing achievement gaps between the poorer, rural areas and the higher income urban areas of the state—gaps that many argue are at the root of the huge variation in college attendance, economic prosperity, and health outcomes observed across the state. Kentucky, like other high poverty states, also contains a large number of small, rural school districts: 53% of its school districts and 39% of its students are in rural areas. Although we focus on a single state, the implications for the study apply to many others, particularly to those with large numbers of rural school districts and to those with large gaps in socioeconomic status and student performance.
We contend that the value-added approach does not address the fact that teachers from specific TPPs are not independently distributed across the state: indeed, they show persistent geographic concentration, so that only a small number of TPPs are typically represented in a given district and school (typically, with a single TPP dominating). Although this may seem a narrow technical issue in a single policy area, we suggest that it is a case of a more general complication of performance measurement in a complex real-world setting: where a single performance measure is depended on to provide meaningful evaluation for a series of entities, which may be nested (teacher/school/district) and in series (teacher/TPP; Figure 1).
In a typical evaluation to estimate the effects of any intervention, preintervention outcome data are compared to postintervention data, controlling for other factors that are expected to affect the outcome. The coefficient on the intervention variable then estimates the magnitude of the program effect. Conceptually, the model applied to preservice college training is represented by the following 3 :
where i indexes students, t indexes timepoints, j indexes teachers, and k indexes schools; Ai and Ait-1 are standardized student test scores; 4 TPPj is an indicator variable designating the teacher’s preparation program; Stuit represents student-specific characteristics, such as race, gender, and subsidized lunch eligibility; Tchij includes teacher-specific characteristics, including gender, race, experience, college entrance scores; and Schkt refers to dichotomous variables that control for the unmeasured, time-invariant characteristics of schools within a district (this model is estimated separately by district, a constraint imposed by the geographic stratification problem discussed in full below). Finally, uit is a randomly distributed error term. Of primary interest is the estimation of β 1 , the coefficient on TPP, which can be interpreted as the impact of a particular teacher preservice training institution on student scores, controlling for all other included factors that are expected to influence scores, including student prior scores.
Note that this model requires observations for individual students, a match of those students to their individual teachers, observations of these teacher matches with multiple students, observations of these matches over multiple teachers, knowledge of a teacher’s TPP, and a sufficient number of teachers from multiple TPP programs. The same students must be observed over at least two consecutive time periods so that a pre- and postscore can be calculated. If all factors that influence a student’s score between periods in time are included in the empirical model, then any single estimated coefficient measures the contribution of a specific variable to the child’s level of achievement in the same time period. From this perspective, it initially appears that measuring the value added by a given TPP should be no different than measuring the value added by any other policy intervention, with the exception of the additional data requirements necessary for estimation of the effects of the TPP.
In Kentucky, as in most states at this time, the student–teacher matches are not available in a centralized, state location. States typically retain individual student information and individual teacher information but not in a way that enables the researcher to match the two. Rather, these matches can be made only by obtaining classroom rolls from individual schools or districts. With the approval of the state’s teacher licensing agency, the Education Professional Standards Board (EPSB), three anonymous, randomly selected school districts agreed to provide their 11th-grade classroom rolls from the 2005-2006 school year so that the teacher–student matches could be made for this case study.
Before discussing the implementation challenges, it is useful to examine the data in more detail. Table 1 provides summary statistics for the study sample. The smallest district contained three high schools that enrolled 771 11th-grade math students and the largest contained five high schools enrolling 1,699 11th-grade math students. After all missing information was accounted for, District 1 provided 477 complete student observations for use in the regression models, District 2 had 564 student observations, and District 3 contributed 1,137 complete observations. The aggregate number of math teachers in these districts ranged from 23 to 67. Our study sample contained between 21 and 63 teachers. At the school level, 11th-grade math student enrollments ranged from 188 to 440. The number of math teachers whose students took the test ranged from 7 to 18 per high school. Table 1 illustrates that these three districts had roughly similar math performance on the KCCT exam in the 2005-2006 school year.
Variable Means and Standard Deviations for Final Sample
Note: TPP = teacher preparation program; KCCT = Kentucky Core Content Test; CTBS = Comprehensive Tests of Basic Skills (national normed standardized test); NCE = normal curve equivalent.
Black student percentage was too low to justify use the use of this variable as a predictor in the regression.
Although a simple indicator variable for presence of a disability designation in the administrative data was used, we recognize that a wide range of disabilities may be represented by such a designation. If administrative data permit, greater resolution in this measure would be desirable.
Scaled scores were unavailable for this test, so the normal curve equivalent values were used as the control variable instead. For this reason magnitude of the coefficients should not be compared between the 9th-grade math score and the other two scores used as independent variables (9th-grade science and 10th-grade reading).
The students in Districts 2 and 3 have fairly similar demographic characteristics. About half of the sample students in those districts are female; African American students make up just over 16% of the sample; and about 23% of sample students are in a gifted program. The only major difference between Districts 2 and 3 in terms of student demographics is that District 2 has a higher percentage (25%) of students who receive federally subsidized lunch than District 3 (18.5%). Students in District 1, the smallest of the three districts, look different than those in the other two districts. The district itself has less than 1% each of African American, Asian, and Latino/a students. Our study sample contained only 0.4% African American students, and there were no Latino/a or Asian students with complete information from that district to include in our regression models. Finally, a larger share of District 1 students are classified as gifted (33% vs. 23% in each of the other districts) and a smaller share participates in the subsidized lunch program.
Demonstration of Geographic Stratification
EPSB recognizes 30 institutions of higher education with teacher training programs in the state, which are responsible for producing most of the teachers of roughly 670,000 Kentucky K-12 students. Among these programs, the majority are small, private institutions—each producing a few education graduates per year—while eight are publicly funded institutions with substantial numbers of graduates annually. Table 2 illustrates the clustered distribution of TPP graduates across the study districts. The authors created five TPP categories. TPPs A, B, and C are TPPs of three different Kentucky institutions of higher education and are singled out because they produced the most math teachers in one of each of the three districts in 2005-2006 examined for this case. “TPP, Other KY” indicates the number of graduates of all other Kentucky colleges and universities teaching 11th-grade math in the district. “TPP, Other State” indicates the number of graduates from programs in other states. Teachers who taught five or fewer students with test scores in the sample are not shown. Table 2 shows that there are at least 21 teachers who teach one or more high school math classes and 16 of these teachers received teacher training from a Kentucky institution in District 1. Of these 16 Kentucky graduates, however, 11 were trained in the same TPP (TPP A). No high school math teachers on staff in this district in the study year were trained in TPP C. In District 2, on the other hand, 10 of 23 state-trained teachers received credentials from TPP C but no teachers were trained in the dominant training institution for District 1, TPP A. District 3, which predominantly hires its teachers from yet another program, TPP B, did not have any high school math teachers who trained in TPP A in the study year, and had only one who trained in TPP C.
Stratification of High School Math Teachers by Teacher Preparation Program (TPP) in Three Kentucky districts, in final Regression Samples
Note: TPP = Teacher preparation program.
Refers to the number of teachers that graduated from the relevant TPP.
Refers to the number of 11th-grade math students taught in sample.
The data from these three districts starkly illustrate an issue that has been raised in other contexts in the teacher training literature. Teachers tend to enroll in training programs near their homes and to take jobs in districts near their institutions of training (Boyd et al., 2005). For the small districts in this sample, the labor market segmentation is sufficiently severe that an evaluation of TPPs effectively implies a comparison of only the two most common programs, or a single dominant program to all others present. But because these dominant programs with sufficient numbers of graduates are not the same across the three districts, each district must be analyzed separately. Enlarging the sample to include data across all Kentucky school districts would expand the set of TPPs that could be evaluated but would not mitigate the stratification problem. The teacher stratification issue appears sufficiently severe to make evaluation of TPPs in rural areas of Kentucky using this type of approach infeasible from a practical perspective, as isolating the TPP effect from the district effect is difficult. It is especially problematic for any high-stakes budgeting or other policy decisions.
Note that the problem of confounding of geographic dimensions with TPP labor markets does not diminish with time—nor is it an artifact arising from our focus on 11th-grade math scores in this case. 5 Consider the pattern shown in Figure 2, which includes information about all teacher graduates (K-12) from the seven largest producers of Kentucky public school teacher bachelor’s degrees from 2001 to 2008. The horizontal dimension displays the TPPs from west to east (based on geographic center) and the vertical axis lists the Kentucky counties from west to east. Bars range from 0 to 8 years, indicating the total number of years between 2001 and 2008 in which the given TPP produced 10% or more of that county’s teachers. 6 Thus the figure illustrates that TPP graduates tend to concentrate in geographic proximity to their institution and that this concentration persists through time. Another perspective on the extent of the concentration is shown in Table 3. For each of the seven TPPs, the majority of counties had less than 10% of teaching staff from that program in each of the 8 years examined. Graduates from Northern Kentucky University (NKU) and the University of Louisville (U of L) are the most highly concentrated; only 7% of Kentucky counties have more than 10% or more of their teaching staffs from these TPPs in at least one of the 8 years studied. Because of this kind of stratification, it will be difficult to disentangle the effect of district and school from TPP on student achievement since geography and TPP tend to be linked so strongly over time. While geographic stratification by this metric is a function of both numbers of graduates a TPP generates and job location held by graduates, if a program produces fewer graduates it is likely to exacerbate the geographic preference effect.

Persistence of teacher preparation program graduates’ geographic stratification in Kentucky counties over time, for period 2001-2008
Extent of Concentration of Teacher Preparation Program (TPP) Graduates in Kentucky, 2001-2008
Note: TPP = Teacher preparation program. The higher the percentage of counties with less than 10% of its teachers from that TPP in every one of 8 years examined, the greater the persistent concentration of the TPP’s graduates.
This geographic stratification issue is our main focus, but this case also illustrates another challenge which will be severe for predominantly rural states. Noell (2006) and others find that TPP effects are greatest during the first 3 years of the teacher’s classroom experience. After the 3rd year, the TPP effects are diluted by the cohort or peer effects of the school in which the teacher is hired. These peer effects are likely to be larger in schools where the teaching staff has been in place for longer periods of time. In our study sample, the average years of experience in the three districts ranges from 12.3 to 14.5 years (Table 1). Restricting the data to those teachers with experience of 3 years or less or, at most, to those with 5 years or less would cause a loss of more student observations for those schools with greater numbers of more experienced teachers. The average years of experience itself may be associated with the quality of the school, as teacher turnover is expected to be lower in schools with more amenities, such as a higher performing student body. Thus one data choice that would make identifying TPP effects more likely—that is, a focus on recent graduates—would entail (a) a trade-off with sample size and (b) the risk of a school-level selection bias due to the nonrandom distribution of new teachers in schools from which those with more seniority tend to transfer.
The variation in time of graduation from a teacher preparation institution presents another complication to the interpretation of estimates of TPP effects. Table 4 lists individually the five TPP categories and the years in which the training occurred. The data indicate that there are teachers in these districts who received degrees from TPP B as early as 1969 and as recently as 2005. If the policymakers’ goal in identifying a TPP effect is to reward institutions that prepare better teachers, or to identify best curricular practices for diffusion to other institutions, one must make the assumption that the TPP effect represents some consistent educational or (less optimistically) selection practice on the part of the institution.
Range of Graduation Years of High School Math Teachers From Various TPPs Across Three Kentucky Districts, 2005-2006
Note: TPP = Teacher preparation program. The wide range of graduation years makes estimating the effect of a given teacher preparation program on student achievement challenging, if the assumption that preparation program quality and content remained constant over the period is not realistic.
District Level Results
Recognizing the deficiencies of the data and the more fundamental implementation challenges described above, the value-added model with school fixed effects (Equation 1 above) was estimated separately for each school district. We do not claim that this estimation solves the problems we identified above: rather, we provide an illustration of an attempt to estimate TPP value added using typical available administrative data and a value-added model. Indeed, our point is that this illustration and similar approaches do not address those fundamental issues—and hence policymakers relying on such approaches with typical administrative data risk making unwarranted judgments about TPP efficacy.
Student test scores for end of period are their 11th-grade math scores on the KCCT exam. All prior high school scores available for the students are included as controls, including science and reading scores in addition to math. The student characteristics include gender, race, special abilities status, and free and reduced lunch status. The teacher’s gender and experience levels are included. All time-invariant school characteristics are captured in the school fixed effects. 7
The results of the regressions are listed in Table 5 and indicate reassuringly that a student’s past scores in both reading and math (and science in the largest district) significantly influence 11th-grade math scores. The 10th-grade reading and 9th-grade math scores are significant and positively related to 11th-grade math performance across all districts. In fact, prior student test scores are the only variables that are consistently statistically significant in explaining student outcomes across all three districts. The significance of the school fixed effects observed in two of the districts indicates strong school-level effects, even controlling for included student and teacher characteristics.
Individual Student Achievement Outcomes in Three Districts as a Function of Student Characteristics, 11th-Grade Math Teacher Characteristics, TPP, and Test Score History.
Note: TPP = Teacher preparation program; CTBS = Comprehensive Tests of Basic Skills (national normed standardized test); NCE = normal curve equivalent; KCCT = Kentucky Core Content Test. T scores are shown in parentheses coefficient estimates. Ordinary least squares regression is used with school fixed effects (shown; one school is the omitted category in each district). The TPP indicator variable takes a 1 for students of teachers that graduated from the most common TPP of teachers in the district, and 0 otherwise. The coefficient indicates the effect of that TPP on student achievement relative to the effect of graduates of any other TPP present in the district, holding all else constant.
p < .05. **p < .01. ***p < .001.
Finally, and most important for our purposes, we consider the coefficient on the TPP variable. Recognizing that each district’s equation compares only its most common TPP to all others represented by that district’s high school math teachers, we see that there are no significant differences in student performance that can be attributed to the training institution. We ran an additional model that attempts to further separate the innate characteristics of the teacher from the effects of the TPP. The model includes teachers’ ACT score to correct for some of the observable and unobservable differences in teachers that are accepted into the training programs. One complication with this model is that relatively recent hires are the only teachers in the sample for whom we have reliable ACT scores. To ensure a sufficient sample size, the model containing ACT scores could therefore only be run for the largest district. The District 3 student sample size was reduced from 1,137 to 370 by focusing on students of these recently graduated teachers alone. As in the previous regression, the effect of dominant TPP relative to all others remained nonsignificant. On the other hand, the estimated effect of teacher ACT is positive and statistically significant, even controlling for all the variables listed in Table 1.
There is more than one reason that a TPP effect may not have been identified: at the most basic level, perhaps TPP does not matter once the other included student and teacher characteristics are accounted for. We cannot make such a determination with confidence, however, due to unresolved issues in the structure of these data: if a TPP effect truly exists, the inability to identify a significant effect could be due to the fact that all teachers are included, regardless of year in which they graduated. TPPs may have varied in their selection or curricular practices over time, as discussed above. This would make our TPP signal a very noisy one. Furthermore, and most important for our focus in this case study, the approach of comparing a district’s most common TPP to all others is a very weak one. If the TPPs in the combined reference group have strongly divergent effects on student achievement relative to the most common TPP, it would of course be difficult to show an effect of that TPP in this design. Nonetheless, this admittedly problematic type of comparison is necessary due to the dramatic geographic stratification of TPPs across districts and the relatively small numbers of high school math teachers from the nondominant TPP(s) in each district.
Alternatives
Although most agree that all public agencies should be held accountable for their outputs—and ideally, outcomes—this case study illustrates the substantial hurdles to implementation of accountability schemes for TPPs that rely on K-12 student measures. The cost of assembling appropriate data, the problems of separating experienced teachers from novice teachers, and (especially for rural districts) the segmentation of teacher markets pose serious problems for a high-stakes quantitative evaluation of teacher preparation. But these problems do not mean that accountability of the programs that prepare teachers must be ignored. This type of research endeavor still provides useful data to decision makers when trying to improve agency performance (Bloom, Hill, & Riccio, 2001; Heinrich, 2002). The answer is not to give up on the idea of using data to evaluate TPPs—but to develop richer approaches rather than expecting only one metric, student value-added achievement, to bear the entire weight of the performance evaluation enterprise for a state’s educational system. Such an effort will require some creativity in developing alternative modes of TPP evaluation and may entail complementing strictly quantitative measures of performance with more qualitative approaches. We next explore some possible approaches.
The recognition that reliance on a single metric or even class of metrics can distort organizational incentives inspired the influential “Balanced Scorecard” approach to managerial performance measurement (Kaplan & Norton, 1996); though originated for the private sector, this model has had some influence in the public sector as part of the New Public Management movement toward improved use of performance measurement (see, for example, Niven, 2008). Focusing on performance measurement for external stakeholders as opposed to internal management, Gormley and Weimer (1999) also caution against using a single outcome variable in an accountability system when there are multiple important outcomes that an agency wishes to measure. Rather, they argue in favor of using multiple measures that can be combined into a single index (although this practice introduces a new set of concerns related to the building of the index and weighting of individual items). Challenges face even multiple-measure approaches in the public context, with costs more certain than potential benefits (Halachmi, 2005). However, with thoughtful consideration an approach with multiple metrics still seems more likely to provide useful information to decision makers than overreliance on a single measure that is several layers removed from the original agency, and must bear the weight of evaluating each of those layers (Figure 1). 8 Standards-based evaluation is an example of such a multiple-measure endeavor in education. This type of evaluation moves away from the newer focus on outcomes, returning to the older focus on bureaucratic inputs. Specifically, the evaluation uses a number of rating scales based on the behaviors that teachers exhibit in the classroom as well as a review of lesson plans and samples of student work (Heneman & Milanowski, 2004). Using standards-based evaluation data from four U.S. sites, Milanowski, Kimball, and Odden (2005) provide some early indications that the standards-based evaluation is correlated with student achievement. Additional studies show that principal assessments of teachers, such as those incorporated in standards-based evaluations, are correlated with teacher value-added estimates (Harris & Sass, 2007; Jacob & Lefgren, 2008). A strong candidate for a more balanced accountability system for TPPs might be a combination of standards-based evaluation and the VAMs that have been suggested.
Teacher warranty agreements, in theory, provide another potential mechanism for placing the judge of accountability at the school or district level. If implemented at a state systemwide level (rather than single-institution), teacher warranties could provide central government stakeholders with an approach to quality control that does not depend on data analysis prone to the stratification issue. For example, as part of its P-16 initiatives the Board of Regents of the University System of Georgia guarantees the quality of teachers trained in its TPPs (Kettlewell, Kaste, & Jones, 2000). 9 Any school that hires a teacher trained in a state program can “return” that teacher for additional training, without cost to either the TPP graduate or the school, if not fully satisfied with the quality of the teacher within the first 2 years. Such an approach places the definition of quality at the school and district level, at least in theory. One would expect that schools will look at the test scores of the students of new teachers, but we can assume that schools may also utilize alternative means of assessing quality, such as review of lesson plans, classroom observations, or evaluation of students’ sample work. Guaranteeing graduate quality in a meaningful way would devolve the responsibility for judging quality to the lowest level decision makers closest to the actual performance being judged, and allows the development of competing measures of quality. These competing measures subsequently can provide information to other schools, districts, and states. However, it is not clear whether the Georgia policy functions more as a symbolic device valued for its salutary effect on TPPs, or as an actual post hoc mechanism for TPP graduate improvement. In his letter opening the Board of Regents’ 2007 report on teacher preparation (2007), Chancellor Erroll Davis notes “Since instituting the guarantee, we have had no reports of school districts being dissatisfied with the teachers we prepare.” There are obvious, and serious principal-agent issues in the warranty approach to TPP accountability as a mechanism of state evaluation of TPP quality, with the institutions/TPPs having an incentive to conceal warranty claims from state-level policymakers. However, if the system were structured with the claims process operating through the state office of educational accountability rather than directly to the TPP or its institutional system, the signal of a developing quality issue might not be lost.
Concluding Comments
This case study uses data from Kentucky school districts to illustrate some of the particular, underrecognized challenges of using the student test performance metric to evaluate TPPs. While many policymakers and scholars agree on the positive value of introducing accountability for process and outcomes in all areas of public service, schools provide an excellent example of some of the recent attempts to make data-driven accountability a reality. Many stakeholders are hoping to use student test scores as the single outcome to measure performance not only of K-12 teachers, schools, and districts but also as a measure of the performance of higher education institutions’ TPPs. But as demonstrated in this case study, linking test scores in a given year to the quality of a specific TPP is problematic at best and may not be feasible in rural regions. In Kentucky, as in many states, rural schools tend to hire disproportionately from a single TPP. Policymakers will have to take a close look at the geographic stratification of teacher training and hiring in their own state if they are to engage in evaluations of teacher training programs relying only on student test scores.
As noted earlier, 56% of the school districts in the United States are in rural areas (Provasnik et al., 2007). This reality means that implementation of teacher preparation evaluations in the majority of school districts will face challenges similar to those we identify related to geographic stratification if such evaluations rely only on value-added models of student achievement. Although it is beyond the scope of this article to recommend a specific alternative form of evaluation, we do suggest that overreliance on the single measure of student value-added achievement gains as a device for measuring the effectiveness of teacher training programs is premature. As research on school effectiveness continues, perhaps we will learn more about the attributes of a good teacher. Some argue good teaching is tied to the aptitude of the teacher. Others argue it is the ability of the teacher to manage a classroom. If it is aptitude, we need not evaluate TPPs at all but merely evaluate their selection of teachers into training programs through such devices as SAT or ACT scores. On the other hand, classroom management can be taught by TPPs and we should be able to evaluate whether programs are effectively training teachers to manage their classrooms without relying exclusively on the K-12 students’ test scores. Peers and professionals can observe classroom management styles and assess whether the teacher is a good manager. In a wider public management context, this work suggests that any single performance indicator that is expected to signal performance of multiple actors and multiple institutions that themselves interact in complex ways may fall short. Our case study here is analogous to judging the performance of public administration academic programs by using a single outcome metric drawn from all budget offices or all city management agencies in which our graduating students are placed. Few in the public administration scholarly world or in those policy arenas would accept such a single-dimensional indicator of performance. Accountability is desirable; yet the way in which it is implemented is critical to its success.
Footnotes
Acknowledgements
The authors are indebted to the Kentucky Education Professional Standards Board for the opportunity to use administrative data that are not publicly available and to Terry Hibpshman for his role in extracting the necessary data. Jacob Fowles provided valuable assistance in manuscript preparation.
The author(s) declared no potential conflicts of interest with respect to the authorship and/or publication of this article.
The author(s) received no financial support for the research and/or authorship of this article.
