Abstract
Despite growing calls for more accountability of teacher education programs (TEPs), there is little consensus about how to evaluate them. This study investigates the potential for using observational ratings of program completers to evaluate TEPs. Drawing on statewide data on almost 9,500 program completers, representing 44 providers (183 programs) in Tennessee across 3 years, we investigate multiple models to estimate TEP quality. Results suggest that using observational ratings to evaluate TEPs has promise. We were able to detect significant and meaningful differences between TEPs, which were fairly robust across modeling approaches. Moreover, TEP rankings based on observational ratings were positively and significantly related to rankings based on student achievement gains.
Introduction
As part of increasing accountability and oversight, policymakers and researchers agree that better systems for evaluating TEPs are sorely needed. Yet, there is little consensus about how best to assess TEP quality. Approaches have generally fallen into two broad categories: (a) input-oriented and (b) output-oriented. The former have focused on the amount and quality of preparation inputs to determine whether they meet certain standards. State certification requirements and accreditation processes have historically been input-oriented. For example, most states in the United States have minimum requirements for length of student teaching experiences; likewise, accreditation agencies often require that clinical experiences are of sufficient duration.
Among the highest profile use of input-oriented approaches, NCTQ developed a method for evaluating all U.S. TEPs based on inputs, many of which are gleaned from information in course syllabi. Based on their review of research, NCTQ created a number of preparation standards, rated all programs against these standards, and publicized ratings in the US News & World Report. Input-oriented approaches such as these have the potential to provide preparation programs with usable information for how to improve on many dimensions of preparation. However, their promise for promoting program growth depends on the quality of criteria; in turn, criteria are only as good as the research on which they are based. Unfortunately, a common critique of the body of teacher education research, on which the NCTQ and other preparation standards are based, is that it is thin (Grossman, Ronfeldt, & Cohen, 2011; Wilson, Floden, & Ferrini-Mundy, 2001). Even NCTQ (n.d.) acknowledges that their criteria draw on “a dearth of actionable, gold-standard research in the field” (p. 1), concluding that only 30 articles, out of more than 3,000 reviewed, met their criteria for being well designed and directly linking preparation to teacher effectiveness. Until the research base about which preparation features predict teaching quality is more substantial, it is questionable whether one should draw strong conclusions about input-oriented evaluations of TEPs.
Output-oriented approaches to TEP assessment evaluate the quality of program graduates rather than program features. An advantage of output-oriented approaches is that they do not require consensus about what types of preparation are “good.” A challenge, though, is in disentangling the impact of TEPs on graduate performance from other influences, including school workplace influences and factors prior to formal preparation. 1
Spurred largely by requirements for participation in the Race to the Top competition, the vast majority of output-oriented approaches today use graduates’ value added to student achievement scores (value-added models [VAMs]) to evaluate TEPs. Among studies using output-oriented approaches, evidence is mixed. Some have found meaningful differences between programs, suggesting that preparation can have significant effects on teachers’ later performance (Henry, Purtell, et al., 2014; Bastian & Patterson, 2014; Boyd, Grossman, Lankford, Loeb, & Wyckoff, 2009; Gansle, Noell, & Burns, 2012). Using a school fixed effects (SFE) approach to examine all elementary programs that supply teachers to New York City schools, Boyd et al. (2009) found differences between the average and the most effective programs to be 0.07 standard deviations in both math and English language arts (ELA). The authors argued that this difference was equivalent in magnitude to the average learning of students eligible for free or reduced priced lunch and those who are not. They estimated separate effects for each elementary or early childhood program within an institution (university or alternative provider). When differentiating “program” even further, so as to distinguish elementary graduate-level from elementary undergraduate-level programs, the authors identified larger differences between the most effective and average programs—0.18 standard deviations in math and 0.10 in ELA. Although the authors claimed the differences between programs to be meaningful, only 3 of the 23 programs were significantly different from the mean in math.
Estimating TEP effects in Louisiana, Gansle et al. (2012) used hierarchical linear modeling (HLM) to account for the nested structure of data (students nested within teachers who are nested within schools). Defining TEPs as programs within institutions (e.g., undergraduate programs, master’s degree programs, certification-only), findings were mixed across content areas. Even so, the authors identified 3 out of 10 programs that graduated teachers who consistently outperformed average new teachers.
Also using HLM, Henry, Purtell, et al. (2014) compared North Carolina institutions (colleges or universities), rather than programs within institutions, based on graduates’ VAM estimates. Compared with graduates from all other pathways combined (alternative in-state, out-of-state, etc.), 10 institutions had more effective graduates on average, 2 institutions had less effective graduates, and 3 institutions had graduates who performed at statistically similar levels.
Other studies based on the VAMs of program graduates have found little to no differences between TEPs, indicating that variation in teacher effectiveness is not explained by differences in preparation (Goldhaber, Liddle, & Theobald, 2013; Koedel, Parsons, Podgursky, & Ehlert, 2012; Mihaly, McCaffrey, Sass, & Lockwood, 2013; Osborne et al., 2012; von Hippel, Osborne, Lincove, Mills, & Bellows, 2014). Defining TEPs at the institution (college or university) level, Goldhaber et al. (2013) estimated TEP effects using several modeling strategies including ordinary least squares (OLS), SFE, and district fixed effects. Rather than estimating TEP effects using only recent graduates, an important contribution of this study is that it used all graduates, regardless of years of experience, and used an innovative method to adjust for the decay of preparation effects over time. The study indicated that preparation programs explain little variation in teacher effectiveness, as measured by VAM; moreover, teachers prepared by Washington state institutions performed no better than those prepared out of the state. Even so, the authors argued that the difference between the best and worst programs was meaningful—the difference in graduate performance was roughly the difference between a first- and fifth-year teacher.
Comparing TEPs in Missouri, Koedel et al. (2012) estimated program effects with and without SFE. The authors argued that prior studies that clustered standard errors at the classroom rather than teacher level, including many of those described above, overstate the independence of observations and, thus, statistical precision. After clustering standard errors at the teacher level, Koedel and colleagues found that differences across programs explain only 1% to 3.2% of the total variation in teacher effects. Because the sample included only teachers from self-contained elementary classrooms, the comparison was apparently between elementary TEPs only, similar to Boyd et al. (2009). The authors found differences between the top and bottom programs to be large (about 0.12 points) and comparable with those reported by Boyd et al. (2009); however, the authors concluded the difference was entirely due to estimation-error variance rather than real differences in preparation. Likewise, in Texas, von Hippel et al. (2014) found differences between TEP estimates to be quite small and noisy. Compounded by the problem of multiple comparisons, the authors concluded, “a TPP’s position near the top or bottom of the rankings is more often due to luck than to quality” (p. 28).
There are many possible explanations for the mixed evidence from studies that have estimated TEP effects based on graduates’ VAM. These studies were done in different districts and states, so it is possible that TEP quality varies more in some labor markets than others. In addition, there is no consensus about how best to model TEP effects based on graduate effectiveness (National Research Council [NRC], 2010). The analytic methods used across studies vary on many features that could lead to different results, including (a) how “program” is defined, (b) the reference category used, (c) how standard errors are (not) adjusted to account for the nested structure of data, (d) the use of fixed effects, and (e) the degree to which scholars consider signal to noise in their analyses.
Although evaluating TEPs based on student achievement is the most common output-oriented approach, many have expressed technical and conceptual concerns over this approach (Hill, Kapitula, & Umland, 2011; Kane & Staiger, 2012; Lincove, Osborne, Dillon, & Mills, 2013; Mihaly et al., 2013; Newton, Darling-Hammond, Haertel, & Thomas, 2010; Rothstein, 2009). Among concerns, students are not assigned randomly to teachers, so using student achievement to evaluate teachers may unfairly penalize both the teachers who work with challenging student groups and the programs that prepare these teachers (Hill et al., 2011; Rothstein, 2009). 2 Because estimates are only available for teachers in tested grades and/or subjects, TEPs that specialize in untested grades (e.g., early elementary) and subjects (e.g., physical education) are excluded from the analysis or estimated with a smaller number of teachers in tested grades/subjects. Moreover, using student achievement to estimate teacher quality conflates teaching with learning in ways that can be conceptually problematic. As Fenstermacher and Richardson (2005) deftly argue, good teaching does not always produce learning because other factors (such as social surroundings and preexisting student motivation) are known to influence, and sometimes complicate, a direct relationship between them. Moreover, there is emerging evidence that the kinds of tests typically used in state/district assessments are insensitive to more ambitious forms of teaching, thus failing to detect differences in instructional quality that may actually exist (Grossman, Cohen, Ronfeldt, & Brown, 2014). 3 Finally, VAM approaches may tell programs how they rank against one another, but do little to guide program improvement (Lincove et al., 2013).
Observation ratings (OR) provide several potential benefits over the use of VAM scores for evaluating TEP effectiveness. First, OR are available for all or most teachers, whereas VAM is available only for teachers in tested grades and subjects. 4 Second, observational rubrics measure instruction directly rather than making inferences about instructional quality based on student outcomes. Given that TEPs are focused primarily on improving their graduates’ instructional quality, direct measures of instruction, such as observational ratings, may be a more appropriate measure than student achievement for evaluating program effects. This is especially the case given that even high-quality instruction may not always yield positive student outcomes (Fenstermacher & Richardson, 2005); if so, student achievement would fail to capture a program’s impact on instructional quality. Finally, observational evaluations typically provide detailed information on performance in different domains of instruction, which has potential to offer TEPs information about specific areas for program improvement. For example, if a program learns that its graduates perform poorly, on average, compared with neighboring programs in the domain of classroom management, then it can adjust its coursework and/or fieldwork to improve opportunities to learn in this domain. Whether or not receiving this kind of information actually spurs program improvement is an empirical question that needs to be tested.
Despite these likely advantages, we are aware of no prior research attempting to evaluate TEP quality based on graduates’ observational ratings. One study used OR to measure the effectiveness of teacher preparation, but this report focused on differences between pathways of entry rather than programs or institutions. Specifically, Bastian and Patterson (2014) evaluated differences between six preparation routes (e.g., public institution, private institution, alternative, etc.), in terms of graduates’ OR, VAM, persistence, and distribution across North Carolina public schools. OR were constructed as a dichotomous outcome—“above proficient” and “proficient or below.” Using logistic regressions with teacher and school controls, the authors found that, compared with traditionally prepared teachers, teachers from most alternative pathways received lower OR and only Teach for America corps received higher OR. Reporting on preparation routes is a useful exercise in assessing whether preparation matters; however, program-level information is more likely to spur program improvement. Moreover, accreditation and approval processes, including the proposed federal regulations, target programs or institutions rather than pathways. Thus, our study investigates whether it is possible to use graduates’ observational ratings to detect differences between programs and institutions, rather than pathways. To do so, however, we first needed to settle on a method for estimating TEP quality. Given little consensus about which modeling approach to use in the VAM literature, and no prior research, to our knowledge, using observational ratings, we begin by investigating a number of modeling approaches and how results are similar and different across them. We ask the following:
To what extent do TEP ratings based on graduates’ observational ratings vary by modeling approach?
Are there differences between TEPs in terms of graduates’ OR?
How do TEP rankings based on graduates’ Tennessee Value-Added Assessment System (TVAAS) scores compare with TEP rankings based on graduates’ observational ratings?
Teacher Evaluation in Tennessee
As a part of Tennessee’s First to the Top Act, the state established and implemented a new teacher evaluation system during the 2011–2012 academic year. Under the new evaluation system, teachers’ evaluations are comprised of three areas: (a) student test score growth as measured by TVAAS, (b) student achievement on another selected measure, and (c) classroom observation rubrics. The weight placed on each of these areas depends on whether the teacher is in a tested subject. 5 The Tennessee State Board of Education adopted the Tennessee Educator Acceleration Model (TEAM) as the statewide observational rubric, which is an adaptation of the National Institute for Excellence and Teaching’s TAP rubric. However, they also approved three alter-native models—Project COACH, Teacher Effectiveness Measure (TEM), and Teacher Instructional Growth for Effectiveness and Results (TIGER). Of the 137 school districts during the 2011–2012 academic year, 123 districts (90%) used the TEAM rubric, 1 district used Project COACH, 1 district used TEM, and the remaining 12 districts used the TIGER rubric.
Teachers in our sample were observed on average 4 times each year, ranging from a minimum of 1 to a maximum of 24 times in a given year. The number of observations varies based on a teacher’s licensure status and prior overall evaluation scores. An apprentice (novice) teacher will typically have four observations each year—one announced and one unannounced observation each semester. An administrator or teacher leader is responsible for observing and rating teachers, and observations typically include pre- and/or postobservation conferences. In addition, administrators provide an end-of-year, summative evaluation rating. All administrators and teacher leaders must undergo annual training and pass a certification exam prior to evaluating teachers. The TEAM rubric includes four domains: instruction, planning, environment, and professionalism with several indicators associated with each of the domains. Raters are required to observe multiple domains during a classroom visit, with the exception of the professionalism domain, which is evaluated only at the end of the year.
In addition to the OR, the state uses TVAAS scores as a different measure of teacher quality. There are several VAMs currently used to estimate teacher effects (for detailed comparisons of VAMs, see Guarino, Reckase, & Wooldridge, 2015; Henry, Rose, & Lauen, 2014; McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004; Sass, Semykina, & Harris, 2014). TVAAS, developed by William Sanders and colleagues in the late 1990s, is among the most prominent (Sanders, Saxton, & Horn, 1997). TVAAS models use only students’ current and prior test scores to estimate teacher effects. This methodological approach has been criticized for excluding student demographic characteristics that are thought to influence test performance (Kupermintz, 2002; Linn, 2001). However, some studies have shown that adding these covariates has little influence on teacher effect estimates (Ballou, Sanders, & Wright, 2004; McCaffrey et al., 2004).
Method
Data Overview
This study draws on several statewide data sets from the Tennessee Higher Education Commission (THEC) and Tennessee Department of Education (TDOE). First, it uses THEC data on all program completers in the state, including their TEP name, degree awarded, endorsements received, and license type (i.e., alternative or traditional). Among individuals subsequently hired in Tennessee, we linked program completer information to TDOE OR, years of experience, and school-level demographic information about the schools in which they worked. 6 Last, to answer the final research question, we linked the subset of program completers with year-by-subject TVAAS information.
Sample
Our primary analytic sample included program completers from all Tennessee TEPs that graduated between 2009–2010 and 2012–2013 and were subsequently employed in Tennessee public schools during the 2011–2012 through the 2013–2014 academic years—9,482 teachers staffed in 1,553 schools. 7 Some analyses also include all other teachers employed in Tennessee public schools during the study period—56,254 teachers (non–program completers) employed in 1,726 schools.
We define and analyze TEPs at two different levels—at the “institution” level (e.g., university, college, or provider) and at the “program type” level. Here, we define “program type” as a combination of the institution, degree awarded (graduate or undergraduate), and endorsement area (elementary, secondary, 8 or special education) of each teacher. For example, “Teacher University Undergraduate Elementary Program” and “Teacher University Undergraduate Special Education Program” constitute two different TEPs in our “program type” analyses. Many prior studies using graduates’ VAM scores compare institutions without considering within-institution differences in preparation. However, programs within the same institution can differ substantially in how they prepare teachers (Boyd et al., 2012; Greenberg & Dugan, 2015).
It is important to note that how we define “program type” is not meant to be consistent with the ways that universities or colleges define “programs” within their institutions. Rather, given the substantial variation in preparation experienced by different candidates within institutions, our goal was to group candidates with peers who experienced relatively more similar forms of preparation. Although defining “program” as universities do would identify preparation even more precisely, the small size of many programs, in conjunction with our strategy of keeping only TEPs that can be linked to at least 10 graduates, makes generating estimates for many infeasible. Even with our existing definition for program type, we lose more than 20% of TEPs when we require at least 10 graduates per TEP. In other words, there is a trade-off—the more precisely we define program, the smaller share of programs for which we can estimate effects. Thus, our approach to defining “program type” likely does a better job than “institution” of identifying differences in preparation, but still allows us to generate precise estimates for about 80% of TEPs. For simplicity, we refer to “program types” as “programs” for the remainder of this article.
A limitation of our data is that we are able to identify only teachers’ most recent preparation programs and institutions. For individuals who may have completed preparation across more than one program or institution, we do not have historical data on prior preparation. A concern here is that we may then overestimate the effects of the most recent TEP on a teacher’s quality. Thus, estimates for TEPs that enroll a larger share of teachers with prior preparation will likely be inflated.
Program completers graduated from 44 institutions (university, college, or providers) across the state, representing 183 programs. However, we generated estimates only for institutions and programs that had at least 10 graduates employed in Tennessee across the 3-year analytic period. This reduced the number of unique institutions and programs in our sample to 39 and 118, respectively. For program completers who graduated from TEPs with fewer than 10 graduates across the 3 years of data, we combined these individuals into a single “small TEP” group in our analyses. This maintained our full analytic sample, while allowing us still to test how graduates from small TEPs (combined) compared with graduates from other TEPs. According to Noell, Porter, and Patt (2007), TEP estimates based on 10 to 24 graduates may be unstable, suggesting our threshold may not be high enough. Rather than removing programs and institutions with 10 to 24 graduates, we elected instead to use empirical Bayes methods (described later) to adjust for sampling error.
Table 1 summarizes the characteristics of program completers and non–program completers (including more experienced teachers regardless of where they were prepared, as well as early-career teachers who did not graduate from a Tennessee TEP) in our analytic sample, as well as the characteristics of the schools in which they were employed. On average, the program completers had about 1.5 years of teaching experience, which is appropriate given that this sample consists only of recent graduates from Tennessee TEPs. Despite being identified as recent graduates in program completer files, some teachers were identified by state administrative files as having substantial amounts of teaching experience (max value of 26 years). However, according to administrative files, 85% of the program completer sample had 2 or fewer years of experience whereas 98% had 5 or fewer. More than 60% of individuals with 5 or more years of experience were alternatively certified teachers. There are various possible explanations for recent graduates having substantial prior experience. For example, some individuals may have taught in another state and, on moving to Tennessee, entered a certification program for local state certification. Others may have been substitute teachers or previously taught in some capacity that did not require state certification (e.g., private school teacher). To adjust for differences in prior teaching experience, all models control for teaching experience. 9 Non–program completers averaged about 13.5 years of teaching experience; 74% had 6 or more years of experience.
Descriptive Statistics on TEP Completers
Note. Secondary Endorsement includes middle grades. Teachers with a traditional license are compared with teachers with an alternative license. Teachers with an undergrad degree are compared with those with a graduate degree, which includes masters and doctorates. TEP = teacher education program.
Table 1 suggests also that just more than three quarters of program completers completed traditional certification routes, whereas just less than three quarters received an undergraduate degree while pursuing certification. Most program completers received an elementary endorsement (44%), followed closely by secondary (36%); the remaining 10% received special education endorsements. On average, program completers taught in schools with students who were mostly White (69%) and eligible for free or reduced priced lunch (62%); non–program completers taught in schools with a larger majority of White students (74%) and fewer students eligible for free or reduced priced lunch (57%). Among both groups of teachers, only 1% to 2% of students were classified as English language learners (ELL), whereas 9% to 11% were classified as receiving special educational services. Most program and non–program completers worked in elementary schools (36%–37%), whereas the fewest (14%–15%) worked in middle schools.
Measures
Observational Ratings
During each observation using the TEAM rubric, teachers were rated on a scale of 1 to 5 on each indicator within a domain, with 1 being “significantly below expectations,” 3 “at expectations,” and 5 “significantly above expectations.” 10 The aggregate OR used in our analyses were the average of all individual indicator ratings for a given teacher during a given year. 11 Figure 1 shows the distribution of teacher-year OR across the teachers certified in Tennessee. The teachers in the sample received an average observation score of 3.6, which equates to being rated above “at expectation” on the TEAM rubric. However, there is a sufficient amount of variation across the ratings, including 12% of teachers with average ratings below “at expectation.” For interpretability, we standardized the average OR to have a mean of zero and standard deviation of one.

Distribution of teacher-year aggregate observation ratings.
TVAAS
The TDOE provided us TVAAS estimates for each teacher by subject and year, standardized within grade, subject, and year. Although we would have preferred to construct VAM scores ourselves, we were unable to obtain student-level information. Even so, TVAAS estimates have been shown to be highly correlated with “true” teacher effects (Henry, Rose, et al., 2014). To make our TVAAS and observational evaluation analyses as comparable as possible, we aggregated the TVAAS file at the teacher-by-year level. In other words, for each teacher in each year, we averaged all of her TVAAS scores across subject areas and grade levels. Figure 2 shows the distribution of teacher-year TVAAS across the teachers certified in Tennessee.

Distribution of teacher-year mean TVAAS scores.
Our approach to aggregating TVAAS information to construct a single teacher quality measure for each teacher in each year is admittedly crude, as it assumes comparability across subject areas and grade levels. However, teacher education institutions and programs prepare teachers across a range of subjects and grades so any estimate of TEP quality needs to aggregate information across these domains. Moreover, observational ratings use the same rubrics across teachers, regardless of subject and grade level. Given the goal of this analysis was to compare TEP estimates using observational ratings with estimates using TVAAS, aggregating TVAAS information across grade and subjects made these approaches and samples comparable. An alternative could be to construct TEP estimates separate for each subject and grade; however, sample sizes make this infeasible in some subjects. In alternative analyses, we tried constraining our samples to only teachers of mathematics, and separately for reading, and then reproduced comparisons. One reason why we conducted these additional analyses is that the existing value-added literature finds larger teacher effects in math than reading (cf. Kane & Staiger, 2012).
Variation in Teacher Quality Explained by TEPs
Using OLS models to estimate either teacher-by-year OR or TVAAS scores as a function of teacher preparation TEP indicators and no other covariates, we found that institutions explained about 2% of variation in graduates’ TVAAS scores and 4% of variation in OR; programs explained about 4% of variation on both outcomes. These results are similar to those from prior analyses (Goldhaber et al., 2013; Koedel et al., 2012). A possible limitation to these estimates, though, is that the variation in our measures for teacher quality are likely due to other sources beyond teacher quality, including variation in school working conditions and, in the case of OR, rater tendencies/biases. Thus, we adapted an approach used by Koedel et al. (2012) to (a) better isolate variation in teacher quality measures due specifically to teacher quality (rather than other sources) and (b) determine the share of variation in (a) explained by TEPs. Appendix Table S1 (available in the online version of the journal) summarizes results from these analyses. Using our SFE model (described below), the first row summarizes the change in R-squared when TEP indicators were added to the base model. The second row shows the change in R-squared when teacher indicators were added. The third row summarizes the ratio of R-squared between the first and second rows. The results suggest that differences between TEPs explain about 3% of the total variation in teacher effects on OR and 2% to 5% of the variation in teacher effects on TVAAS scores. 12 Although finding TEPs to explain about 3% of variation in teacher quality may seem small, it could reflect meaningful differences between TEPs, so long as this variation is due to signal rather than noise (von Hippel et al., 2014). Otherwise, multiple comparisons between TEPs can yield significant and large differences between estimates even when real differences do not exist.
Analytic Approach
To understand whether TEP ratings varied by modeling approach, we first estimated TEP effects based on graduates’ observational ratings using various regression modeling approaches including OLS regression, HLM, and SFE. Because model comparisons, rather than comparisons of TEP estimates, were our focus, we kept all teachers in the state to increase statistical power. For each modeling approach, we generated TEP estimates for three groups of teachers: those certified between 2009–2010 and 2012–2013 to teach in the state, out of the state-certified teachers with 3 or less years of experience, and experienced teachers with more than 3 years of experience (regardless of being prepared in or out of state). The latter group was used as the reference category in these models. Estimates from various models were saved and then adjusted using empirical Bayes techniques (see Appendix 1 available in the online version of the journal).
OLS models used to estimate program effects were based on the following equation:
Here, the OR in Year t, of Teacher i, in School j, in District k are functions of the TEP fixed effects (TEP), time-varying indicators for years of teaching experience (Exp), a vector for time-varying school characteristics (S; see Table 1 for list of school characteristics), an indicator for whether the district uses the TEAM evaluation rubric or another state-approved evaluation rubric (D), year indicators (
Our second approach to estimating program effects used a four-level HLM with observation years at Level 1, teachers at Level 2, schools at Level 3, and districts at Level 4. The reduced form equation is as follows:
Here, the OR in Year t, of Teacher i, in School j, in District k is a function of the same sets of predictors as described in Model 1, but includes mutually independent random intercepts associated with time,
Our third analytic approach replaces school covariates, S, in Equation 1 with SFE, φ j , as summarized below:
Because they would be absorbed by SFE, indicators for whether districts used the TEAM rubric are not included in Equation 3. Similar to Equation 1, we clustered the standard errors at the teacher level to account for nonindependence of observations for the same teacher. In two different specifications, we reproduced models using Equation 3 but (a) added time-varying school characteristics and (b) replaced SFE with school-by-year fixed effects. Although SFE adjust for time-invariant differences between schools, biased estimates may remain due to time-varying factors. Adding time-varying characteristics helps adjust for these factors; however, unobserved, time-varying school characteristics could still bias results. Incorporating school-by-year fixed effects effectively adjusts for observed and unobserved time-varying factors by comparing teachers (and TEPs) only with peers in the same school and year. The results with the addition of time-varying school characteristics and school-by-year fixed effects were very similar; therefore, we report only on our Equation 3 results.
SFE
Given that we use SFE models (Equation 3) as our preferred model for much of this article, in this section, we discuss possible advantages and limitations; we also consider model assumptions. As described earlier, a major challenge in estimating TEP quality based on graduate performance is in disentangling school from TEP effects on teacher quality, especially given that TEPs do not randomly supply teachers to schools (Mihaly et al., 2013). Although we include school covariates in both OLS and HLM models, both OR and TEP quality could vary with unobserved school characteristics. Including SFE effectively adjusts for time-invariant observed and unobserved school (and district) characteristics that influence OR. Given prior literature suggesting that there may be systematic differences between principals and other raters in terms of how they evaluate teachers on observational instruments (Chaplin, Gill, Thompkins, & Miller, 2014; Ho & Kane, 2013; Sartain et al., 2011), an advantage of SFE models is that they adjust for rater differences because typically the same rater, usually a principal, evaluates all teachers in a given school.
However, there are limitations to SFE models as well. First, SFE models tend to produce estimates with inflated standard errors (Mihaly et al., 2013). In addition, any schools, and teachers in these schools, that do not vary on the outcome measure are excluded from SFE models. This can bias estimates as small schools and schools with less turnover tend to be excluded whereas large schools and schools with higher turnover get oversampled (Mihaly et al., 2013; von Hippel et al., 2014).
Identifiability and homogeneity
To use SFE to estimate TEP effects, Mihaly, McCaffrey, Sass, and Lockwood (2013) argue two assumptions must be met—identifiability and homogeneity. In using SFE to estimate differences between TEPs, a critical requirement—identifiability—is that all TEPs be connected to each other, even if indirectly (Mihaly et al., 2013). To be identified in our SFE models, a TEP must have at least one graduate who, across 3 years of data, has a colleague in the same school who graduated from a different TEP. To the degree that excluded teachers, TEPs, and schools are not representative of the larger population, estimates can be biased. Moreover, all TEPs must be connected in a single stratum; if one group of TEPs supply one stratum of schools whereas another group of TEPs supply a different stratum, then SFE models are infeasible. Like Mihaly et al. (2013), we find that a window of 3 years of evaluation data allows for TEPs in our sample to be connected in a single stratum. 15 Only 19% of the schools in the sample are staffed with beginning teachers from a single TEP. This percentage is slightly higher than what Goldhaber et al. (2013) report (15%), but still not large enough to pose concerns over TEP connectivity.
Regarding homogeneity, if there are highly centralized schools (employing graduates from a large share of TEPs) that are significantly different from less centralized schools, this can produce biased TEP estimates. Building on procedures used by Mihaly et al. (2013) and Goldhaber et al. (2013), Appendix Table S2 (available in the online version of the journal) reports on results from t tests comparing mean characteristics of schools that connect one to three TEPs to mean characteristics of schools that connect four or more TEPs. Results indicate that less centralized schools, with graduates from fewer TEPs, tended to have more students who were White, eligible for free/reduced priced lunch, eligible for special education, and were ELLs; these schools also tended to be elementary schools with smaller enrollments. With the exception of free and reduced price lunch, these descriptive results are consistent with Mihaly et al. (2013). We also ran logistic regressions to estimate whether or not a school was centralized as a function of school characteristics. The results, shown in Column 4, were similar. However, after controlling for enrollment, school level (e.g., elementary) was no longer significant, suggesting that enrollment differences explain why elementary schools were less centralized—because middle and high schools tend to be larger, they hire more teachers, thus increasing TEP connectivity.
Appendix Table S2 (available in the online version of the journal) results suggest that more and less centralized schools indeed differ, raising some concern that the homogeneity assumption is not met and that TEP estimates based on SFE models could therefore be biased. Despite these concerns, when we reproduce TEP estimates using HLM models that adjust for these school characteristics, results are very similar; in fact, TEP rankings based on SFE and HLM models are correlated at between .93 and .97.
Results
Do TEP Ratings Based on Graduates’ Observational Ratings Vary by Modeling Approach?
Because no prior research, to our knowledge, has tried to evaluate TEPs using observational ratings of program graduates, a first task was to consider different analytic approaches and to investigate whether TEP ratings differed across them. Table 2 summarizes institution estimates across three different models likely to be used by state policymakers in evaluating TEPs. 16 To maximize statistical power, we used the full sample of teachers across the state; because we were interested in how recent TEP graduates compare with experienced teachers, we used experienced teachers as the reference category. 17 The many negative and significant coefficients suggest that many institutions graduate teachers who, in their first 3 years, perform significantly worse, on average, than experienced teachers in the state. This is unsurprising, as it would be unrealistic to expect a year or so of teacher preparation to propel graduates to the same levels of teaching performance as colleagues who have been in the classroom 4 or more years. It is notable that, across modeling approaches, three institutions (30, 31, and 44) graduated teachers who, as early-career teachers, significantly outperformed experienced teachers in the state; graduates from two other institutions (23, 28) outperformed experienced teachers based on estimates from OLS models, but not other modeling approaches. Also notable, 14 institutions had nonsignificant (at p < .05 level) estimates, suggesting they graduated teachers who, in their first few years of teaching, perform comparably with experienced teachers—assuming, of course, that the samples are large enough to detect significance.
Institution Estimates Based on Various Modeling Approaches
Note. Robust standard errors in parentheses. “Institution 777” includes all graduates from TEPs with fewer than 10 program completers; “Institution 999” includes early-career teachers who did not graduate from a Tennessee provider (e.g., out of state). Small institutions (e.g., Institution 1, 10) have no coefficient and are grouped in Institution 777. The reference category includes all experienced teachers in the state with more than 3 years of experience. OLS = ordinary least squares; SFE = school fixed effects; HLM = hierarchical linear modeling.
p < .1. *p < .05. **p < .01. ***p < .001.
The various modeling approaches are fairly consistent also in terms of the directionality and significance levels of estimates. Institutions that were identified as performing significantly better (or worse) than the reference according to one modeling approach also tended to be identified as doing significantly or moderately (at p < .1 level) better (or worse) according to other approaches. Institutions estimated to be statistically indistinguishable (nonsignificant) from the reference group using one modeling approach are mostly indistinguishable using other modeling approaches as well. 18 Results were similar when investigating program instead of institution estimates, but due to space constraints, we report only on institutions in Table 2.
That model estimates tended to converge in terms of general directionality and levels of significance does not necessarily mean that rankings were similar across models. To test this, we ranked institutions within each modeling approach; we then used Spearman rank correlations to examine how highly correlated rankings were across modeling approaches. Table 3 summarizes correlations between model estimates for institution effects, where coefficients for shrunken estimates (empirical Bayes) are reported below the diagonal. Table 4 reports coefficients for program rather than institution rankings.
Spearman’s Rank Correlations for Institution Estimates
Note. Adjusted empirical Bayes estimates were used for correlations on the bottom of the diagonal, and unadjusted estimates were used for the correlations on the top of the diagonal; n = 41. OLS = ordinary least squares; SFE = school fixed effects; HLM = hierarchical linear modeling.
Spearman’s Rank Correlations for Program Estimates
Note. Adjusted empirical Bayes estimates were used for correlations on the bottom of the diagonal, and unadjusted estimates were used for the correlations on the top of the diagonal; n = 118. OLS = ordinary least squares; SFE = school fixed effects; HLM = hierarchical linear modeling.
Estimates indicate that HLM and SFE estimates were very highly correlated (rinstitution = .97; rprogram = .93 for unadjusted), suggesting that these two modeling approaches are identifying, with a great degree of consistency, the same institutions or programs as being strong and weak. OLS estimates were strongly correlated with estimates from all other methods (simple mean scores, HLM, and SFE) but tended to have stronger correlations with mean score rankings than with HLM or SFE rankings. Consistent with this point, mean rankings were highly correlated with OLS rankings (rinstitution = .81; rprogram = .86) and only modestly correlated with HLM (rinstitution = .49; rprogram = .57) and SFE (rinstitution = .52; rprogram = .57) rankings.
We also grouped institutions and programs into quartiles and then measured how consistent quartile rankings were across modeling approaches. Table 5 summarizes how programs and institutions ranked in each of the quartiles using SFE (columns)—our focal model for much of the remainder of the article—were distributed across mean quartiles, OLS quartiles, and HLM quartiles. Focusing on programs (right side), the first row demonstrates that 17 out of 30 (57%) programs identified as lowest performing programs (bottom quartile) using mean scores were also identified as lowest performing (bottom quartile) programs using SFE; however, four out of these 30 (13%) were identified as highest performing (top quartile) using SFE. Among the 30 bottom-quartile programs using OLS, two (7%) were in the top quartile using SFE. Assuming states consider using quartiles to categorize TEPs into the four categories in the proposed federal regulations, the choice of model then could mean the difference between being categorized as “exceptional” and “low-performing” for some programs. By contrast, none of the bottom (or top) quartile programs using HLM were in the top (or bottom) quartile using SFE.
TEP Switching Quartiles by Modeling Approach
Note. TEP = teacher education program; SFE = school fixed effects; OLS = ordinary least squares; HLM = hierarchical linear modeling.
Sensitivity Test
Results in this section suggest that program rankings based on SFE are consistent with those based on HLM, whereas OLS rankings are consistent with mean rankings. Because mean rankings do not adjust for differences between school characteristics, the fact they are highly correlated with OLS rankings led us to speculate that OLS may not adequately adjust for differences between schools. Moreover, we began to speculate that SFE and HLM rankings may be highly correlated, at least in part, because both models are similarly and more successfully (as compared with OLS) adjusting for school characteristics. To test this, we saved regression-adjusted estimates from each modeling approach and calculated the difference of these estimates from program means. We then examined whether these differences [(regression estimate) − (mean)] varied by school characteristics.
The results, summarized in Appendix Table S3 available in the online version of the journal, indicate that all regression model approaches adjusted up estimates (relative to means) for TEPs that supplied teachers to schools with more low-income students and to high schools. Given growing evidence that teachers receive lower OR when they teach in schools and classrooms with more lower income and secondary students (Campbell & Ronfeldt, under review; Whitehurst, Chingos, & Lindquist, 2014), our results suggest that all three regression approaches are correcting TEP estimates for bias in teacher OR that might, otherwise, unfairly penalize TEPs that supply more teachers to these kinds of schools. However, the shaded columns demonstrate that SFE and HLM models adjusted up estimates at significantly greater rates than OLS models for TEPs supplying high schools and schools with more low-income students. Coupled with the fact that OLS estimates were more highly correlated with mean scores—which do not adjust for differences between school characteristics—than either HLM or SFE, these results may indicate that SFE and HLM are adjusting estimates for school differences in ways that are similar to one another and likely preferable to OLS or mean scores. 19
The above results suggest HLM and SFE estimates to be comparable, and both preferable to OLS or mean scores. These findings are consistent with goodness of fit analyses, which, based on Akaike information criterion (AIC) and Bayesian information criterion (BIC), indicate both HLM and SFE models fit the data better than OLS. Although both HLM and SFE models appear to adjust for observed school characteristics better than OLS models or simple means, SFE models also adjust for unobserved characteristics, including differences between schools in terms of rater tendencies. Because the same rater (typically the principal) usually completes observational ratings, any rater effects are likely common across teachers in the same school. SFE models should then adjust for rater effects (biases or tendencies). For these reasons, we privilege SFE models for the remainder of the article. Even so, we reproduced analyses using HLM models and the results were very similar.
Are There Differences Between TEPs in Terms of Average Graduates’ OR?
Figure 3 presents caterpillar plots for institution (left) and program (right) estimates using SFE models. Plots on the top present unadjusted estimates, whereas plots on the bottom present empirical Bayes (EB) adjusted estimates. The models used in these analyses differ in important ways from SFE models used earlier (Table 2). First, the sample included only Tennessee program completers; out-of-state program completers and experienced teachers were excluded. Second, the reference category was the mean of all recent program completers (the sample) rather than experienced teachers. 20 Beyond being more policy-relevant, this reference category is also more conservative—fewer TEPs will differ significantly from the mean of recent graduates than from experienced teachers in the state. Being conservative, however, seems appropriate given that TEP evaluations could have important consequences for program approval or accreditation. 21

Caterpillar plots of TEP estimates (SFE models).
Figure 3 demonstrates that 8 out of 40 institutions (20%) and 21 out of 118 programs (18%) graduated teachers who, on average, were significantly better or worse than the mean of all recent program completers in the state. These are far fewer significant TEP estimates than were detected when using experienced teachers in the state as the reference, with between 30% and 40% of estimates being significant. Even so, those programs and institutions identified as doing exceptionally well or poorly here were similarly exceptional in Table 2 analyses. Notably, graduates from all small providers combined (Institution 777) perform significantly worse, on average, than the mean of program completers in the state.
As important as whether differences between TEPs are statistically significant, though, is whether differences in magnitude are meaningful. To test this, we divided institution ratings into quartiles. We then compared institutions in the top and bottom quartiles, finding the top-quartile institutions to have ratings that are, on average, 0.31 standard deviations better than bottom-quartile institutions. In our models, this is comparable with the difference in average observational ratings between first- and second-year teachers (β = .33, p = .011), an amount we know to be meaningful. Differences between programs in the top and bottom quartiles were similar (β = .32, p = .011). These results suggest that top-performing TEPs are graduating teachers who effectively have an additional year of initial teaching experience on the first day of class compared with graduates from the lowest performing TEPs. The difference between the very best and very worst institutions was about one standard deviation.
A concern, however, is that differences between TEPs that are statistically significant and appear to be meaningful in magnitude may actually consist mostly of noise (von Hippel et al., 2014). Following von Hippel et al. (2014), we examine the fraction of the variance in the TEP estimates that is due to heterogeneity in the TEP effects versus noise. 22 The results indicate that the institution estimates were 40% noise, which is substantially better than the 56% to 81% noise that von Hippel and colleagues report. As expected, program estimates were noisier (56% noise), though still at the bottom of the range reported by von Hippel and colleagues. We also compared SFE models with and without TEP indicators based on likelihood-ratio tests and AIC/BIC criteria. Likelihood-ratio tests indicated that adding TEP indicators significantly improved the model (p < .001), suggesting that we can detect differences between TEPs; AIC/BIC criteria also showed that including TEP indicators improved model fit.
Given that there appeared to be significant and meaningful differences between TEPs, we also wondered whether there was variation in the performance among programs within a given institution. If all the programs within a given institution performed at comparable levels, then institution estimates would suffice and program estimates would be unnecessary. However, we found that 54% of all programs were ranked in different quartiles than the institutions in which they were housed. Table 6 summarizes a set of institution estimates and the corresponding estimates for programs within those institutions. These cases were chosen to highlight a variety of distributions.
Select Institution and Related Program Estimates and Quartiles Rankings
Note. Institutions are shown in bold. Robust standard errors in parentheses. SFE = school fixed effects; SPED = special education.
p < .1. *p < .05. **p < .01. ***p < .001.
Programs within some institutions do not vary much in performance. Institution 4, for example, is a bottom-quartile institution in the sample, and all the programs within this institution are in the bottom two quartiles; in other words, all programs seem to be contributing to the poor performance of this institution. In this case, a single, institution-level rating would likely be appropriate for all programs. Institution 9 is a fairly average performing institution, as are most of its programs. However, its secondary undergraduate program scores among the lowest in the state. This case illustrates a need perhaps to consider evaluations not just of institutions but also of programs, as an institution rating would not reflect the very weak performance of its secondary undergraduate program. Institutions 31 and 34 underscore this point further. Although Institution 31 is one of the top performers in the state, its strong performance is due largely to one of its programs; in fact, its other program is a bottom-quartile program. Although Institution 34 would likely be at risk of losing program approval in some likely policy scenarios, its elementary graduate program is one of the top-performing programs in the state.
How Do Program Ratings That Use TVAAS Scores Compare?
We also examined the degree to which TEP estimates based on graduates’ observational evaluations are correlated with TEP estimates based on graduates’ TVAAS scores. Results indicated that there is a positive and statistically significant relationship between observation-based and TVAAS-based institution estimates (r = .39; r = .42 shrunken) and between observation-based and TVAAS-based program estimates (r = .32; r = .30 shrunken); Appendix Figure S1 (available in the online version of the journal) displays scatterplots representing these relationships. These correlations are comparable with correlations between value-added and OR when using teacher, rather than TEP, quality estimates. For example, Kane and Staiger (2012) report correlations of between .2 and .3 across a range of observational and VAM measures. In separate analyses, we constrained the sample only to individuals with both observational ratings and TVAAS and reproduced TEP estimates. Correlations were somewhat greater for institutions (r = .47; r = .48 shrunken) and comparable for programs (r = .32; r = .33 shrunken).
Given that there may be reasons to expect differences in production functions between elementary and secondary schools, we separated the sample by elementary and secondary teachers and then reproduced analyses. We found the correlations to be statistically similar to correlations from the full sample. Among elementary teachers, the correlations were positive and statistically significant for institutions (r = .41) and for programs (r = .26). Among secondary teachers, the correlations were also positive and statistically significant for institutions (r = .57) and for programs (r = .32).
We also reproduced these analyses (a) only with teachers who had math TVAAS and (b) only with teachers who had reading TVAAS. Focusing on these subpopulations greatly reduced our sample of teachers and TEPs, which meant that the correlations were based on a different set of TEPs. Correlations between TEP rankings using observational ratings and TVAAS were comparable, though somewhat lower, with correlations based on TEP rankings that aggregated across subjects. Consistent with prior literature, correlations were stronger in math than in reading. When using the full sample of teachers (including experienced teachers and non–program completers), the Spearman correlations between TVAAS and OR TEP institution rankings were .28 for math and .25 for reading. For program rankings, the correlations were .36 for math and .15 for reading. With the further reduced sample of only program completers, the correlations at the institution level for math and reading were 0.26 and 0.28, respectively. At the program level, the correlations for math and reading were .35 and .08, respectively.
Although positive and significant, the relationships between TEP ratings based on different measures are still weak to moderate in magnitude. As a result, TEPs ranked in a given quartile using one outcome measure often are in a different quartile using the other. Table 7 illustrates the distribution of TEP rankings across quartiles using both measures—the left side summarizes estimates for institutions and the right side for programs; the bottom of table are shrunken (EB) estimates. The top row demonstrates that, among the 10 institutions and 27 programs ranked in the bottom quartile using TVAAS, three institutions (30%) and three programs (11%) were ranked in the top quartile using observational ratings. In fact, only about 40% of institution and programs are in the same quartile based on both TVAAS and OR. 23 Given that these kinds of quartile rankings could have important consequences for TEPs, such discrepancies between outcome measures have important policy implications, which we take up below.
TVAAS and OR Quartile Switching by TEP
Note. TVAAS = Tennessee Value-Added Assessment System; OR = observation ratings; TEP = teacher education program; EB = empirical Bayes.
Discussion
To our knowledge, this is the first study to investigate the use of state-level data on teachers’ observational ratings to evaluate preparation programs and institutions. The results indicate that we are able to detect significant and meaningful differences between TEPs. Moreover, TEPs that were identified as performing particularly well or poorly using one modeling approach performed similarly across other modeling approaches. This suggests that observational ratings of graduates can be used to measure TEP performance and that TEP classifications are fairly robust across modeling approaches. Finally, we found evidence for convergent validity—TEP rankings based on observational ratings were positively and significantly related to rankings based on student achievement gains (TVAAS).
This study has also made progress in determining how best to analyze teacher observational ratings to assess TEP performance. Specifically, our analyses indicate estimating TEP rankings using SFE or HLM to be preferable to OLS or simple means. Rankings based on SFE and HLM are highly correlated with one another but less correlated with OLS; by contrast, OLS is more highly correlated with mean scores—which do not adjust TEP evaluations for differences between schools—than SFE or HLM. This suggests that SFE and HLM models do a better job of adjusting for school characteristics than OLS, a conclusion that is consistent with finding that SFE and HLM models adjust up estimates for TEPs that supply teachers to the low-income and secondary schools more so than OLS models. Finally, goodness of fit analyses also indicate that HLM and SFE fit the data better than OLS.
Findings from this study also indicate that TEP rankings varied by modeling approach and teacher quality measure. SFE program rankings were correlated with mean rankings at .57, with OLS rankings at .72, and with HLM rankings at .93. Even where correlations were strong for programs overall, rankings for specific programs sometimes differed in meaningful ways. For example, SFE and OLS rankings were correlated at .72 but, among the 29 programs ranked in the top quartile based on SFE, 2 were ranked in the bottom quartile using OLS (4 were ranked in the bottom quartile using mean scores). Although only a small minority of TEPs had extremely divergent quartile rankings across models, the implications for these TEPs could be profound. In some likely policy scenarios, the choice of modeling approach could be the difference between a “top-performing” designation and being at risk of losing state approval.
Likewise, although TEP rankings based on observational ratings and TVAAS are positively and significantly related, correlations are weak to moderate in magnitude (rs = .3–.4). Only about 40% of programs ranked in the top (or bottom) quartile based on observational rating data are also ranked in the top (or bottom) quartile based on TVAAS. Moreover, a number of TEPs ranked in the top quartile using one outcome are in the bottom using the other.
Implications
A central implication of this study is that TEPs, through some combination of training and recruitment, can make a difference. We find that some institutions and programs graduate teachers who are significantly better or worse than the mean of graduates in the state. In addition, these differences are meaningful; graduates from top-quartile TEPs performed as though they had an additional year of initial teaching experience when compared with graduates from bottom-quartile TEPs.
Another implication is that the use of teacher observational ratings to assess teacher preparation programs appears to have promise—it is possible to reliably identify significant and meaningful differences between TEPs. However, when doing so, policymakers may want to go beyond using observational ratings to evaluate institutions to also evaluate programs within institutions. There can be substantial differences in the performance among programs within the same institution; recognizing these differences is especially important when evaluations carry policy consequences for program approval/accreditation. For instance, a high-performing program within a low-performing institution should probably not be penalized by the performance of its institutional peers and vice versa.
This study then suggests that using observational ratings to assess TEP quality is a viable approach. We are not advocating, though, for displacing the evaluation of TEPs based on graduates’ VAMs but believe instead TEP evaluations based on graduates’ observational ratings to be a potentially complementary strategy. This is especially true given that rankings based on observational ratings and TVAAS are positively and significantly related.
However, even where state or district policymakers agree about basing TEP evaluations on observational ratings, there is little consensus about how best to do so. The simplicity of using mean observation scores is appealing and is an approach already used to evaluate teachers (instead of TEPs) in many states. However, as mentioned earlier, there is growing evidence that mean OR are systematically lower for teachers in schools with more low-income students and students of color, and these differences do not appear to be due to actual differences in teacher quality. Consistent with these findings, we find TEP mean scores are differentially and significantly lower for institutions/programs supplying teachers to schools and districts with more low-income students. Based on our analyses, we believe it would then be a mistake to use simple mean scores to evaluate quality, as TEPs should not be penalized for supplying teachers into schools with historically marginalized students. There is already much consensus that a teacher’s value-added score should be adjusted for differences in school, classroom, and student characteristics; consistent with Whitehurst, Chingos, and Lindquist (2014), we believe that observational evaluations of teachers, and evaluations of TEPs that use this information, should also be adjusted accordingly. When adjusting for differences between schools, our findings further suggest that SFE or HLM models are preferable to OLS.
As part of using observational evaluations for TEP approval purposes, some states are considering ways of identifying TEPs that are both exemplary and at risk of losing state approval/accreditation. Given that the proposed federal regulations ask states to categorize programs into four groups, some states are considering using quartile rankings to identify these groups. Our results provide initial evidence that using quartile rankings based on observational ratings may be a sound approach. As summarized in Table 5, there is more agreement between modeling approaches about which institutions/programs are in the top and bottom quartiles as compared with the middle quartiles. 24 Given that these top and bottom categories will likely carry with them the greatest policy consequences, it is important to use evaluation strategies that identify these TEPs in reliable ways. To this point, no top (bottom) quartile TEPs using SFE are identified in the bottom (top) quartile using HLM, our other preferred model. In addition, the differences between institutions/programs in the top and bottom quartiles—roughly a year of initial teaching experience—are substantial. Thus, using these categories has potential to capture TEPs with meaningfully different impacts.
We are not, however, advocating that policymakers make program approval decisions using quartile rankings based solely on graduates’ observational ratings. The fact that correlations are moderate suggests that OR cannot fully substitute for TVAAS ratings. To the contrary, we believe that other teacher quality information, such as value-added, and workforce outcome data, such as teacher retention, should also be considered, alongside other possible criteria. However, our analyses suggest further that using multiple outcomes measures will not be straightforward, as different measures can lead to drastically different TEP rankings. Namely, finding only 40% of institutions/programs ranked in the top quartile using observational evaluations to be in the top quartile using TVAAS indicates that state policymakers will need to determine ways of reconciling TEP rankings across various outcome measures. We believe, though, that having to reconcile multiple rankings based on multiple measures can introduce a potentially generative system of “checks and balances” to the TEP evaluation process. For example, policymakers can feel more confident in assigning an “exceptional” categorization to a TEP ranked in the top quartile on TVAAS when that same TEP is ranked in the top quartile using observational ratings. By contrast, were this same TEP to receive a bottom-quartile ranking using observational ratings, policymakers would have good reason to second guess assigning it an “exceptional” categorization. Using multiple measures can increase thresholds for making high-stakes decisions, which is likely good policy. Moreover, having multiple measures should result in more comprehensive assessments of teacher and TEP quality, which we know to be complex and multidimensional constructs (Boyd et al., 2009; Shulman, 2004).
Finally, there is great potential in using TEP ratings based on observational ratings for formative, program improvement purposes. For example, Papay, Taylor, Tyler, and Laski (2015) conducted a randomized control trial, also in Tennessee, where they used dimension-level OR information from the TEAM rubric to match high-performing teachers on specific dimensions to “target” colleagues in the same treatment schools who had lower performance on the same TEAM dimension. After being encouraged to work together for a year on instruction, the authors found treatment schools to have meaningfully greater achievement gains than other schools. This study suggests that formative feedback related to dimensions of observational rubrics can have substantial effects on teacher effectiveness. Policymakers and TEP leaders also tend to agree that information about how graduates from a given program perform relative to graduates from other programs in specific instructional domains (e.g., classroom environment or planning) has great promise for guiding TEP reform efforts. Using this information, programs might, for instance, redesign coursework or clinical experiences to improve preparation in areas in which their graduates tend to be weak. We are currently exploring whether it is feasible to produce TEP effect estimates based on dimension- or domain-level observational ratings. In future work, we hope to extend work by Papay et al. (2015) to test whether dimension/domain-level information can prompt TEP rather than teacher development.
Footnotes
Acknowledgements
We gratefully acknowledge the Tennessee Department of Education for being willing to partner on and support this research. In particular, we appreciate ongoing and thoughtful input from Nate Schwartz, Amy Wooten, Michael Deurlein, and Laura Booker. We also want to acknowledge Dan Goldhaber, Christina Weiland, and our blind peer reviewers for helpful feedback on early drafts of this article. The views expressed should be attributed to the authors; any errors or omissions are also the sole responsibility of the authors.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Notes
Authors
MATTHEW RONFELDT is an assistant professor of educational studies at the University of Michigan School of Education. His scholarship sits at the intersection of educational practice and policy and focuses on teacher preparation, teacher retention, teacher induction, and the evaluation of teachers and preparation programs. He received his doctoral degree from Stanford University School of Education, where he was also an IES postdoctoral fellow; before that he taught middle school math and science for eight years.
SHANYCE L. CAMPBELL is a postdoctoral research fellow of educational studies at the University of Michigan’ s School of Education. She received her doctoral degree in public policy from the University of North Carolina at Chapel Hill. Her research focuses on understanding and improving educational opportunities to learn for undeserved and marginalized student populations. Within the broader topic of opportunities to learn, she focuses on three specific research areas: teacher effectiveness, district and school reforms, and school-community relations.
