Abstract
Most education agencies have implemented new teacher evaluation systems that promise to improve teacher performance. Post-observation performance feedback is a theoretically important driver of this promise as it should ultimately develop teacher-specific weaknesses. This is the first large-scale study to use the written feedback provided to early-career teachers during formal post-observation conferences and quantitatively link critical feedback characteristics (CFCs) to measures of teacher human capital. We find that most conferences do not include CFCs, that feedback is typically unidimensional, and that less effective early-career teachers receive higher shares of CFCs. However, goal-setting is the only CFC associated with subsequent teacher performance. Beginning and less-educated teachers, for whom goal-setting may clarify performance expectations, drive this relationship.
Keywords
Introduction
Although feedback generally improves teacher performance, theoretical and empirical research suggests that certain feedback characteristics are more potent than others. Specifically, research both within and beyond K–12 settings suggests a positive association between improvements in performance and feedback that (a) is aligned with an improvement area, (b) discusses the feedback’s evidential basis, (c) sets specific improvement goals, and (d) includes actionable next steps (Cherasaro et al., 2016; DeNisi & Murphy, 2017; Hattie & Timperley, 2016; Ilgen et al., 1979). 3 We call these four qualities critical feedback characteristics (CFCs). Because students taught by more effective teachers experience better short- and long-term academic and nonacademic outcomes (e.g., Chetty et al., 2014; Doan, 2019; Jackson, 2018), we suspect that policymakers want observation conferences to include performance-enhancing CFCs, especially if their provision is low-cost. 4
Prior studies explore the prevalence of effective feedback, broadly defined, in current or recently implemented teacher evaluation settings. We extend this work in several ways. Survey and small-scale qualitative research studies report that teachers receive effective feedback (Cherasaro et al., 2016; Donaldson et al., 2014; Finster & Milanowski, 2018; Long, 2019; Sun et al., 2016), but the data collected by these studies are self-reports. Furthermore, survey studies predominantly collect data near the end of a school year while asking participants to recall feedback received over the year, potentially introducing recall bias. Therefore, it is unclear how prior data collection methods have shaped what we know about feedback provision.
Our study is radically different from prior work. We qualitatively code the micro-textual feedback observers provided to approximately 1,200 early-career Tennessee teachers throughout an academic year. We then transform these qualitative codes into quantitative measures of CFCs. As detailed in foundational work by Fesler et al. (2019), access to text data sets, and ours, specifically, now affords researchers the ability to ask questions that were not previously possible. Our data collection methods differ from prior quantitative work in another meaningful manner. Responses are missing from the sampling frames of previous survey studies due to nonresponse, potentially introducing nonresponse bias. As we randomly select teachers from administrative data that includes no missing data from the entire sampling frame, our study avoids nonresponse bias.
We also extend prior work on feedback presenting CFCs. First, we explore whether teachers receiving high concentrations of one feedback characteristic across their post-observation conferences also receive high concentrations of other characteristics. Second, we assess who is receiving CFCs by drawing on strategic management of human capital theory, which examines how policymakers and school leaders use measures of educator human capital for educator development and other purposes (Odden & Kelly, 2008). Although prior research reports that decision-makers use such measures to strategically manage educator human capital (Cohen-Vogel, 2011; Goldring et al., 2015; Grissom & Bartanen, 2018; Grissom et al., 2017), we are unaware of any studies that examine whether observers issue feedback based on these measures. We hypothesize that teachers with lower human capital (specifically, observation scores, composite effectiveness scores, education level, and experience) receive more CFCs, given that these teachers need more development than teachers with higher levels of human capital.
Finally, we examine the relationships between feedback characteristics and subsequent performance. Prior studies have tried to clarify these connections using participant self-reports. We use a unique data set of externally coded feedback by estimating associations between each of the CFCs and subsequent value-added and observation scores. We also explore whether these associations depend on teacher human capital measures.
The purpose of this study is to extend our understanding of teacher evaluation processes as recently applied in field settings, which is critically important to education policy and practice considering the significant financial investments in teacher evaluation and its use as a focal mechanism to enhance teacher practice and effectiveness. Specifically, we ask the following questions:
To what extent do teachers’ early-career post-observation conferences include the CFCs, that is, feedback which (a) is aligned to improvement targets, (b) refers to evidence, (c) sets goals, and (d) is actionable?
To what extent is the receipt of one CFC correlated with the receipt of other CFCs?
To what extent are teacher post-observation conferences exhibiting the examined CFCs associated with teacher baseline observation scores, composite effectiveness scores, education level, and experience?
To what extent are teacher performance outcomes associated with the proportion of teacher post-observation conferences exhibiting the examined CFCs? Which of the four examined teacher measures of human capital moderate these associations?
The article proceeds as follows. First, we describe literature concerning feedback characteristics and employee improvement. We then discuss strategic human management as a conceptual frame for our work, study context, and data and methods. Finally, the article ends with a presentation and interpretation of our findings, followed by a discussion of implications for future research and policy and practice.
Feedback Characteristics and Performance Improvement
We consult educational, psychological, and management literature, as the latter fields include decades of established empirical work on the relationships, mediators, and moderators of feedback provided to employees. Research within and beyond the K–12 sector finds positive and negative net associations between changes in employee (teacher) performance and feedback characteristics (e.g., DeNisi & Murphy, 2017; Hattie & Timperley, 2016). Despite this mixed evidence, scholars consistently identify specific feedback characteristics as performance-enhancing. We review prior research to identify feedback characteristics associated with performance improvement, focusing specifically on those characteristics communicated via written feedback. Ultimately, we use these four CFCs to qualitatively code the textual feedback before transforming it into quantitative measures. As displayed and defined in Table 1, the four CFCs are as follows: (a) Alignment with an area of improvement, (b) Evidence, (c) Goal-Setting, and (d) Action.
Qualitative Codes: Feedback Characteristics
Alignment With an Area of Improvement
Scholars argue that feedback is more effective when aligned to one job skill because it allows for in-depth analysis and greater focus during post-observation conferences (Cannon & Witherspoon, 2005; Cornelius & Nagro, 2014; Tuytens & Devos, 2011). For instance, feedback about “questioning” may be too broad; instead, effective feedback might focus on particular questioning techniques such as “the degree to which all students in a class participate in answering questions” or “use of cognitively demanding questions that push student thinking” (Archer et al., 2016). In addition, research has shown that feedback that discusses several areas where improvement is warranted may overwhelm recipients, thus limiting teachers’ responsiveness to observer recommendations (Brinko, 1993; Hattie & Timperley, 2016; Scheeler et al., 2004). We apply these results in two ways. First, we analyze written feedback meant to focus on a single area for teacher improvement (see the Study Context section for details about feedback design). Second, we examine the feedback to determine whether it aligned with the observer-identified area of improvement or not.
Evidence
Feedback referencing evidence from an observation is one of the most consistently identified characteristics of performance-enhancing feedback (Feeney, 2007; Hattie & Timperley, 2016; Hemmeter et al., 2011; Tuytens & Devos, 2011). Previous research and practitioner resources suggest that feedback effectiveness depends on the specificity and objectivity of the evidence referenced in the feedback (Glickman et al., 2018; Hill & Grossman, 2013; Ilgen et al., 1979). Prior work also suggests that referencing evidence promotes teachers’ perceptions of observer credibility and is positively associated with teachers’ acting on the recommendations from the feedback (Archer et al., 2016; Ilgen et al., 1979; Thurlings et al., 2013; Tuytens & Devos, 2013). Evidence obtained from a teacher observation might be a direct quote of something the teacher said to students, something the teacher wrote on the board, or something students wrote in their papers. We define feedback as including evidence if it describes specific teacher or student behaviors or work products.
Goal-Setting
Several feedback studies consider goal-setting a characteristic of performance-enhancing feedback (Cherasaro et al., 2016; DeNisi & Murphy, 2017; Hattie & Timperley, 2016; Hemmeter et al., 2011). In one study, Ivancevich (1982) found that supervisors trained in goal-setting reported that their employees were more likely to take up recommended changes to their work. Goal-setting also signals what supervisors may look for during future observations, clarifying the supervisor’s performance expectations (Ilgen et al., 1979). We characterize feedback as goal-setting if it describes at least one specific behavior for the teacher to change. 5
Actionable
Actionable feedback specifies methods for achieving evaluator-identified goals. Scholars argue that feedback should offer strategies for goal attainment or approaches for closing performance gaps (Carver & Scheier, 1982; Cherasaro et al., 2016; Kimball & Milanowski, 2009; Sun et al., 2016). Observers might recommend strategies based on their knowledge of the individual and the area for improvement or may refer the teacher to other professional learning sources, such as workshops or peer mentors. We count feedback as actionable if it suggests a mechanism through which the teacher could improve some aspect of her teaching.
Strategic Management of Human Capital
The strategic management of human capital in K–12 education aims to improve student outcomes via educator development and the recruitment and retention of talented educators (Hitt & Tucker, 2016; Odden & Kelly, 2008). Research in this area has primarily focused on how school administrators use management information systems to make hiring and classroom assignment decisions (Cannata et al., 2017; Cohen-Vogel et al., 2015; Goldring et al., 2015; Grissom & Bartanen, 2018; Grissom et al., 2017).
The strategic management of human capital may be applicable when providing teachers with critical feedback. Although all employees deserve high-quality feedback, we hypothesize that school administrators may consider teacher human capital when providing critical feedback. Specifically, we hypothesize that teachers in need of more improvement, that is, teachers with less human capital, will receive more critical feedback. Importantly, we do not assume that school administrators rely exclusively on observable measures to manage teachers. Indeed, prior work indicates that school administrators also consider unmeasured qualities such as commitment and collaborative skills (Grissom & Loeb, 2017; Harris & Sass, 2014; Neumerski et al., 2018).
Measures of Human Capital
Commonly available measures of educator human capital (e.g., experience, educational attainment, test score value-added, and observation scores) vary in their ability to predict student achievement. Recent research on teacher experience and human capital development has demonstrated that teachers see large gains in effectiveness early in their career (Gershenson, 2016; Harris & Sass, 2011; Ladd & Sorensen, 2017; Papay & Kraft, 2015). Educational attainment is a less reliable measure. Advanced degree attainment on its own is not a reliable predictor of future student achievement (e.g., Harris & Sass, 2011; Wayne & Youngs, 2003). However, teachers with advanced degrees in their assigned subject demonstrate higher effectiveness (e.g., Dee & Cohodes, 2008; Wayne & Youngs, 2003).
Alternative methods for assessing educator human capital include measuring teacher effectiveness via (a) value-added modeling (VAM) or (b) subjective performance ratings, such as scores from classroom observations. Research shows that three consecutive years in a highly effective teacher’s classroom (e.g., one with high VAM) can elevate a student’s state standardized test-score ranking from the 25th to the 75th percentile (Sanders & Rivers, 1996). More recent work shows that exposure to high value-added teachers increases students’ chances of attending college, matriculating at higher ranked colleges, and earning higher salaries (Chetty et al., 2014). We use a teacher’s score from the Tennessee Value-Added Assessment System (TVAAS) as a measure of interest, despite some concerns over its operationalization in staffing decisions (Ballou & Springer, 2015; Goldring et al., 2015).
Another widely used approach to measuring teacher quality is classroom observation. Although classroom observations have been used to assess teachers for far longer than VAM (Brophy & Good, 1986), researchers have only recently begun to evaluate the properties of observation scores (Campbell & Ronfeldt, 2018; Hunter, 2020; Steinberg & Garrett, 2016). The depth of research examining the connections between teacher observation scores and students’ outcomes is much more limited. Several recent studies report that classroom observations and student achievement are correlated (Bacher-Hicks et al., 2017; Garrett & Steinberg, 2015; Kane et al., 2011). A recent study from Tennessee found that classroom observation scores not only track teachers’ impacts on students’ K–12, postsecondary, and labor market outcomes but also that the effects of observation scores are at least comparable to, if not greater than, the effects of teacher value-added scores on various student outcomes including reduced student absences and suspensions (Doan, 2019).
Study Context: Background on Teacher Evaluation in Tennessee
This study occurs in Tennessee, an ideal setting. Tennessee’s education policy requires each teacher to receive at least one observation each school year, with early-career teachers often receiving more. There are clear rules regarding the assignment of teacher observations and structured post-observation feedback (Teacher and Principal Evaluation Policy, 2013). Critically, the Tennessee Department of Education (TNDOE) also expects teacher evaluators to input post-observation feedback into a central data management system that links observer feedback to individual teachers, a key feature we leverage in the current study. We describe the observation system in which our study occurs using the framework developed by Liu et al. (2019).
TEAM Observation System
Tennessee policymakers adopted the Tennessee Educator Acceleration Model (TEAM) observation and evaluation system in the early 2010s. Teachers are observed using the TEAM rubric (see Supplemental Appendix A, available in the online version of this article) based on Charlotte Danielson’s widely used Framework for Teaching. Although the TEAM rubric includes four domains, only three are used for classroom observations: instruction, environment, and planning. These three domains have 12, four, and three indicators, respectively. The indicators describe specific aspects of standards-based teaching mapped onto three levels of proficiency: below expectations (=1), at expectations (=3), and above expectations (=5).
Training, Certification, and Accountability
TNDOE annually provides 2 days of training on using the TEAM rubric, facilitating pre- and post-observation conferences, basic knowledge of Tennessee’s evaluation policy, and the characteristics of performance-enhancing feedback (Alexander, 2016). Attendees must pass a certification exam before officially conducting teacher observations (Teacher and Principal Evaluation Policy, 2013). Certified observers need not be school administrators; however, less than 20% of observers are district personnel or peer teachers.
Tennessee policy holds certified observers accountable in three ways. First, observers are expected to generate observation scores that are somewhat aligned with the value-added score of teachers of tested subjects (Teacher and Principal Evaluation Policy, 2013). Observers who consistently generate observation scores that are too far above or too far below a teacher’s value-added scores can lose their certification. Second, teachers can file formal grievances if observers do not adhere to policy expectations (Teacher and Principal Evaluation Policy, 2013). For instance, teachers may file a grievance if they do not receive a copy of their observation scores. Third, observers who are school administrators receive their own performance rating (see details below) concerning skills in teacher evaluation and support of teacher professional learning. 6
Rating Processes
Per year, TEAM policy assigns a minimum of four observations to teachers receiving Tennessee’s lowest teacher effectiveness score, four or two observations to teachers in the middle categories of effectiveness depending on their certification, and one observation to teachers receiving the highest effectiveness score (Teacher and Principal Evaluation Policy, 2013). Although state policy expects the typical observation to last approximately 15 minutes (Teacher and Principal Evaluation Policy, 2013), teachers report that observations tend to last about 30 minutes (Hunter, 2020). School administrators decide which observers will conduct observations of which teachers. 7
Each observation is followed by a timely, structured face-to-face conference, whereas a conference precedes only some observations. Each observation is either announced to the teacher in advance or not. Conferences do not precede unannounced observations but precede announced observations (Teacher and Principal Evaluation Policy, 2013). Tennessee policy states that teachers should receive their post-observation conference within 1 week of each observation (Teacher and Principal Evaluation Policy, 2013). During post-observation conferences, observers discuss an area of refinement and an area of reinforcement. Each area refers to a single indicator from the TEAM rubric. The area of reinforcement identifies the best aspect of teaching seen during the observation. In contrast, the area of refinement represents the aspect most in need of improvement (Tennessee Department of Education, 2016). Observers are trained to offer suggestions on how teachers might improve their refinement area and to point teachers toward resources that might aid improvement (alignment and actionable feedback). Observers are required to discuss performance ratings across all indicators, not just the areas of refinement and reinforcement. TNDOE expects observers to set improvement timelines with teachers, identifying a timeframe over which the refinement area should improve (goal-setting feedback). Finally, observers discuss scores for each indicator and the basis for each score (evidence-referencing feedback; Alexander, 2016).
Level of Effectiveness
Observation scores and other measures of teacher performance determine each teacher’s level of effectiveness (LOE-cont), a continuous composite 8 measure of teacher effectiveness that combines teacher observation scores with “growth” and “achievement” scores. The growth component for teachers of tested subjects is their TVAAS score (for details, see SAS Analytics Software, 2015). Growth scores for teachers of untested subjects are based on school- or district-wide student outcomes (e.g., accountability test scores, school-wide TVAAS scores). Achievement measures are grade-, school-, or district-wide student achievement outcomes (e.g., ACT scores and high school graduation rates). A teacher and her school administrators receive the teacher’s observation, growth, achievement, and LOE-Cont scores in advance of her first observation, as these scores partially determine the number of observations assigned by state policy. Moreover, observers can access a teacher’s prior observation scores.
Theory of Action
This study analyzes feedback concerning refinement areas because we are most interested in the characteristics of critical feedback. Reinforcement area feedback emphasizes practices that a teacher should continue; it is not intended to change teaching practices. In contrast, the refinement area, which we classify as critical feedback, is designed to change teachers’ practices. Subsequent references to feedback refer exclusively to the refinement area of feedback.
The TEAM theory of action asserts that teacher performance as measured by the TEAM rubric will improve when teachers receive critical feedback that (a) aligns with areas of refinement, (b) references evidence, (c) explicitly sets improvement goals and times, and (d) includes actionable next steps. Written feedback for performance improvement is expected to exhibit “alignment” to a TEAM indicator because refinement feedback aims to improve some aspect of teaching measured by the TEAM rubric. As observers receive annual training about the importance of referencing evidence, identifying next steps for teacher improvement, and goal-setting, the analyzed feedback is expected to include characteristics (b) to (d) mentioned earlier. The TEAM theory of action assumes that feedback provided at any point of the year, even late in the school year, can improve teaching as measured by observation scores and may be able to improve value-added scores (i.e., TVAAS scores; Alexander, 2016). Emerging evidence corroborates this assumption (Phipps, 2018).
If feedback improves teacher performance as measured by the TEAM rubric, it is expected to improve TVAAS scores. Prior work finds that higher TEAM scores are associated with higher student achievement on standardized tests (Daley & Kim, 2010). Teachers whose teaching improves more may raise their students’ achievement scores by more, translating into even higher TVAAS scores. Similar logic undergirds recently reformed teacher evaluation systems (Steinberg & Donaldson, 2016).
Data and Method
Data
Our data come from administrative records obtained from TNDOE via the Tennessee Education Research Alliance. Table 2 presents descriptive statistics for variables from the 2013–2014 school year, the baseline year. Teachers’ data include years of experience, age, gender, highest degree held, and race/ethnicity. We also obtained teachers’ LOE scores and scores from LOE determinants.
Standardized Differences Between Sample and Population
Note. LOE = level of effectiveness; TVAAS = Tennessee Value-Added Assessment System.
0.40 ≥ | standardized difference |≥ 0.20, a “small” difference (Cohen, 1988).
TNDOE collects a written version of post-observation feedback regarding a teacher’s area of refinement after each observation. Observers record this written feedback into a TNDOE information management system, and unique identifiers link feedback data to individual teachers. Some teachers received a single observation and post-observation feedback session, whereas others received several. In our sample, the average teacher was observed 3 times and received three post-observation conferences, although some received as many as 10.
We examine written critical feedback entries for 1,219 randomly selected early-career teachers (i.e., teachers with less than 5 years of experience) across the state. We focus on early-career teachers because Tennessee policy requires these teachers to be observed at least 4 times per year unless they receive a prior-year effectiveness score of 5, which would place them in the highest category of teacher effectiveness. In addition, prior research suggests that early-career employees are more likely to benefit from on-the-job feedback (Kimball, 2003), implying that if there are associations between feedback quality and changes in performance among any group of teachers, it would be early-career teachers, ceteris paribus. Our randomly chosen sample closely resembles the population of all early-career Tennessee teachers (see Table 2). The only statistically significant difference is the proportion of non-White teachers in our sample (0.05) compared with those in the state (0.13).
Qualitatively Coded Written Feedback
Two analysts independently coded approximately 5,000 individual critical feedback episodes for the characteristics identified in Table 1. If a post-observation conference included critical feedback exhibiting a specific feedback characteristic, analysts flagged the conference accordingly. For example, if one conference included critical feedback making a single reference to evidence, a second conference included critical feedback making several references to evidence, and a third conference no mention of evidence, the first two conferences are coded as one and the last zero. We repeated this process for each of the four feedback characteristics identified for our study.
We adopted two processes to ensure that analysts coded the feedback similarly. First, about 500 conferences were randomly selected and independently coded by each analyst. The mean percentage of agreement in the coding of this subsample was 88% and, as seen in the right-most column of Table 1, it ranged from a high of 91.9% agreement in the “actionable” characteristic to a low of 84.7% in the “goal-setting” attribute. All the percentages of agreement exceed the 80% minimum agreement threshold established by Marques and McCall (2005) and Belur et al. (2021). Where disagreements existed, analysts rereviewed written post-observation feedback entries, discussed differences, and adjusted subsequent coding. Second, to mitigate “coder drift” (Bartholomew et al., 2000), the two analysts conducted check-ins every 2 weeks to maintain their shared understanding of the qualitative codes while sharing examples of the types of text representative of each characteristic (Carey & Gelaude, 2008).
Quantitative Analytic Strategies
We convert qualitatively flagged conferences to teacher-level quantitative measures by calculating the share of each teacher’s conferences that included critical feedback exhibiting each of our CFCs. This resulted in four proportions per early-career teacher. Unconditional means and standard deviations (SDs) describe the feedback characteristics of an early-career teacher’s typical (i.e., mean) conference and variation in these characteristics.
Associations With Baseline Differences
We examine the relationships between each feedback characteristic and the baseline differences in early-career teacher human capital measures using Equation 1:
where
To facilitate interpretation, we standardize continuous measures in
Associations With Early-Career Teacher Outcomes
To investigate the relationships between
The variable
Moderation by Measures of Early-Career Teacher Human Capital
We investigate whether measures of early-career teacher human capital moderate the relationships between
where
Findings
Distributions and Correlations of CFCs
Although there is substantial variation in the concentration of CFCs early-career teachers receive, the modal teacher tends not to receive any of the feedback examined and the mean teacher receives low concentrations of critical feedback (Figure 1). Each panel of Figure 1 shows that some teachers receive no conferences with the CFC while all other teachers’ conferences include it (the range of each characteristic is zero to one). SDs also suggest ample within-characteristic variation as each SD is approximately 0.24 units. However, there are some important between-characteristic differences. Approximately 50% to 80% of conferences included aligned feedback for a relatively large number of early-career teachers, which is why 66% of the mean teachers’ conferences are aligned (dashed line). Moreover, 80% of the modal teacher’s conferences contained aligned feedback. However, distributions of the remaining CFCs are somewhat disappointing. None of the conferences received by the modal teacher include evidence, goal-setting, or actionable feedback, and no more than one-third of the conferences received by the mean teacher exhibited these CFCs (dashed lines). Goal-setting and actionable feedback are particularly rare, as approximately half of the teachers in our sample received no conferences with these CFCs.

Histograms: Share of early-career teachers’ feedback conferences including feedback characteristics.
Teachers whose conferences document high shares of one CFC tend not to show high shares of any other CFC (Table 3). That is, the data suggest that few teachers receive conferences including more than one CFC over a school year. The largest correlation between the shares of conferences that include any two CFCs is 0.33 (evidence and actionable characteristics), a modest relationship at best. Correlations with the alignment CFC range from 0.18 to 0.28, all of which are low enough to suggest that its provision is unrelated to the provision of any other CFC. The weakest correlations belong to the goal-setting CFC, ranging from 0.08 with the actionable CFC and a near-zero of −0.03 with evidence.
Correlations Between Critical Feedback Characteristics
Note. N = 1,219 Teachers are unit of analysis; includes first-year teachers. Pearson correlations.
Taken together, findings in Figure 1 and Table 3 suggest that most teachers receive post-observation conferences that include few, if any, of the examined feedback characteristics. In addition, if a teacher receives a high share of one CFC, they are unlikely to receive a high share of another. Although most early-career teachers’ conferences may not exhibit many characteristics of critical feedback, each of the four CFCs has considerable variation to exploit for the current study. To do so, we first describe who receives higher concentrations of conferences with each of the four CFCs by examining teacher baseline human capital measures.
CFCs and Measures of Early-Career Teacher Human Capital
Table 4 reports associations between CFCs and measures of early-career teacher human capital. The only consistent relationship detected is that teachers with lower prior-year effectiveness scores tend to receive conferences with higher concentrations of each CFC (Table 4). An early-career teacher with a 1SD lower prior-year effectiveness score is 2 probability points more likely to receive actionable feedback in all their post-observation conferences, conditional on observation scores (Column I). The relationships between prior-year effectiveness and the evidence (Column III) and goal-setting (Column IV) CFCs are similar in magnitude to the relationship with actionable feedback. Prior-year effectiveness relates most strongly with the share of conferences presenting the aligned CFC. An SD decrease in prior-year effectiveness increases the chance that all feedback conferences exhibit aligned feedback by 7 probability points (Column II).
Associations With Teacher Baseline Differences
Note. Standard errors clustered at school. All predictors are standardized.
p < .05. **p < .01. ***p < .001.
Although the relationship with LOE-Cont is statistically significant across models, the largest association is between years of experience and the evidence CFC (column III). Early-career teachers with 1SD fewer years of experience (i.e., 1 year less experience) are 27 probability points more likely to receive a feedback conference including the evidence-referencing CFC. However, the provision of no other CFC depends on years of experience. Similarly, none of the CFCs depend on prior-year observation scores or education level.
Ultimately, less-experienced early-career teachers are more likely to receive feedback with the evidence CFC, while less effective teachers are slightly more likely to receive conferences with all CFCs. Critically, as observation scores partially determine effectiveness scores and all models control for effectiveness and observation scores, relationships with prior-year effectiveness are effectively based on variation in nonobservation score components (i.e., student outcomes).
CFCs and Early-Career Teacher Outcomes
Although prior research suggests that receiving feedback with the CFCs improves teacher performance, little evidence substantiates this claim (Table 5). A 10 percentage point rise in the share of teachers’ conferences presenting actionable feedback is associated with a near-zero and null decline in observation scores of 0.01 SD (0.6 units; Panel A). As the magnitude of this association is insignificant, and its standard error is precise (0.01), measurement error is unlikely to drive this null finding; we validate this below. Similar patterns exist between teacher observation scores and the alignment and evidence CFCs in Panel A. Only the goal-setting relationship is statistically significant, but it is negative (−0.02 SD). The bottom panel of Table 5 also tends to show near-zero and null relationships between TVAAS scores and CFCs. Most coefficients are precisely estimated but not statistically significant. Again, only the goal-setting CFC is statistically significant, but this time the relationship is positive, whereby a 10 percentage point increase in the share of conferences with goal-setting feedback is associated with a rise of 0.03 SD in TVAAS.
Associations With Teacher Outcomes
Note. Outcomes are standardized. Standard errors clustered at school level. CFCs = critical feedback characteristics; TVAAS = Tennessee Value-Added Assessment System.
p < .05.
Measurement Error
Measurement error is one of the most fundamental problems in education policy analysis and evaluation, particularly when relying on human coding of written feedback to operationalize complex constructs. Although the percentages of interrater agreement in Table 1 exceed the 80% minimum agreement threshold proposed by some researchers (Belur et al., 2021; Marques & McCall, 2005), error in the coding of feedback characteristics may explain the null associations with actionable, aligned, and evidence-based CFCs. We explore this issue using two conceptually different sensitivity tests, both of which refute this hypothesis.
We first examine the sensitivity of our estimates by applying errors-in-variables (EiV) regression, which adjusts potentially error-prone estimated coefficients and standard errors for additive measurement error. If substantial measurement error exists, we presume that it is “additive” (for a discussion of additive measurement error, see Hardin & Carroll, 2003). Qualitative analysts coded randomly selected feedback episodes and knew nothing about the feedback providers, recipients, their schools, or their districts. As recommended by Lockwood and McCaffrey (2020), we apply bootstrapped standard errors. To specify the reliability ratio of the potentially error-prone variable—a critical component in the EiV approach—we use calculated reliabilities from the interrater agreements reported in Table 1. 10
The second sensitivity test uses the simulation extrapolation method (SIMEX) developed by Cook and Stefanski (1994). SIMEX simulates what happens to estimated coefficients and standard errors if the suspected error-prone variable suffered from more additive measurement error. By adding additional measurement error via a resampling approach, SIMEX generates a measurement error trend and then uses that trend to extrapolate back to an error-corrected estimate (i.e., an estimate with no measurement error). In our case, a meaningful difference between the noncorrected regression estimates (and standard errors) and the SIMEX estimates (and standard errors) would suggest that measurement error biased the Table 5 results.
The point estimates and confidence intervals displayed in Figure 2 indicate that the Table 5 near-zero null results are unaffected by biasing measurement error. Each cell in Figure 2 displays the EiV and SIMEX coefficients and 95% confidence intervals for an association between one of the CFCs (actionable, alignment, or evidence) and one of the outcomes (observation or TVAAS score). A separate equation generates each point estimate and confidence interval. Given that the sensitivity tests use bootstrapped standard errors, and our primary analysis used clustered standard errors, we re-estimate Equation 2 with bootstrapped standard errors; these estimates are labeled “primary.” In Figure 2, the point estimates and confidence intervals are remarkably similar within each cell, supporting the conclusion that Table 5 results are unaffected by biasing measurement error. 11

Measurement error sensitivity tests results.
Moderated Associations
Observation Scores and Measures of Teacher Human Capital
There is little evidence to suggest that teacher human capital measures moderate the relationship between observation scores and shares of CFCs (Table 6). None of the main or moderated associations with the actionable or aligned CFCs are statistically significant (Columns I and II). Associations between conferences including evidence-referencing CFCs and observation scores depend on prior-year effectiveness scores, independent of information contained in prior-year observation scores. However, these are the only moderated associations detected in Column III. The least effective early-career teachers receiving a 10 percentage point higher share of conferences with evidence-referencing CFCs have lower observation scores (−0.06 SD; Panel A, Column III), and the interaction increases observation scores by 0.01 with each SD increase in prior-year effectiveness score (~ 80 LOE-Cont units). Yet even the total association for early-career teachers with the highest prior-year effectiveness score is 0.01 SD and nonsignificant at the 5% level. We also find that less-educated early-career teachers who receive a 10 percentage point higher share of conferences with the goal-setting CFC have lower observation scores (−0.02 SD), while there is effectively no relationship among more educated early-career teachers (Panel C, Column IV).
Associations With Observation Scores Moderated by Teacher Human Capital
Note. The outcome is standardized. Standard errors, in parentheses, clustered at school level. LOE = level of effectiveness.
p < .05.
TVAAS Scores and Measures of Teacher Human Capital
Thus far, the only association with teacher outcomes corroborating our hypotheses is between the goal-setting CFC and TVAAS scores. Results of Table 7 suggest that less-educated and less-experienced teachers drive this association, as expected (Column IV). Less-educated early-career teachers receiving a 10 percentage point higher share of conferences including the goal-setting CFC have 0.03 SD higher TVAAS score, whereas the least experienced early-career teachers receiving a higher dosage of goal-setting feedback have 0.05 SD higher TVAAS score. No other interactions with education and experience are statistically significant. Furthermore, we conclude that there are no other moderated relationships between TVAAS scores and actionable, aligned, or evidence-referencing CFCs. Although the interaction between experience and actionable feedback is statistically significant, the main association is not, which is why we infer that there is no evidence of moderation in Column I.
Associations With TVAAS Scores Moderated by Teacher Human Capital
Note. The outcome is standardized. Standard errors, in parentheses, clustered at school level. TVAAS = Tennessee Value-Added Assessment System; LOE = level of effectiveness.
p < .05.
Discussion
Prior work suggests that observation processes influence teacher development, specifically post-observation feedback (Donaldson, 2021). Theoretically, observation processes in recently reformed teacher evaluation systems provide teachers with critical feedback that will improve their performance, directly or indirectly, by pointing them to appropriate professional learning opportunities (Donaldson, 2021). This study examined the prevalence of four CFCs (specifically, evidence-referencing, goal-setting, aligned to improvement area, actionable) in post-observation conferences within Tennessee’s reformed evaluation system. Unlike prior studies, which predominantly rely on self-reports and self-selection for participation in studies, we qualitatively coded nearly 5,000 instances of written feedback provided to a random sample of early-career teachers. We then converted codes for quantitative analysis. Ultimately, we created a data set described by others as presenting previously unseen analytic affordances and contributions to teacher evaluation research (Fesler et al., 2019). As such, several of our findings offer new insights on teacher observation processes and challenge the findings of previous feedback research.
Our data suggest that few early-career Tennessee teachers’ conferences include CFCs. Over an academic year, nearly half of the teachers in our sample did not receive any actionable or goal-setting feedback, and nearly one third did not receive feedback referencing evidence. In most teachers’ conferences, the only CFC present was feedback aligned to an identified improvement area, a relatively easy characteristic to manifest in the Tennessee context. Tennessee’s information management system forces observers to identify an area of strength and improvement after every observation, prompting them to provide feedback aligned with the improvement area. Although a crucial policy and practice takeaway might be to increase the provision of CFCs, which we agree with in concept, we are cautious about providing a blanket declarative given the limited associations between our CFC measures and important teacher outcomes, a finding we address in greater detail later in the discussion.
Despite the low concentration of CFCs received by the modal teacher, we found substantial variation in teachers’ shares of conferences presenting each CFC. Using this variation, we conclude that teachers whose conferences exhibit high shares of one CFC tend not to exhibit high shares of other CFCs. We do not know whether the correlational evidence implies that observers are unskilled at providing several characteristics of critical feedback simultaneously or if their choice to provide specific characteristics is strategic; untangling these interpretations warrants additional research.
Although the visual and correlational evidence implies opportunities to improve observation conference implementation, select analyses find that observers issue critical feedback to the early-career teachers who need it most. We examined whether teacher experience, education level, prior-year effectiveness, and observation scores played a role in determining the recipients of high shares of conferences with CFCs. As most of the relationships between CFCs and human capital measures were negative, the evidence partially supports our hypotheses that teachers with less human capital would receive higher shares of critical feedback. Simultaneously, most of these negative relationships were statistically nonsignificant. Tennessee’s de facto teacher composite measure of effectiveness—LOE—was the only human capital measure examined that consistently affected the provision of CFCs. As hypothesized, less effective early-career teachers were more likely to receive higher concentrations of CFCs, extending strategic management of human capital theory by suggesting that teacher effectiveness informs feedback provision.
That teachers with less human capital receive conferences with theoretically performance-enhancing feedback is promising. However, we suspect that policymakers are more interested in subsequent improvements to teaching performance. The evidence is disappointing in this regard, as the receipt of higher shares of critical feedback across conferences is not associated with subsequent teacher performance, on average. Goal-setting is the only CFC associated with subsequent observation or TVAAS scores, and in the former, the association is negative while the latter’s association is positive. Other research concludes that students taught by teachers with 1.0 SD higher value-added are 0.82 percentage points more likely to attend college, will enjoy an adult income hike of 1.3%, and will score 0.13 SD higher on end-of-year tests (Chetty et al., 2014). Back-of-the-envelope calculations suggest that increasing the share of teachers’ conferences presenting the goal-setting CFC by 1SD (21 percentage points) may increase the likelihood of college attendance by 0.17 percentage points, raise adult income by 0.27%, and increase test scores by 0.03 SD. These are small but meaningful changes.
Why might CFCs associate with observation scores negatively and TVAAS scores positively? The answer may rest in the specific goals set during feedback conferences. An established body of work suggests that observation and value-added scores capture different aspects of teaching (Bacher-Hicks et al., 2017; Hill & Grossman, 2013; Kraft et al., 2020). Goal-setting CFCs may focus on teaching aspects that are more related to gains in student achievement than teaching practice as evaluated by a classroom observation rubric. Given that TVAAS scores represent a sizable component of a teacher’s effectiveness score, it is plausible that Tennessee observers set goals in written feedback focused on improving student test scores. Furthermore, the negative main and moderated associations with observation scores may reflect bias. Research suggests that an observer’s prior knowledge about a teacher’s performance can influence observation scores (Hunter, 2020). Observers might issue critical feedback to teachers they expect will have difficulty improving; these expectations may downwardly bias subsequently assigned observation scores independently of teachers’ performance. Although such bias affects observation scores, it cannot affect TVAAS scores, which may account for the positive TVAAS associations.
In addition, moderation analyses find that goal-setting feedback provided to less-educated and less-experienced early-career teachers positively correlates with TVAAS scores. These two moderated associations are consistent with the idea that the purpose of goal-setting is to help early-career teachers understand performance expectations. Teachers may develop their understanding of performance expectations through formal education or on-the-job experience. Goal-setting may compensate for the absence of these experiences among the less-educated and less-experienced.
We recognize that our findings are at odds with some prior research. For example, teacher survey results from two states suggest that teachers receive CFCs (e.g., Cherasaro et al., 2016), while Chicago teachers and administrators say that evaluations are helpful, implying effective feedback (Sartain et al., 2020). The uniqueness of our data may largely explain these differences, further underscoring the need for further research using similar data.
Our study is not without limitations. First, we only examine four CFCs, but research identifies many feedback characteristics for improvement, including specificity and timeliness (Jawahar, 2010; Kimball, 2003; Kinicki et al., 2004). Future research might directly examine additional characteristics of feedback provided to teachers and relate those characteristics to teacher performance. Prior survey research also finds that employee reactions to feedback mediate associations between feedback characteristics and employee performance (Jawahar, 2010). During the post-observation conference, the words, tone, and body language of an observer may also affect how a teacher responds to their oral and written feedback. It would be helpful to know whether these same mediators and moderators apply to information based on the feedback directly provided to teachers during formal conferences.
Second, the associations between the proportion of teacher conferences exhibiting specific CFCs and subsequent teacher performance and effectiveness may not be causal. That we find associations between teacher baseline human capital measures underscores this limitation. Indeed, the lack of positive associations with teacher performance and effectiveness may be explained by the fact that early-career teachers with less human capital receive conferences with higher shares of the feedback examined, which in turn may introduce negative bias.
Finally, the generalizability of our results is limited. We purposefully selected teachers in their first 5 years on the job; our findings may not generalize to mid- or late-career teachers. In addition, as findings from feedback research may depend on the observation system in which the feedback is provided, our findings may not generalize beyond the TEAM evaluation system. Machine learning techniques may be well-suited for generalizability. Computers can learn how we qualitatively coded written feedback and then use our coding procedures to other written feedback in new samples, substantially expanding the analytical data set.
Notwithstanding these potential limitations, the evidence suggests that feedback processes within Tennessee’s teacher evaluation system, one of the most mature “next-generation” systems in the United States (Koedel et al., 2019; Steinberg & Garrett, 2016), are not working as intended. We end by discussing how policy and research might improve feedback processes, but we urge caution befitting a first-of-its-kind study like ours. Indeed, such prudence underscores the need for more large-scale research using externally coded feedback episodes from field settings.
Based on prior work, the lack of CFCs and unidimensionality of feedback suggests that policymakers should promote higher levels of CFCs and multidimensional feedback, which theoretically improves teacher performance. However, evidence in the current study suggests that higher concentrations of the examined CFCs do not affect teacher performance, implying that policies pushing for higher concentrations of critical feedback, ceteris paribus, may not be effective. Similarly, other research implies that increasing the number of observations per teacher to increase the total levels of critical feedback received, ceteris paribus, may be unwise without first addressing the quality of the observations themselves. For example, emerging quasi-experimental research finds that more observations do not improve student achievement (de Barros, 2019; Hunter, 2019). Furthermore, while there is mixed evidence concerning the burdens of higher observational loads on evaluators, it seems that evaluators may cope with higher loads by reducing time spent in pre- and post-conferences, potentially undermining the developmental goals of increasing observations (Hunter & Rodriguez, 2021; Kraft & Gilmour, 2016; Rigby, 2015). Ultimately, we believe there is not yet enough evidence to inform policy regarding the provision of CFCs, highlighting the need for more research. Specifically, we urge researchers to expand on work like ours by using externally coded feedback episodes linked to subsequent teacher performance measures.
Our study implies that policymakers might take steps to ensure that certain teachers receive conferences with high concentrations of goal-setting feedback. Although the main associations between goal-setting and TVAAS are significant, they are driven by feedback received by beginning and less-educated teachers, implying that goal-setting benefits these teacher groups. Our findings suggest that teachers’ prior-year effectiveness drives the provision of goal-setting feedback, not teacher experience or education. In light of this, policy-driven supports are likely needed to change how evaluators issue goal-setting feedback. For example, policies might encourage principal preparation programs to train leaders in performance management principles and the role of goal-setting in employee improvement (Gallo, 2011). Another avenue is for state or district policies to support executive coaching to transform school administrator (i.e., evaluator) practices (Hagen, 2012; Huff et al., 2013). At the same time, recent research shows workshop-style in-service is typically ineffective, whether for teachers (Garet et al., 2001; Penuel et al., 2007) or administrators (Kraft & Christian, 2021).
Our study also raises questions about other opportunities for improving feedback and observation processes. The findings suggest that the feedback provided to teachers across their observations is predominantly unidimensional, exhibiting just one of the examined CFCs. Critical feedback may be more effective when it is multidimensional, consistently exhibiting multiple CFCs. Future research should examine the effect of multidimensional critical feedback on teacher performance. As we do not find many substantive associations, researchers might also explore the policy-manipulable factors that might suppress feedback’s effects (e.g., lack of targeted teacher professional learning opportunities supporting formal goal attainment). Although psychological research identifies several employee (teacher) cognitive factors suppressing feedback’s effects (Jawahar, 2010; Kinicki et al., 2004; London & Smither, 2002), we suspect that policy is ill-positioned to change cognitively based suppressors. 12 However, some have intimated that loose or nonexistent connections between evaluation processes and professional development systems suppress teacher performance improvement (Donaldson, 2021; Papay, 2012; Weisberg et al., 2009). Large-scale research exploring the importance of these connections might eventually show that policymakers should purposefully link evaluation and professional development systems, if feedback and other observation processes are to improve teacher performance and, ultimately, the learning opportunity provided to students.
Supplemental Material
sj-docx-1-epa-10.3102_01623737211062913 – Supplemental material for Critical Feedback Characteristics, Teacher Human Capital, and Early-Career Teacher Performance: A Mixed-Methods Analysis
Supplemental material, sj-docx-1-epa-10.3102_01623737211062913 for Critical Feedback Characteristics, Teacher Human Capital, and Early-Career Teacher Performance: A Mixed-Methods Analysis by Seth B. Hunter and Matthew G. Springer in Educational Evaluation and Policy Analysis
Footnotes
Acknowledgements
This article is much improved from its original versions, thanks to helpful CFCs from several parties, including EEPA editors and reviewers, the Tennessee Department of Education, Tennessee Education Research Alliance, Association for Public Policy and Management, Association for Education Finance and Policy, and Inequality Seminar at The University of North Carolina at Chapel Hill. We are also grateful to Karin Gegenheimer for excellent research assistance. Corresponding author, Matthew G. Springer, can be reached at mgspringer@unc.edu.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
Authors
SETH B. HUNTER is an assistant professor of education leadership at George Mason University. His research focuses on teacher leadership and the policies and practices of educator evaluation.
MATTHEW G. SPRINGER is the Robena and Walter E. Hussman, Jr. Distinguished Professor and chair of the Educational Policy and Organizational Leadership at The University of North Carolina at Chapel Hill. His research focuses on accountability, compensation, and incentives.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
