Abstract
Papay (2011) noticed that teacher value-added measures (VAMs) from a statistical model using the most common pre/post testing timeframe–current-year spring relative to previous spring (SS)–are essentially unrelated to those same teachers’ VAMs when instead using next-fall relative to current-fall (FF). This is concerning since this choice–made solely as an artifact of the timing of statewide testing–produces an entirely different ranking of teachers’ effectiveness. Since subsequent studies (grades K/1) have not replicated these findings, we revisit and extend Papay’s analyses in another Grade 3–8 setting. We find similarly low correlations (.13–.15) that persist across value-added specifications. We delineate and apply a literature-based framework for considering the role of summer learning loss in producing these low correlations.
Keywords
During the past 10 years, many U.S. states have revised their teacher evaluation systems to incorporate value-added measures (VAM) scores. Federal Race to the Top grants initially spurred these changes, and the more recent Every Student Succeeds Act—although not requiring states to link teacher evaluations specifically to test scores—codified the expectation that districts distinguish teachers based on effectiveness (Berg-Jacobson, 2016). As of 2019, 26 states require teacher evaluations to include student growth data based on standardized tests (Ross & Walsh, 2019), and Steinberg and Donaldson (2016) report that 61% of the nation’s largest school districts include VAM scores in teacher evaluations. At their core, value-added (VA) models attempt to isolate the causal effect of individual teachers on student achievement by predicting students’ end-of-year test scores while taking into account baseline scores from the previous spring. 1 Yet estimating teachers’ causal effects in this way is far from straightforward. As VAM usage expanded, a large body of research 2 emerged to consider whether the statistical properties of VAM scores (e.g., precision, unbiasedness) could support high-stakes teacher talent management decisions such as teacher hiring, retention, compensation, or tenure.
In one such study, Papay (2011) briefly shows that teacher VAM scores from models using the most common pre/post testing timeframe—that is, based on changes in test scores from last spring to current spring—are essentially orthogonal (correlation of -.10) to those same teachers’ VAM scores from a model using test score changes from current fall to next fall. 3 This finding should be of great concern: There is no principled reason to use spring-to-spring over fall-to-fall pre/post timeframes to construct VAMs, and according to Papay’s results, this choice—made solely as an artifact of the timing of statewide testing systems—would lead to an entirely different ranking of teachers’ effectiveness.
This troubling finding warrants further study for several reasons. First, because VAM sensitivity to the choice of pre/post timeframe was not his focus, Papay (2011) did not explore what potential mechanisms might be at work. It is also possible the finding was idiosyncratic to the assessment at hand, the specific district or decade, or the specification of the model used to produce the VAM scores. In fact, a handful of subsequent studies have used Early Childhood Longitudinal Study–Kindergarten Class (ECLS-K) data to explore this question in kindergarten and first grade but have not found similarly low correlations (Gershenson & Hayes, 2018; Hayes & Gershenson, 2018; Palardy & Peng, 2015). Given that Papay’s findings would have serious implications for policies that use VAM scores, but results from later studies substantively differ, we take up this line of inquiry in a new school district with data from a more recent period in the tested subjects and grades in which VAM scores are typically implemented.
Research Questions
Because the current district administered the same assessment for all Grade 3 through 8 students in both the fall and spring for at least 3 consecutive years, we can do what is typically not possible with state standardized test score data: For each teacher, we estimate three VAM scores using spring-to-spring (S→S), fall-to-fall (F→F), and fall-to-spring (F→S) test timeframes. 4 We then examine whether VAM rankings are sensitive to choice of pre/post timeframe and whether findings are specific to only certain VA model specifications. We also consider the role summer learning loss (SLL) may play in producing this result. We pose the following four questions:
(1) Using a VA model as similar as possible to that used by Papay (2011), do we observe the same no-correlation finding now in a different district, during a different time period, and using a different test?
(2) Does the correlation remain low when using updated versions of the VA model?
(3) Do patterns of variation in SLL in the current district help us understand this finding?
(4) Is there heterogeneity in the results across teachers, grades, and schools?
As a preview, our basic results are similar to Papay’s: We find very low correlations between SS- and FF-based VAM scores (.13 for English language arts [ELA], .17 for math). This finding—now apparent in two educational contexts—raises concerns about the expansion of using VAM scores for teacher evaluation during the past decade.
Conceptual Framework
An ideal way to isolate teachers’ causal effects on student test scores would require administration of a reliable assessment at the very start of fall and end of spring in each school year. This FS timing would most directly capture a student’s achievement gains that occurred while assigned to a given teacher. An artifact of administering standardized tests only once annually is that a summer period—either the summer before (SS) or the summer after (FF) a given school year—will be attributed to the teacher, even though the teacher and student do not interact during that time (see Figure 1). Although neither summer misattribution is desirable, incorporating the presummer (SS) time seems more problematic because the teacher has not yet met the student. In sum, we can think of FS-based VAMs as ideal but impractical, SS as pervasive but problematic (includes the prior summer), and FF as an uncommon, but more desirable, alternative to SS.

Simplified illustration of the three pre/post test score timeframes FF, FS, & SS used to estimate the effect of a Grade 4 teacher on one student’s test scores.
Figure 1 illustrates this pre/post timeframe issue. First, note that estimating all three VAMs for the same teacher requires data across a minimum of 3 years (here Grades 3, 4, and 5). Although nine basic patterns of SLL preceding and following a given school year are possible, 5 we highlight two scenarios that would lead SS-, FS-, and FF-based VAMs for the same teacher all to be different from one another. Consider the simplified Scenario 1 (left) in which we estimate the impact of a Grade 4 teacher on one student’s test scores (zig-zagging line, dashed and shaded during the summers). The horizontal lines at the bottom depict SS (black dashed), FS (gray dashed), and FF (gray solid) timeframes and the summers they encompass. The corresponding vertical lines on the right of the Scenario 1 graph illustrate that the student’s growth appears smaller for this teacher when using the SS timeframe and larger when using FF. We also can see that summer gains/losses will affect the ordering of these VAM scores: Spring scores are the same on both the left and right panels of Figure 1; however, Scenario 1 (left) shows that the child lost ground in the summer before Grade 4 but nearly sustained her school year growth rate in the summer after Grade 4. Scenario 2 (right) shows the opposite pattern (gain in presummer, loss in postsummer), and now SS attributes the most growth to the Grade 4 teacher and FF the least. We therefore need to think carefully not only about the unintended inclusion of summers but about how the patterns of summer learning will affect teacher VAMs.
Potential Role of SLL
Three conditions are necessary for SLL
6
to produce low correlations among SS-, FS-, and FF-based VAMs: (1) There must in fact be SLL (
Summer before current-year teacher
At first, it seems intuitive that SLL in the summer before being assigned to a given teacher (the presummer) will be unrelated to that teacher’s effectiveness—indeed, the teacher and student have not yet met. Based on this logic, we would hypothesize that SS-based VAMs that incorporate the prior summer would be less correlated with FS-based VAMs than would FF-based VAMs. However, if SLL in the presummer is correlated with factors associated with the sorting of students to teachers or schools, then this correlation may arise indirectly. For example, student race/ethnicity and socioeconomic status (SES) predict access to high-quality teachers and schools (Boyd et al., 2005; Goldhaber et al., 2019; Isenberg et al., 2016; Lankford et al., 2002). Do these factors also predict SLL? Below, we bring to bear SLL literature and empirical evidence on this question.
Summer after current-year teacher
Because students no longer interact with their teachers or schools in the postsummer, this SLL could be unrelated to teacher effectiveness. However, in contrast to presummer SLL—for which an as-yet-unmet teacher’s effectiveness can have no direct impact—a teacher’s influence could persist into the postsummer. If so, then
Teacher effectiveness sequencing
A final complication arises if a student’s teachers are sequenced across grades in a manner that induces a correlation between effectiveness and summer learning. For example, if school leaders recognize that some students were assigned to a weaker teacher in Grade 3 (so presumably they exhibit lower
SLL Literature
A comprehensive review of the SLL literature is beyond the scope of the current article (see Gershenson, 2013, for an overview). However, several studies are particularly relevant to the three necessary conditions outlined above. For example, Downey et al. (2004) document significant correlations between learning rates in the school year (in kindergarten or Grade 1) and the intervening summer. Assuming school year learning rates are in part produced by teachers, then
Existing Research on Pre/Post Timeframes for VAM
A handful of studies have reported on the correspondence among teachers’ FF-, FS-, and SS-based VAMs (see Table 1 for an overview). We are particularly motivated by the remarkable finding in Papay (2011) of virtually no correlation between teachers’ FF-based and SS-based scores:
Correlations Between FF-, FS-, and SS-Based Value-Added Measures (VAMs) Across Known Studies
Note. Data years are reported based on the spring of the school year. All estimates reported from other studies are from the available model most similar to our preferred model, including vectors of student characteristics and, when possible, classroom characteristics and school effects.
ECLS-K:99 and ECLS-K:11 administered direct cognitive assessments designed to measure children’s knowledge and skills in reading and mathematics, partly modeled after more commonly known assessments, including the Peabody Picture Vocabulary Test (PPVT), the Woodcock Johnson, and/or the Test of Early Math Ability (TEMA).
Teacher-level VAMs are sometimes time invariant but sometimes are allowed to vary by year. ECLSK:99 data observe each teacher in only 1 year, so this 1-year teacher effect could be framed as a classroom effect (the language used by Gershenson & Hayes, 2018) or simply as a teacher effect (used by Palardy & Peng, 2015, when using the same data). In the current article, we estimate both teacher time-invariant effects across multiple years (akin to Papay, 2011) and teacher-by-year estimates (shown in parentheses below the teacher effect correlations). However, we do not use the term “classroom” for these latter effects because the two terms are not synonymous in higher grades.
Although our focus is on teacher VAMs, two recent studies examine this question for school VAMs. The issue of summer misattribution is substantively different for school impacts: Because teachers typically work with each student for only 1 year, it seems much clearer that the summer before they meet should not be considered part of the teacher’s effect. Schools, on the other hand, serve many of the same students year after year as they return after each summer to move across grades. A student’s intervening summers, then, are typically situated between two exposures to the same school (except in the case of cross-school transitions). Unlike teacher VAMs, school VAMs based on annual testing do not fully misattribute summers to a school that a student has not yet attended. We therefore would anticipate higher correlations among FF-, FS-, and SS-based school VAMs.
8
Indeed, McEachin and Atteberry (2017) use data from an anonymous southern state and find for school VAMs a
Shifting back to teacher
9
VAMs, three studies employ ECLS-K data to estimate
Although ECLS-K:99 cannot be used
10
to replicate the troubling
The correlations reported from studies using ECLS-K:99 or ECLS-K:11 are notably higher than those from Papay (2011) or—as we will see—the current study. The different findings across data sets could be due to any of several factors. First, the ECLS-K studies follow a single cohort of students, and thus each teacher is observed only once. Second, none of the three data sets contains the same vertically scaled assessments, all of which may differ from one another in terms of scaling, content covered, or intended usage (see von Hippel & Hamrock, 2019, for a particularly relevant discussion of how measurement artifacts can affect inferences about SLL). Third, the VAM scores estimated in the ECLS-K studies are based off an average of 3 students per teacher. In the current study, each teacher is linked to an average of 23 students per year in elementary schools (100 students per year in middle schools). Although smaller samples might be expected to attenuate correlations, one possibility is that empirical Bayes shrinkage on teacher VAM scores with low reliability could overshrink estimates toward a grand mean and make the correlations appear higher. Another speculation for those ECLS-K analyses that do not shrink estimates is that, whereas the correlations across teachers’ VAMs are precisely estimated in ECLS-K (because there are around 750 classrooms), the underlying VAM scores’ being correlated could be somewhat idiosyncratic to the approximately 3 students (on average) with both spring and fall scores in each classroom. 11
Finally, the biggest difference when using ECLS-K studies is that these correlations can be estimated only for kindergarten and Grade 1 teachers. In practice, districts use teacher VAMs almost exclusively in Grades 4 and above, and it is unclear whether results from the earliest grades also would apply to these later grades. The current data set thus provides a unique opportunity to replicate and extend Papay’s (2011) findings in policy-relevant grades.
Data and Analytic Sample
The current study uses 2010–2011 through 2014–2015 administrative student- and school-level data from one anonymous district located in the southeastern United States. The district covers one of the state’s largest cities and its surrounding area, and it consists of 57 schools serving about 50,000 students annually. It spans a large geographic area—about 250 square miles—which includes a central urban area surrounded by both suburban and rural communities.
The data set includes typical demographic data for its students, including race/ethnicity, sex, free/reduced-price lunch program (FRPL) eligibility, limited English proficiency (LEP) status, and special education (individualized education program [IEP]) status. Roster data link students to their classrooms, teachers, and schools in each grade and school year. Unfortunately, teacher covariates like years of experience are not available. Finally, the data set contains students’ state standardized test scores and—the key to the current study—fall and spring Measures of Academic Progress (MAP) assessment scores from the Northwest Evaluation Association (NWEA), which was administered to students in Grades 3 through 8 12 for up to 5 consecutive years.
About the MAP Assessment
Scaling, intended use, and content
The MAP assessment has different scaling properties and purposes than the state’s standardized achievement test, 13 which would typically be used to construct SS-based teacher VAM scores. Unlike the state standardized assessment, the MAP test is scored to follow a vertical and interval scale, to produce what NWEA calls RIT (Rasch Unit) scale score points. One could hypothesize that MAP’s vertical scaling explains the patterns we observe across FF-, FS-, and SS-based VAM scores. However, we present main analyses on both the original RIT scale scores and a standardized version and find nearly identical results.
The intended use of the statewide end-of-year exam differs from that of MAP, which is designed to be used as a supplementary tool to aid schools in improving their instruction, not as the high-stakes test of record. Moreover, the MAP test is given to students across the United States and in this sense is not curriculum based, although NWEA does attempt to tailor the assessment to a state’s content standards. NWEA has conducted a MAP linking study within the past 3 years with the current state’s assessment and found high proficiency classification consistency rates at or greater than 81% across subjects and grades. 14 Although scaling, intended use, and content may differ, students’ standardized state test scores and spring MAP scores in this district are strongly correlated with one another (.82 in ELA, .78 in math). Despite their differences, these two assessments appear to provide similar information about students’ math and ELA achievement.
Differential effort
One must acknowledge the possibility that MAP testing could be approached differently in the fall than in the spring in ways that could introduce some bias (e.g., certain teachers give it greater focus than others) or imprecision. For instance, MAP tests could be taken less seriously in the fall than in the spring. If, at the extreme, fall scores are essentially noise, then low correlations could be an artifact of that noise.
15
Although it is not possible with this data set, we explore this issue indirectly using a separate, national data set
16
of MAP scores that contains time spent on each test and the number of items attempted. There, we see that students do spend about 6 fewer seconds per item on fall tests relative to spring tests. This is consistent with a possible fall-spring effort differential; however, the magnitude of this difference suggests that fall tests are not generally disregarded. In addition, fall MAP scores in the current district are strongly correlated with both spring MAP scores (
Unknown test dates
Finally, MAP tests are not taken on the very first and last day of school. Therefore, some of the time between the prior spring test and current fall test—what we would label summer—in fact takes place during the end and start of the school year. We use publicly available district calendar data to identify likely MAP testing window dates and then project RIT scores to the first and last day of school. 17 Consistent with projected scores in the national MAP data set, projected and observed scores in this district are correlated at greater than .96. We present analyses using both the observed and projected RIT scores. The pattern of results remains quite consistent, especially among models producing empirical Bayes shrunken teacher estimates.
Analytic Sample
We limit the current analytic sample to students observed with MAP test scores in Grades 3 through 8 and their linked math and ELA teachers. To provide context for the kinds of districts to which our findings might generalize, Table 2 contains basic student- and school-level demographics for the analytic sample in one example school year (2011–2012).
Sample Descriptives at the Student and School Level, Academic Year 2011–2012
Note. Descriptive statistics are for the 2011–2012 school year. LEP = limited English proficiency; FRPL = free/reduced-price lunch program. Dashes indicate the SD of a binary, student-level variable (not applicable).
The analytic sample includes about 22,000 students annually, of whom about half are identified as non-White and more than one third are eligible for the federal FRPL. Black students constitute the largest non-White racial/ethnic group (38%); another 8% are Asian and 7% are Hispanic. See Appendix Table A1 for annual analytic sample sizes for students by grade, teachers by subject, and schools. Overall, the 5-year analytic sample includes 44,793 unique students, 1,459 unique teachers, and 57 unique schools.
Analytic Approach
In Equation 1, we first specify a VA model as similar as possible to the one used by Papay (2011) to estimate time-invariant teacher VAM scores, separately for math and ELA:
Like Papay, we allow teacher effects,
We estimate the above model three times, each time changing only how we define t–1 = baseline and t = end-of-period (from SS, to FS, to FF). We generate for each teacher three VAM-based rankings:
We also examine whether the strength of these correlations differs across other specifications of the VA model (e.g., with school covariates in place of school fixed effects). We rerun the analysis using the RIT scores (not standardized) to see whether vertical scaling affects our findings. We also respecify the VA model to produce year-specific teacher VAM scores (as opposed to teacher time-invariant scores). This allows us to explore several sources of potential heterogeneity: We examine how results change when teachers’ scores are allowed to vary from one year to another. We also explore whether those correlations vary based on teachers’ current-year grade assignment or current-year school’s student composition.
Results
Research Question 1: Replication
Like Papay (2011), we find that switching to a different test administration timing dramatically alters the ranking of teachers based on the VA model in Equation 1. In Table 3, we find that teachers’ ELA VAM rankings that are produced utilizing SS versus FF test administration timings are only correlated at .17 (compare this to Papay’s estimate of -.10 for ELA). We extend the analysis into math and find a
Replication of Papay (2011) Spearman Rank Correlations Across Value-Added Measures (VAMs) Using FF, FS, and SS Pre/Post Timeframes
Note. Deltas
We also consider the use of the more ideal FS-based timing for VAM scores, in which end-of-year spring test scores are modeled as a function of start-of-year fall test scores, thus isolating both the summer before and the summer after the school year from the teacher effects. We are interested in whether this ideal scenario (
In Table 3, we find the
Research Question 2: Updated Model Specifications
We also consider the possibility that the low correspondence of teacher VAM scores across test timings is an artifact of the specific VA model chosen by Papay (2011) (represented by our Equation 1). For instance, his model includes school fixed effects, and whereas there often are reasons to include these for research purposes, districts are unlikely to use them because they undermine their ability to compare teachers across schools. In addition, the literature on VA modeling specifications has progressed since 2011. We adopt a preferred model based on both these practical considerations and the literature (Chetty et al., 2014a; Guarino et al., 2015; Kane et al., 2013; Koedel et al., 2015; Sass et al., 2012). We therefore make the following changes to Equation 1 for our preferred model: We replace school fixed effects with a vector of time-varying and time-invariant school covariates, 21 use a third-order polynomial in place of the fifth-order polynomial function, 22 include prior scores in both subjects rather than only the same subject, and analyze correlations among the estimates (rather than rankings).
Results from this preferred model are presented in Table 4 (row 1 [preferred]). The general pattern of correlations holds, and if anything, correlations from the preferred model are slightly lower than those from the replication model. The
Teacher Value-Added Measure (VAM) Correlations for FF, FS, and SS Pre/Post Timeframes, Across 12 Specifications of the Value-Added Model
Note. Correlations across VAM estimates generated from the preferred model are shown in the first row (1 [preferred] row). All models in this table include the vector of time-varying and time-invariant student demographics and the vector of time-varying classroom controls. All models also include an up to third-order polynomial function of prior test scores in both math and English language arts, and these functions are allowed to vary by grade. Models differ from one another on three dimensions: test scaling, whether teacher effect estimates have been shrunk, and school controls. FF = current fall → next fall; SS = prior spring → current spring; FS = current fall → current spring; S & SY Covs = the model includes the vector of both time-invariant and time-varying school covariates; FEs & SY Covs = the model substitutes school fixed effects in place of the time-invariant school covariates; RIT = Rasch Unit (NWEA scale units).
We again draw the reader’s attention to the lowest correlations in the first row of Table 4 between FF- and SS-based VAMs. To illustrate the troubling nature of these results, we represent this
Quartile-Quartile Teacher Transition Matrix for SS Versus FF Test Timings, by Subject
Note. Quartiles are created for value-added measure scores from the preferred specification of the value-added model shown in the first row of Table 3 (1 [preferred] row). Rows show quartiles based on SS timing, and columns show quartiles based on FF timing. Cells contain counts of teachers in a given quartile transition (e.g., for English language arts, 22 teachers were in Q1 according to SS timing but in Q4 according to FF timing). Row percentages are included in parentheses in cells. We also present row and column totals at the margins. SS = prior spring → current spring; FF = current fall → next fall; Q = quartile.
Only about 33% of all ELA teachers appear along the table diagonal—that is, in the same quartile using their SS-based and FF-based VAM score rankings. In addition, about 30% of teachers appear in an FF-based category two or more quartiles away from their SS-based category. In fact, 15% of top-quartile ELA teachers according to SS timing would be considered bottom-quartile teachers according to the FF timing (25% for math). In essence, a nontrivial portion of the seemingly strongest teachers according to the traditional SS timing would be categorized among the weakest teachers using the FF timing. Results are similar for math. This underscores that FF-based teacher VAM score rankings bear little resemblance to those based on the traditional SS-based timing. There is no principled reason to use SS-based VAMs over FF-based VAMs; however, this choice would lead to very different conclusions about teachers’ effectiveness.
Robustness to other model specifications
In Table 4, we reproduce these three correlations—
Overall, we find that the general takeaway holds across model specifications. In every model, the
Although the primary takeaway from Table 4 is one of consistency, we note some patterns in the variability across specifications. Looking across all 12 models, results from our preferred model (Model 1) are among the least problematic (i.e., somewhat higher correspondence across FF, FS, and SS timings). We also see that estimates produced without shrinkage tend to exhibit lower correlations. This makes sense if the shrunk estimates have eliminated some noise that would otherwise attenuate correlations. For instance, whereas Models 3, 7, and 11 all produce negative ELA correlations between FF- and SS-based scores—similar to Papay’s (2011) -.10 estimate—our correlations are never negative in models with shrinkage.
Between two otherwise similar models, the one that includes school fixed effects in lieu of time-invariant school covariates often leads to slightly lower correlations. Although this is not always the case, it is always true among models with shrinkage. Finally, correlations are often lower in models using the projected RIT scores, which attempt to separate the time between the prior spring test and current fall test into the actual summer and school year periods. This projection approach, however, does not overturn the basic findings.
Research Question 3: SLL Patterns in District
To understand why FF-, FS-, and SS-based timings produce such different teacher VA scores, we provide a brief analysis of SLL in the current district, and we find results that are consistent with the theoretical discussion earlier in the narrative. Because MAP is vertically scaled and growth norms differ across grades, we conduct this SLL analysis separately by grade. For each student, we estimate a summer learning gain/loss by comparing their fall score in a given grade to their previous spring score. For each grade, we use an unconditional, three-level random effects model to partition the variation in SLL across students, teachers, and schools. This model allows us to estimate mean SLL and student-level variation in SLL. It also allows us to explore the extent to which SLL is correlated with student demographics, sorted across teachers in a given school, or clustered across schools. The greater the (nonrandom) variation in SLL, the lower we think the correlations across test administration timings should be.
The SLL findings are summarized in Table 6, using both observed RIT scores (upper panel) and RIT scores we projected to the first and last day of school (lower panel). Not surprisingly, mean SLL looks more negative when using projected scores because the projection attends to the fact that students may continue to learn after the spring test in April and before the fall test in September. Although estimates of mean SLL are somewhat different based on this choice, the subsequent examination of SLL variation is quite consistent across the two. We therefore focus on results using the original RIT scale. As a point of reference for what follows, NWEA reports that the standard deviation of math RIT scores in Grade 5 is about 16 points, and the Grade 5 NWEA growth norm is 9.9 RIT points (Thum & Hauser, 2015).
Statistical Exploration of Variation in Summer Learning Loss (SLL) Using a Three-Level Hierarchical Model, by Score Scaling, Subject, and Grade
Note. Results are from an unconditional, three-level random effects model used to partition the variation in SLL in a given subject/grade across students, teachers, and schools—see Columns 6, 7, and 8. The standard deviation of RIT scale score points ranges between 13 and 17, depending on the subject and grade. This model allows us to estimate mean SLL (Column 4) and student-level variation in SLL (Column 5). We translate student-level variation into a 95% plausible value range for expected SLL across students in a given school (Column 9). In Column 10, we add to the model the vector of student-level demographics (identical to the vector in the value-added models) and report the reduction in Level 1 residual variance, relative to the unconditional model. Analyses are conducted for SLL estimates based on observed RIT scores from the data set, as well as RIT scores we projected to the first and last day of school. RIT = Rasch Unit (NWEA scale units); ELA = English language arts.
Summer learning patterns in the current district mirror findings from prior SLL research and meet the necessary conditions for test administration timing to affect teacher VA estimates. For instance, although students exhibit, on average, very little change in math test scores in the summer after fifth grade (0.10 RIT points)—consistent with prior work—this average masks incredible variability across students in SLL (Atteberry & McEachin, 2020). The 95% plausible value range for math SLL after Grade 5 spans from -13.6 to +13.8 RIT score points. This means that some students return to school in the fall of sixth grade with test scores that are much higher than their observed scores from the prior spring, whereas others return having lost much of what they gained over the course of the previous grade. There is a potential, then, for a teacher to encounter a substantial range in SLL among his or her students at the start of each school year.
We also find that the large majority of variation in SLL occurs at the student level (about 90% to 96%). In some ways, it makes sense that out-of-school, summer learning is less clustered by school than learning that occurs during the school year. There is some sorting of SLL across teachers in the same schools and across schools, but the main takeaway here is the incredible variation among students.
In an effort to explore the systematic sorting of SLL, we add a full vector of student-level demographics to the student level of the model. Again, results mirror findings documented in prior work: Although we observe a remarkable spread of summer gains/losses across students, the sum of student demographics does little to account for this variation. The set of race/ethnicity, sex, IEP, and LEP status indicators—although sometimes correlated with SLL—together account for less than 3% of the student-level variance in SLL (final column of Table 6).
Taken together, this brief SLL analysis suggests that there is real variation in SLL that could cause FF-, FS-, and SS-based VAM scores to diverge. Given the low correspondence among VAMs using those three timings, this SLL variation must be systematic in some way. However, the sorting among teachers in the same school and across schools appears limited. This may explain why VA models with school fixed effects yield only slightly different correlations. Moreover, although demographics play a small part in accounting for this notable variation across students in SLL, the research community has yet to identify the features of students, families, or their summer experiences that tell us why so much variation exists.
Research Question 4: Heterogeneity in Correlations Across Years, Grades, and Schools
Finally, we explore the potential for heterogeneity in the FF-, FS-, and FF-based VAM correlations by year, by grade, and by school setting. For this research question, we modify the VA model to estimate teacher-by-year VAM scores to more directly consider how absorbing the preceding summer instead of the following summer affects how a teacher is evaluated in a given year. Consistent with the theoretical framework, a teacher’s year-specific VAM scores may be more affected by pre/post timeframes when there is some effectiveness-based sequencing across students’ teachers from one grade to the next. The greater the differential in latent effectiveness between a given teacher and the teacher(s) who taught their students in the preceding year, the more one might expect FF-based, SS-based, and FS-based VAM scores to diverge.
We therefore rerun the preferred specification of the VA model but now substitute teacher-by-year effects (
Teacher-by-Year Value-Added Measure (VAM) Score Correlations: Overall and Separately by the Teacher’s Grade Assignment and the Socioeconomic Status of the Teacher’s School
Note. Deltas
Turning to the grade-specific results in Table 7, there do appear to be substantive differences across grades: In ELA, whereas
In Table 7 we also explore whether the low correspondence among the FF-, FS-, and SS-based VAM scores differs across teachers in schools serving high- and low-SES students. We see some evidence that correlations are slightly less problematic (i.e., higher) for teachers assigned to high-SES schools (less than 25% of students are FRPL eligible) than for teachers in low-SES schools (more than 50% of students are FRPL eligible). We see this pattern for five of the six correlations reported, with an exception in the math
Discussion
We find a very weak relationship (.13 ELA, .17 math) between teachers’ VAM scores produced by an otherwise identical VA model that estimates student growth either from prior spring to current spring (SS—the pragmatically feasible option) or current fall to next fall (FF—a theoretically better alternative to SS). The majority of teachers—about 70%—would be categorized into a different quartile of performance by FF- and SS-based VAMs. Moreover, neither of these options exhibits a strong correlation with VAM score rankings based on a current-fall to current-spring (FS—ideal but impractical) test timing.
Results from the current study generally align with findings from a similar analysis embedded in Papay (2011) and hold up against the additional robustness checks we are able to bring to bear. The results are less well aligned with studies that have used ECLS-K data that find much stronger teacher-level correlations (Gershenson & Hayes, 2018; Hayes & Gershenson, 2018; Palardy & Peng, 2015). As discussed in the literature review, however, there are a number of possible explanations for the discrepancy, most notably the difference in grade level.
Findings have implications for how VAM scores are used to characterize teacher performance in both policy and research contexts. These measures of teacher effectiveness often play a role in high-stakes decisions about teacher retention, placement, and compensation. Low correlations between models that—at least intuitively—should produce somewhat similar rankings call into question the fairness of using SS-based VAM scores in ways that can dramatically affect the lives of teachers. For example, districts that use VAMs as a factor in teacher compensation may not be correctly identifying the teachers who have the strongest positive impact on student achievement outcomes.
A clear candidate for explaining the discrepancies in teacher VAM scores based on FF- versus SS-based test timings lies in the potential misattribution of summer periods. We document incredible variability across students in summer learning gains/losses, which we think is a key to understanding why these different VAM scores diverge. Several studies show that SLL appears to disproportionately occur for low-income and historically marginalized students (Allington et al., 2010; Burkam et al., 2004; Cooper et al., 1996; Downey et al., 2004; Entwisle & Alexander, 1992; Quinn & Le, 2018; Quinn et al., 2016; von Hippel et al., 2018). In our heterogeneity analyses based on school SES, we find some evidence that the VAM scores for teachers working in low-SES schools may be slightly more susceptible to these timing-based VAM discrepancies.
Given that a student’s learning in the summer before a given school year cannot be affected by the current-year teacher’s true effectiveness, one might be somewhat surprised to find that the typically available SS-based rankings exhibit a stronger correlation of .61 with the more “ideal” FS-based rankings than do the theoretically preferred FF-based option (correlation of .32). Such a pattern could arise if SLL effects carry forward into the subsequent school year in ways that influence a teacher’s effectiveness or if teacher assignment patterns are correlated with prior-SLL. In practice, the fact that SS-based teacher VAMs are more correlated with FS-based VAMs can be of some comfort because SS-based VAMs are typically the only available option to districts.
Our work also highlights how much we do not yet understand about SLL. Although we clearly can see that students’ summers are very different from one another, some of our most powerful predictors of school year learning inequalities (e.g., student demographics) are only a small part of the story for summer learning rates. As one might expect, SLL is also much less clustered by teachers and schools than is school year learning because students are not in school when these summer gains/losses take place. Nonrandom variation in SLL must exist if SLL is the cause of the discrepancies among FF-, FS-, and SS-based teacher VAM scores. However, the field has not yet been able to explain why SLL varies. Researchers could explore this issue using more detailed data on teacher and school contexts as well as students’ summer experiences.
The current findings contribute to the existing literature base that considers the many factors that drive the same teacher to have somewhat different VA scores (e.g., year-to-year instability, in math versus ELA, different tests, controlling for various sets of observed covariates, including student or school fixed effects, using a random effects framework, a student growth percentile approach). In broad terms, the literature documents correlations across these choices in the general range of .30 to .70 (see Koedel et al., 2015, for a recent, excellent synthesis of points of consensus and contention in the VA literature). In contrast, many of the correlations documented in the current study are in the .10 to .20 range. If these findings truly reflect an unprincipled artifact of test administration timing, then it would suggest that inferences about teacher effectiveness made using SS-based VAMs should be questioned.
Future research could push this topic forward by addressing several important limitations of the current study. A key concern is differential effort on fall MAP administrations, which could lead to greater imprecision in fall scores. If fall scores are simply unreliable or inaccurate measures of students’ math or ELA skills at that point in time, then incorporating them into a VA model would not be advisable. We are unable to fully address this limitation in the current context, although we do observe that fall MAP scores are strongly correlated with students’ high-stakes test scores on the statewide assessment, which occurs just a few months earlier in the preceding spring (e.g., .79 for math). It would be useful to conduct these analyses in settings either where there are compelling reasons to believe the fall and spring test administrations are of equal importance to teachers or where information about the reliability of individual test scores—for example, test score standard errors, seconds spent per item, total minutes spent by each student—is available to the researcher.
The finding of a very low correlation between FF- and SS-based teacher VAMs has now been documented in both a large, urban school district in the northeast (Papay, 2011) and a southeastern school district of similar size that includes urban, suburban, and rural areas. How do these results fit with other research on FF, FS, and SS pre/post timeframes? We can see in Table 1 that much higher correlations have been found when using ECLS-K data or for school-level VAMs. Though we posit some reasonable explanations for these varied results across studies, more work is needed to truly understand the source of these discrepancies. If future studies corroborate the low correlations documented herein, then researchers and policymakers should carefully consider the implications of relying on SS-based VAM scores to characterize teachers’ relative effectiveness. This is particularly troubling given that the theoretically more appropriate approach of estimating teacher VAM scores based on FS or FF timings is simply not an option for most districts.
Footnotes
Appendix Table A1
Student, Teacher, and School Sample Sizes by Year, Grade, and Subject
| 2010–2011 | 2011–2012 | 2012–2013 | 2013–2014 | 2014–2015 | Total Unique | |
|---|---|---|---|---|---|---|
| Students | ||||||
| Grade 3 | 3,861 | 3,850 | 3,929 | 3,901 | 4,154 | 19,649 |
| Grade 4 | 3,811 | 3,880 | 3,842 | 3,948 | 3,928 | 19,382 |
| Grade 5 | 3,921 | 3,844 | 3,846 | 3,822 | 3,981 | 19,405 |
| Grade 6 | 3,826 | 3,969 | 3,929 | 3,955 | 3,959 | 19,511 |
| Grade 7 | 3,901 | 3,870 | 3,964 | 3,963 | 4,041 | 19,561 |
| Grade 8 | 3,713 | 3,922 | 3,888 | 4,055 | 3,991 | 19,347 |
| All grades | 23,033 | 23,335 | 23,398 | 23,644 | 24,054 | 44,793 |
| Teachers | ||||||
| English language arts only | 141 | 134 | 135 | 135 | 138 | 268 |
| Math only | 133 | 139 | 138 | 138 | 131 | 266 |
| Both subjects | 559 | 557 | 556 | 537 | 531 | 967 |
| All teachers | 833 | 830 | 829 | 810 | 800 | 1,459 |
| Schools | ||||||
| Total unique | 56 | 57 | 56 | 56 | 56 | 57 |
