Abstract
Since the early 2000s, educational evaluation research has primarily centered on teachers’, rather than schools’, contributions to students’ academic outcomes due to concerns that estimates of the latter were smaller, less stable, and more prone to measurement error. We argue that this disparity should be reduced. Using administrative data from three cohorts of Massachusetts public school students (N = 123,261) and two-level models, we estimate middle schools’ value-added effects on eighth-grade and 10th-grade math scores and, importantly, a non–test score outcome: 4-year college enrollment. Comparing our results to teacher-centered studies, we find that school effects (encompassing both teaching- and nonteaching-related factors) are initially smaller but nearly as stable and perhaps more persistent than are individual teacher effects. Our study motivates future research estimating the long-term effects of both teachers and schools on a wide range of outcomes.
Education researchers and policymakers have long debated how to measure school and teacher effects on students’ academic outcomes (Everson, 2017). With the 2002 passage of No Child Left Behind (NCLB), standardized test scores became increasingly available and central to such evaluations. The landmark legislation rendered score-based school and teacher assessments more than an academic exercise: Educators’ compensation and job security, as well as school closure decisions and state takeovers of schools, now hinge on the results (Baker et al., 2010; Everson, 2017; Goldhaber & Hansen, 2008; Hanushek & Rivkin, 2012; Isenberg & Hock, 2012). As score-based evaluations have become more consequential, the methods used to translate student scores into teacher and school effectiveness measures have become increasingly sophisticated and scrutinized. Student growth and value-added models enable researchers to evaluate not only student performance levels but also the ability of schools and teachers to improve students’ performance in a given school year. A spate of recent studies employing experimental and nonexperimental data confirmed that value-added models revealed largely unbiased and sizable effects of teachers (Chetty et al., 2014a) and schools (Angrist et al., 2017; Deming, 2014) on students’ end-of-year test scores.
However, educational evaluation research and accountability frameworks have primarily centered on teachers’, rather than schools’, value-added effects on academic outcomes. Ever since the Coleman report, education scholarship has typically characterized school-level differences in academic performance as primarily reflective of student sorting rather than school-induced effects. Within-school differences (e.g., in teacher quality) appeared to play a larger role in stratifying students’ academic outcomes. Subsequent empirical studies, leveraging increasingly sophisticated quantitative models, seemed to bear this out. A considerably greater proportion of the variance in end-of-year test scores resided within schools, rather than between, them—even after adjusting for students’ sociodemographic characteristics (Konstantopoulos, 2006, 2007a). Concerns about the trivial size and questionable stability of schools’ value-added effects from year-to-year (Kane & Staiger, 2002; Staiger et al., 2002) fueled the perception that school effects were too small and too imprecisely estimated to be of practical value.
We argue that once value-added effects are reframed from short-term existence to long-term persistence and longitudinal student-level data are deployed, this perception may change. Schools offer a treatment different in length and in kind than do individual teachers. Although teachers may have more leverage than schools in shaping children’s cognitive development over the course of a single year, schools provide an opportunity to intervene in a child’s cognitive and socioemotional development in a more sustained way, across multiple years.
Of course, schools may exert effects on students’ outcomes through their differences in teaching-related factors, such as (1) the quality of individual teachers and (2) the quality of the combined teaching workforce. The latter could enhance (or depress) individual teachers’ effectiveness through collective efforts like curriculum alignment and professional development. However, other school-level mechanisms beyond pedagogy are plausibly implicated, as well. Schools encompass the combined effects of: administrators and other school-wide personnel (e.g., principals, counselors, psychologists, and social workers); institutional resources (e.g., computers, libraries, physical infrastructure); neighborhood context (e.g., crime levels, environmental conditions, and toxicity); peer factors (e.g., sociodemographics, grade span, student culture and behavior); policies and programs regarding attendance, student behavior, and discipline; academic resources (e.g., curricula, books, and materials); extracurricular offerings; teacher and classroom assignment protocols; parent engagement strategies; and school community history and reputation.
In this study, we conceive of school effects broadly—as encompassing the aforementioned teaching and nonteaching-related factors—and use two-level hierarchical linear models (HLMs) to estimate their persistence on students’ test score and non–test score outcomes over a 4-year period. We focus specifically on middle school effects for theoretical and methodological reasons. Theoretically, middle schools enroll students at a key transition period in their K–12 schooling careers, when a host of factors with long-term implications, ranging from curricular choices to self-esteem to intrinsic motivation, are revisited and reformed (Anderman & Maehr, 1994; Rathunde & Csikszentmihalyi, 2005; Wigfield et al., 1991). Methodologically, focusing on middle school allows us to leverage both pretreatment test scores and posttreatment outcomes—the former to control for initial student achievement, which prior studies have confirmed is required to reduce bias, and the latter to track student outcomes into high school and beyond.
We use a rich longitudinal data set that follows three cohorts of Massachusetts public school students—123,261 in total—who entered seventh grade between 2004 and 2006 as they move from elementary school through secondary school and into college. In our main analyses, we estimate middle schools’ value-added effects on math scores at the end of the “treatment” (i.e., eighth grade) and 2 years posttreatment (i.e., 10th grade) and estimate the stability of these value-added effects across three cohorts. For one cohort of students, we also examine middle schools’ effects on the likelihood of 4-year college enrollment.
We find that middle schools’ effects on student achievement endure beyond the initial treatment years, lasting more than 4 years after students leave middle school. Comparing our estimates of school effects to existing estimates of individual teacher effects, school effects appear initially smaller but nearly as stable across years and potentially more persistent than are the latter when measured over similar timeframes. Moreover, our estimates suggest middle schools may have a stronger influence on desirable non–test score outcomes, such as 4-year college enrollment, than do individual teachers. Importantly, sizable middle school effects on 10th-grade math scores and college enrollment remain, even after accounting for high school and district sorting patterns.
Our findings, based on two-level models, motivate future research that employs longitudinal data sets and three-level models to disentangle what portion of schools’ persistent effects are driven by teaching and nonteaching-related factors. If such analyses confirm sizable, stable, and durable school effects—even when accounting for school-level differences in individual teacher quality—then educational evaluation systems should reduce the disparity in emphasis on teacher versus school effects, examining the size and precise sources of both teachers’ and schools’ long-term impacts on test score and non–test score outcomes.
The Existence and Persistence of Middle School Effects
Evaluating schools’ and teachers’ effects on students’ standardized test scores has been a central occupation of policymakers and education researchers since the late 1990s. In 2002, NCLB codified this orientation into federal law. The far-reaching legislation focused exclusively on schools’ levels of achievement, requiring every state to assess its schools by setting proficiency standards in reading and math. Critics noted that requiring all students in a given state to reach the same absolute level of proficiency imposed a heavier burden on schools serving predominantly low-income, minority students than on schools with more advantaged pupils (Ballou et al., 2004).
Value-added models address this concern by comparing the gains of students in different schools and classrooms who not only have the same initial test scores but also come from families with similar observable sociodemographic characteristics. Absent random assignment, such models cannot control for all the compositional differences across schools and classrooms. Many scholars have mused that differential sorting of students across schools and classrooms on the basis of unobserved factors correlated with future achievement may bias value-added estimates of school and teacher effects (Hanushek & Rivkin, 2012; Reardon & Raudenbush, 2009; Rothstein, 2010).
Concerns of sorting-driven bias have likely contributed to education researchers’ disproportionate focus on teacher, rather than school, value-added effects. Some scholars perceive sorting across neighborhoods and schools to be a more strongly selected process, or at least a less statistically tractable one, than is sorting between teachers among students within the same school. Even if between-teacher selection (e.g., via tracking) threatens the validity of teacher effectiveness measures, statistical controls, such as student baseline test scores and student fixed effects, may more easily adjust for it than for between-school or between-neighborhood sorting (see Hanushek & Rivkin, 2012; Jencks & Mayer, 1990). Thus, value-added school effects may be more prone to uncorrected bias than are value-added teacher effects.
Another key factor driving the disparity in research on teacher and school effects is the empirical finding that twice the proportion of the variance in students’ end-of-year value-added gains is attributable to the former versus the latter (Kane & Staiger, 2008; Konstantopoulos, 2006; Nye et al., 2004). The large body of work on teachers’ short-term effects, which spans a wide range of grade levels and geographies, produced fairly consistent estimates of the variance in teachers’ contributions to end-of-year test scores, with an average SD of 0.13 for reading and 0.17 for math when comparing teachers within the same schools (Hanushek & Rivkin, 2012). For reference, this contribution translates to an increase of approximately 6 percentiles in the test score distribution (Raudenbush, 2014), and the Black-White achievement gap is estimated to be about 0.7 to 1 SD (Hanushek & Rivkin, 2012). However, other scholars have estimated smaller teacher value-added effects. For example, Chetty, Friedman, and Rockoff (2016) exploited randomization to estimate that a 1 SD increase in teacher value-added predicts a 0.10 SD increase in student test scores.
School-oriented studies, which typically conceive of school effects as encompassing both teaching- and nonteaching-related factors, tend to find short-term school effects that are equal to, or slightly smaller than, individual teacher effect estimates. For example, Jennings et al.’s (2015) analysis of high school effects suggested that a 1 SD increase in school quality generates a 3.5 to 3.7 percentile (~0.14 SDs) value-added boost to end-of-year 10th-grade test scores, averaged across math and reading. Deutsch (2013) also estimated a 0.14 SD difference in fourth to eighth graders’ value-added math and reading test scores between schools, while Deming (2014) generated a smaller school value-added effect, of under 0.10 SD on third to eighth graders’ scores, averaged across math and reading.
But it is not only the disparity in the size of teacher and school effects that matters. Concerns about the stability of school effect estimates may play a role, as well. In the early 2000s, Kane and Staiger (2001, 2002) and Staiger et al. (2002) argued that schools’ effects on end-of-year test scores may vary widely from year to year and therefore may not exert a clear signal about school quality or provide useful information for rewarding or penalizing schools based on only a single year of data. Although they did not directly compare the stability of school and teacher effects, their findings may have sown seeds of doubt regarding the former’s utility, contributing to the literature’s path curving away from school effects.
Since then, as test scores have become the taken-for-granted metric of student achievement, and as value-added models have become more widespread and sophisticated, few researchers have revisited the question of whether school effects’ size and stability are sufficiently strong to justify their inclusion in accountability regimes. The empirical studies described above suggest the following two concrete hypotheses:
The Long-Term Persistence of School and Teacher Effects
However, small and perhaps unstable short-term effects do not necessarily translate into small long-term effects. Full persistence in either school or teacher effects’ size over time is likely an untenable assumption (Briggs & Weeks, 2011; Lockwood et al., 2007; McCaffrey et al., 2004). Yet the rates of decay may vary between schools and teachers. This possibility has not received sufficient examination because researchers and policymakers have primarily deployed value-added models to track teacher or school effects on test scores within a single school year rather than over a longer time period. When they do employ a longer time horizon, they tend to track the persistence of individual teachers’, not schools’, effects on students’ test scores.
Studies examining teacher effect persistence showed that the benefit (or cost) induced by a given teacher on end-of-year test scores tends to decay dramatically within a short period of time. Given two teachers who diverge by 1 SD in value-added based on end-of-year scores (~0.10–0.17 SD), 1 year later only 20% to 50% of the difference is still detectable (0.02–0.09 SD; Raudenbush, 2014). Two years later, the difference is estimated to be 15% to 40% (0.02–0.07 SD; Lockwood et al., 2007; McCaffrey et al., 2004), and 3 or more years later, this difference reduces to 10% to 20% (0.01–0.03 SD; Chetty et al., 2014b; Lockwood et al., 2007).
Compared to the growing literature on the persistence of individual teacher effects on test scores, which encompasses over 10 studies (Raudenbush, 2014), the literature on the persistence of school effects remains sparse. Deming’s (2014) analysis of school lottery effects on fourth to eighth graders included an appendix revealing that the value-added estimate of the school a student attended in 2003 significantly predicts the average of their 2004 math and reading scores, net of the value-added estimate of the school attended in 2004, baseline test scores, and demographics. However, the 1-year persistence rate is not directly reported. Briggs and Weeks (2011) focused squarely on school effect persistence among fourth to eighth graders and found that school value-added effects on reading scores persist at a rate of 10% a year later, while school effects on math scores persist at a rate of 48%—a difference reflecting, perhaps, the fact that the learning of the latter subject is more malleable than the former. These results suggest that the 1-year persistence rate of school effects on math scores is at the high end of the range of individual teacher effects’ persistence rate, while for reading, school effects’ persistence rate is on the low end.
School effects plausibly persist at higher rates than do individual teacher effects for several possible reasons. First, schools typically constitute a more robust treatment than do individual teachers both within years (especially in middle and high schools, when students are taught by multiple instructors) and across years (exposure to specific schools typically lasts several years, whereas exposure to specific teachers more often lasts only 1 year). Second, whereas individual teachers appear to exert stronger effects on short-term cognitive skill growth, schools’ nonteaching-related factors may exert stronger effects on socioemotional skill growth (e.g., via exposure to peer networks, counselors, social workers, extracurricular activities). To the extent socioemotional skills are more malleable over the long term, especially after early childhood, and predict subsequent degrees of cognitive skill growth and educational attainment (Heckman & Mosso, 2014), we might expect schools to exert more persistent effects on test scores than do individual teachers.
Generating a complete picture of school and individual teacher effects’ persistence requires attending not only to the durability of their test score effects but also to their effects on long-term, non–test score outcomes. Recent research suggests that conclusions regarding the persistence of teacher and school effects may meaningfully vary based on the particular outcome in question. For example, Chetty et al. (2011) estimated that lingering kindergarten classroom effects on student test scores 7 years posttreatment are small (SD = 1.5 percentiles), representing a decay of over 75%, from 6.3 percentiles. However, these effects can generate significant long-term effects on non–test score outcomes in early adulthood, such as teenage birth rates, college attendance, and earnings, perhaps due to larger, more durable effects on socioemotional skills (i.e., levels of effort, initiative, engagement in class, and whether the student reports valuing school). Concretely, a 1 percentile increase in estimated kindergarten class value-added is associated with a 0.1 percentage point increase in the likelihood of attending college and a $483 increase in annual wages. Another study by Chetty et al. (2014b) suggested that a 1 SD increase in estimated teacher value-added during Grades 4, 5, 6, 7, or 8 is associated with a 0.5 to 1 percentage point increase in the likelihood of attending college.
This same temporal dynamic applying to individual teacher effects may also apply to school effects. For example, Jennings et al. (2015) found that between-school differences explain more of the variance in college attendance than in 10th-grade test scores. Several other studies confirmed that educational interventions’ effects on test scores fade out precipitously within a few years but substantial effects emerge when longer-term outcomes are evaluated (Deming, 2009; Heckman et al., 2013). The intuition that school and teacher interventions on test scores are less durable than their interventions on socioemotional skills and non–test score outcomes is deepened by Jennings et al.’s (2015) argument that, relative to test scores, which are highly correlated across school years, nonscore outcomes (e.g., college enrollment) depend far more on students’ conscious choices made within school or classroom contexts. School-level factors, such as counselors, school culture and expectations, and peer effects, at least at the high school level, may exert stronger effects on these choices than do individual teachers. Lacking robust evidence on the relative persistence of teacher versus school effects, our theoretical account above yields the following hypotheses:
Can schools that students attend prior to ninth grade also shape non–test score outcomes in high school and beyond? While high school effects on postsecondary outcomes may be discernible because the choices students make in their last year of high school exert considerable influence on their outcomes in the immediate years afterward, elementary and middle school effects on these outcomes may decay by the time students reach high school. Yet recent studies on skill development have suggested that interventions tend to be more effective when they are implemented earlier in a student’s childhood (Cunha & Heckman, 2007; Heckman, 2006; Heckman & Mosso, 2014). Middle school may be a critical intervention point, given that it is still relatively early in a child’s development, and it is a key transition period during which young adolescents face a range of formidable challenges and decisions (e.g., curricular choices) that may affect their long-term trajectories (Anderman & Maehr, 1994; Rathunde & Csikszentmihalyi, 2005; Simmons & Blyth, 1987; Wigfield et al., 1991). Thus, middle school quality may meaningfully stratify students’ postsecondary outcomes.
The Present Study
We believe estimating school effects on longer-term test score and non-test score outcomes is a valuable exercise at a time when theories of children’s skill development are evolving to encompass not only cognitive but also socioemotional skills (Heckman & Mosso, 2014), accountability regimes centered on test scores and teachers are receiving intense scrutiny (Koretz, 2017), and longitudinal administrative data encompassing test score and non-test score outcomes are increasingly available. Our rich dataset follows three cohorts of Massachusetts public school students as they move from elementary school through secondary school and into college. Measuring student outcomes within the K-12 pipeline both before and after middle school provides unusual analytical leverage, enabling us to construct two-level models estimating the size of middle school effects, net of baseline student abilities and high school and district effects, and these effects’ persistence over time.
Yet we would be remiss if we did not acknowledge two important concerns posed by our approach. First, teacher and school effects on non–test score outcomes may be artificially inflated compared to their effects on test scores because models of the latter employ lagged measures of the outcome, purging effect estimates of bias. Models of the former cannot. Moreover, holding teachers and schools accountable for longer term outcomes that may be less accurately estimated and tracked with a substantial time lag poses serious concerns, especially in an era of high educational personnel turnover (García & Weiss, 2019) and a fraught political environment.
Second, as noted at the outset, school effects on test score and non–test score outcomes could theoretically be explained by school-level differences in teaching-related factors (i.e., the quality of the individual teachers and/or the quality of the combined teaching workforce). If true, then school effects would be a compositional artifact of individual and collective teaching quality rather than a distinct driver of student outcomes. Indeed, a body of research articulates the potential importance of school-level differences in teaching-related factors, such as curricula, instructional resources, and professional development for student achievement. Creemers and Reezigt (1996), for example, emphasized school-level “rules and agreements” about how classroom instruction is executed (e.g., curricular materials, grouping procedures, teacher behavior) and evaluation and intervention policies that entail testing, remediation, and counseling, as well as school-level professional development policies. However, they bemoaned the complexity of empirically measuring and disentangling the aforementioned school-level factors from each other and from teacher-level factors. The relatively few studies attempting to isolate school-level instructional factors’ effects on student outcomes suggested that they exert modest effects on student outcomes (see Berends et al., 2010; Goddard et al., 2015; Kyriakides et al., 2010; Lubienski et al., 2008; Von Secker & Lissitz, 1999).
Another line of research informs whether school effects merely reflect individual teacher effects by estimating the proportion of the variance in individual teacher quality that resides between versus within schools. Using administrative data from San Diego, Koedel and Betts (2007) found that despite the conventional wisdom, “essentially all” the variance in teacher value-added quality exists within schools, not between them, and implicated schools’ difficulty in identifying and hiring the highest value-added teachers as a potential explanation. A review of multiple studies suggested that 80% to 90% of the variance in teacher value-added quality resides within, rather than between, schools (Xu & Swanlund, 2013; see also Chetty et al., 2014a). Thus, school effects are unlikely to be explained by between-school differences in individual teacher quality or the collective teaching workforce. We return to this issue and the aforementioned endogeneity concern below.
Data and Methods
Our analyses rely on two data sources: administrative school records from the Massachusetts Department of Elementary and Secondary Education and college enrollment data from the National Student Clearinghouse (NSC). The former source allows us to identify all students enrolled in a Massachusetts public school, including charters, between 2002 and 2010 and track them for as long as they remained in any Massachusetts public school. We can follow students who moved from one public school to another (including charters). However, we cannot follow students who dropped out of school, moved to private schools, or moved to another state.
During the years covered by our data, the Massachusetts Comprehensive Assessment System (MCAS) administered a math test to students enrolled in Grades 4, 6, 8, or 10 and an English language arts (ELA) test to students enrolled in Grades 4, 7, 8, or 10 near the end of the school year. For the cohort of students who entered seventh grade during the 2004–2005 school year (subsequently referred to as the 2004 cohort), we link K–12 records to NSC college enrollment data.
Sample and Variable Specification
Massachusetts schools’ grade configurations vary across local school districts (see Supplemental Appendix in the online version of the journal). For our analytic purposes, “middle school” designates a school that includes Grades 7 and 8, regardless of what other grades it includes. Just over 193,000 students entered the seventh grade of a Massachusetts public school for the first time during the fall of 2004, 2005, and 2006. We impose the following sample restrictions to generate stable estimates of middle schools’ value-added:
Considering seventh and eighth grades as our “treatment” period, we exclude students who did not remain in the same school throughout both grades. 1
Because estimating schools’ value-added effects requires an estimate of students’ initial skills, we restrict our sample to Massachusetts public school students who took the math and ELA tests near the end of fourth grade and the math test near the end of sixth grade.2,3
To gauge the persistence of middle school value-added (MSVA) effects using a measure of students’ skills after they completed eighth grade, we restrict our sample to Massachusetts public school students who took the math tests near the end of 10th grade.
Given that the MCAS tests change somewhat from year to year, we limit our sample to students who took the fourth-, sixth-, eighth-, and 10th-grade tests “on time.” This restriction eliminates students who were held back or skipped grades, but it ensures that every remaining cohort member had spent the same number of years in school between any two tests.
To ensure that our estimate of each school’s impact on seventh and eighth graders is moderately reliable, we dropped all schools with fewer than 10 students in any of our three entering cohorts.
These restrictions produce an analytic sample of 123,261 students, pooled across three cohorts, and enrolled in 355 middle schools. With an average of nearly 350 students per school, our sample produces precise estimates of schools’ mean value-added. For every student in the analytic sample, the data set includes our core test score outcomes of interest—eighth- and 10th-grade MCAS math percentiles—as well as fourth- and sixth-grade MCAS math scores and fourth-grade MCAS ELA scores, enabling us to construct growth models using two lagged measures of the outcome and an ELA measure in fourth grade. Given concerns that advantaged and disadvantaged students tend to accumulate knowledge at different rates, we supplement these baseline score measures with a rich set of student-level sociodemographic controls measured in Grade 4, including: gender, race/ethnicity, age, country of birth, and eligibility for a free or reduced-price lunch (FRPL; an indicator of whether parents’ income falls below 185% of the poverty line). Then, to address the concern that peer effects bias value-added estimates (e.g., Angrist & Lang, 2004; Barr & Dreeben, 1983; Kane & Staiger, 2008), we also control for school-level averages of each of the aforementioned sociodemographic variables (pooling all three cohorts together) for the middle school each student attended. The final variable we include is a non–test score outcome, available for only the 2004 cohort: an indicator of whether the student enrolled in a 4-year college/university in 2010 (the year they would be expected to graduate from high school if not retained in any grade). A score of 1 indicates they enrolled, and 0 indicates they did not.
Table 1 presents descriptive statistics for the unrestricted sample of 193,259 students, for the 123,261 students in the analytic sample, and for the 38,620 students in the 2004 cohort for whom we also have NSC college enrollment data as well as a valid high school value-added (HSVA) estimate (we describe how these estimates are calculated in the Results section). Math and ELA scores are higher in the two subsamples with complete data than in the full sample, and the percentages of African American, Hispanic, and FRPL students are lower in the former than the latter. These differences derive mainly from two factors. First, low-income and minority students in our sample repeat grades and change schools considerably more often than do higher income and White students. Second, low-income and minority students are less likely to have taken the 10th-grade tests because they are more likely to have withdrawn from school or stopped attending regularly before the end of 10th grade. To partially mitigate these selection concerns, we reran all analyses with the 10th grade on time restriction lifted. Doing so only increased our analytic sample slightly—from 123,261 to 126,452 (an increase of ~2.6%)—and did not substantively change any of our results.
Descriptive Statistics on Massachusetts Public School Students Entering Seventh Grade Between 2004–2005 and 2006–2007
Note. Age on September 1 of students’ sixth-grade year. School-level demographics are not weighted by number of students per school. ELA = English language arts; FRPL = free or reduced-price lunch.
Given our sample restrictions, the analyses that follow must be interpreted as capturing middle schools’ effects on the test scores of students who remain in the same middle school for multiple years and who progress “on time.” They cannot speak to middle schools’ role in increasing or decreasing the likelihood that students remain in the same school and progress through grades on schedule. While we recognize the important limitations of this approach, we see a fundamental trade-off between stability/reliability and generalizability when measuring school effects and have opted for the former in this study. Our aim is to estimate the size and persistence of “true” middle school effects, even if on a relatively advantaged student population.
Estimating the Size and Stability of Short-Term School Effects
Education researchers frequently employ HLMs to evaluate the effects of a school or teacher/classroom within a given year (e.g, Adcock & Phillips, 1997; Raudenbush & Bryk, 2002; Reardon & Raudenbush, 2009). These models partition the variance in academic achievement into between- and within-school and/or teacher/classroom components, while accounting for the clustering of certain types of students within schools/classrooms. The between-school/teacher variance component calculated by these models is commonly used to estimate the proportion of the variance in a given academic outcome that is not attributable to compositional (e.g., demographic, prior academic ability) differences among students.
Our data permit us to follow suit by employing two-level HLMs estimating the proportions of the variance in eighth-grade MCAS math scores that reside between, rather than within, middle schools. We also gauge the extent to which these between-school differences in achievement are attributable to between-school differences in students’ prior test scores (“Growth” models), student-level demographic characteristics (“Value-Added” models), and school-level demographic characteristics (“VA + Peer Effects” models). We then use the latter model to calculate MSVA estimates representing how much higher/lower each school’s students performed, on average, compared to students with similar baseline test scores and demographic backgrounds, who attended demographically similar schools.
Concretely, we begin by partitioning the total variance in eighth-grade math scores to between– versus within–middle school components via an unconditional random intercept model:
where the outcome is the eighth-grade math score (grand mean centered and standardized to have a mean of 0 and an SD of 1) of student i (Level 1), nested within middle school j (Level 2). γ
00
represents the fixed component of the middle school–level intercept (the estimated math score for the average middle school, pooled across three cohorts), r0j is the random error component of the middle school–level intercept (the deviation of the particular middle school’s intercept from the mean middle school’s intercept), and eij is the student-level error term (the difference between a student’s eighth-grade math score and the average score among students within the same middle school). This model and all that follow assume the random component of the middle school-level intercept (r0j) and the individual-level residual (eij) are normally distributed with means of zero and variances of
We first assess whether the middle school-level intercept’s random component (
We then determine the extent to which estimated middle school differences in eighth-grade math scores reflect compositional factors: (1) students’ prior test scores, (2) student-level demographic characteristics, and (3) school-level demographic characteristics. Equation 2 represents our most complete model of eighth-grade math scores by returning to the unconditional random intercept model as a base and adding in the full slate of student- and school-level covariates available (all grand mean centered and standardized to have a mean of 0 and an SD of 1), producing a random intercept and fixed slope model:
β 1 ,β 2 , and β 3 represent the fixed slopes quantifying the relationships between the eighth-grade math score of student i within school j and their sixth-grade math score, fourth-grade math score, and fourth-grade ELA score, respectively. β 4 ,β 5 ,β 6 ,β 7 , β 8 , and β 9 represent the fixed slopes quantifying the relationships between the eighth-grade math score of student i within school j and their age, racial/ethnic, class, and nativity characteristics. γ 01 ,γ 02 ,γ 03 , γ 04 , γ 05 , and γ 06 represent the relationships between a given student’s eighth-grade math score and school-level demographic characteristics whose values vary across schools but not among students within schools. If, after accounting for these student-level score differences and student- and school-level demographic differences, variance in the school-level intercept still significantly exceeds 0, then we have strong evidence that true middle school effects exist and that they are not merely the function of student- or school-level demographic differences.
The next step is to quantify these residual (i.e., noncompositional) middle school effects on students’ eighth-grade test scores by calculating a value-added estimate for each of the 355 middle schools (which we subsequently refer to as MSVA). MSVA represents how much higher or lower that school’s students perform, on average, compared to how students with similar baseline scores and demographic characteristics, who attend demographically similar schools, would be expected to perform. To generate the estimates we return to Equation 2, the most complete model of students’ eighth-grade test scores, and calculate the best linear unbiased predictors (BLUPs; Raudenbush & Bryk, 2002) of the middle school random effects (r0j). It is important to note that the variance of the BLUP estimates is negatively biased by a factor equal to the reliability of the school value-added estimates. However, this bias appears negligible when groups (i.e., schools) exceed 40 students (von Hippel & Bellows, 2018), which the vast majority of our schools do.
The random effects (BLUP estimates) are employed as school value-added estimates and assumed to be normally distributed, with a mean of 0 and a variance of
Estimating Longer Term School Effects
We then shift to evaluating whether middle school effects not only exist but persist over time. To this end, we follow prior analyses, such as Konstantopoulos (2007b), by constructing multivariate regressions predicting student test scores at least 1 year posttreatment, as a function of school-level value-added estimates (in our case, MSVA)—the key predictor of interest—as well as pretreatment test scores and student sociodemographics. After running an unconditional random intercept model to partition the variance in 10th-grade math scores into between- and within-middle school components, we begin with Equation 2 and make three adjustments: changing the outcome of interest from the original, short-term effect (i.e., eighth-grade math score) to the longer term outcome (i.e., 10th-grade math score); adding in the MSVA covariate; and removing the sociodemographic characteristics of the middle school attended by the student. Note that we do not include eighth-grade math score as a predictor because one of the key channels through which middle schools are likely to matter is via eighth-grade math scores and we do not want to “control away” this effect.
The key parameter of interest in Equation 3 is γ 01 . Its significance indicates whether or not middle school quality effects (measured via test scores) on 10th-grade math scores are discernible. Because MSVA is grand mean centered and standardized, with a mean of 0 and an SD of 1, γ 01 can be interpreted as the effect of a 1 SD increase in test score–based middle school quality on 10th-grade scores. By dividing the γ 01 coefficient by the original MSVA estimate generated via Equation 2, we evaluate the percentage of the middle school effect that is still discernible 2 years later (i.e., the persistence rate).
An important consideration is that middle school effects’ persistence may reflect, in part, the disproportionate propensity of students attending higher value-added middle schools to subsequently attend higher value-added high schools through a sorting process driven by (1) achievement gains induced by the middle school (e.g., in the case of selective high schools) and/or (2) an unobservable set of individual- and/or household-level factors predicting both middle school and high school sorting. We perceive channel (1) to be a valid middle school–induced mechanism shaping longer term outcomes, so we first report MSVA effects on 10th-grade math scores without controlling for any information related to high school of enrollment. However, to address the concern that channel (2) is confounding middle school effects on 10th-grade math scores, we calculate a set of HSVA estimates based on 10th-grade math scores and add them into our model. By comparing MSVA effects with and without an HSVA control, we generate a range of plausible MSVA effect sizes on 10th-grade scores.
To calculate HSVA, we use the same framework as Equation 1 but define the timeframe between the eighth- and 10th-grade tests (roughly ninth and 10th grades) as the treatment period, adding eighth-grade math scores to the model as an additional control. Students are assigned to the high school they entered in ninth grade, without regard to whether they remained in the high school until they took the 10th-grade test. As with middle schools, we limit the sample to high schools with 10 or more eligible students in each cohort, yielding a subsample of 117,932 students enrolled in 314 high schools. (Model output underlying our HSVA estimates is available upon request.)
After generating HSVA estimates, we assign these school-level high school estimates and the school-level middle school estimates we generated previously to individual students and then rerun Equation 3, with one additional control: HSVA. The significance of MSVA and magnitude of its coefficient, net of HSVA, provides stronger evidence regarding the extent to which middle schools exert direct effects on 10th-grade math scores, beyond sorting processes. Finally, we replicate the exact same analysis (i.e., Equation 3 with and without HSVA) for the 2004 student cohort, replacing the 10th-grade math score outcome with a longer term, non–test score outcome: whether the student enrolled in a 4-year college by 2010 (i.e., the year they would have been expected to graduate from high school if not retained in any grade). 4 Given that this final outcome is binary, rather than continuous, a logistic regression model would typically be recommended. However, in order to preserve consistency across the study’s analytic models and to generate more easily interpretable results, we construct a linear probability model, using HLM for our core analyses of school effects on 4-year college enrollment. As a robustness check, we replicate these analyses using logit models with standard errors clustered by middle school and generate results that are substantively unchanged.
As mentioned above, estimating MSVA effects on the 4-year college enrollment outcome is likely more prone to omitted variable bias in our models than is estimating MSVA effects on test scores because the former outcome, unlike the latter, likely reflects difficult-to-observe factors (e.g., parental wealth, educational aspirations) that cannot be controlled away using a lagged test score measure. Although data limitations preclude us from directly accounting for these broader sets of factors, we rerun our college enrollment analyses using district fixed effects, under the assumption that the distribution of parental wealth and educational aspirations is likely to be far larger between school districts than within them (Owens, 2016; Owens et al., 2016).
Results
To what extent do students’ eighth-grade test scores diverge between, as opposed to within, middle schools? The unconditional random intercept model produces a middle school–level intercept of −0.050—the average of all Massachusetts middle schools’ grand mean centered eighth-grade math test scores—with a variance of 0.168, which is significantly different from 0 (likelihood ratio = 18957.21, p < .001) and translates to an SD of 0.410. By dividing the random intercept’s variance by the total variance in the outcome (the ICC), we estimate that 17% of the total variance in eighth graders’ math scores resides between, rather than within, the middle schools they attend. Jennings et al. (2015), Konstantopoulos (2006), and Nye et al. (2004) similarly found that approximately 20% of the variance in reading and math test scores resides between, rather than within, schools. We can infer the vast majority of test score variance resides not between schools but between students and classrooms within the same schools.
A large portion of the estimated between-school variance in test score levels may merely reflect differences in the composition of students served. To test this possibility, we first estimate growth models of eighth-grade math scores that incorporate prior test scores from fourth and sixth grades and then estimate value-added models (Table 2, Model 1). In this growth model, both the Level 1 and Level 2 variance components decline dramatically relative to the unconditional model, from 0.848 to 0.247 and from 0.168 to 0.029 (SD = 0.171), respectively, though the random intercept’s variance remains significantly different from 0 (likelihood ratio = 9488.33, p < .001). Over four-fifths of the estimated variance in middle schools’ eighth-grade math scores appears to reflect differences in the baseline (i.e., “pretreatment”) achievement levels of their student bodies. This finding echoes the large literature arguing that school-level evaluation procedures must adjust for differences in baseline aptitude in order to generate informative score-based measures of school effectiveness.
Hierarchical Linear Models of Eighth-Grade Math Scores for 2004–2005, 2005–2006, 2006–2007 Cohorts, Student-Level N = 123,261, Middle School–Level N = 355
Note. Dependent and independent variables in models above are grand mean centered and standardized to have a mean of 0 and an SD of 1. ELA = English language arts; FRPL = free or reduced-price lunch; MSVA = middle school value-added.
p < .05. **p < .01, ***p < .001, two-tailed.
It is also worth noting that not only do sixth-grade math scores significantly predict eighth-grade math scores, as we would expect, but fourth-grade math and, to a lesser extent, fourth-grade ELA scores are also significant predictors. These secondary findings suggest that the inclusion of multiple pretreatment skill measures helps reduce bias in estimating school effects (see also Deming, 2014). Once these baseline score differences are accounted for, the remaining between–middle school variance suggests attending a middle school 1 SD more effective than average is associated with an approximately 0.17 SD gain in a student’s eighth-grade math score relative to the score that would otherwise be expected. Interestingly, the subsequent value-added models that incorporate student- and school-level demographic characteristics do not meaningfully account for between-school differences in eighth-grade math scores, net of baseline test scores. Adding in student-level demographic characteristics when predicting eighth-grade math scores (Table 2, Model 2) reveals effects in the theoretically predicted directions: Black and low-income students score between 0.01 and 0.02 SD lower on eighth-grade math than do other students, net of baseline test scores and other demographics, while immigrants score about 0.02 SD higher. Taken together, student-level demographics reduce the student-level variance component from 0.247 to 0.245, while the school-level variance component remains virtually unchanged at 0.030.
The addition of school-level demographic characteristics (Table 2, Model 3) generates some surprising results: proportions Black, Latino, and immigrant are significantly associated with higher eighth-grade math scores, net of all other covariates, while proportion FRPL is significantly and negatively associated with the outcome. The school-level factors only marginally diminish the random intercept to 0.027 (SD = 0.16), suggesting that exclusion of peer effects only slightly biases MSVA effects in this sample. With all variables included, the random intercept indicates that 10% of the remaining variance in eighth-grade math scores resides between rather than within middle schools. This is just above the analogous percentages calculated by Jennings et al. (2015) when estimating HSVA effects on 10th-grade scores of Massachusetts and Texas students (9% and 6%, respectively).
The most complete model (Table 2, Model 3) also suggests that attending a middle school 1 SD more effective than average is associated with an approximately 0.16 SD boost in a student’s eighth-grade math score compared to students with similar baseline scores and demographic characteristics, who attend demographically similar schools. This estimate is just slightly above the estimated 0.14 SD average reading and math score boost associated with a 1 SD boost to school value-added calculated by Jennings et al. (2015) with regard to 10th graders and Deutsch (2013) with regard to fourth to eighth graders. It is considerably above the 0.10 SD boost calculated by Deming (2014) with regard to fourth to eighth graders’ math and reading score average.
Although our data preclude us from directly comparing the analogous short-term estimate of teacher effects using the same sample, we can benchmark our school effect estimates with the analogous end-of-treatment teacher effect estimate of 0.10 to 0.17 SD on math scores from results of multiple studies, conducted across grade levels and geographies (Chetty, Hendren, & Katz, 2016; Hanushek & Rivkin, 2012). Our estimated end-of-treatment middle school effect size (0.16 SD) may initially seem comparable to the average teacher effect size.
However, there is an important difference. The teacher “treatment” in most teacher effects studies is conceptualized as lasting 1 year, while the middle school “treatment” in our study lasts 2 years. Thus, it may not be appropriate to directly compare the end-of-treatment gains between a 1-year (i.e., teacher) and 2-year treatment (i.e., school). To generate a fairer comparison, we compare our end-of-treatment middle school effect estimate (0.16 SD) to the estimated effect of receiving 2 consecutive years of instruction from a teacher that ranks 1 SD higher than average in value-added. If teacher effects persisted at a rate of 100% 1 year posttreatment (an untenable assumption), then receiving 2 years of instruction from a teacher that is 1 SD higher than average quality would be associated with a 0.20 to 0.34 SD score gain at the end of the 2-year treatment (0.10–0.17 SD from Year 1 instruction + 0.10–0.17 SD from Year 2 instruction). If teacher effects persisted at a rate of 20% to 50%, as empirical evidence suggests, then 2 years of instruction from a teacher that is 1 SD higher than average quality would be associated with an approximately 0.14 to 0.26 SD score gain (0.04–0.09 SD from Year 1 instruction + 0.10–0.17 SD from Year 2 instruction). Integrating these calculations with our estimated 0.16 middle school effect size provides tentative support for Hypothesis 1: School effects on test scores are initially smaller than are teacher effects—especially when adjusted for intervention length.
Estimating The Stability of Middle School Effects
If school effects are both smaller in size and considerably less stable across years than are teacher effects, then the educational evaluation research is justified in focusing on the latter rather than the former. We have shown evidence that our short-term school effect estimates may be initially smaller than are widely cited estimates of teacher effects, but are they less stable? To gauge this possibility, we shift from calculating MSVA estimates by pooling data across all three cohorts—as Kane and Staiger (2002) recommended—to recalculating these estimates, disaggregated by cohort. We then calculate the correlation in each school’s cohort-specific MSVA estimates across each of the 3 years. We find that for our full analytic sample of 355 middle schools, cohort-specific value-added estimates are correlated ~0.6 in consecutive years. As expected, these year-to-year correlations slightly increase when only middle schools with the largest cohort sizes are included (200+ students per cohort)—to ~.7—and slightly decrease when only schools with the smallest cohort sizes are included (fewer than 20 students per cohort)—to .4 to .5 (full results available upon request). Prior studies estimate year-to-year correlations of .6 to .8 for teacher value-added effects on math scores, though several have estimated even lower correlations of .2 to .6 (Loeb & Candelaria, 2012). Thus, if all middle schools are considered, school-level value-added effects’ stability appears comparable to, if only slightly lower than, that of teacher value-added effects.
Because accountability systems typically target schools at the very top and very bottom of the distribution, we conduct another stability analysis focused on middle schools whose first of three cohorts in our data placed them in the top or bottom quintile of the state distribution. We ask, what percentage of these top- and bottom-quintile middle schools remained in the top or bottom quintile during the subsequent year? The higher the percentage, the more stable MSVA effects are and the more useful their signal is likely to be for policymakers. Our data suggest about 50% of the schools ranked in the bottom and top quintiles based on the first cohort remain in their respective quintiles based on the second cohort. Similar quintile persistence rates are obtained when comparing results from the second to third cohorts (full results available upon request). An analogous set of analyses conducted based on the value-added estimates of San Diego high school teachers (Koedel & Betts, 2007) and Florida elementary school teachers (McCaffrey et al., 2008) generated top- and bottom-quintile 1-year persistence rates of 30% to 50%, depending on the particular model specification used to generate the value-added estimates. Thus, school value-added effects appear to be more valid and stable than previously thought—perhaps even more stable than are teacher value-added effects. Hypothesis 2 is not supported.
Estimating The Persistence of Middle School Effects
Having estimated the size and stability of middle schools’ short-term effects, we shift to their persistence over time. We begin by decomposing the variance in 10th-grade math scores to discern whether between–middle school differences in these scores are evident 2 years posttreatment. If middle school effects on 10th-grade math scores diminished entirely 2 years posttreatment and students attending the same middle school were not systematically more likely to attend higher or lower value-added high schools, there would be no discernible between–middle school variance in this outcome. However, the unconditional random intercept model of 10th-grade math scores produces a middle school–level intercept of −0.053, with a variance of 0.166, which is significantly different from 0; middle school effects remain discernible 2 years later. The ICC suggests 16% of the total variance in 10th graders’ math scores resides between, rather than within, the middle schools they attend—just slightly below the 17% result generated when eighth-grade math scores were considered—providing additional evidence that middle school effects may persist 2 years later.
For a more rigorous assessment of MSVA effects’ persistence, we add in fourth- and sixth-grade test scores and student-level demographics, the estimated MSVA of the particular middle school the student attended, and finally the estimated HSVA of the high school the student attended. For each model, we evaluate the size and significance of the MSVA covariate’s coefficient. Incorporating fourth- and sixth-grade test scores reduces the random intercept’s variance from 0.162 in the previous model to 0.030; 80% of the middle school–level variance in 10th-grade test scores is attributable to baseline aptitude and sociodemographic differences between groups of middle school alums (Table 3, Model 1).
Hierarchical Linear Models of 10th-Grade Math Scores for 2004–2005, 2005–2006, 2006–2007 Cohorts, Student-Level N = 117,932, Middle School–Level N = 354
Note. Dependent and independent variables in models above are grand mean centered and standardized to have a mean of 0 and an SD of 1. MSVA estimate is based on Table 3, Model 3 (value-added + peer effects). See text for description of how HSVA is estimated. Estimated persistence rate is calculated by dividing coefficient on MSVA in each model above by MSVA estimate generated by Table 3, Model 3 (0.163). MSVA = middle school value-added; HSVA = high school value-added; ELA = English language arts; FRPL = free or reduced-price lunch.
p < .05. **p < .01. ***p < .001, two-tailed.
Once we include the MSVA covariate in the model, a clearer picture of middle school effects’ persistence begins to emerge. MSVA’s estimated effect on 10th-grade math scores is strongly significant and substantively large (β = 0.107, p < .001; Table 3, Model 2). In other words, a 1 SD increase in MSVA is estimated to increase a student’s math score 2 years later (i.e., in 10th grade) by about a tenth of an SD, even after accounting for pre–middle school scores and student-level demographic characteristics. By dividing this predicted MSVA effect on 10th-grade math scores by the original MSVA effect based on eighth-grade scores (recall that a 1 SD increase in MSVA translated to a 0.163 SD eighth-grade score increase), we calculate a 2-year persistence rate of 66%, which exceeds the estimated 15% to 40% 2-year persistence rate of teacher effects calculated by recent studies (Chetty et al., 2014b; Lockwood et al., 2007).
To what extent is the persistence of this MSVA effect attributable to post–middle school student sorting? Students from middle schools with high (/low) value-added effects may tend to move on to high schools with high (/low) value-added effects due, perhaps, to a middle school–induced boost to children’s educational performance and/or expectations, which in turn, leads them to attend a higher quality high school than they would have otherwise attended had they attended a lower quality middle school. If this is the case, then high school quality–based sorting is a valid mechanism by which middle schools exert their effects on 10th-grade test scores, and the 2-year persistence rate of 66% is applicable regardless of whether HSVA effects mediate some portion of this lingering effect.
However, another plausible explanation for correlated middle school and HSVA estimates implies that middle school effects on 10th-grade test scores should be estimated independent of high school quality. If high (/low) value-added middle schools tend to feed into high value-added high schools due to district policies or neighborhood resources that shape the performance of both school types, then the middle schools themselves are doing little to drive the additional test score benefits reaped by attending a high-quality high school, and MSVA effects on 10th-grade test scores should be adjusted accordingly. To account for this scenario, we use the prior model and add in students’ HSVA estimate, which is correlated at .18 (unweighted) with students’ MSVA estimate, indicating a modest degree of sorting. In this model, the HSVA covariate is highly significant (β= 0.128, p < .001), and the MSVA covariate remains strongly significant, though its coefficient attenuates modestly in size, from 0.107 to 0.085 (a reduction of about 20%). In this more rigorous analysis, a 1 SD increase in MSVA is associated with a math score boost 2 years posttreatment of about 0.09 SD and exhibits an estimated 2-year persistence rate of 52%. High school effects on 10th-grade test scores are about 50% larger than are middle school effects on the same outcome, which is plausible given the temporal ordering of the school treatments.
Next we run a model identical to Table 3, Model 3, but add in district fixed effects to partially account for the possibility that difficult-to-observe factors, such as parental wealth and educational expectations, could confound MSVA effects on 10th-grade test scores. We argue that these types of confounders likely vary primarily between rather than within school districts. In this model, the MSVA coefficient modestly diminishes, from 0.085 to 0.071 (p < .001; see Table 3, Model 4), generating an estimated 2-year persistence rate of 44%. For an even more rigorous test of middle school effects, we replace HSVA estimates and district fixed effects with high school fixed effects (full model results available upon request). This model goes even further in netting out school sorting effects by comparing middle school value-added effects on 10th-grade math scores only among students attending the same high school. The MSVA coefficient once again reduces slightly; however, at 0.067, it remains significant and displays a persistence rate of 41%. It is important to remind the reader that the prior three models “control away” the effects of MSVA induced via high school sorting (e.g., by leading students into higher value-added high schools than they otherwise would have attended) and thus reflect a narrower conception of middle school effects.
Figure 1 summarizes the calculations of middle school value-added effects’ persistence rates based on each aforementioned model specification. Every model specification generates a 2-year middle school effect persistence rate that exceeds the range of estimated 2-year teacher effect persistence rates (i.e., 15%–40%), providing tentative support in favor of Hypothesis 3: School effects are more persistent than are individual teacher effects, at least with regard to math scores.

Estimated effect of a 1 SD increase in MSVA on students’ eighth-grade and 10th-grade math scores (in SDs), holding all other covariates at their means.
As a final robustness check, we rerun the 10th-grade math score model for only our oldest cohort using middle school value-added estimates based only on the two younger cohorts, and not all three cohorts. Purging the middle school value added estimate of data from the cohort for whom we are tracking 10th-grade math scores further assuages concerns about the reliability and stability of MSVA effects. This approach yields virtually identical results (available upon request) to those produced by the core model output described above.
Our final core analysis replicates Equation 3 but replaces the 10th-grade math score with a longer term, non–test score outcome: whether or not a member of the 2004 student cohort enrolled in a 4-year college/university “on time” (i.e., in the year 2010). The unconditional random intercept model of 4-year college attendance produces a middle school–level intercept of 0.553, with a variance of 0.031, which is significantly different from 0 and translates to an SD of 0.176. The ICC suggests that 13% of the total variance in middle schoolers’ propensity to attend a 4-year college/university on time resides between, rather than within, the middle schools they attend. Incorporating fourth- and sixth-grade test scores reduces the random intercept’s variance from 0.031 in the previous model to 0.011 (Table 4, Model 1). Thus 65% of the middle school–level variance in 4-year college enrollment is attributed to pre–middle school aptitude and sociodemographic differences across middle school student bodies, which is approximately 15 percentage points less than the analogous school-level variance in 10th-grade math accounted for by the same factors. This reflects one of two possibilities: (1) middle schools exert stronger value-added effects on students’ decisions regarding college enrollment than on test scores, or (2) unobservable factors, such as parental expectations and aspirations, shape both middle school sorting and the likelihood of 4-year college enrollment, net of sociodemographics. In other words, the relatively large remaining between-school variance may not reflect middle school–induced differences. We attempt to account for this possibility shortly.
Hierarchical Linear Models of 4-Year College/University Enrollment for 2004–2005 Cohort, Student-Level N = 38,620, Middle School–Level N = 352
Note. Dependent variable is binary, with 1 indicating student enrolled in a 4-year college/university in the year 2010 and 0 indicating then student did not. All independent variables in models above are grand mean centered and standardized to have a mean of 0 and an SD of 1. MSVA estimated is based on Table 3, Model 3 (value-added + peer effects). See text for description of how HSVA is estimated. MSVA = middle school value-added; HSVA = high school value-added; ELA = English language arts; FRPL = free or reduced-price lunch.
p < .05. **p < .01. ***p < .001, two-tailed.
After controlling for between–middle school sociodemographic and test score differences, MSVAs’ estimated effect on 4-year college attendance is strongly significant and substantively large (β = 0.036, p < .001; Table 4, Model 2). A 1 SD increase in MSVA is estimated to increase a student’s likelihood of attending a 4-year college or university by about 3.6 percentage points, or an approximately 6% increase relative to the overall average likelihood of enrollment for this analytic sample (57%). The estimate remains significant but reduces by 11% to 3.2 percentage points when the HSVA control is included. However, again, if one believes a valid pathway by which middle schools shape longer term outcomes is by propelling students to attend a higher quality high school than they otherwise would have then this model “over controls” for middle school effects. Regardless, it is important to note that MSVA’s effect on 4-year college enrollment is considerably larger than the estimated HSVA effect on the same outcome (β = 0.018, p < .001; Table 4, Model 3).
As we alluded above, one might counter that our analytical framework is more susceptible to generating biased estimates of MSVA effects on college enrollment than biased estimates of middle school effects on test score outcomes. In this view, accounting for lagged test score measures likely purges bias induced by omitting variables that predict our core test score outcomes of interest, as prior studies have shown (Deming, 2014). However, because there is no comparable lagged measure of college enrollment—and college enrollment may be influenced by difficult-to-observe factors like parental wealth, social networks, and educational aspirations, which are not included in our data set—it is more difficult to purge omitted variable bias from our estimates of MSVA effects on college enrollment.
In an attempt to “net out” the effect of unobservable factors underlying both middle school value added-based sorting and college enrollment decisions, we once again add in district fixed effects (see Figure 2). In this model, the estimated value-added effects of MSVA on 4-year college enrollment reduces by 50%—from 0.032 to 0.016—but the coefficient remains statistically significant (p < .01; see Table 4, Model 4). We then replicate all HLM model specifications from Table 4 using a logit model, with standard errors clustered by middle school (results available upon request). The estimated middle school effects are nearly identical regardless of whether HLM or logit is used.

Estimated effect of a 1 SD increase in MSVA on the probability of 4-year college/university enrollment, while holding all other covariates at their means.
Overall, then, our estimate of approximately 1.5– to 3.5–percentage point boosts associated with a 1 SD increase in MSVA exceeds the estimated 0.1– and 1.0–percentage point increases to the likelihood of college attendance spurred by a 1 SD increase in kindergarten class quality (Chetty et al. 2011) and fourth- to eighth-grade teacher quality (Chetty et al. 2014b), respectively. Hypothesis 4 is supported.
Discussion
Spurred in part by NCLB, rigorous evaluations of teacher and, to a lesser extent, school effects on student test scores have become a fixture of the contemporary U.S. educational landscape. Despite ongoing controversy, value-added models have proven quite durable, and recent analyses suggest they yield largely unbiased estimates. However, the vast majority of value-added models estimate individual teacher effects, rather than school effects, and do so based on end-of-year test scores.
In this study, we consider whether schools’ effects on test score and non–test score outcomes deserve a greater share of researchers’ and policymakers’ attention than they have received in recent years. To this end, we estimate middle schools’ value-added effects on both short-term (eighth grade) and longer term (10th grade) math scores among three cohorts of Massachusetts public school students and, within the eldest cohort, gauge their effects on 4-year college enrollment. Our two-level models confirm that schools’ effects (encompassing both teaching- and nonteaching-related factors) on test scores are initially smaller than are existing estimates of individual teachers’ effects on test scores. However, school effects appear nearly as stable across years and more persistent over time, especially when non–test score outcomes are considered.
Theories of skill development, combined with our empirical results and our account of plausible school-level mechanisms that shape children’s development, lead us to speculate that schools induce larger effects on students’ socioemotional skills than do individual teachers, and individual teachers induce larger effects on students’ cognitive skills than do schools. Recent evidence suggests that educational interventions’ effects on cognitive skills, measured via test scores, rapidly fade whereas meaningful effects on socioemotional skills may persist and, in turn, support long-term life outcomes (Heckman & Mosso, 2014). Although we do not directly measure socioemotional skill growth due to data limitations, the facts that (1) middle school effects on 10th-grade test scores appear smaller than do high school effects and (2) the pattern is reversed when the outcome shifts to college enrollment is consistent with the hypothesis that middle schools may shape socioemotional development. This proposition is theoretically plausible in light of evidence that socioemotional and other skills are more malleable at earlier phases in the lifecycle, and that socioemotional skills exert greater effects on non–test score outcomes, such as college enrollment, than on test scores (Heckman et al., 2006). If true, then schools may matter more than the Coleman report and subsequent education research has suggested, but score-based measures, especially if examined only in the short term, may understate their importance.
Limitations, Extensions, and Policy Implications
Our results should be interpreted with caution, given two important concerns raised at the outset. The first concern centers on the possibility that omitted variable bias affects our results, especially given that the independent variable measures we employ are incomplete. Data constraints preclude controlling for a more granular set of student-, family-, school-, and neighborhood-level characteristics that could theoretically bias school value-added estimates (e.g., continuous measures of family income and wealth, parental education and expectations). Our models include HSVA measures, as well as high school and district fixed effects, which may help mitigate these unobserved factors. Moreover, Chetty et al. (2014b) found that finer grained controls drawn from Internal Revenue Service tax data provide limited additional utility in generating unbiased value-added measures, beyond the set of controls traditionally used in such analyses. However, future research using lotteries or natural experiments in Massachusetts would help determine whether our approach over- or underestimated middle school effects in this geographic context.
Second, one could counter that our estimated middle school effects merely reflect school-level differences in teaching-related factors, such as (1) the quality the of individual teachers and (2) the quality of the combined teaching workforce. If our data permitted, we would run supplementary analyses leveraging three-level models that explicitly disentangle school effects from individual teacher effects to examine this proposition directly; instead, we use two-level models and benchmark our school effect estimates with the large body of work that has generated teacher effect estimates.
Yet we believe our results would remain largely consistent even in a three-level framework. As previously mentioned, prior research suggests the vast majority of the variance in individual teacher value-added quality resides within, not between, schools. Moreover, two studies employing three-level models that decompose the variance in high school and kindergarten students’ end-of-year test scores into student-, teacher-/classroom-, and school-level components suggested that the school-level variance component remains similar in magnitude, regardless of whether the teacher/classroom level is explicitly modeled (Konstantopoulos, 2006, 2007a). The kindergarten-based analysis used Project STAR data—which included about 18 students per classroom, four classrooms per school, and 79 schools—and generated a two-level model suggesting that 19.5% of the total variance in test scores resides at the school level. Accounting for the intermediary classroom–level variance via a three-level model reduces the school-level variance component estimate only slightly—to 16.5%, which is nearly identical to the school-level variance estimate we generated based on our two-level model. This same study argued that the larger the number of teachers/classrooms within each school-level data cluster, the more consistent the school-level variance component within a two-level model is predicted to be when the teacher-/classroom-level variance component is excluded (Konstantopoulos, 2007a). In sum, our data set’s exclusion of teacher-/classroom-level data may lead to slightly overstated school effect estimates (if conceptualized separate and apart from individual teacher quality), but the unusually large size of our data set should partially assuage this concern. In future work we plan to use three-level models to retest this study’s contentions that schools exert sizable, stable, and persistent effects that are not primarily explained by individual teacher quality, and we encourage other researchers to do the same.
Other limitations remain. In terms of outcomes, our test score analyses examine math rather than ELA achievement, and our non–test score analyses consider college enrollment but ignore other key milestones, such as college completion, labor market performance, and criminal justice encounters. It is also important to note that the patterns found in this study are Massachusetts-specific and cohort-specific, and they may not replicate in other places and time periods. Jennings et al. (2015) estimated that more of the variance in 10th-grade students’ value-added growth resided at the school level in Massachusetts than in Texas (9% vs. 6%). This finding suggests school quality may vary to a greater extent within and between school districts in Massachusetts than in other states. If true, our estimates of school effects’ size and persistence may be at the high end of the U.S. distribution. However, the possibilities that Texas is an outlier and that state-level comparisons would change if middle schools, not high schools, were assessed, cannot be ruled out.
Furthermore, to reliably estimate school effects, we restricted our sample to students who progress from Grades 4 through 10 “on time” and who remained in the same middle school in seventh and eighth grades. As a result, we cannot draw conclusions regarding middle schools’ effects on students who move across schools or who skip or repeat grades. Because these students are disproportionately likely to be disadvantaged, future work will be required to thoroughly analyze the implications of middle schools for race and class inequality, in general, and for disadvantaged students, in particular. Tracking the effects of middle schools on the likelihood of switching schools, repeating grades, and chronic absenteeism would add depth to our understanding of school effects.
Conclusion
Despite its limitations, our study underscores the potential utility of shifting both researchers’ analytic approaches and policymakers’ strategies for creating incentives that facilitate long-term student success. The evaluation literature’s orientation should expand from a primary focus on teachers’ contributions over a 1-year time period to a combination of teachers’and schools’ contributions over longer time horizons. Disentangling teacher and school effects becomes more difficult in this persistence framework, but we believe both types of effect estimates appear relatively stable over time. Large, ideally national, samples and multiple cohorts would increase estimate stability. Our findings also reinforce the value of expanding the dependent variables of interest from test scores to include a wide range of other non–test score outcomes, including socioemotional skills, as well as college enrollment and completion.
We recognize the practical difficulties of implementing these shifts. Specifically, it may be very difficult to hold teachers and schools accountable for longer term outcomes that they do not feel they possess direct control over (e.g., college enrollment). Fortunately, our data suggest school value-added effects measured toward the end of “treatment” are predictive of longer term test score and non–test score outcomes and therefore policymakers could justify incorporating this shorter term measure into accountability regimes, at least as a first step. We echo Kane and Staiger’s contention that ideally this shorter-term measure would be calculated over at least 2 to 3 years to maximize reliability and stability and ensure schools are not held “randomly accountable” (Staiger et al. 2002). If future studies confirm that school value-added effects are stable, persistent, and predictive of long-term outcomes—and meaningfully distinct from individual teacher value-added effects in three-level models—then identifying the particular school-level mechanisms responsible for middle school effects, and restructuring schools accordingly, will be critical.
Supplemental Material
6._OnlineSupp_Lloyd_Schachner_AERJ_final – Supplemental material for School Effects Revisited: The Size, Stability, and Persistence of Middle Schools’ Effects on Academic Outcomes
Supplemental material, 6._OnlineSupp_Lloyd_Schachner_AERJ_final for School Effects Revisited: The Size, Stability, and Persistence of Middle Schools’ Effects on Academic Outcomes by Tracey Lloyd and Jared N. Schachner in American Educational Research Journal
Footnotes
Notes
Supplemental Material
Supplemental material for this article is available in the online version of the journal.
T
J
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
