Abstract
Keywords
Introduction
The accountability reform movement in the educational policy world generally acknowledges the importance of principals in the schooling process and the need for evaluating and holding principals accountable (Pashiardis & Brauckmann, 2009; Portin, Feldman, & Knapp, 2006). Prior to the 2010 Race to the Top (RttT) federal grant opportunities, however, few states had developed comprehensive evaluation systems for school administrators (Davis, Kearney, & Sanders, 2011; Jacques, Clifford, & Hornung, 2012). Indeed, the RttT guidelines revealed that evaluations of state proposals would examine
the extent to which the State, in collaboration with its participating LEAs, has a high quality plan and ambitious yet achievable annual targets to (a) Determine an approach to measuring student growth (as defined in this notice); (b) employ rigorous, transparent, and equitable processes for differentiating the effectiveness of teachers and principals using multiple rating categories that take into account data on student growth (as defined in this notice) as a significant factor; (c) provide to each teacher and principal his or her own data and rating; and (d) use this information when making decisions . . . (U.S. Department of Education, 2011, 37809)
Among those states that neither applied for nor obtained RttT funds, many found themselves adopting performance-based evaluation systems for both teachers and principals to satisfy the requirements for NCLB (No Child Left Behind) waivers (Pashiardis & Brauckmann, 2009; Portin et al., 2006). Similarly, the 2011 Teacher Incentive Fund grants were created by the U.S. Department of Education to fund $1.2 billion of projects to “develop and implement performance-based teacher and principal compensation systems in high-need schools.” This flurry of recent policy activity has focused on the importance of school leaders in improving outcomes for both teachers and students (Jacques et al., 2012). Moreover, the policy initiatives have strongly encouraged states to give significant weight to value-added or growth measures of student achievement. For example, the 2010 RttT grant application process motivated many states to propose and enact principal evaluation systems that incorporate measures of student achievement progress (U.S. Department of Education, 2011).
The focus on accountability for school leaders is not unique to the United States. Other countries—from Canada to Vietnam—are also investing in principal evaluation as a means of improving school quality and effectiveness and struggle with similar questions about purposes, methods, and sources of evidence (Gaziel, 2008; Kim et al., 2010; Pashiardis & Brauckmann, 2009; Pham, 2011; Thomas, Holdaway, & Ward, 2000).
Little empirical research, however, has examined methods of estimating principal effectiveness, particularly for evaluative purposes. Literally none of the high-quality studies investigating how to accurately assess principal effectiveness were written before RttT and the adoption of principal evaluation in numerous states across the country. There were, alternatively, numerous high-quality studies investigating the estimation of teacher effectiveness that preceded RttT and state adoptions of teacher evaluation strategies. Apparently, policy makers simply assumed that if teacher effectiveness could be estimated, then principal effectiveness could be estimated as well, despite the absence of research that would validate such an assumption. Policy makers may have, in fact, simply assumed that estimating principal effectiveness would be technically easier than estimating teacher effectiveness because of larger sample sizes at the school level and greater data availability.
The purpose of this article is to examine the assumptions underlying efforts to evaluate principal effectiveness in terms of student test scores, to examine research on efforts to estimate principal effectiveness in relation to student test scores, and to discuss the appropriateness of current efforts to evaluate principals with respect to student test scores. Because much of this article discusses issues some technical issues associated with test scores and statistical methods, we begin by defining some key terms used. We then review the purposes of evaluation and the characteristics of effective personnel evaluation, followed by a discussion of test scores and the appropriate use of test scores in personnel evaluations. Subsequently, we describe in detail the various statistical approaches to estimating principal effectiveness and the critiques of each of these approaches. Finally, we present our findings, discussions, and policy recommendations.
Definitions of Important Terms
In this article, the term school effectiveness refers to the impact of a school has on changes in the test scores of students enrolled in a school whereas “principal effectiveness” refers to the ability of the principal to affect changes in student test scores. Based on a voluminous amount of research in educational leadership, we contend that both school and principal effectiveness encompass far more than just changes in student test scores (Leithwood, Harris, & Hopkins, 2008; Leithwood & Riehl, 2003; Silins & Mulford, 2004). Indeed, we believe that effective principals strive for a number of important student outcomes other than simply improved test scores on achievement tests. We use these terms in this way, however, because much of the policy and research focus is on estimating the impact of schools and principals on student test scores, rather than examining the relationship between educator behaviors and other outcomes such as increased student engagement, improved school culture, more effective communication, greater collaboration, increased teacher capacity, greater teacher retention, improvements in school culture, and so on.
Similarly, we use the term estimates of principal effectiveness in order to denote statistical efforts to quantify the impact a principal might have on the changes in school-level test scores. We use the term principal evaluation to refer to any effort to make judgments about principals based to some degree on statistical estimates of changes in student test scores. “Student growth” refers to changes in student test scores from one year to the next. As we describe below, the methods to assess student growth vary from simplistic subtraction of scores to sophisticated statistical procedures.
Purposes of Evaluation
The basic purpose of evaluation is, according to Fitzpatrick, Sanders, and Worthen, (2011), “the identification, clarification, and application of defensible criteria to determine an evaluation object’s value (worth or merit) in relation to those criteria” (p. 7). Therefore principal evaluation would entail the identification, clarification, and application of defensible criteria to judge the worth or merit of a principal relative to changes in test scores. Critical to this definition is the term defensible criteria, since it is inextricably related to the Joint Committee on Standards for Educational Evaluation’s (2009) recommendation that personnel evaluations should be “ethical, fair, useful, feasible, and accurate” (p. 1). We argue principal evaluations cannot meet this standard unless the criteria used to judge principals are defensible.
Perhaps the most important objective of personnel evaluations is to provide a clear signal to the employee about her/his performance in order to improve performance (Joint Committee on Standards for Educational Evaluation, 2009). Indeed, every state that has adopted a principal evaluation plan has stated one of the purposes of the effort is to improve principal performance (Davis et al., 2011; Jacques et al., 2012). A positive evaluation signals to the employee to continue current behaviors since she or he is producing positive results. Alternatively, a negative evaluation signals to the employee to work harder or change behaviors because the current effort levels, enacted behaviors, or chosen strategies are not producing the desired results.
For the signaling efforts to have their intended effect, the personnel being evaluated must perceive the system as fair and equitable in order to act on the signals communicated by the evaluation (Grissom, Kalgorides, & Loeb, 2012; Joint Committee on Standards for Educational Evaluation, 2009; Kane & Staiger, 2002). If, in fact, personnel do not perceive the evaluation system to be fair, then they are likely to ignore, subvert, or game the evaluation process (Erdogan, 2002; Kane & Staiger, 2002). Evaluations perceived as unfair, thus, would not accomplish the purpose of either encouraging behavior that would lead to positive outcomes or discouraging behavior that would lead to negative outcomes. We contend, then, that any effort to evaluate or hold accountable principals must be perceived as fair by those being evaluated—namely, school principals. In other words, it must hold face validity (Fitzpatrick et al., 2011; Koretz, 2008; Thomas et al., 2000).
Valid Uses for Student Test Scores
A primary assessment standard is that each test is valid for its intended use and that test administrators and policy makers take steps to reduce possible validity threats. According to The Standards for Educational and Psychological Testing developed by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (1999), “Validity refers to the degree to which evidence and theory support the interpretations of test scores,” (p. 9). There are two forms of validity: construct validity, the use of test scores according to a theoretical construct variable inferred from multiple types of convergent evidence, and validity in use, or how the test scores are used to make decisions, such as determining what content students might still need to learn or where to place a student in a level of instruction appropriate to his or her needs.
Tests that are developed to measure a construct (e.g., reading or math) are designed to be used for a specific purpose. It is then the responsibility of the end user to use the test data to make valid inferences. The American Psychological Association has established strong guidelines for test developers and users on ensuring the validity of these uses and limiting misuse, by using tests for unintended purposes without prior validation. Specifically, Standard 1.2 of the Standards for Educational and Psychological Testing states that test developers have a responsibility to describe the population the test is designed for and the kinds of interpretations of test scores that are appropriate. Standard 1.4 explains that if a test is used in a new way, it is the user’s responsibility to collect new validity evidence to support the new use. Thus, if those same reading test scores that were collected to determine student instructional placement were now going to be used evaluate a building principal, supervisors would be responsible for gathering validity evidence to justify this new use.
Use of Student Test Scores for Personnel Decisions
Despite the fact that student achievement tests were not designed to evaluate either teachers or principals, many states now use such test scores to, in fact, evaluate teachers and principals. A 2007 report from the National Comprehensive Center for Teacher Quality lists two serious issues to be considered before using student test scores to assess teacher effectiveness that also apply to efforts to assess principal effectiveness: First, these types of tests were not designed for such purposes and therefore are not sensitive to factors that would allow for an analysis of teacher or principal contributions; and, second, variability in alignment among the tests, the curriculum, and what is taught might mean that student learning is not accurately reflected in the test scores, thus rendering estimates of teacher and principal effectiveness inaccurate (Goe, 2007).
Similarly, the Economic Policy Institute released a 2010 brief titled “Problems With the Use of Student Test Scores to Evaluate Teachers” by a litany of top researchers in the field of education testing and measurement (E. L. Baker et al., 2010). In the brief, some of the brightest minds in assessment expressed grave concerns about the technical appropriateness of the use of test scores to evaluate teachers:
There is broad agreement among statisticians, psychometricians, and economists that student test scores alone are not sufficiently reliable and valid indicators of teacher effectiveness to be used in high-stakes personnel decisions, even when the most sophisticated statistical applications such as value-added modeling are employed. (E. L. Baker & Linn, 2002)
What is more,
Poor measurement of the lowest achieving students has been exacerbated under NCLB by the policy of requiring alignment of tests to grade-level standards. If tests are too difficult, or if they are not aligned to the content students are actually learning, then they will not reflect actual learning gains. (E. L. Baker & Linn, 2002, p. 2)
Thus, at best, there is only mixed support in the psychometric community for the use of student test scores that were originally designed to measure student achievement as a method to evaluate teachers and, by extension, principals.
Approaches to Evaluating Principal Effectiveness
Despite the serious issues concerning the use of test scores for unintended purposes as described above, there is great appeal among policy makers in using assessments of student growth as a measure of teacher, school, and principal effectiveness (Betebenner & Linn, 2010; Davis et al., 2011). As such, numerous states have adopted measures of student growth as at least one component of the states principal evaluation system (Jacques et al., 2012). Moreover, numerous individual districts, such as the Houston Independent School District, have adopted principal evaluation systems that include some form of student growth. In this section, we examine the various approaches to evaluate principal effectiveness and the specific statistical approaches states and districts might employ under each approach. We then critique the different statistical approaches employed under each general approach to evaluating principal effectiveness.
There are, as noted by Grissom et al. (2012), three basic approaches to evaluating principal effectiveness. The first approach assumes principal effectiveness can be accurately estimated by simply estimating school effectiveness. The second approach assumes that principal effectiveness can be accurately measured only by isolating the effects of a principal on student test scores apart from the effects of schools on student test scores. To isolate the effects of principals, this approach compares the effects of a single principal to the effects of two sets of principals: first, the principals immediately preceding or following the principal in the same school and, second, the set of principals connected to the preceding and following principals in other schools. To clarify this complex comparison, we describe the following scenario portrayed in Figure 1. For example, suppose Principal T is a principal in School A and Principal P is the principal that immediately preceded Principal T in School A and Principal F is the principal immediately following Principal A in School A. Furthermore, suppose Principal P had previously been a principal in School B and that Principal X had preceded Principal P and Principal Y had succeeded Principal P. Also assume that Principal F was previously a principal in School C and was preceded by Principal Q and succeeded by Principal R. Under this approach, the effectiveness of Principal T would be estimated by comparing her or his effectiveness to Principal P as well as Principals X and Y through their associations with Principal P in School B. Moreover, Principal T would also be compared to Principal F and both Principals Q and R through their associations with Principal F in School C. Thus, efforts to assess the effectiveness of Principal T would involve comparisons to the effectiveness of six other principals. The connections required in this approach are described in greater detail below. The third approach assumes that the only accurate approach to estimate principal effectiveness is to examine school improvement relative to prior achievement levels at the same school during a principal’s tenure.

Portrayal of the comparisons of the effectiveness of one principal through associations with other principals under Approach 2.
In adopting specific models and approaches, policy makers must make certain assumptions about testing, accountability, and evaluation and the ability of statistical models to accurately capture effects of schools and principals on student test scores. Figure 2 provides a very simplistic diagram of the factors a principal evaluation system might include depending on the assumptions of the policy makers designing the system and data availability in that particular state or district. In reality, each of the factors has some degree of interdependence with the other factors and, moreover, additional factors would likely be considered. Although overly simplistic, our model is sufficient for making our points about the different approaches and assumptions employed by almost all current efforts to evaluate principals based on test scores. For each approach described below, we provide a version of this model. Boxes with solid lines and no shading indicate the factor is included in the approach, whereas boxes with dashed lines and shading indicate the factor is not included in the approach.

Description of the factors included in various approaches to evaluating principals based on student test scores.
Furthermore, within our discussions below, we note the assumptions held by advocates of particular approaches. In addition, a comparison of these assumptions, descriptions of the variables employed in the models, and both the strengths and weaknesses of each approach are included in Table A.1 in Appendix A.
Approach 1: Principal Effectiveness Is Best Measured by School Effectiveness
Arguably, the most common approach to evaluating principal effectiveness is to simply equate principal effectiveness with school effectiveness (Davis et al., 2011; Jacques et al., 2012). Simply put, if a school is considered to be “effective” by some particular measure, then the principal of the school must also be effective. To make such a claim, advocates of this approach must assume that principals have complete and unilateral control over all of the resources, policies, strategies, and functions of a school. Indeed, if a principal did not have complete control over school factors that influence changes in student test scores such as student demographics, community support for education, school facilities, and access to a pool of quality teachers, then school effectiveness and principal effectiveness would not be equal.
More specifically, there are underlying assumptions about the nature of the relationships between student and school characteristics and student test scores for this approach. In short, these assumptions reflect whether policy makers believe student and school characteristics influence student test scores. The first assumption is that neither student characteristics nor school characteristics influence student test scores, whereas the second assumption is that both student and school characteristics do influence student test scores. We review each of these assumptions and the methods to evaluate principal effectiveness under each of these assumptions.
Approach 1, Assumption A: Student and School Characteristics Have No Influence on Test Scores
If principal effectiveness is assumed to be equivalent to school effectiveness and student and school characteristics have no influence on student growth, then estimating principal effectiveness is relatively simple and straightforward. Indeed, if policy makers assume that student and school characteristics have no influence on student test scores, then evaluation models of principal effectiveness could rely on estimates of school effectiveness that do not control for either student characteristics or school characteristics. This view that only prior student test scores influence test scores is portrayed in Figure 3. The shaded boxes indicate that the particular factor is not included in the approach.

Description of the factors included in Approach 1, Assumption A.
States and districts have generally adopted one of five specific strategies within Approach 1, Assumption A: (a) Changes in Percentage of Students Passing/Proficient, (b) Changes in Scale Scores, (c) Changes in z Scores and Percentile Ranks, (d) Student Growth Percentiles/Median Growth Percentiles, and (e) Simple Value-Added Models (VAMs). An additional strategy is to employ Student Learning Objectives (SLOs), SLOs are goals identified by a teacher in collaboration with a principal or other supervisor that identify expected learning outcomes or growth targets for students on a particular assessment. For the purpose of this article, SLOs can be considered a form of Change in Percentage Passing or Change in Scale Scores.
Changes in percentage of students passing/proficient
A number of programs, policy makers, and even some researchers rely on the change in the percentage of students passing, proficient, advanced, or college-ready as a measure of school effectiveness and, hence, principal effectiveness. For example, the Indiana RISE principal evaluation handbook provides example objectives to measure principal performance and one such example was as follows: “The bottom 25% of grade 6-8 students, based on last year’s [state test] scores, will increase their [state test] ELA passing rates by 10% (Indiana Department of Education, n.d.).” Similarly, as part of the school accountability system, Texas calculates school growth by simply calculating the change in the percentage of students passing the state-mandated tests (Texas Education Agency, 2011).
Koretz (2008) and a host of other assessment experts, however, have long highlighted the deficiencies of using changes in the percentage of students meeting a particular standard such as passing, proficient, advanced, or college-ready as a measure of student or school progress. Indeed, testing experts and researchers universally accept that using such a metric is problematic for multiple reasons. First, using this metric is almost always an inaccurate indicator of student growth, since the change in the percentage of students meeting a particular standard is profoundly influenced by the distribution of student scores around the cut point for the particular standard (Betebenner & Linn, 2010; Glazerman & Potamites, 2011; Koretz, 2008). For example, a school with a relatively large proportion of students scoring within a few questions of meeting a particular standard will inevitably show greater gains in the percentage of students meeting that standard than a school with only a small percentage of students scoring within a few questions of the particular standard. Yet the second school could actually have far greater growth than the first school. In this way, actual growth is often masked by the use of this metric. Even in states with multiple standards such as proficient, advanced, and college-ready, there is a wide range of achievement within each band that masks the true growth of students (Glazerman & Potamites, 2011; Koretz, 2008). Strikingly, schools with actual real declines in student growth can achieve increases in the percentage of students meeting a particular standard (Glazerman & Potamites, 2011; Koretz, 2008).
Second, ceiling effects can influence the results. For example, a school that increased the passing rate on a test from 95% to 100% and had a scale score change of 200 would be considered less effective than a school that increased the student passing rate from 10% to 20% and had a scale score change of 25 because the first school would only evidence a 5–percentage point change in the percentage of students passing whereas the second school would claim a 10–percentage point increase.
Finally, such a metric ignores the influence of student and school characteristics on student test scores (Betebenner & Linn, 2010; Glazerman & Potamites, 2011; Koretz, 2008). It therefore does not identify the unique contribution of the school to student achievement (Glazerman & Potamites, 2011). In short, such a metric provides an inaccurate estimate of the actual contribution of the school or principal to student growth.
Changes in scale scores
To avert the problems associated with using the changes in percentage of students meeting a standard, some researchers use scale scores as a strategy to compare the progress of schools. Student scale scores are derived from a statistical transformation to ensure scores from different versions of the test within the same year or across years are reported on the same scale. Using scale scores is arguably a more accurate method for identifying student growth, because scale scores are sensitive to the changes in achievement of all students, not just to the changes in achievement of those students moving from “not passing” to “passing.” However, to compare scale scores across grade levels and years, the scale scores must be “vertically aligned.” In other words, the scores must be placed on a scale that covers all grades tested and has no ceiling. Creating such a system is incredibly complex (Tomkowicz, Zhang, & Yen, 2010). For example, let us assume a state may create a scale in which performance is measured continuously across Grades 3 through 8. This requires “vertically aligned learning standards with considerable grade-to-grade overlap and a systematic, intentional increase in grade-to-grade difficulty” (Tomkowicz et al., 2010, p. 3). Given this difficulty, we not surprisingly found few examples of states or districts using changes in scale scores to assess school growth. 1
Even with vertical alignment, using scale scores to assess growth is also problematic for a number of reasons. First, as was the case with using percentage passing, ceiling effects can make accurate comparisons difficult. Indeed, a school with a large proportion of students scoring at or near the top of the scale may actually appear to have lower growth than a school with an initially lower average scale score. Second, as implied by Problem 1, a school’s prior year average can have a profound effect on the overall growth score of a school, unless the scale has no upper bound. Finally, and most important, changes in scale scores over time ignore the influence of student and school characteristics on the changes in scale scores. As such, the metric cannot be used to isolate the effects of the school or principal on student growth. Hence it cannot be used to identify either school or principal effectiveness.
Changes in z scores and percentile ranks
To circumvent the issues related to scale scores, researchers often convert scale scores into z scores by transforming the scale scores in such a way that the mean of the scores is set to 0 and the standard deviation is set to 1. Similarly, researchers also create percentile rankings by transforming the scale scores into percentile ranks in a manner that places the median scale score at the 50th percentile. In both instances, the transformation allows researchers to avoid the problems associated with using percentage of students passing and comparing scale scores across years and grades. The simple change in z scores or percentile ranks is then employed as an indicator of school or principal effectiveness. For example, a 2012 study of the effectiveness of principals from the New Leaders for New Schools employed variants of both these methods to estimate principal effectiveness (Burkhauser, Gates, Hamilton, & Ikemoto, 2012).
Although using z scores or percentile ranks certainly avoids most of the issues related to the above two strategies, this third strategy is also problematic. First, percentile ranks are ordinal, thus any basic mathematical manipulations such as addition or subtraction can be misleading (Glazerman & Potamites, 2011). More important, simply subtracting the prior year’s average z scores or percentile ranks from the current year does not adjust for the effects of student and school characteristics on test scores. Thus, as was the case with the first two strategies, using this metric to identify school or principal effectiveness would be inaccurate and, hence, inappropriate to use in a personnel evaluation system.
Student growth percentiles/median growth percentiles
A relatively similar approach to assessing student growth is the calculation of student growth percentiles (SGPs), sometimes referred to as median growth percentiles (MGPs). In states such as New York and Colorado, advocates that claim this approach is similar to a value-added approach that seeks to measure student growth based on current and prior test scores. Although SGPs conceivably measure student growth, they do so in very different manners than VAMs. A basic SGP simply compares the scale score growth of a student relative to the scale score growth of all other students. To control for prior scores, a student’s growth is compared to only other students who had the same initial scale score. Table 1 provides a simplified example of four students and their percentile gains. For use in estimating school or principal effectiveness, the percentile growth for each student is calculated. In a simplistic application, the SGPs are then arranged in ascending order and the median SGP becomes the indicator of student progress for the entire school and for the principal. In a more sophisticated application, states would use “a method called quantile regression to estimate the rarity that a child falls in her current position in the distribution, given her past position in the distribution” (B. D. Baker, Oluwole, & Green, 2013, p. 7).
Simplified Example of Student Growth Percentiles.
The primary flaw in the SGP approach is the lack of controls for student and school characteristics. Interestingly, despite multiple states adopting this approach and using the results to assess teacher, school, and/or principal effectiveness (Jacques et al., 2012), the creators of the SGP strategy never intended it to be used in any other manner rather than as a description of the progress of individual students or groups of students (B. D. Baker et al., 2013). Indeed, the developers of SGP readily admit that the method was not created to attribute effectiveness to teachers, students, or principals and should not be used in such a manner (B. D. Baker et al., 2013).
Simple value-added models
As shown above, initial simplistic efforts to accurately assess student growth was highly problematic even when assuming that student and school characteristics had no influence on student test scores. As efforts to hold teachers and principals accountable for student performance proliferated, researchers developed statistical approaches that theoretically avoided the pitfalls of the more simplistic approaches by identifying the amount of student growth attributable to a teacher, school, or principal independent of the effects of student prior scores. These approaches were labeled VAMs since the overarching purpose was to estimate the additional student growth in test scores exhibited by a student above and beyond the growth in scores explained by the student’s previous test score history.
Often, value-added efforts seek to predict a student’s score based on past scores of the student and then compare the student’s actual achieved score with her or his predicted score. Based on this comparison, the student is then placed into one of three student groups: students whose scores were lower than predicted, students whose scores were not statistically different than the predicted scores, and students whose scores were greater than predicted. The individual scores can then be aggregated to the teacher or school level and the teachers and schools placed into various groups depending on the aggregate value-added scores.
As with SGPs and the other strategies in this section, the primary flaw in this strategy is the failure to adjust the outcomes for student and school characteristics. Advocates of the approach claim the complete student testing history controls for student characteristics such as race, ethnicity, and economically disadvantaged status. For example, the Pennsylvania Department of Education (n.d.) states, “There is NO relationship between demographic variables, such as socioeconomic status, and growth [and, thus, the state’s VAM] does not need to control for demographics” (p. 3). Almost all researchers, however, disagree with such a statement, and almost universally agree that simple VAMs should not be used to isolate the impact of teachers, schools or principals on student growth (B. D. Baker et al., 2013; Branch, Hanushek, & Rivkin, 2012; Chiang, Lipscomb, & Gill, 2012: Ehler, Koedel, Parsons, & Podgurskey, 2013). As a metric used to identify principal effectiveness, then, simple VAM results would inaccurately identify principal effectiveness and would be inappropriate for use in a personnel evaluation system.
Approach 1, Assumption B: Student and School Characteristics Influence Test Scores
Despite the numerous states that have adopted school and principal evaluation systems that exclude student and school characteristics from estimates of school and principal effectiveness, almost all researchers agree both student and school characteristics influence student test scores (B. D. Baker et al., 2013; Branch et al., 2012; Chiang et al., 2012: Ehler et al., 2013) This agreement is illustrated in Figure 4, which portrays the factors included in this approach and the assumptions of the approach.

Description of the factors included in Approach 1, Assumption B.
Agreement on their influence, however, has not lead to general consensus about the most appropriate statistical strategy to identify these influences. Three general strategies, however, do appear to be either relatively widely used or have received substantial empirical support: one-step VAMs, SGP/VAM hybrids, and two-step VAMs.
One-Step VAMs
One-step VAMs are similar to simple VAMs in that the method typically relies on ordinary least squares regression. Yet, one-step VAMs move beyond simple VAMs in a very important way—they attempt to “level the playing field” across schools (and principals) by parceling out the influence of factors not under the control of teachers, schools, or principals such as student demographics, school facilities, school size, prior scores, and so on. By leveling the playing field, we mean estimating principal effectiveness in such a way that every principal has an equal chance of being evaluated as effective regardless of the characteristics of the school in which the principals work. The ultimate goal of this strategy, in fact, is to isolate the impact—or value-added—of a teacher, school, or principal with respect to changes in student test scores after adjusting the results for prior student scores and the influence of factors not under the control of teachers, school, or principal. Although one-step VAMs are considered to be far more appropriate than the five strategies described above, there are still important flaws in such approaches. One consistent issue raised about VAMs is that the results are unstable—the estimates can fluctuate rather dramatically from one year to the next (E. L. Baker et al., 2010). Moreover, research suggests that these fluctuations are driven more by measurement error than by actual changes in the effectiveness of teachers, school, or principals (E. L. Baker et al., 2010). Although the instability is more egregious at the teacher level, the same issues with instability arise at the school and principal levels.
Second, there is a growing consensus among current researchers in this area that the inclusion of student and school characteristics in a VAM model fails to capture all the differences across schools. For example, including the percentage of students participating in the federal free or reduced-price lunch (FRPL; the FRPL program does not adequately capture the influence of poverty on student test scores. Indeed, as Berliner notes (2006), participation rates in FRPL across schools and even within schools mask very large differences in the underlying causal factors associated with poverty that influence student test scores, namely, the number of stressors affecting the child. Thus, inaccuracy of control variables affects the accuracy with which researchers can estimate school and principal effectiveness.
In addition, there are many “unobserved” student and school characteristics that influence test scores. The factors are “unobserved” because there is typically no data available on the factors or the factors are simply impossible to quantify in any accurate manner. For example, school facilities have been found to affect student test scores (Hanushek, 2003), yet most states do not collect or make available such data. The omission of these unobserved characteristics creates inaccurate estimates of school and principal effectiveness.
Finally, even though one-step VAMs attempt to make fair and unbiased comparisons of principals across schools such that any principal has an equal chance to be considered effective, extant research on one-step VAMs strongly suggests that such methods do not, in fact, level the playing field across principals in this way (Ehler et al., 2013). Indeed, Ehler et al. convincingly argue that one-step VAMs rarely, if ever, are able to level the playing field across schools in such a way that all principals have an equal chance of being evaluated as effective because of the very nature of the statistical approach taken. Thus, efforts to estimate school—and, therefore—principal effectiveness, result in biased estimates. That is, the estimates of school (and principal) effectiveness are still correlated with certain school characteristics that are outside the control of schools and principals. For example, estimates could be biased with respect to the percentage of economically disadvantaged students such that results favor schools enrolling low percentages of economically disadvantaged students and high percentages of high-achieving students.
SGP and VAM hybrid
Although SGPs typically do not attempt to control for student or school characteristics, a hybrid approach can be employed in which an SGP strategy is conjoined with a typical VAM strategy by including student or school characteristics in the calculations (B. D. Baker et al. 2013). New York, in fact, has adopted such an approach to employ in estimating teacher, school, and principal effectiveness (B. D. Baker et al., 2013). This approach addresses the serious issue of ignoring the impact of factors out of the control of educators levied at the SGP strategy.
The use of such a hybrid approach, however, does not address all of the methodological issues associated with the SGP and VAM approaches. Although there are certainly other serious issues, perhaps the most important criticism of this approach is that the estimates are biased such that the playing field is not leveled across schools or principals. For example, the hybrid approach employed by New York is clearly biased against schools with high concentrations of economically disadvantaged students (King & McIntosh, 2011). In other words, even after the use of the hybrid approach, there remained a correlation between the estimates of student growth and the percentage of economically disadvantaged students in the school. If the approach was, in fact, unbiased, such a correlation would not exist. As Ehler et al. (2013) note, even a one-step VAM does not fully control for factors outside the control of educators, thus does not truly “level the playing field” across schools and principals.
Two-step VAMs
As noted above, one fundamental characteristic of effective evaluations, particularly those used to make high-stakes decisions, is that comparisons of individuals or groups must be conducted in a fair manner. In other words, high-stakes evaluations must be built on a system that levels the playing field across the individuals or entities being judged. Importantly, none of the previous strategies described above level the playing field across all schools or principals.
Ehler et al. (2013) tested numerous strategies to estimate school effectiveness and concluded that only a two-step VAM adequately levels the playing field across all schools and, hence, principals. Rather than rely on only one regression equation, a two-step VAM relies on two successive regression equations with the second equation employing the results from the first equation. The first equation controls of a host of student and school factors, including prior achievement and “partials out differences in test-score performance between students with different characteristics, and in different schooling environments, before estimating the school effects (p. 9).” The second equation uses the results from the first equation to accurately assess school effects in a manner that levels the playing field across all schools. In other words, the first equation is used to control for all of the student and school characteristics that influence student test scores outside of the control of educators and then the second equation used the results from the first equation to identify estimates of school effectiveness. This strategy, in the words of Ehler et al., “levels the playing field across schools so that ‘winners’ and ‘losers’ are representative of the system as a whole” (p. 24).
Dallas Independent School District employed an approach quite similar to the two-step VAM for multiple years. The first equation controlled for student characteristics and was referred to the “fairness variables” because some of the variables in the equation controlled for student background characteristics such as race/ethnicity, sex, age, economically disadvantaged status, English Language Learner status, and prior test scores (Goldstein & Behuniak, 2005). The evaluators then used the residuals from the first equation as the dependent variables in a series of grade-specific hierarchical linear modeling equations that also controlled for school factors to control. Despite criticism from the research community that the first “fairness index” was not necessary, Thum and Bryk (1997) contended that the strong support from educators for the system was based on the use of the fairness index and explained the longevity of the program. In short, because educators perceived the system had face validity, the educators supported the continuation of the effort.
Although the two-step VAM is arguably the only strategy within this broad approach that levels the playing field across schools, even this strategy is problematic in terms of evaluating principal effectiveness. In general, researchers currently measure student test scores using some variation of the following overly simplistic equation:
such that changes in student test scores can be explained by prior test scores, student effects, and schooling effects. Thus, Equation 1a is based on the assumption that principals have complete and unilateral control over the schooling process. If we replace school effects with principal effects, then the equation for estimating principal effects becomes the following:
such that principal effects can be calculated by statistically subtracting the effects of prior test scores and student effects from the changes in student test scores. Importantly, then, all the strategies in this section assume a principal has complete and unilateral control over school facilities, the nature of community support for education, peer effects, segregation of schools, characteristics of the families in the school attendance boundary, per pupil funding, teacher quality, teacher turnover, class size, and the plethora of other school factors that affect student test scores.
Yet research has long held that principals do not have complete and unilateral control over all aspects of schooling (Chiang et al., 2012; Davis et al. 2011; Grissom et al., 2012). Indeed, based on sophisticated statistical approaches, a spate of recent high-quality efforts to estimate principal effectiveness has found that most of the schooling effects are not, in fact, the same as principal effects (Branch et al., 2012; Chiang et al., 2012; Coelli & Green, 2012; Dhuey & Smith, 2012; Grissom et al., 2012). Thus, all of the current research uses some variation of the following equation to explain the influence of principals on test scores:
such that changes in student test scores are explained by prior test scores, student effects, and overall school effects that comprise both principal effects and the effects of other school factors not under the control of a principal. To reiterate, in Equation 2a, the influence of the principal is distinct from the influence of other schooling factors and, by extension, principals do not have unilateral control over many of the schooling factors associated with student test scores.
Given that prior research has found that principal and schooling effects are distinct, there is consensus among the research community that those schooling factors not under the control of principals must be identified and removed from estimates of school effectiveness to arrive at an estimate of principal effectiveness (Chiang et al., 2012; Grissom et al., 2012). In short, principal effectiveness is estimated using the following equation:
In this model, prior test scores, student effects, and the effects of school factors outside the control of the principal must be subtracted from estimates of overall school effectiveness. Although this may seem a relatively straightforward endeavor, isolating principal effectiveness is more difficult than it would appear.
Approach 2: Principal Effectiveness Is Best Measured by Within-School Effectiveness
To isolate principal effectiveness from overall school effectiveness, researchers must first identify the effects of prior test scores, student characteristics (both observed and unobserved), and schooling effects other than those under the control of the principal. To address the issue of the effects of other schooling factors outside the control of principals, researchers must estimate the effects of unobserved schooling factors on the changes in student test scores. Researchers use the term unobserved in the sense that such factors are not systematically collected and made available to researchers or, in some cases, are simply impossible to accurately quantify. To accomplish this, researchers typically employ what is called a school fixed-effects approach in the statistical analysis. This allows researchers to separate out the effects of the unobserved characteristics of schools that influence changes in student test scores—at least those characteristics that are stable over time (Burkhauser et al., 2012). This approach is displayed in Figure 5, with factors included in this approach denoted by solid lines and no shading and factors not included in his approach denoted by dashed lines and shading.

Description of the factors included in Approach 2.
Although school fixed-effects models can control for both the observed and unobserved characteristics across schools, the approach also forces the analysis to compare tests scores of students in a school to the test scores of students in the very same school in prior years. Because comparisons are made within the same school, researchers must rely on principal turnover to create comparisons between principals leading the same school. To include this change in leadership in the model, researchers employ a principal fixed-effects model.
This strategy, then, addresses the critique that relying solely on estimates of school effectiveness to measure principal effectiveness inappropriately comingles principal effects with school effects outside the control of principals. If the underlying approach to estimating school effects employs any strategy other than a two-step VAM, however, then the estimates of both school and principal effectiveness will still be inaccurate, because the model will be biased.
Let us assume that efforts to estimate principal effectiveness rely on an initial two-step VAM strategy as well as a strategy that employs both school and principal fixed effects in an effort to control for the unobserved characteristics across schools and to disentangle the influence of schooling factors from the influence of principals. Would employing such a strategy provide estimates of principal effectiveness that would allow valid, reliable, and precise judgments to be made as part of principal evaluation and accountability efforts? The answer, unfortunately, is no. Even after the employment of such a sophisticated approach, there would remain serious methodological issues that render the results unsuitable for use in high-stakes evaluation and accountability efforts.
Although there are other critiques of such an effort, there are four primary, interrelated critiques. First, this approach requires a substantial amount of data over numerous years because the approach relies on comparing the impacts of principals in schools in which leadership changes have occurred, The shorter the time frame used in this approach, the fewer principals could be included in the estimation of effectiveness, because fewer changes in principals at the same school would be included in the analysis. Using Pennsylvania and Texas data, Chiang et al. (2012) and Branch et al. (2012), respectively, found that a substantial number of principals would not be included under such an approach. Even under fairly long time frames such as a decade, not all principals would be included in the analysis. The exclusion of a significant proportion of principals from the analysis would violate the very notion of having a statewide principal evaluation plan.
The second primary critique with this approach is the comparison group for a principal would necessarily be a very small number of other principals—often only one or two other principals (Chiang et al., 2012; Grissom et al., 2012). For example, suppose Principal A leaves the profession or moves out of state and beginning Principal B takes over as school leader of School X as shown in Figure 6. The use of school and principal fixed effects would limit the comparison group for Principal B to only one other principal—namely, Principal A. This small comparison group would obviously create serious problems in determining the effectiveness of a principal. For example, if Principal A were a highly effective principal with an estimated effect of 0.8, then Principal B would be more likely to appear to be a relatively ineffective principal even if her or his effectiveness was positive. Indeed, under the scenario shown in Figure 6, Principal B would likely be considered ineffective. On the other hand, if Principal A were an ineffective principal, then Principal B would be more likely to appear to be an effective principal. This would be true even if the true effectiveness of Principal B was the same in both scenarios. Thus, even though Principal B would theoretically have the same effectiveness, the actual estimates could differ depending entirely on the effectiveness of the preceding principal (Grissom et al., 2012).

Connected network of comparison principals in one school.
In another example, let us assume Principal A worked in School X and then transferred from School X to School Y take the place of Principal C whereas Principal B replaced Principal A in School X (See Figure 7). In such a scenario, the effectiveness of Principals A, B, and C could be compared assuming enough data existed. In this case, Principal B could be compared to two other principals rather than just one, but none of the principals could be compared to any more than the two other principals within the connected network of schools.

Connected network of comparison principals in two schools.
As in the first scenario, the effectiveness of Principal B would be largely determined by whether Principals A and C were effective or ineffective. In Figure 7, Principal B, with an estimated effectiveness of +0.1, might be considered to have average effectiveness because Principal A was more effective (+0.8) whereas Principal C was less effective (−0.4). Importantly, as Grissom et al. (2012) note, any inequitable distribution of principals would exacerbate this issue across schools since less experienced and less qualified individuals are employed in lower performing, high-poverty schools (Clotfelter, Ladd, Vigdor, & Wheeler, 2007; Horng, Kalgorides, & Loeb, 2009).
A third critique of this approach is that it does not control for principal experience. Suppose, for example, that Principal A in Figure 7 had 10 years of experience as a principal and Principals B and C were both beginning principals. Principal A would likely be estimated to have a greater estimated effectiveness than Principals B or C simply due to greater experience. Because most research has found principal effectiveness increases over time, particularly in the first 3 years of tenure (Branch et al., 2012; Chiang et al., 2012; Coelli & Green, 2012; Dhuey & Smith, 2012; Grissom et al., 2012), comparing the effectiveness of principals with differing levels of experience within the same network of schools is problematic. Indeed, Branch et al. (2012) argue, “The impact of a principal on school quality likely increases with tenure given the persistence of personnel and other decisions” (p. 10). Consequently, Chiang et al. (2012) included principals with 3 or fewer years of experience or principals in networks of connected schools with at least one principal with 3 or fewer years of experience. Similarly, Branch et al. (2012) focused on principals with 3 or fewer years of experience in the same connected network of principals in order to control for the considerable influence experience might have on effectiveness.
Unfortunately, adding this restriction further reduced the overall sample of principals included in the estimation of principal effects to such a degree that both sets of authors concluded that such an approach could not be used to evaluate all principals in a state. Indeed, Branch et al. (2012) summarized,
Importantly, the restriction of the sample to the first three years in a school is not feasible in school fixed-effects models that identify principal effectiveness on the basis of within-school achievement differences, because the numbers of schools with two principals observed in their first three years is quite small. (p. 16)
A fourth critique of this approach is the short time frame employed if principal effectiveness were to be estimated after only 1 or 2 years of leading a school and the estimations were used to make high-stakes judgments about principals. Research has consistently shown that principal effectiveness increases over time, particularly within the beginning years of acting as a school leader; and more important, it strongly suggests that principal effectiveness is not linear but builds on itself over time at the same school (Branch et al., 2012; Chiang et al., 2012; Coelli & Green, 2012; Dhuey & Smith, 2012; Grissom et al., 2012). Using this approach in a principal’s first year to make high-stakes judgment would possibly underestimate the principal’s effectiveness. Given this problem, Grissom et al. (2012) suggest a more accurate approach would take into account how principal effectiveness builds over time:
Much of what a good principal may do is improve the school through building a productive work environment (e.g., through hiring, professional development, and building relationships), which may take several years to achieve. If so, we may wish to employ a principal effects model that accounts for this time dimension (p. 13).
Approach 3: Principal Effectiveness Is Best Measured by School Improvement at the Same School
Grissom et al. (2012) suggest a third approach—capturing the improvement of a school over time under the same principal. The factors included in such an approach are portrayed in Figure 8. Under this approach, statistical estimates are employed that compare a principal’s effectiveness in Year X to her or his effectiveness in the same school in years X − 1 and X − 2. Such an approach certainly seems to be a commonsense method for fairly estimating the impact of a principal on student test scores. But simple calculations, such as subtracting the percentage of students passing or scale scores from one year to the next, are not enough to make such an estimate an accurate indicator of principal effectiveness. This method, in fact, would call for a sophisticated methodology very similar to the approach used in Approach 2, although it would also add a term in the equation to measure the tenure of a principal at the school.

Description of the factors included in Approach 3.
Because of the similarity of Approach 2 and Approach 3, both approaches adequately address the deficiencies of Approach 1, particularly if a two-step VAM is employed. Moreover, Approach 3 averts the issue of the idiosyncratic comparisons created by the small group of connected principals in Approach 2 by comparing a principal’s effectiveness to only her or his own prior effectiveness at the same school. Despite these advantages, Approach 3 also has serious drawbacks with respect to utilization as a strategy to evaluate all principals in a state or district.
As with the other sophisticated approaches, this approach requires substantial data (Grissom et al., 2012) and requires a principal to enter and remain at a school for at least a 3-year time period. This two-part requirement would substantially limit the percentage of principals in the evaluation in two ways. First, most principals are not new to a school in any given year. Second, a substantial proportion of newly hired principals do not remain at the same school for 3 consecutive years (Fuller & Young, 2009). In the sample used by Grissom et al. (2012), the percentage of principals subject to the estimates of effectiveness under this approach was only 30%. If only a minority of principals could be evaluated each year, then RttP or NCLB waiver requirement that principals be evaluated every year could not be met.
In addition, as with all estimates of changes in student test scores, there is measurement error, and as noted by Grissom et al. (2012), research suggests that “differencing these imperfectly measured variables to create a principal effectiveness measure increases the error (p. 22).” The measurement error may, in fact, be so large that the estimates are simply too noisy to be useful (Grissom et al., 2012).
Furthermore, as with Approach 2, the prior pattern of test scores at the school could compromise the accuracy of the estimate of principal effectiveness (Grissom et al., 2012). For example, a newly hired principal at a school whose test score trajectory was already increasing would benefit from the positive influence of the previous principal and appear more effective than she or he would if the school test score profile had been stagnant or decreasing. This could have the unintended consequence of creating a perverse incentive for principals to attempt to move to the schools already showing improvement.
Use of the Various Approaches
As shown above, there are a number of different strategies states could employ to estimate principal effectiveness. Given that this area is extremely new and states are constantly changing their decisions as models are developed in the midst of fully developing their models (Davis et al., 2011; Jacques et al., 2012), accurately quantifying the strategies currently in use is extremely difficult. However, we estimate the number and percentage of states using various strategies based on data collected on 35 states 2 by the National Comprehensive Center on Teacher Quality data and data collected through our own internet searches on the states not included in the National Comprehensive Center on Teacher Quality data. Based on these data, we found 13 states have adopted some form of SGPs/MGPs, 12 states have adopted some form of value-added (at least 4 states have adopted a simple VAM whereas the others have adopted a one-step value-added approach), 7 states have adopted some combination of strategies or allow districts to adopt their own strategy (all of which rely on strategies described in Approach 1), and 19 states provide no information on principal evaluation systems outside of evaluations conducted by supervisors. Thus, of the 32 states that have adopted and provided a description of a model, at least 24 have adopted models assuming that student characteristics and/or school characteristics do not influence test scores.
Conclusions
Although many states have promised to evaluate all principals in their RttT applications or NCLB waivers, a substantial number of the remaining states have also embarked on efforts to evaluate all principals. Almost all of these efforts were initiated prior to the existence of research on such efforts. Based on the research in this area, we can unequivocally conclude that even the most sophisticated and thoughtful efforts to estimate principal effectiveness are flawed and produce inaccurate results. Indeed, Grissom et al. (2012) conclude their study by stating,
It is important to think carefully about what the measures [of principal effectiveness] are revealing about the specific contribution of the principal and to use the measures for what they are, which is not as a clear indicator of principals’ specific contributions to student test score growth. (p. 34)
If, in fact, states use these estimates, the signals sent to both principals and evaluators of principals would be inaccurate, leading to the unintended consequence of creating incentives for principals to act in unproductive ways and for principal supervisors to make unwise decisions about hiring, firing, rewards, and sanctions.
Even if the estimates of principal effectiveness were accurate, researchers appear to be in agreement that a significant proportion of principals could not even be included in the best approaches (Branch et al., 2012; Chiang et al., 2012; Coelli & Green, 2012; Dhuey & Smith, 2012; Grissom et al., 2012). Importantly, the principals necessarily excluded from such evaluation systems would be principals with 3 or fewer years of experience—the very principals policy makers are most desirous of evaluating.
Perhaps most disheartening is that 75% of the states that have adopted a strategy to estimate principal effectiveness have chosen strategies that are incredibly simplistic. Indeed, policy makers in such states assume that principal effectiveness can be measured by student test scores without adjusting for the influence of other factors. Such approaches are unsubstantiated by research in educational leadership or in measurement and statistics and are generally considered the “worst” of the approaches described in this article, since the strategies make absolutely no attempt to control for the myriad different factors that influence student test scores.
In sum, many—if not most—principals could not be included in the highest quality efforts to estimate effectiveness, the estimates themselves simply do not accurately reflect the independent contributions principals make to student changes in test scores, and most states have adopted the most simplistic of efforts to estimate principal effectiveness. Thus, without question, using student test scores to estimate principal effectiveness is simply building a bridge too far. 3
Discussion and Policy Recommendations
Our review of the literature points to several important policy recommendations. First, and foremost, states and districts should not employ statistical estimates of principal effectiveness in any high-stakes decisions. At best, even the most sophisticated efforts are very rough estimates of a principal’s contributions to student achievement. At worst, simple efforts used by many states such as SGPs, simple VAMs, and one-step VAMs produce wildly inaccurate results that would be biased against principals in lower performing and/or high-poverty schools.
Some might argue the negative impact of a flawed statistical estimate of principal effectiveness would be mitigated if included in a multiple measures model (MMM) of principal effectiveness. Yet the very notion of a MMM is that different aspects of a construct are combined into one measure. In this case, the overarching construct is principal effectiveness. Given the unreliability and inaccuracy of estimates of principal effectiveness, the inclusion of statistical estimates of principal effectiveness in an MMM would not necessarily result in a reliable and accurate measure of principal effectiveness (B. D. Baker et al., 2013). This is particularly true if the statistical estimate becomes the “tipping point” for making decisions because of the greater weight and/or the greater variation in this particular measure (B. D. Baker et al., 2013)
What, then, should policy makers do? Based on the available evidence, we strongly suggest policy makers not use statistical estimates of principal effectiveness to rank, judge, or evaluate principals in any high-stakes manner. This caution would include decisions about hiring, firing, salary, and merit pay. If statistical estimates of principal effectiveness are to be used at all, they should be used for discussion purposes, for informational purposes, and/or as a screening device. For example, a principal and her or his colleagues and supervisors might judiciously use the results to discuss the progress of the school relative to other schools in very similar situations. In addition, based on multiple years of data and a VAM approach that controlled for, at a minimum, student and school characteristics, a state or district could use the estimates of principal effectiveness to decide where to spend time and energy obtaining additional evidence on principals who were found to be particularly ineffective or effective. This would be analogous to how hospitals operate—employ some blunt diagnostic tools to provide information about where and how to collect further information about a patient before making decisions about treatment options.
Finally, policy makers simply need to be more patient and thoughtful about this work. Although there is considerable pressure from other state policy makers, philanthropists, and the U.S. Department of Education to move ahead quickly “for the children,” the history of policy implementation is replete with examples of hurriedly implemented policies that caused more harm than good (Tyack, 1974, 1995). A rush to implement principal evaluation policies based on statistical estimates are likely to send the wrong signals to principals about their own behavior and the wrong signals to evaluators of principals. Adopting such efforts, in fact, could create a situation that drives good principals out of the profession or toward particular schools while concomitantly creating a huge disincentive for well-qualified individuals to even enter school leadership positions. In short, reckless efforts to meet RttT or NCLB waiver mandates could easily cause more harm than good.
Footnotes
Appendix A
A Comparison of Models to Assess School Effectiveness.
| Model | Variables Used | Method | Strengths | Weaknesses |
|---|---|---|---|---|
| Approach 1: Principal effectiveness measured by school effectiveness | ||||
| Assumption A: Student and school characteristics do not influence student growth | ||||
| Change in percentage students passing, proficient, advanced, etc. | Percentage of students meeting a standard in Year X and in Year X − 1 | Simple subtraction | Easy to understand for parents, community members, and educators; calculation can be completed by anyone | Calculations result in incorrect identification of student growth; student growth can be masked; method was designed for descriptive purposes only, making attribution impossible; does not equal the playing field |
| Change in scale scores | Average student scale scores in Year X and Year X − 1 | Simple subtraction | Relatively easy to understand for parents, community members, and educators; easy to calculate if data are available | Scales are often not comparable across years, grades, or forms of the test; method was designed for descriptive purposes only, making attribution impossible; does not equal the playing field |
| Change in z scores | Average student z scores in Year X and Year X − 1 | Simple subtraction | Allows for comparisons across years, grades, and test forms; provides information relative to all other schools; relatively easy to understand for parents, community members, and educators | Method was designed for descriptive purposes only, making attribution impossible; method does not control for the student or school factors outside the control of educators, thus should not be used for attribution; does not equal the playing field |
| Student growth percentiles (SGPs) and median growth percentiles (MGPs) | Student scores in Year X and Year X − 1 as well as the median score or median growth | Simple subtraction and quantile regression | Allows for easy comparison of students and schools to the median student or school; can also provide information about whether progress was greater or less than expected; relatively easy to understand for parents, community members, and educators | Method was designed for descriptive purposes and does not control for student or school factors, thus attribution is impossible; does not equal the playing field |
| Simple value added model (VAM) | Student scores in Year X and scores from as many previous years and test subjects as possible | Ordinary least squares regression (OLS) analysis | Full controls for prior scores; can be very useful at the individual student level and for diagnostic purposes for students, teachers, and schools | Method does not control for the student or school factors outside the control of educators, thus should not be used for attribution or it does not equal the playing field; can be unstable and very inaccurate for low-enrollment schools |
| Assumption B: Student and school characteristics influence student growth | ||||
| One-step VAM | Student scores in Year X and X − 1; student and school characteristics | OLS or hierarchical linear modeling (HLM) | Controls for prior scores and observable student and school characteristics | Difficult to understand and calculate; does not control for unobservable student and school characteristics; results still tend to not equal the playing field across schools; can be unstable and very inaccurate for low-enrollment schools |
| SGP and VAM hybrid | Student scores in Year X and X − 1, median growth, student and school characteristics | Hybrid of SGP and OLS | Controls for prior scores and observable student and school characteristics | Difficult to understand and calculate; does not control for unobservable student and school characteristics; results still tend to not equal the playing field across schools; can be unstable and very inaccurate for low-enrollment schools |
| Two-step VAM | Student scores in Year X and X − 1, student and school characteristics | OLS or HLM | Controls for prior scores and observable student and school characteristics; levels the playing field across schools; not currently in use | Difficult to understand and calculate; does not control for unobservable student and school characteristics |
| Approach 2: Principal effectiveness measured by within-school effectiveness | ||||
| One- or two-step VAM with school and principal fixed effects | Student scores in Year X and X − 1, . . . X − 6; student and school characteristics; multiple principals at same school | OLS or HLM with school and principal fixed effects | Compares principal effectiveness within same school over time, thus controls for unobservable student and school characteristics and partitions out school effectiveness from principal effectiveness | Requires extensive longitudinal data and a sufficient degree of principal turnover; some nontrivial percentage of principals could not be evaluated; principals would be compared to a very small number of other principals |
| Approach 3: Principal effectiveness measured by school improvement at same school | ||||
| One- or two-step VAM with school and principal fixed effects and time trend | Student scores in Year X and X − 1, . . . X − 4; student and school characteristics; same principal at school for 3 consecutive years | OLS or HLM with school and principal fixed effects and time trend | Compares the improvement of a school under the same principal over a 3-year period relative to starting point | Requires extensive longitudinal data and a sufficient degree of principal turnover; a substantial percentage of principals could not be evaluated due to data requirements |
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
