Abstract
The evaluation of teachers’ performance in the classroom is an important application of educational testing and psychometrics. School districts are using gain scores, such as the average difference between pre- and posttest scores, to characterize teacher performance in the classroom. Consequently, additional research is needed to understand factors that affect the reliability of gain scores to ensure reliable teaching evaluations. This article examines a class of linear gain scores (LGS) with a modified common factor model to understand the effect of latent variable characteristics on the reliability of observed gain scores. The analytic results derive an upper bound for the reliability of a class of LGS and compare simple difference scores and residualized scores. The results suggest that simple difference scores tend to be more reliable than residualized gain scores whenever strong invariance is satisfied where latent intercepts and loadings are equal over classrooms. However, residualized gain scores are more similar to the optimal reliability in instances when classrooms differ in latent measurement intercepts and loadings. In addition, in contrast with previous conjectures, the results imply that student tracking artificially inflates LGS reliability. The results in this article serve as a guide for researchers who develop and refine methods for measuring student learning gains and evaluating teachers.
Standardized tests are increasingly used to calculate gain scores in an effort to measure student learning and evaluate teacher performance. Prior research has examined the reliability of measures of growth and change (e.g., Bond, 1979; Cronbach & Furby, 1970; Tucker, Damarin, & Messick, 1966; Zimmerman & Williams, 1998). Some special cases of a broader class of linear gain scores (LGS) have become popular measures of student learning as researchers and decision makers attempt to quantitatively evaluate the impact of teachers on students (Battauz, Bellio, & Gori, 2011; Lockwood, McCaffrey, Mariano, & Setodji, 2007; McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004; Woodhouse, Yang, Goldstein, & Rasbash, 1996). The reliability of gain scores is particularly important given that the popular press has echoed concerns about using statistical models as a means to rate teachers (Garland, 2012; Watanbe, 2011; Winerip, 2011). For instance, many teachers do not understand the statistical models that are used in their evaluation and are concerned that highly reliable, but imperfect measures of student content knowledge, such as standardized test scores, may distort ratings that are used for high-stakes decisions such as teacher retention or compensation. Researchers have voiced numerous methodological issues with models that employ LGS, which range from attenuation attributed to test score measurement error (Battauz et al., 2011; Culpepper & Aguinis, 2011; Woodhouse et al., 1996), vertical scaling issues (Briggs & Weeks, 2009), threats to internal validity such as the persistence of school-level effects (Briggs & Weeks, 2011) and inability to estimate the causal effect of teachers from observational designs (Raudenbush, 2004; Reardon & Raudenbush, 2009; Rubin, Stuart, & Zanutto, 2004), or the consistency of teacher rankings when using different outcome variables (Papay, 2011).
Applied researchers are relying on gain scores to measure student learning and growth and this study provides a framework for understanding factors that affect the reliability of classroom-level gain scores. In a manner similar to previous investigations on the connection between prediction and measurement bias (Culpepper, 2012; Millsap, 1997, 1998, 2007), this article uses a modified common factor model (CFM) to highlight important theoretical results pertaining to a class of LGS. This study examines the reliability of LGS in the scenario where differences between classrooms are a function of the underlying measurement model, common factor moments, and latent student growth.
It is important to delineate the contributions of this study to previous research on the reliability of gain scores. Schochet and Chiang (2013) examined classification errors associated with least squares and empirical Bayes estimators of classroom gain scores. Schochet and Chiang made several simplifying assumptions, such as random assignment of teachers to classrooms, perfectly reliable student gain scores, and measurement invariance between classrooms, and recommended that future research should examine how these factors impact the reliability of student growth measures. Vautier, Steyer, and Boomsma (2008) proposed a structural equation model (SEM) to measure student growth between two periods and over multiple methods. Vautier et al. derived the reliability of simple difference scores under the assumption of strong invariance where latent intercepts and loadings are constant over time. The current investigation addresses the aforementioned assumptions. This article defines a general class of LGS for investigating the relative reliability of simple difference and residual scores and uses a latent framework to understand how latent variable model characteristics and the presence of student tracking into pathways with differential growth potential impact the reliability of observed LGS. The results in this article also extend the work of Schochet and Chiang by examining the effect of measurement bias (i.e., classroom differences in the relationship between latent and observed variables) on the reliability of observed student gains.
This article includes four sections. The first section introduces the definition of a class of LGS, in addition to relevant theoretical results. The second section provides an overview of the modified CFM for latent change and derives general expressions used to compute the reliability of observed LGS. The third section examines how issues such as student tracking and measurement bias impact the reliability of observed LGS. The last section provides a discussion of the findings, directions for future research, and concluding remarks.
Measuring Student Growth With LGS
The first subsection introduces the general class of LGS and discusses several special cases that will be examined throughout the remainder of the article. The second subsection includes theoretical information regarding the upper bound to reliability for any LGS.
The General Class of LGS
Researchers have proposed evaluating teachers’ contribution toward student learning by examining the average gain that students make in their test scores over the academic year. Several measures of observed change fall within a general class of LGS. Let
where
For the general class of LGS, let
with constants
Let
The reliability of
Applied researchers employ two special cases of Equations 1 and 2. Simple difference scores arise when
One alternative to simple difference scores are residualized scores, which compares posttest scores to a predicted value based on pretest scores (Cronbach & Furby, 1970; Malgady & Colon-Malgady, 1991). Let
where
The results in this article are valid for any LGS and two additional cases are studied for
An Upper Bound for the Class of LGS
The previous subsection introduced the LGS and several special cases. This section includes two theorems. The first theorem shows that
Proof. Technical details are included in the Online Appendix.
Theorem 1 implies that the maximum reliability for a given scenario is
Theorem 1 demonstrates that
Proof. Technical details are included in the Online Appendix.
Theorem 2 will be useful for establishing conditions where a value of
The results below compare the reliability of
LGS Under the Modified CFM
The purpose of this section is to discuss the general class of LGS in the context of the modified CFM. This section first introduces the modified CFM as a latent variable framework for describing the change between pre- and posttest scores within classrooms and, second, discusses
The CFM With Student Growth
Let
where
where

Common factor model with a latent gain score,
Equation 10 presents the general case where classrooms vary in CFM parameters such as
It is important to briefly define additional parameters that are used throughout the article. This article examines how variability and dependence among the CFM parameters impacts the reliability of
respectively. Several parameters in Equation 12 should be zero in practice. For instance, the average student growth within classroom
The parameters in Equation 12 have important substantive interpretations. First,
It is important to consider the effect that a relationship between
Another issue to examine is the impact of using test scores that satisfy strong invariance (Meredith, 1993; Millsap & Kwok, 2004), which requires
Similarly, violating weak invariance implies that
The Reliability of LGS Under the CFM
Equations 13 through 17 include variances of products and covariances involving products of random variables (Bohrnstedt & Goldberger, 1969). For instance,
An expression for the pooled slope as a function of common factor parameters is available by noting that the expected within-classroom pretest variance and pre- and posttest covariance are generally defined as
where
The Reliability of Observed Teacher Effects
This section discusses the reliability of
The figures discussed in the following two scenarios examine reliability of
Scenario 1: Strong Invariance and Student Tracking
This subsection explores the relative reliability of
In the case where
The equation for the reliability of
Equation 21 demonstrates that student tracking affects
Equation 21 and Theorem 1 imply that
Proof. Technical details are included in the Online Appendix.
We can examine general conditions where residualized scores outperform gain scores when factors that systematically promote aggregate growth are nonrandomly assigned to classrooms. Clearly, if strong invariance is satisfied, simple difference scores will be optimal whenever
where
Figure 2 examines the reliability of

Contours of impact of
As noted in Equation 21,
Scenario 2: Measurement Bias and Student Tracking
This scenario examines the reliability of
It is immediately clear from Equation 23 that
Figure 3 shows the effect of

Contours of impact of
Discussion
This article examined the accuracy of commonly used measures of student growth, which are used to evaluate teachers. This article included new theoretical results for LGS and evaluated the reliability of traditional indicators of student learning with the aid of a latent variable framework. The remainder of the manuscript summarizes key findings, discusses the practical implications of the results, and offers ideas for future research.
It is important to describe the contributions of this study in the context of existing research. First, this article offers new theoretical results to existing research on the reliability of gain scores. Prior research examined the reliability of residual and simple difference scores. This article framed gain scores in terms of a general class of LGS and identified a pretest slope
Second, this article builds on prior research that used the CFM to clarify issues related to measurement bias and prediction bias (Culpepper, 2012; Millsap, 1997, 1998, 2007). In fact, the findings in this study show that classroom-level differences in test score properties have significant implications for the interpretation of gain scores and for measures of teacher performance that arise from those scores. Furthermore, the use of the CFM as a latent variable framework offered an examination of the reliability of
Third, this article contributes to the dialogue pertaining to teacher evaluation. Namely, school districts are increasingly adopting statistical modeling techniques that rely on gain scores or residualized gain scores as a way to judge teacher quality. The results in this article suggest that more research is needed to understand classroom measurement bias. If strong invariance is satisfied
Fourth, this article offers some guidance about the relative performance of residual and gain scores in a variety of situations. Researchers currently use
Another direction for future research is to extend the approach discussed by Vautier et al. (2008). For instance, some models incorporate test scores from several years and several content areas. A natural extension of this article would be to examine measures of classroom growth when student performance over time is a function of the CFM and one or more teachers or schools (Briggs & Weeks, 2011). Past teachers likely impact students’ performance on future exams and the CFM could be used to understand the reliability of observed growth in instances where learning effects decay over time. Similarly, previous research examined student learning across more than one content domain in addition to over time. Future research could expand the CFM by including additional latent and observed variables for other content areas and examine how changes in student characteristics impacts multivariate measures of growth.
In conclusion, this study continues an important discussion about technical issues surrounding the reliability of gain scores, which are increasingly used to evaluate teachers. As noted earlier, the popular press has tried to make sense of highly technical measurement and psychometric issues and interest in teacher evaluations will continue to grow with the use of test scores to rate teachers. Some school districts seem poised to use statistical models to evaluate teachers. The stakes are high for educators to ensure that statistical tools yield reliable teacher ratings, especially given that decision makers are using, or planning to use, model estimates to reward and/or punish teachers. The findings in this study shed light on the relative reliability of different types of gain scores for measuring student growth and ranking teachers’ performance. In short, the nature of measurement bias and student tracking are important factors to understand regarding LGS reliability and the results in this article can serve as a guide as researchers, statisticians, and psychometricians develop and refine methods for teacher evaluation.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
