Abstract
Student Growth Percentiles (SGPs) increasingly are being used in the United States for inferences about student achievement growth and educator effectiveness. Emerging research has indicated that SGPs estimated from observed test scores have large measurement errors. As such, little is known about “true” SGPs, which are defined in terms of nonlinear functions of latent achievement attributes for individual students and their distributions across students. We develop a novel framework using latent regression multidimensional item response theory models to study distributional properties of true SGPs. We apply these methods to several cohorts of longitudinal item response data from more than 330,000 students in a large urban metropolitan area to provide new empirical information about true SGPs. We find that true SGPs are correlated 0.3 to 0.5 across mathematics and English language arts, and that they have nontrivial relationships with individual student characteristics, particularly student race/ethnicity and absenteeism. We evaluate the potential of using these relationships to improve the accuracy of SGPs estimated from observed test scores, finding that accuracy gains even under optimal circumstances are modest. We also consider the properties of SGPs averaged to the teacher level, widely used for teacher evaluations. We find that average true SGPs for individual teachers vary substantially as a function of the characteristics of the students they teach. We discuss implications of our findings for the estimation and interpretation of SGPs at both the individual and aggregate levels.
Keywords
A Student Growth Percentile (SGP) is the percentile rank of a student’s current achievement among students with similar prior achievement (Betebenner, 2009). For example, a student whose current achievement is at the 70th percentile among students matched to him/her with respect to prior achievement would have an SGP equal to 70. Two features of this definition make SGPs appealing. First, the percentile rank scale is familiar and interpretable, and remains well-defined even if test scores are not vertically or even intervally scaled (Betebenner, 2009; Briggs & Betebenner, 2009; Castellano & Ho, 2013). Second, ranking students against other students with similar prior achievement is perceived as more fair and relevant to evaluating both individual student progress and educator effectiveness than simply examining unadjusted achievement levels (Betebenner, 2009). These benefits have contributed to the increasing use of SGPs in the United States.
However, recent research has demonstrated that SGPs estimated from standardized test scores suffer from large estimation errors (Akram, Erickson, & Meyer, 2013; Lockwood & Castellano, 2015; McCaffrey, Castellano, & Lockwood, 2015; Monroe & Cai, 2015; Shang, Van Iwaarden, & Betebenner, 2015). Both the prior and current test scores used in SGP calculations are error-prone measures of their corresponding latent achievement traits due to the finite number of items on each test (Lord, 1980). These errors combine to make estimated SGPs noisy measures of the “true” (or latent) SGPs, defined for each student as the percentile rank of his/her current latent achievement among students with the same prior latent achievement (Lockwood & Castellano, 2015).
These errors jeopardize the validity of inferences made from estimated SGPs. For example, stakeholders and other consumers of SGP data are likely to be interested in students’ true SGPs as indicators of academic progress. However, McCaffrey et al. (2015) demonstrate that under typical testing conditions, 95% confidence intervals for true SGPs given estimated SGPs often cover much of the entire 0 to 100 percentile rank range. Thus, estimated SGPs typically are only weakly informative about actual academic progress for individual students. Moreover, the errors in estimated SGPs contain a component that is positively related to students’ true prior achievement levels (McCaffrey et al., 2015). This is problematic for interpreting SGPs aggregated to teacher or school levels as indicators of educator effectiveness, because it implies that aggregated SGPs for equally proficient educators who serve students of different prior achievement levels will tend to be different.
Understanding whether alternative ways of estimating SGPs from observed data could mitigate such problems depends on distributional properties of true SGPs that are currently unknown. For example, if true SGPs are correlated across tested subjects and/or with auxiliary data such as student background characteristics, shrinkage estimators that exploit these relationships can be used to improve accuracy of estimated SGPs (Efron & Morris, 1973). de la Torre (2009) makes analogous arguments for estimating latent abilities using testing data from multiple academic subjects, and Sinharay, Puhan, and Haberman (2011) review the extensive literature on the use of shrinkage to improve the accuracy of estimated subscores. However, the potential accuracy gains from shrinkage depend in part on the strength of the relationships among the latent quantities and auxiliary information, and we currently do not know how strongly correlated true SGPs are across academic subjects, nor do we know to what extent true SGP distributions vary as a function of student background characteristics. This information cannot be learned simply by studying estimated SGPs because of the random and systematic measurement errors they contain.
Distributional properties of true SGPs also have implications for the ability of alternative estimation methods to improve fairness of SGP measures. As noted, one of the purported advantages of SGPs is that by comparing conditional achievement status relative to students with similar prior achievement, they provide a fair assessment of student progress. Understanding to what extent true SGP distributions vary with respect to student background variables is important for understanding to what extent they actually level the playing field, and whether improvements in tests, SGP estimation methods, or both could ultimately remove any undesirable correlations of the measures with student background characteristics. For example, Shang et al. (2015) suggest the SIMEX method of measurement error correction to reduce the bias in teacher-level aggregated SGPs due to measurement error in the prior test scores. However, if true SGPs are correlated with student background characteristics such as ethnicity and economic status, corrections for test measurement error alone would be insufficient to remove potentially undesirable relationships between teacher performance indicators and student background.
To study these issues, we apply latent regression (e.g., Mislevy, Beaton, Kaplan, & Sheehan, 1992; von Davier & Sinharay, 2010) multidimensional item response theory (MIRT; e.g., Adams, Wilson, & Wang, 1997) models to longitudinal item response data from several cohorts of students from a large urban metropolitan area. In the Statistical Model section, we develop a model for latent achievement across grades and subjects that includes regressions on student covariates, and show how the model can be used to estimate population joint distributions of true SGPs. In the Data section, we describe our data sources and estimation cohorts, which include item response data from Math and English language arts (ELA) assessments in Grades 3 to 8 as well as a variety of student covariates. We then consider three research aims: (1) describe the properties of true SGPs, namely, cross-subject correlations and relationships with student covariates, by estimating latent regression models and true SGP distributions; (2) evaluate the potential gains in accuracy for estimating individual student true SGPs by using their relationships to other observable information; and (3) evaluate to what extent true SGPs aggregated to the teacher level would demonstrate relationships with student background characteristics. We address each of the aims in turn, describing their relevant methods and results in their own sections. Finally, we discuss further implications, limitations, and next steps in the Discussion section.
Statistical Model
This section specifies a model for latent achievement attributes, defines true SGPs under this model, and shows how their distributional properties can be assessed from data.
Latent Regression MIRT Model
For
The joint distribution
where
This model specification is typical for latent regression in a MIRT context. For example, it is used by de la Torre (2009). Also, although the National Assessment of Educational Progress involves a complex sampling and assessment design as well as a large number of covariates, it too uses the same latent regression MIRT model specification assumed here (Mislevy, Johnson, & Muraki, 1992; von Davier & Sinharay, 2010).
SGP Definition
The true SGPs are defined as functions of
Our interest is in more complicated properties of the distribution of
Conditioning on
Estimating Distributional Properties
Given data
This can be used to compute maximum likelihood estimates
and this estimated distribution can be used to evaluate distributional properties of any functions of
We use Monte Carlo methods to assess distributional properties because they are simple to implement. To describe these methods, let
Data
For all analyses, we use longitudinal item-level data from a large, diversely populated city in the northeastern United States. We focus on the two most recent years of available data over which the testing program was stable: 2008-2009 (2009) and 2009-2010 (2010). We model data from both ELA and math to study the relationship of SGPs for the same student across these academic subjects. To address our aim of understanding the extent that true SGP distributions vary by student background variables, we consider key background variables that are supported by the available data.
We subset the data by students’ grade levels in 2010 for Grades 4 to 8, representing five 2010 grade-level cohorts. Each subset was a 2-year by 2-subject block with current and prior year data for both math and ELA. Table 1 summarizes the student distributions by our background covariates of interest for each cohort. None of the student background variables we consider vary by subject, and thus for each cohort, the frequencies and percentages refer to both ELA and math. We subset each cohort to students with item-level data for both subjects and years and with nonmissing records for each covariate, which resulted in attrition of between 9.5% and 11.4% of students by cohort. The final student sample sizes by cohort, at the bottom of Table 1, ranged from 65,093 to 67,343. Note that the table lists all covariates that we include in the latent regression model, and thus for each type of background variable, one of the categories is not shown but can easily be computed from the available frequencies in the table. For instance, for the Grade 4 cohort, 23% speak Spanish at home and 16% speak another non-English language at home, leaving 61% of students whose primary home language is English. For race/ethnicity, we combined Hispanic and Other as there was a relatively small proportion of students who identified themselves as “Other,” and preliminary analyses revealed they had similar performance across grades and subjects as Hispanic students.
Frequencies and Percentages of Students by Each Student Background Covariate for Each Cohort.
We classify covariates as time-invariant or time-varying. The time-invariant covariates describe personal characteristics that remain static over time, including gender (female), race or ethnicity (Asian, Hispanic or Other, and Black), and home language spoken (Other Home Language and Spanish Home Language). The frequencies and percentages for these covariates, shown in the top half of the table, are thus the same for each year within each cohort. In contrast, the time-varying covariates that describe student statuses or group memberships can fluctuate over time. These covariates, shown in the bottom half the table, include English language learner (ELL) status, Special Education status, Disability status, Free or Reduced Price Lunch (FRL) status, and excessive school absences (>10%).
Generally, the distributions of time-varying covariates do not fluctuate substantially across the 2 years within a cohort with the exception of FRL status and Special Education status. The percentage of students coded as participating in FRL increased by between 17 and 24 percentage points for each cohort from 2009 to 2010, while the percentage of students coded as participating in Special Education programs increased by between 7 and 9 percentage points for each cohort from 2009 to 2010. Such changes were not explained in the data documentation, but perhaps shifts in eligibility and identification criteria were implemented over these years, resulting in increased participation in these programs.
Estimating Latent Regression Models and True SGP Distributions
In this section, we present details of how we fit the latent regression MIRT models to our data, and how we used the resulting parameter estimates to estimate properties of true SGP distributions. We then summarize these results.
Method
We model the latent achievement traits for the two years (2009 and 2010) and two subjects (ELA and math) jointly by covariates of interest (see Table 1) with a latent regression four-dimensional MIRT model. Separate models were fit for each of the five cohorts. The achievement tests include both dichotomously scored multiple choice and polytomously scored constructed-response items, which we model with a 2-parameter-logistic (2PL) and Generalized Partial Credit model (Muraki, 1992), respectively. No items are common across grade levels, so items for a particular grade-level and subject area load only on their corresponding dimension.
1
That is, we have “between-item” MIRT models (Adams et al., 1997) for each current Grade
We used the
As noted, the covariates for each student include both time-invariant and time-varying components. All time-invariant covariates (gender, race/ethnicity, home language) were included in each of the
For each cohort, we used the estimated model parameters and the Monte Carlo methods described previously to obtain samples
Latent Regression Results
Because we were using test score data that had already been calibrated for a state testing program, there were no apparent issues with the item parameter estimates or a need to drop misbehaving items. The latent regression coefficients, representing average group differences holding other covariates constant, generally followed patterns seen with test scores. For instance, we found negative relationships to latent achievement for Hispanic or Other, Black, ELL, Special Education, FRL, and excessive absences across all cohorts. For Asian, Other Home Language, and Spanish Home Language, we generally found small to moderate positive coefficients for both subjects. For females and Disability, relationships were not as consistent across subjects or time points: females generally had small negative coefficients for Math but small to moderate positive coefficients for ELA at both time points, and Disability tended to have positive coefficients for both subjects at Time 1 but negative coefficients at Time 2. Complete tables of the estimated latent regression coefficients by cohort and dimension are provided in the Supplemental Material (available online).
The covariates explain between 37% and 46% of the variance in the latent traits, depending on cohort and dimension. We compute this for each latent trait
where
The
Estimated Residual Correlations for Each Cohort.
True SGP Distribution Results
We used samples
The other key attribute of
Table 3 summarizes the group mean differences in true SGPs by cohort and subject, with math in the top half of the table and ELA in the bottom half. The rows are ordered from most negative to most positive by the mean SGP differences in math averaged over cohort (last column), although these averaged differences tend to be similar across the two subjects. For the gender, race/ethnicity, and home language variables, the group mean differences contrast students who are in the given group to all other students. For the time-varying covariates, the group mean differences contrast students who are in the given group for both years to those who are not in the given group in either year. For example, the FRL row of Table 3 for math indicates that on average across cohorts, students who participate in the FRL program for two consecutive years have true math SGPs 5.1 percentile points lower than students who do not participate in the FRL program for either of the two years. The table shows that the most negative differences tend to be for students with excessive absences relative to students who attend regularly, ranging from −11.5 to −5.7 percentile points for math and −11.4 to −6.5 for ELA. In contrast, the Asian and Other Home Language groups have true SGPs that are on average 9 to 10 percentile points higher than other students. The mean differences for the Other Home Language group track the Asian group because there is a large overlap in these populations, with most students who indicated speaking a language other than English or Spanish at home also identified themselves as Asian.
Group Mean Differences in True SGP for Math and ELA for Each Covariate and Cohort.
Although the mean differences generally are similar in sign and magnitude across cohorts within a subject, there are notable exceptions, including Special Education and Disability in Grade 6 math, ELL in Grade 8 math, FRL in Grade 8 ELA, and Asian and Other Home Language in Grade 7 ELA. The exact sources of these deviations from the general patterns are unknown. They could result from idiosyncratic features of different cohorts of students, idiosyncratic features of the tests from particular grades and subjects, or unobserved interventions targeting specific subpopulations of students. We observed qualitatively similar patterns for the deviant cases with SGPs estimated from the observed scale scores. Thus, they are not a result of modeling the item-level data.
Finally, although some of the group differences are large, collectively the student background variables do not explain much of the variance in true SGPs for students. In a linear regression of the true SGPs on main effects for the covariates, the R2 range from only 0.04 to 0.07 across grades and subjects. These are markedly lower than the R2 from a regression of the true SGP for one subject on the true SGP for the other subject, which from squaring the true SGP correlations reported previously, would range from 0.09 to 0.27. This suggests that cross-subject information may be more useful than student background variables for shrinkage estimation.
Implications for SGP Estimation Accuracy
The previous section established that true SGPs for students are correlated across math and ELA, and that they are related to student background characteristics. These are descriptive properties of the distribution
Method
We consider the use of conditional means for estimation. Conditional means would correspond, for example, to estimates obtained via a Bayesian analysis where true SGPs were estimated using their posterior means given the observed data (Lockwood & Castellano, 2015; McCaffrey et al., 2015). We thus refer to these estimators as expected a posteriori (EAP) estimators. To describe the methods, it is useful to partition the item response data
Two properties of EAPs make them convenient for calibrating the potential value of auxiliary data in this context. The first is that the function
Because the estimated distribution
Our analyses followed three general steps: (1) generating samples from the appropriate distribution; (2) using 80% of the samples to approximate
For Step (2), we used a random 80% of the samples
Finally, for Step (3), we used the estimated functions
We repeated this procedure for a sequence of different values of
Results
The main results are summarized in Figure 1(a). The reliability λ is on the horizontal axis, and the vertical axis is the square root of the MSE (denoted “RMSE”) for conditional mean estimators of

(a) Approximate RMSE of EAP estimators for math SGP conditioning on different amounts of information, as a function of test reliability. (b) Approximate percentage reduction in MSE for different math EAP estimators, relative to the EAP estimator that conditions only on math scores, as functions of test reliability. Values are averaged across Grades 4 to 8.
Figure 1(b) calibrates the improvements in terms of percentage reduction in MSE (Haberman, 2008; Sinharay et al., 2011) relative to the default EAP estimator that conditions only on math scores. Adding covariates alone (curve with triangles) provides only a few percentage point improvement, with benefits increasing as the test reliability decreases. Adding ELA scores alone (curve with +) provides somewhat more benefit, but the relationship is not monotonic with the test reliability. Conditioning on both sources of information (curve with ×) leads to percentage reductions in MSE that are at best 6%. The corresponding maximum for ELA (not shown) is 7%. Curiously, the percentage reduction is maximized with reliability between 0.85 and 0.90, typical values for actual standardized assessments. This means that given the state of the world, the relative benefit provided by conditioning on auxiliary information is about as large as it could be, even though in absolute terms the accuracy gains will not be large.
It is important to clarify that our calculations here hold constant the definition of the true SGP as conditioning on only the matched-subject prior year achievement. For example, the true math SGP
Implications for Aggregating SGPs to the Teacher Level
Previous results established that distributions of true SGPs vary as a function of student covariates. Here we consider the implications of this fact for the behavior of aggregates of SGPs to the teacher level, currently used for teacher evaluations (e.g., Colorado Department of Education, 2013; Georgia Department of Education, 2014). Specifically, we use
Such variation indicates a correlation between the background characteristics of the students a teacher teaches and the average of these students’ true SGPs. The variation results from the combination of the unequal distribution of student covariates across classrooms and the relationships of true SGPs with these covariates. There are at least three distinct mechanisms for these relationships. First, they could result from student-level factors (e.g., motivation, skills, or family circumstances) that are related to both true SGPs and the observed student covariates. Second, they could result from contextual effects, where students have more or less growth as a result of contextual factors (e.g., neighborhoods or classroom dynamics) that are correlated with the observed student covariates. Either of these two mechanisms would pose a problem for interpreting aggregate SGPs as teacher performance indicators because they would cause teachers of equal effectiveness, but who teach different types of students, to receive systematically different aggregate SGPs. The third possible mechanism for relationships of true SGPs with student covariates is the sorting of more or less effective teachers to schools and classrooms that vary systematically with respect to our student background variables. That is, if more effective teachers are more likely to teach students with particular background characteristics, then such students on average may have higher true SGPs simply because they are taught by better teachers. In our investigations of true SGPs aggregated to the teacher level, we conduct some analyses that try to shed light on the contributions of these different mechanisms.
Method
Our analysis of aggregated expected SGPs consisted of four steps. First, for each cohort, we used samples
In the second step, we computed
In the third step, we restricted the analysis sample for each cohort to students for whom we observed either a math teacher link or an ELA teacher link for Grade
Finally, in the fourth step, we averaged all these expected SGPs to the teacher level for each teacher. If the average for a particular teacher is 60, for example, it indicates that given the background characteristics of the students he or she teaches, we would expect that the teacher would receive an average SGP of 60 (on the 0-100 scale) if there were no measurement error in either the prior or current year tests. We then restricted attention to teachers with 15 or more expected SGPs contributing to the teacher-level mean to mitigate the impact of small samples on the estimated distribution, resulting in 12,103 teachers with “class” sizes ranging from 15 to 221 with a median of 46. For middle school teachers who tend to teach either math or ELA but not both, the restriction generally means that teachers needed to be linked to at least 15 students. For elementary school teachers who teach both subjects, the number of actual students might be as small as 8 because each student contributes both a math and ELA expected SGP.
Results
Figure 2 provides a histogram of the teacher-level averages of the expected true SGPs for the sample of teachers described above. The distribution has some low outliers below 40, and a heavier right tail that extends above 60. The 0.10 and 0.90 quantiles are 44.8 and 55.4, respectively. Other authors have noted that SGPs estimated from test scores and then aggregated to the teacher level can be correlated with aggregated student background characteristics solely as a result of measurement error in the prior test scores used to estimate SGPs (McCaffrey et al., 2015; Shang et al., 2015). Our results go further: they indicate that such relationships would exist even if true SGPs could be measured perfectly through tests with no measurement error.

Histogram of estimated expected true SGP at the teacher level based on student covariates for teachers with at least 15 expected SGPs.
The variability of the distribution is striking, but as noted in the previous section, there are multiple mechanisms that could be responsible for the correlation between true SGPs and student covariates that ultimately leads to the type of variability evident in Figure 2 when teachers vary with respect to the types of students they teach. We conducted several analyses that probed these mechanisms. First, we investigated whether we obtained distributions similar to Figure 2 if we considered expected SGPs given lag-1 prior achievement in the other subject or given additional years of prior achievement from the same subject. Conditioning on additional prior achievement attributes is a common strategy for matching students more closely with respect to prior achievement (Lockwood & McCaffrey, 2014), potentially reducing the magnitude of relationships between other student covariates and student progress. That is, if part of the relationship between true SGPs and observed student covariates is due to unobserved student-level factors correlated with both, then conditioning on additional information that may proxy for such factors can help to reduce the correlation between true SGPs and observed student covariates.
We can easily obtain true SGP distributions for, say, current math achievement given both prior math and ELA achievement using the models presented to this point. For true SGPs given additional lagged prior achievement, we ran additional latent regression models where the four dimensions were 4 years of achievement for a single subject (e.g., math achievement in Grades 3 [2007] to 6 [2010]), which allowed us to examine distributional properties of true SGPs conditional on up to 3 years of prior achievement. In summary, using these true SGP distributions and the methods described above for obtaining average expected SGPs for teachers given student background covariates, we found only a modest reduction in variance. Specifically, versions of the distribution in Figure 2 based on including additional prior achievement attributes would have a standard deviation ranging from 78% to 85% as large as that of the distribution in Figure 2. Thus, the large spread evident in Figure 2 is not removed simply by conditioning on more prior achievement traits.
We also conducted analyses that tried to isolate the part of the observed relationships between expected SGPs and student covariates that are due only to individual-level relationships. These analyses are described in the appendix. The results are summarized in Figure 3, which is analogous to Figure 2 but is based on only individual-level relationships between background characteristics and true SGPs, and does not reflect variation due to either contextual effects or sorting of teachers of different effectiveness to different types of students. The standard deviation of the distribution in Figure 3 is 63% as large as that of the distribution in Figure 2. In addition, spread between the 0.10 and 0.90 quantiles in Figure 3 is (46.9, 53.5), compared with (44.8, 55.4) for Figure 2. Thus, our analyses suggest that even if part of the spread in Figure 2 is due to contextual effects or teacher sorting, variation across teachers in expected aggregated SGP would remain due to individual student-level relationships and variation across teachers in the types of students they teach.

Histogram analogous to Figure 2, but based on within-group latent regression coefficients.
Discussion
Studying properties of true SGPs only through the lens of estimated SGPs is difficult due to excessive estimation error in estimated SGPs. Modeling longitudinal item-level data with latent regression MIRT models is an efficient and effective alternative. The latent regression specification leads directly to model-based true SGP functions, and the parameters required to specify these functions can be estimated straightforwardly in the MIRT framework. Monte Carlo methods can then be used to study features of the joint distribution of multiple true SGPs, student covariates, and test scores that, to date, have not been investigated.
Our results raise concerns about using and interpreting estimated SGPs at both the student and aggregate levels. A substantial research base already notes that SGP estimates for individual students have large errors (Lockwood & Castellano, 2015; McCaffrey et al., 2015; Monroe & Cai, 2015; Shang et al., 2015). Our findings indicate that joint models capitalizing on relationships of true SGPs both across subjects and with student covariates would provide only modest benefits for estimation accuracy. Although this accuracy problem manifests with SGPs, it is not unique to SGPs: it is an intrinsic problem with trying to use typical standardized assessments to measure growth accurately (Harris, 1963). Our findings underscore that using multiple features of the observed data to learn about true growth cannot overcome this fundamental limitation. Thus, estimated SGPs may not be accurate enough to support inferences or decision making for individual students.
The fact that SGPs apparently are related to student background characteristics even in the absence of test measurement error, with directions of the relationships generally echoing those observed with achievement status, creates further interpretation problems. On the one hand, our finding that excessive absence is a strong predictor of true SGPs provides some reassurance that tests can be sensitive to time-varying factors that we would hope to have causal impacts on student progress. On the other hand, relationships with persistent characteristics such as race/ethnicity suggest that the process of conditioning on prior achievement, even if it could be measured accurately, will result in achievement progress measures that carry with them some part of the gaps seen with achievement status. This creates a dissonance between some of the rhetoric surrounding the fairness of growth measures such as SGPs, and the reality of how such measures are likely to behave. Our finding that these relationships exist with latent achievement attributes, not just with observed test scores, makes clear that improving the reliability of standardized assessments would be insufficient to solve this problem.
The relationships of true SGPs to student characteristics also creates a clear problem for interpreting estimated SGPs aggregated to the teacher or school levels. One of the putative benefits of aggregating estimated SGPs is that it overcomes the excessive measurement error problem at the individual level. However, the variability in the distribution in Figure 2 is troubling, and our evidence that a nontrivial part of that variability may be due to individual-level relationships between student characteristics and true SGPs is even more troubling. It suggests that SGPs aggregated to the teacher level may contain a source of variance that is due solely to the fact that teachers do not teach the same types of students. This source of variance represents bias if the goal is to interpret aggregated SGP as an indicator of teacher effectiveness. This bias is easy to avoid in a value-added model that regresses student test scores on teacher fixed effects, prior test scores, and student background variables because such a model removes variance due to the individual-level relationships from the estimated teacher effects (see, e.g., Wooldridge, 2002). Our results thus suggest that the interpretation and transparency benefits provided by aggregated SGPs need to be weighed against the costs of allowing a source of bias in performance indicators that is removed by alternative modeling approaches.
Our results come with a number of caveats. The main limitation results from the specification of the latent regression model in Equation 1. Although such a specification is standard in MIRT modeling, and is useful for analyzing aspects of the statistical structure of achievement attributes and their relationships to student characteristics, it falls short of being a structural (causal) model for the evolution of student achievement. Such models specify student achievement as a cumulative function of the history of educational inputs (e.g., teacher and school effects), peer effects, and effects of both observed and unobserved individual and family attributes (see, e.g., Todd & Wolpin, 2003). Provided the many assumptions required to estimate such models with real data are appropriate, they have the advantage that they permit the sorts of decompositions needed to fully interpret, for example, the distribution in Figure 2 because they disentangle the causal effects of various inputs to student achievement. We have no reason to think that our second-stage analyses probing the decomposition provide misleading results. However, more refined inferences would be possible if the latent regression part of the MIRT model was specified as something closer to a structural model for longitudinal student achievement. Such modeling introduces a number of challenges, including data requirements, model specification decisions (e.g., how to deal with the cross-classification of students to teachers over time as well as missing student-teacher links), and software limitations that are beyond the scope of this article. Future work along these lines could build on our framework under more complex model specifications.
Other modeling decisions may have affected our results. For example, our latent regression included only main effects for the covariates. Preliminary analyses with the scale scores suggested some evidence for two-way interactions, but the increase in R2 for such model terms was 0.01 or less, and including them in the MIRT models would have substantially increased computation time and complicated interpretation of the model parameters. It is unlikely that accounting for these interactions would change any of the substantive findings, but it still may be useful to consider the sensitivity of our findings to such model specification changes. Similarly, our choice of item response model could affect our results, warranting additional analyses to determine the extent that a different model, such as the three-parameter-logistic model for dichotomously scored items, fits the data better and changes our findings, if at all. Finally, it is unlikely that the assumption that the latent trait residuals
Some of our findings also may be due to peculiarities of our data set. For instance, in preliminary analyses, we found that student background covariates still had relatively large coefficients when included in a regression of current year (2010) scores on several prior year scores. Thus, for our data, student background covariates seem to explain additional variation in current test scores over and above prior achievement. This may not be typical, which could contribute to the several large group differences in true SGPs we find.
Finally, the inferences from our EAP analyses may also be sensitive to several choices we made beyond those made in the latent regression MIRT model. For example, we approximated EAPs under the assumption of homoscedastic measurement error, which does not generally hold with IRT-based ability estimates based on linear test forms. We suspect that our substantive conclusions are not sensitive to allowing for heteroskedastic error but future research may consider to what extent it does matter. It may lead to shrinkage being more important for students in the tails of the test score distributions because their test scores are typically noisier.
Although such future research would be useful to investigate the robustness of our findings, this study serves as an important step in investigating the properties of the underlying quantities attempting to be measured by SGPs computed from error-prone test scores, and the implications of those properties for the validity of SGPs as indicators of student achievement growth and educator effectiveness.
Footnotes
Appendix
Acknowledgements
We thank Shelby Haberman, Daniel F. McCaffrey, Peter van Rijn, the editor, and two referees for providing constructive comments on earlier drafts.
Authors’ Note
The opinions expressed are those of the authors and do not represent views of the Institute of Education Sciences or the U.S. Department of Education.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This material was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D140032 to ETS.
Notes
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
