Abstract
Aggregate-level conditional status metrics (ACSMs) describe the status of a group by referencing current performance to expectations given past scores. This article provides a framework for these metrics, classifying them by aggregation function (mean or median), regression approach (linear mean and nonlinear quantile), and the scale that supports interpretations (percentile rank and score scale), among other factors. This study addresses the question “how different are these ACSMs?” in three ways. First, using simulated data, it evaluates how well each model recovers its respective parameters. Second, using both simulated and empirical data, it illustrates practical differences among ACSMs in terms of pairwise rank differences incurred by switching between metrics. Third, it ranks ACSMs in terms of their robustness under scale transformations. The results consistently show that choices between mean- and median-based metrics lead to more substantial differences than choices between fixed- and random-effects or linear mean and nonlinear quantile regression. The findings set expectations for cross-metric comparability in realistic data scenarios.
Recent educational policies have expanded the focus of accountability measures from student proficiency at a single time point to student score histories over time. This article investigates metrics that use these longitudinal data to describe the status of groups of students in terms of empirical expectations given past scores. Referencing status to expectations is intuitively appealing. A group of students may have low achievement but higher achievement than expected given their past scores. Group performance may be better interpreted and improved in light of these expectations.
These methods use regression models that locate individuals and groups in empirically “comparable” reference groups based on their past scores. This may be described as a kind of norm-referencing or “difference from expectation” approach (Kolen, 2011). Following Castellano and Ho (2013a), at the individual level, we call these metrics “conditional status metrics” (CSMs), because they frame individual status in terms of conditional distributions given past scores. This class of metrics includes residuals from simple linear regression models (i.e., residual gain scores, Manning & DuBois, 1962). Another CSM is the quantile regression-based Student Growth Percentile (SGP) metric (Betebenner, 2008a, 2009) that is in active or preliminary use in 25 states (Betebenner, 2010a).
Individual-level CSMs can serve multiple purposes from student-level classification, accountability, and selection to enhancing student score reports for student, parent, and teacher audiences. However, many important questions of policy and practice are asked at the aggregate level. Current educational policies require longitudinal data summaries at the level of classrooms, teachers, subgroups, schools, and states (Lissitz, Doran, Schafer, & Willhoft, 2006; U.S. Department of Education, 2005, 2009).
The popularity and utility of aggregate-level CSMs (ACSMs), and particularly the widespread use of the median SGP (Betebenner, 2010a), motivate a critical review of ACSM properties. We introduce a framework for factors that differentiate among ACSMs, including the choice of aggregation function, nonlinear quantile versus linear mean regression, and fixed versus random effects. To illustrate these contrasts clearly, we restrict our focus to relatively simple ACSMs that condition only on past scores. We begin with aggregate-level SGP metrics and contrast them against alternatives, including aggregate-level percentile ranks of residuals (PRRs; Castellano & Ho, 2013a) and simple versions of so-called value-added models that attempt to isolate the causal effect of a teacher/school/principal on student achievement.
Within this latter class of models, our approach contrasts with those often taken in the large and growing body of literature on variability in value-added scores over time or across measures (e.g., McCaffrey, Sass, Lockwood, & Mihaly, 2009; Papay, 2011). Instead, we aim to distinguish between the growing number of metrics and describe the practical differences of choices among them. We explicate differences among metrics in terms of (1) each model’s recovery of its respective parameters, (2) practical differences among group percentile ranks for simulated and empirical data, and (3) robustness to scale transformations. Together, these analyses represent an effort to better understand the theoretical and practical differences among commonly used ACSMs, as well as the practical variability of any single ACSM under plausible alternative specifications.
Aggregate-Level Conditional Status Metrics
We begin by distinguishing models that support individual- and aggregate-level conditional status interpretations from two other classes of models that use longitudinal test scores. Individual growth curve models (Rogosa & Willett, 1985; Singer & Willett, 2003) require test score data that share a common scale over time. Multivariate models are multivariate in their response variables and include the cross-classified model of Raudenbush and Bryk (2002), the variable persistence model of McCaffrey, Lockwood, Koretz, Louis, and Hamilton (2004), and the Education Value-Added Assessment System model of Sanders, Saxton, and Horn (1997). In contrast, CSMs do not require test score data that share a common scale over time, and they are univariate in their response variables.
McCaffrey et al. (2004) usefully describe individual growth curve models and CSMs as special cases of multivariate models, and we recommend their article as a broader statistical framework. Castellano and Ho (2013b) contrast these models as they are used in state policies. A full discussion of “growth” and “value-added” inferences is beyond the scope of this article. We take the position that CSMs can inform discussions of growth and value-added inferences, but they should first be interpreted in terms of the conditional status interpretations that they support most directly.
Distinguishing Among ACSMs
As previously stated, this article presents three perspectives on differences among ACSMs: recovery of respective parameters, practical differences among ACSMs for simulated and real data, and robustness of each ACSM to scale transformations. We motivate each of these comparisons in this section.
Table 1 lists the nine ACSMs that we address in this article and identifies three distinguishing features along which these ACSMs differ. The nine ACSMs are median SGPs (medSGPs), mean SGPs (meanSGPs), median PRRs (medPRRs), mean PRRs (meanPRRs), mean Residuals (meanResids), median Residuals (medResids), residuals from aggregate regression (RARs), the fixed-effects metric (FEM), and the random-effects metric (REM). The first of the three distinguishing features, is the regression approach that establishes expectations for the target vector of test scores. We refer to the target scores as “current” scores given they are typically from the most recent time point. The models supporting ACSMs use different regression approaches: linear mean regression, including models that incorporate or disregard group membership (multilevel and single-level models, respectively), and nonlinear quantile regression. The second distinguishing feature is the aggregation function, where metrics support aggregate-level inferences through mean- or median-based operations. The third distinguishing feature is the scale for interpretation, where some metrics are expressed as percentile ranks and others are expressed on the score scale of the current grade test, as residuals.
Table of Aggregate-Level Conditional Status Metrics
Note. In this article, for all regression approaches, current status or the current grade-level score is the response variable and prior grade-level scores are the predictors.
The theoretical and practical impact of these distinguishing features has yet to be well described. Although there are bodies of literature supporting ACSMs, theoretical and empirical comparisons are often focused on examining the validity of value-added inferences of the various ACSMs as opposed to describing the magnitude of practical differences and the theoretical basis for cross-metric discrepancies. In particular, studies contrasting medSGPs against alternative ACSMs (e.g., Ehlert, Koedel, Parsons, & Podgursky, 2012; Goldhaber, Walch, & Gabele, 2012; Guarino, Reckase, Stacy, & Wooldridge, 2014; Houng, & Justman, 2013; Wright, 2010) are often critical of the use of median SGPs for value-added policies in comparison to other metrics. However, they have not identified the target parameter for median SGPs as we do here, and they place less emphasis on the basis for and magnitude of practical differences. Castellano and Ho (2013a) compare SGPs and PRRs at the individual level and find they are highly similar across a range of simulated and real-data scenarios. We extend their work to the aggregate level and broaden the scope of metrics under consideration.
Finally, we compare ACSMs based on their robustness to scale transformations. This robustness is one of the motivations for the SGP metric (Betebenner, 2009). If SGPs are calculated for a matrix of student test score data,
Briggs and Betebenner (2009) compared the scale invariance of medSGPs and aggregate-level effects from a “layered,” multivariate value-added model (Ballou, Sanders, & Wright, 2004). They used transformations that reflected linear or nonlinear growth, constant or increasing variance, and the more extreme exponential transformation. They found that conditioning on 1, 2, and 3 prior test scores, medSGPs were near perfectly correlated across the transformations, whereas layered model effects were less strongly correlated (r > .9), with smaller correlations arising from the exponential transformation (r ≈ .3 to .6). Our scale invariance study differs from this study most importantly in our choice of scale transformations and the broader range of ACSMs that we consider.
The proliferating uses of ACSMs motivate a broad review of contrasting metrics as well as criteria for evaluating these metrics. We review each ACSM in Table 1, in turn, and then compare them in terms of parameter recovery, practical differences, and scale invariance.
Mean and Median Student Growth Percentiles
The SGP metric uses quantile regression to describe the “current” status of students in the context of their “prior” test performance. In practice, the “current” status is either the most recent set of test scores available or a time point that is of particular interest, and “prior” refers to score variables from one or more time points that precede the current time point. Castellano and Ho (2013a) describe the estimation of individual SGPs in detail, and Betebenner (2010b) documents it in his “SGP” package for R. We briefly review the procedure here and transition to properties of two aggregate-level SGPs: median and mean SGPs.
The quantile regression approach can be clarified by contrasting it with linear mean regression of current scores on prior scores. In linear mean regression, the prediction takes the form of a conditional mean: an average current score for students given particular prior scores. In contrast, quantile regression allows for the prediction of conditional quantiles, such as the median, 25th percentile, and 90th percentile, for students with particular prior scores. Neither regression approach requires current and prior scores to be on the same scale.
The SGP estimation procedure involves estimation of 100 conditional quantile surfaces corresponding to quantiles from .005 to .995 in .01 increments (Betebenner, 2010b). These surfaces represent boundaries. Students with observed current scores that fall between two adjacent surfaces are assigned an SGP represented by the midpoint quantile between these boundaries. For example, a student whose observed current score is between the .495 and .505 predicted quantile surfaces has an SGP of 50.
Just as linear mean regression can generalize to nonlinear functions, the SGP quantile approach employs nonlinear B-spline functions in fitting conditional quantiles to accommodate nonlinearity and heteroscedasticity in the data (Betebenner, 2009). We follow this approach, implemented in the “SGP” package (Betebenner, 2010b), and thus classify SGPs as using “nonlinear quantile” regression. We also follow Castellano and Ho (2013a) in increasing the resolution of SGPs by estimating 1,000 instead of 100 lines, for quantiles from .0005 to .9995 by .001, allowing reporting of SGPs to one decimal point instead of as integers. This prevents cross-metric comparisons from being confounded by decisions about rounding.
This article contrasts two SGP-based ACSMs: medSGPs and meanSGPs. The medSGP and meanSGP metrics involve simple aggregation of SGPs using the median and mean functions, respectively. The typical SGP-based ACSM in operational use is the medSGP, following Betebenner’s (2008b) recommendation to use medians due to the ordinal nature of percentile ranks. Means, in contrast, are computed under the implicit assumption that an equal-interval scale underlies the averaged units. Such an assumption is violated by the percentile rank scale whenever the underlying, latent distribution for which percentile ranks are reported is nonuniform. However, we take the view that equal-interval scale properties should be evaluated with respect to uses and interpretations. The statistical features of means may support useful inferences and properties even when scales do not appear to have equal-interval properties (Lord, 1956; Scholten & Borsboom, 2009). Further, strict equal-interval properties are rare among all test score scales (Spencer, 1983; Zwick, 1992), and this has not stopped the common practice of averaging test scores. We consider threats to interpretations of ordinal-based averages as a matter of degree. We demonstrate that these threats may be offset by the advantages of means.
Mean and Median Percentile Ranks of Residuals
Castellano and Ho (2013a) contrast SGPs with an analogous approach that uses the percentile ranks of residuals from the linear mean regression of current scores on past scores. This approach has been used previously in a range of applications (e.g., Ellis, Abrams, & Wong, 1999; Fetler, 1991; Suen, 1997). A student’s PRR and SGP are both interpretable on a percentile rank scale as conditional status given past scores, but PRRs require less computation time and are anchored in an elementary statistical framework. Castellano and Ho (2013a) show that SGPs and PRRs have correlations of about .99 for real data and root mean square differences of approximately 3 on the percentile rank scale. They also demonstrate that PRRs recover benchmark percentile ranks better than SGPs when linear regression assumptions hold, and they identify skewness levels at which SGP recovery exceeds PRR recovery.
Like SGPs, PRRs are student-level statistics, but they are easily aggregated to meanPRRs and medPRRs. Individual-level PRRs are computed by first regressing students’ current scores (Y) on their scores from J prior time points (X 1, X 2, … , XJ ) as follows, where i indexes student and g indexes group:
Here, α is the intercept, the β
j
parameters are the (overall) regression coefficients for each prior test score included, and
The residuals for the regression shown in Equation 1 are the simple differences between the observed and expected values given past scores:
Mean and Median Residuals
Another mathematically simple approach involves the aggregation of raw residuals derived from Equation 1 without performing the percentile rank transformation. In contrast to meanPRR
g
,
Residuals from Aggregate Regression
The SGP and PRR metrics first condition and then aggregate in two distinct steps. The RAR metric, in contrast, results from fitting a regression model to aggregated values, reversing the order of operations to aggregating first and then conditioning. The RAR metric thus represents the conditional status of aggregates instead of an aggregate of conditional status. As the name suggests, RARs are the residuals from the regression of average current scores on their J average prior scores. This regression model is often referred to as a “between-groups” regression, as it ignores any within-group variability (e.g., Snijders & Bosker, 2011). The RAR metric can use the same linear mean regression specification as the PRR metric, but the individual scores in Equation 1 are replaced by their group averages, reducing the fitted data from n students to G groups (where G < n unless
Here,
Fixed-Effects Metric
The model supporting FEMs is sometimes described as a “covariate adjustment” model or, more simply, as an analysis of covariance (ANCOVA). In this approach, an aggregate-level “fixed effect” is added to the individual-level linear mean regression of current scores on J prior scores. The fixed effects can be represented by the coefficients (γ g ) of dummy variables for groups, resulting in distinct intercepts for each group g:
Each group’s FEM is operationalized as its estimate of γ
g
with respect to a reference group or subject to another constraint such as
The multilevel modeling literature provides many useful links between models. As one illustrative contrast, when group sizes are equal and there is one prior grade predictor (X), the overall regression coefficient for X, β (from Equation 1), is the weighted sum of the between- (β
B
) and within-group (β
W
) regression coefficients (from Equations 2 and 3 respectively):
The focus of this article is not on regression coefficients but on the contrasts between ACSMs. However, there is reason to expect that similarities in slopes will lead to similarities between ACSMs, particularly between FEMs and meanResids. The FEMs can in fact be expressed as a mean residual from a mean regression line with slope
Random-Effects Metric
The model supporting the REM is a two-level “random intercept” model or a random-effects ANCOVA (Raudenbush & Bryk, 2002). It is similar to the model supporting FEMs, but random-intercept models do not parameterize the intercepts directly; rather, they treat them as random variables. Given school-level random intercepts designated as ug and an overall average group intercept designated as γ0:
This random-intercept model also constrains all groups to have the same slope estimates
Even when slopes are similar, FEMs and REMs may differ due to the necessary estimation of each REM (ug ). We follow common practice (Rabe-Hesketh & Skrondal, 2012) and shrink REMs to the overall mean using empirical Bayes estimation. Within any given data set, there will be more shrinkage for smaller group sizes than larger group sizes. Across data sets, there will be more shrinkage for data with more within-group than between-group variation. In the practical scenarios that we will illustrate, we found that the shrunken and unshrunken estimates were almost perfectly correlated and thus do not consider unshrunken estimates further.
Visual Comparison of ACSMs
Figure 1 contrasts the models that support ACSMs by illustrating their fit to a scatterplot of real data. Light gray conditional boxplots show the empirical bivariate distribution of a statewide cohort’s Grade 6 scores given each score point in their prior grade. The linear mean regression line in dark gray has a slope

Contrasting the models used in deriving aggregate-level conditional status metrics for a J = 1 prior-grade empirical test score data set. The student scores are expressed as conditional boxplots of current score on initial status, and the school mean scores are overlaid as solid gray squares. Note. PRRs = percentile ranks of residuals; SGPs = Student Growth Percentiles; FEMs = fixed-effect metrics; REMs = random-effect metrics; RARs = residuals from aggregate regression.
To illustrate SGP operationalization, Figure 1 also shows the median quantile spline in black. (It is actually the line for the .495 quantile, and scores falling between the .495 and .505 quantile splines receive an SGP of 50.) Like PRRs, points above the line will be assigned higher SGPs. The jagged shape of the spline at extreme score points arises from corrections that ensure that quantile regression lines do not cross each other (Betebenner, 2010b).
The dashed line denotes the regression line for the RARs with slope
The fixed-effects model supporting FEMs effectively estimates a common slope for within-group data,
The overall (resids, PRRs), within-group (FEMs), and between-group (RARs) regression lines are visually distinguishable. As described in the previous sections, the overall regression coefficient will be closer to the within-group coefficient when between-group variation on X is relatively low, that is, low ICCs for predictors. This holds in Figure 1, where the slope of the linear mean regression line is closer to the within-group line than the between-group line. The relative dominance of within-group variation in the predictors also explains our inability to distinguish between the fixed- and random-effects lines visually. These similarities frame the theoretical and empirical results that follow.
ACSM Parameter Recovery
We first evaluate each ACSM according to how well it recovers its corresponding parameter or expected value. As described in the preceding sections, each ACSM is derived under a different model with different parameters or intended targets. In this section, we describe our data-generating process (DGP); define the parameters for each ACSM under this DGP; and then evaluate the recovery of each ACSM using bias, root mean square error (RMSE), and correlations with their respective true values. We use the statistical software program R (R Development Core Team, 2009) for all analyses.
Data Generating Process
The ideal DGP allows control over parameters that ensure realistic relationships among variables while also defining parameters for evaluating ACSMs. We considered generating data using the random-intercept model as given in Equation 4. However, such a DGP would privilege the REM over the other ACSMs. Additionally, the other ACSMs do not have parameters readily derivable from parameters specified in Equation 4. Such an approach also precludes specification of the between-group correlations between each respective predictor and the outcome. In the context of year-to-year test scores, in which both the predictors and outcome are test scores, it is not realistic to have an outcome variable that has correlations with predictors substantially different than the correlations among the predictors themselves.
We therefore use a “decomposed multivariate normal” DGP that generates data as the sum of between-group and within-group MVN distributions. The between-group data are the group means, and the within-group data are individual deviations from the group means or the group-mean-centered values. We use 4 years of test scores—one “current grade” and three “prior grades”—following the common practice of using ACSMs when there are at least three prior time points (e.g., Sanders, 2006). The generating model is thus:
Here, Yig
denotes the nominal current test score for student i in group g, and
We define control parameters to align with the simulation specifications of Castellano and Ho (2013a) that reflect patterns observed in real data. The mean vector is
Following Castellano and Ho (2013a), for
To mimic a reasonable number of schools observed in practice, we assign students to 500 groups with equal group sizes. We investigate three group sizes, ng = 25, 50, and 100. Each of the nine crossed combinations of group sizes (25, 50, and 100) by ICCs (.05, .15, and .25) was replicated 100 times.
Expected Values for Each ACSM
In this section, for each simulation condition, we derive the expected values for each ACSM, beginning by establishing properties of group average residuals (meanResids), proceeding to aggregates of transformed residuals (aggregated SGPs/PRRs), and concluding with FEMs. Under random allocation of students to teachers, SGPs and PRRs are uniformly distributed 0 to 100 for each group, resulting in expected values of 50 for mean and median SGPs and PRRs. Under nonrandom allocation of students to teachers, such as in our DGP controlled by nonzero ICCs, we can also derive the expected values of group residuals and thereby mean and median SGPs and PRRs.
We first consider the conditional distribution of the current scores, given the prior scores. Under our decomposed MVN DGP, this distribution has known variance
We denote the residuals for students within, or given, a particular group g using conditional notation,
The mean of the residuals within group g,
Due to the normality of overall residuals
Additionally, because the distribution of residuals is normal, the parameters of the conditional quantiles follow those of a normal distribution. The associated conditional percentile ranks are the definition of SGPs and are equal to
The expected value of medPRRs and medSGPs follow from the symmetry of the normally distributed residuals for each group, such that the median residual equals the mean:
As before, the normality of the conditional distribution results in
The expected values for FEMs, like those for meanResids in Equation 5, can be expressed in terms of population regression coefficients implied by the DGP. That is,
All of these terms are specified or generated as part of the DGP. The population means, μ
Y
and
This DGP does not allow for straightforward expressions of
ACSM Recovery of Expected Values
We evaluate each estimated ACSM (
Here, r denotes replication with R = 100 and g denotes group with G = 500. Bias and RMSE are expressed on the scale of the ACSM of interest and are thus not comparable across all ACSMs. Specifically, the bias and RMSE for the residual-based metrics, FEMs, and meanResids, should not be compared to the bias and RMSE for the percentile rank-based metrics, mean/medSGPs and mean/medPRRs. To allow for comparison across all metrics, we also report Pearson correlations between each estimated ACSM and its expected value, averaged over the 100 replications for each condition. Table 2 gives these RMSEs and Pearson correlations. All of the biases were essentially zero; thus, we do not report them in the interest of space. Note that RMSEs are not strictly standard errors for an estimate of a single parameter, as each group has its own expected value. Rather, the RSMEs are average standard errors over the group parameters generated by the DGP.
Summary of the Recovery of Each Aggregate-Level Conditional Status Metric’s Parameter as Measured by RMSE and Pearson Correlations
Note. ICC = intraclass correlation; RMSE = root mean square error; medSGP = median Student Growth Percentile; meanSGP = mean Student Growth Percentile; medPRR = median percentile rank of residual; meanPRR = mean percentile rank of residual; FEM = fixed-effect metric; meanResid = mean residual. All of the metrics were essentially unbiased; that is, their bias was not significantly different from 0. The RMSEs for mean and median SGP and PRRs are on the percentile rank scale. The RMSEs for meanResids and FEMs are expressed in terms of standard deviation units of individual scores, σ
Y
, or
First, we review the performance of the percentile rank-scaled metrics. From Table 2, it is apparent that the median SGPs and PRRs underperform their mean-based counterparts in terms of parameter recovery. The RMSEs for the median-based percentile rank metrics are consistently about 1.6 times larger than those of the corresponding mean-based metrics across all simulation conditions. Both meanSGPs/PRRs and medSGPs/PRRs are unbiased, but meanSGPs/PRRs are substantially more efficient. Moreover, results for SGPs and PRRs are almost identical, indicating that under multivariate normality of cross-year test scores, the choice between conditional quantile and conditional mean regression is immaterial for aggregate-level conditional status interpretations.
For the two residual-scaled metrics, FEMs and meanResids, we find that their average efficiency is indistinguishable under these simulation conditions. Recall that FEMs are average residuals from a within-group regression (with
We express the RMSEs for FEMs and meanResids in Table 2 in terms of standard deviation units of Y
Across all metrics, we observe that the two varied simulation factors—group size and grade-level score ICC—affect recovery similarly. Each metric’s recovery of its expected value is better for higher ICCs and larger group sizes as observed by the lower RMSEs and higher Pearson correlations under these conditions.
To clearly compare the performance across all metrics, Figure 2 illustrates the mean correlations for the smallest group size, ng = 25, as this level shows the largest differences among the metrics. As Table 2 shows, the same general pattern holds for group sizes of 50 and 100 as well. Figure 2 clearly illustrates that the median-based percentile rank metrics (medPRR and medSGP) demonstrate the poorest recovery, the mean-based percentile rank metrics (meanPRR and meanSGP) demonstrate better recovery, and the mean-based residual metrics (meanResid and FEM) show the best recovery. The meanSGP/PRR correlations are more similar to those for the meanResids and FEMs than to those for the medSGPs/PRRs. In summary, the efficiency of the medSGPs/PRRs is noticeably poor compared to the performance of all the other ACSMs.

Pearson correlations between each aggregate-level conditional status metric and its respective parameter value, averaged over 100 replications of simulated multivariate normal data with 500 groups of size 25 and grade-level intraclass correlations (ICCs) of ω = .05, .15, and .25.
Theoretical Differences Between Mean and Median Percentile Rank ACSMs
In this section, we use theoretical statistical results to explain the substantial differences between mean-based and median-based percentile rank metrics observed in Table 2 and Figure 2. As percentile ranks, SGPs and PRRs follow a theoretical uniform distribution with a 0 to 100 range and thus have a standard deviation of
The expected value of medSGPs and medPRRs under these conditions is 50, and their standard error is
When there are average group differences ω ≠ 0, the sampling variability of mean and median percentile ranks (SGPs and PRRs), like the expected value, depends upon the variance of the unconditional residuals,

Standard error (a) and relative efficiency (b) of mean versus median percentile ranks of residuals when residual group means are relatively low or high in terms of standard deviation units of the distribution of
Empirical Data
To provide a real-data perspective on these findings, we use the same two statewide data files as those from the real-data analyses of Castellano and Ho (2013a), allowing for a common data reference point between these articles. There are a total of four distinct 4-year longitudinal data sets: two states with data in each of two subjects. The states, referred to as “State A” and “State B,” contrast usefully in size and scaling procedures. The State A data set contains records for a single cohort of about 25,000 students with reading and mathematics scores from Grade 3 to Grade 6 on a vertical scale with increasing variance over time. The State B data set has mathematics and reading scores for a cohort of about 75,000 students from Grade 3 to Grade 6 on within-grade-scaled tests.
The State A data allow for district-level analyses, whereas the State B data allow for school-level analyses. Both data sets include 4 years of data from a single cohort representing sixth graders in the “current” year. For the purposes of these illustrative analyses, students with missing data are excluded. For convenience, we describe district- and school-level grade cohorts as “groups” for State A and B, respectively. To adhere to standard reporting rules, we follow Colorado’s cutoff for reporting medSGPs (Colorado Department of Education, 2012) and exclude results for groups with fewer than 20 students. For aggregate-level SGPs and PRRs, we follow Colorado’s practice of including all students in individual-level computations but excluding students from small groups from aggregate-level analyses. In contrast, FEMs and REMs use aggregate-level information early in estimation; thus, for these metrics, we exclude students from small groups prior to any calculations.
This decision rule excludes about 18% of districts (2.8% of students) in State A and about 20% of schools (1% of students) in State B. In State A, there are 272 districts for reading and 273 for math. The median and mean group sizes for each subject are 46 and 90, respectively, with a maximum size of 1,700. In State B, there are 546 schools in the reading data set and 542 in the math data set. The median and mean group sizes are 133 and 139, respectively, with a maximum size of 380.
Cross-Metric Comparability
In the first analysis, we compared ACSMs in the recovery of their own respective parameters, whereas in this section, we compare metrics to each other directly for both simulated and empirical data. We use rank-based methods to compare the consequences of switching among metrics because of the nonlinear relationship between the residual scale and the percentile rank scale. Ranks are also a useful scale to communicate practical differences. Table 3 displays Spearman rank correlations for simulated and empirical data, and Figure 4 shows empirical distributions of percentile rank differences for groups.
Spearman Rank Correlations Between Each Pair of Aggregate-Level Conditional Status Metrics
Note. meanSGP = mean Student Growth Percentile; medSGP = median Student Growth Percentile; meanPRR = mean percentile rank of residual; medPRR = median percentile rank of residual; meanResid = mean residual; FEM = fixed-effect metric; REM = random-effect metric.

Comparing each metric with (a) the median Student Growth Percentile (medSGP) metric and (b) the fixed-effect metric (FEM) in terms of absolute differences in the percentile ranks of groups. Findings for simulated data are on the left (for a single replication of simulated multivariate normal data, ICC = .05, 500 groups of size 100), and findings for State B Reading data are on the right. Note. ICC = intraclass correlation; medPRR = median percentile rank of residual; meanSGP = mean Student Growth Percentile; meanPRR = mean percentile rank of residual; REM = random-effect metric; meanResid = mean residual.
Correlations Among ACSMs
The top third of Table 3 gives average correlations over replications for simulated data when ng = 25 and ICCs = .05 (above the diagonal) and ng = 25 and ICCs = .25 (below the diagonal). We chose the smallest group size condition, as it shows the starkest contrast among the metrics and reflects typical class sizes. Thus, these correlations represent possible correlations between ACSMs at the teacher (or classroom) level. The bottom two thirds of Table 3 show correlations for the empirical data in States A and B, in reading and mathematics. We divide the ACSMs into mean-based residual metrics, mean-based percentile rank metrics, and median-based percentile rank metrics with a 3 × 3 “tic-tac-toe” grid in each matrix.
We note three primary observations from the simulated and empirical results in Table 3. First, from the upper left of these 3 × 3 grids, we see that REMs, FEMs, and meanResids are nearly perfectly correlated with each other. This holds even for the simulated data with relatively small group sizes of 25. Second, meanSGPs and meanPRRs are also nearly perfectly correlated with each other and, to a slightly lesser degree, to the mean-based residual metrics. Third, medSGPs and medPRRs are highly correlated with each other but are much less correlated with the other metrics.
The high correlations between REMs, FEMs, and meanResids are predictable from Figure 1, where the three coefficients
The strong similarities between meanSGPs and meanPRRs are aggregate-level extensions of the findings of Castellano and Ho (2013a). Table 3 further shows that these two metrics are highly correlated with the mean-based residual metrics. In the empirical data, the correlations between mean-based residual and percentile rank metrics are highest for State B math. Comparing these correlations for the simulated data, above the diagonal (ICC of .05) and below the diagonal (ICC of .25), we find that these correlations will be higher when unconditional ICCs are higher. Indeed, the State B math data had the highest unconditional grade-level ICCs of the four empirical data sets. Additionally, State B has larger group sizes than State A. This is consistent with simulated results that show higher and more similar correlations among metrics when group sizes are larger.
The median-based metrics are the most different from the other metrics. Correlations between the medSGPs and medPRRs with the FEMs, REMs, and meanResids are much lower than those between the meanSGPs and meanPRRs. If aggregate-level SGPs and PRRs are intended to be ad hoc approximations to the more comprehensive statistical framework of multilevel models, it is clear that mean-based metrics result in closer approximations than median-based metrics. Notably, Table 3 shows that correlations between SGP-based metrics and PRR-based metrics are much higher than correlations between mean-based and median-based metrics. The relative impact of the choice of aggregation function is far greater than the choice of regression model.
Absolute Differences in Group Percentile Rank
Correlations provide a limited perspective on differences among metrics, as correlations between metrics are generally high but difficult to interpret on a practical scale. Alternatively, we can compare two metrics using absolute differences in the percentile ranks of schools. If a group drops from the 99th percentile on one metric to the 1st percentile on another, the absolute difference in percentile ranks is 98. Large absolute differences indicate dissimilarity between metrics on an interpretable scale. These differences in percentile ranks should not be confused with differences in aggregated SGPs and PRRs that are themselves on the percentile rank scale. As an example, the medSGP of one group is 32, which is greater than or equal to the medSGPs of 36% of all the groups. This group is thus at the 36th percentile on the medSGP metric. The same group has a meanSGP of 40, which corresponds to the 26th percentile on the meanSGP metric. For this group, switching metrics leads to an absolute difference in percentile ranks of
Figure 4 uses boxplots to show the distributions of absolute percentile rank differences between ACSMs and two “reference metrics,” (a) medSGPs because they are widely used in practice and (b) FEMs because they have the best recovery of their expected values (see Figure 2). Simulated results are on the left, and empirical results are on the right. Boxplots in each panel are ordered in descending order of similarity to their reference metric, as measured by the magnitude of the correlations in Table 3. The simulated results on the left use a single, randomly selected replication with group sizes of 100 and grade-level ICCs of .05. These correspond most closely with the group sizes and ICCs of the State B reading data on the right. Any of the other empirical data sets could have been used to illustrate the same findings.
Like Table 3, Figure 4 shows that comparisons of the metrics for the simulated and empirical data are very similar, suggesting that our simulation modeled the real data well for the purpose of ACSM comparison. The rank differences between medSGPs and mean-based metrics are large in both simulated and empirical data (Figure 4a). The rank differences between FEMs and other mean-based metrics are lower for simulated than for empirical data (Figure 4b). Consistent with the results in Table 3, in Figure 4 the ACSMs fall into three distinct categories for both the simulated and empirical data: mean-based residual metrics (REM, FEM, meanResid), mean-based percentile rank metrics (meanSGP and meanPRR), and median-based percentile rank metrics (medSGP and medPRR).
Figure 4 supplements Table 3 as a finer grain picture of the comparability of ACSMs. The small number to the left of each boxplot refers to the median absolute difference in group percentile ranks. This can be loosely interpreted as the typical difference that switching between metrics would make in terms of percentile ranks. For example, the right panel of Figure 4a shows that switching between medSGPs and meanSGPs would change a typical group ranking by 4 percentile ranks using State B reading data. The maximum data point in this boxplot also shows that, at worst in this data set, switching between medSGPs and meanSGPs changes a group ranking by about 30 percentile ranks. In contrast, Figure 4b shows that for both simulated and empirical data, switching between FEMs and medSGPs (or medPRRs) results in a typical group ranking change of 6 percentile ranks with an interquartile range of about 2 to 11 percentile ranks and a maximum change of about 40 percentile ranks. For example, for the State B reading data, one group’s FEM is at or above about 82% of all the groups, making it appear as exceptional, whereas its medSGP places it at the 49th percentile of all the groups, a more mediocre ranking, and a change of 33 percentile ranks. Such groups could receive substantially different accolades or sanctions depending on which metric was used.
An alternative approach to describing metric dissimilarity is to describe the proportion of groups that are in the same quartile or decile by each pair of metrics or the proportion of groups that change one or more quartiles. These metrics are intuitive and practical but depend on the number and location of cut scores. For example, a group with a cross-metric change of 20 percentile ranks may or may not cross a quartile boundary. Figure 4 is a more robust representation that describes the magnitudes of typical and extreme changes in group percentile ranks.
Scale Invariance
To evaluate scale invariance, we use a family of four piecewise transformations as given in Castellano and Ho (2013a). We apply these transformations directly to each of the standardized test score variables (z) in a way that increases or decreases skewness and kurtosis through transformations S(z) and K(z), respectively:
The constant k in these equations controls changes to the skewness and kurtosis of variables, respectively. Following Castellano and Ho (2013a), we use k = 1.2 for the positive skewness and kurtosis transformations and k = 1/1.2 for the negative skewness and kurtosis transformations. These choices mimic realistic values of skewness and kurtosis observed in practice, and these piecewise transformations follow those sometimes used by state testing programs (e.g., Massachusetts Department of Elementary and Secondary Education, 1999). The variable z represents the standardized grade-level test scores, where the scores are standardized by first subtracting the respective grade-level mean and then dividing by the pooled SD across all the grade levels of interest. This preserves the relative grade-to-grade variability in the empirical data sets. We use J = 3 prior years for this analysis to reflect a common number of prior years used in practice.
The eight ACSMs are estimated for all four transformed data sets. For a given data set, each group has five values for each ACSM—one each from the original (identity transformation), positive skew, negative skew, positive kurtosis, and negative kurtosis transformations. Each group is then rank ordered by each ACSM for each transformation, resulting in each group receiving 5 percentile ranks—one per transformation—for each of the eight ACSMs. The range of these values indicates the extent that a group’s percentile rank would change under this family of plausible transformations. We avoid correlations because they are limited to pairwise comparisons of transformations, and we are interested in a single metric describing the relevant variability of ranks under many plausible transformations. Thus, we quantify the amount of transformation-induced variability by using mean ranges on a common scale of percentile ranks.
Figure 5 displays the average range of percentile ranks for each ACSM under the five transformations for each of the four empirical data sets. The most scale-invariant metric is the meanSGP metric. The medSGPs are comparable to the remaining metrics. This may seem counterintuitive, particularly because medians are generally recognized as being more stable under transformations. In this analysis of PRRs and SGPs, the transformations are applied to the test score scale, not to the percentile rank scale of SGPs and PRRs, which will and should remain roughly uniformly distributed between 0 and 100. As transformations change individual SGPs and PRRs, the mean leads to more stability across these changes than the median. This result is akin to that for sampling variability, except there is no resampling. This result reflects dependence upon transformations, where rescaling leads to differences that looks similar to what we might expect from resampling. Although it may seem contradictory, at the aggregate level, the scale-dependent mean will maximize the scale-independent tendencies of SGPs.

The range of percentile ranks for an average group, under a family of plausible transformations that change skewness and kurtosis. From left, meanSGP = mean Student Growth Percentile; meanPRR = mean percentile rank of residual; medSGP = median Student Growth Percentile; meanResid = mean residual; FEM = fixed-effect metric; REM = random-effect metric; and medPRR = median percentile rank of residual.
Figure 5 also shows that meanPRRs are more robust to transformations than the mean-based residual metrics (FEM, REM, and meanResid). This is due to the percentile rank transformation. When increases in skewness and kurtosis lead to extreme residuals, the PRR transformation will constrain the transformed residual within .1 and 99.9 no matter how large the residual is. Thus, averages of PRRs will be more stable under transformations than averages of residuals themselves. The medPRR metric performs relatively poorly on this criterion for the same reason that medSGPs underperform meanSGPs. Note that the scale invariance of meanResids, FEMs, and REMs is as indistinguishable as the metrics themselves. With regard to comparisons across states, State A has larger mean ranges than State B, due in part to having fewer groups (and thereby larger changes in percentile ranks) and increasing variance across grades.
Conclusion
Although the literature on growth and value-added models is increasingly rich, focused comparisons on a clearly defined subset of models are rare. As we have argued, empirical differences among gain-based models, multivariate models, and models supporting ACSMs are expected, as each addresses different questions. We restricted our focus to ACSMs because their models can support inferences about the status of a group by referencing current performance to expectations given prior scores. Empirical differences can thus be explained in terms of the approaches that different ACSMs take to support conditional status interpretations.
Our analyses support three empirically distinguishable categories of ACSMs. These categories are (1) the mean-based residual metrics (FEMs, REMs, and meanResids), (2) the mean-based percentile rank metrics (meanSGPs and meanPRRs), and (3) the median-based percentile rank metrics (medSGPs and medPRRs). The categories are listed in order of decreasing recovery of expected values (Figure 2). If we had included the medResid metric, it would belong in the third category due to its systematic similarity with medPRRs. If we had included the RAR metric, it would be in its own, fourth category, due to its predictable deviation from all other metrics in a context where within-group variability is substantial. All analyses reveal considerable similarities among ACSMs within each of the categories that we have identified and considerable dissimilarities between categories, particularly the category of median-based percentile rank metrics.
This article is constrained in at least two important ways: It only considers prior test score variables as covariates and it only considers linear, homoscedastic mean regression models. First, we restricted our focus to prior scores as covariates to be in line with the operational estimation of SGPs. However, the inclusion of other covariates such as student demographic covariates would change expected current status and may change the extent that our ACSMs of interest produce comparable results. Moreover, we use error-contaminated prior scores with no adjustment for measurement error in either the prior or current scores. This may bias each ACSM differentially and is an active area of current research (e.g., Akram et al., 2013; Lockwood & McCaffrey, 2014).
Second, as Castellano and Ho (2013a) show, the relationship between SGPs and linear, mean-based ACSMs will decline as the assumptions of the latter are violated. If no accommodations to the model are made, SGPs will provide more accurate conditional status inferences. Although this article demonstrates robustness of findings across different states with different scaling procedures, a useful extension involves identification of pivot points on a continuum of non-normality, for example, levels of skewness or kurtosis beyond which aggregated SGP metrics become markedly and systematically different from other ACSMs. As Castellano and Ho (2013a) note, however, such studies do not universally promote one metric over others as much as emphasize the importance of selecting a model that fits the data. Similarly, extending these analyses to incorporate student- and group-level covariates would be theoretically useful, even if many current state and federal policies discourage the resulting dependence of expectations upon covariates.
A consistent finding is that the aggregation function is a more consequential decision than the regression function or the choice between fixed and random intercepts. There are marked differences between mean- and median-SGPs and PRRs. Although medians support interpretations of “typical values” and are theoretically more appropriate for a noninterval scale, this article presents three findings that suggest considerable advantages of means over medians. The first is the relative efficiency of the mean over the median (Figure 3). The second is the greater alignment between meanSGPs/meanPRRs and their respective expected values (Figure 2) as well as their greater comparability with the FEM, REM, and meanResid metrics (Table 3 and Figure 4). The third is that meanSGPs and meanPRRs are more robust to scale transformations than their median-based counterparts (Figure 5).
In light of these findings, the widespread use of the medSGP statistic should be reconsidered. If concerns about the ordinal scale of percentile ranks remain, we recommend a simple alternative that assumes that a normal distribution underlies the percentile rank function:
Descriptions of ACSMs often incorporate the terms “growth” and “value added.” These terms are ambiguous, and we recommend reviews such as those by Castellano and Ho (2013b) for “growth” and Reardon and Raudenbush (2009) for “value added” that specify assumptions and articulate the necessary data and model features to address specific questions. We stress the importance of describing ACSMs in terms of what they literally do: summarize the performance of a group by referencing current status to expectations given past scores. Under this definition, dependencies are better anticipated. By understanding aggregate-level conditional status, it is less surprising that the aggregation function is consequential. As stakes rise for accountability and evaluation decisions for teachers, schools, subgroups, and districts, clear descriptions of ACSMs and the practical magnitudes of their dependencies become essential.
Footnotes
Authors’ Note
The authors benefited from the advice of Sophia Rabe-Hesketh of University of California, Berkeley. Any remaining errors are those of the authors alone.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through grant R305B110017 to University of California, Berkeley. The opinions expressed in this article are those of the authors and do not represent the views of the Institute or the U.S. Department of Education.
