Abstract
It is often assumed that a vertical scale is necessary when value-added models depend upon the gain scores of students across two or more points in time. This article examines the conditions under which the scale transformations associated with the vertical scaling process would be expected to have a significant impact on normative interpretations using gain scores. It is shown that this will depend upon the extent to which adopting a particular vertical scaling approach leads to a large degree of scale shrinkage (decreases in score variability over time). Empirical data are used to compare school-level gain scores computed as a function of different vertical scales transformed to represent increasing, decreasing, and constant trends in score variability across grades. A pragmatic approach is also presented to assess the departure of a given vertical scale from a scale with ideal equal-interval properties. Finally, longitudinal data are used to illustrate a case when the availability of a vertical scale will be most important: when questions are being posed about the magnitudes of student-level growth trajectories.
Introduction
The key input for any value-added model (VAM) is longitudinal data from standardized assessments that have been administered to students over two or more points in time. Because psychometricians often go through considerable effort to link test scores to facilitate score comparability across grades (i.e., vertical scaling) and because there are many different ways to go about this process (Briggs & Weeks, 2009; Tong & Kolen, 2007), it is intuitive to assume that vertical scaling can have an impact on inferences about value added. This would seem to especially be the case when a VAM specifies a gain score as the outcome of interest. For example, according to Ballou, Sanders, & Wright (2004):
Measuring student progress requires controlling in some fashion for initial level of achievement. This is done most transparently if the pre- and post-tests are on the same achievement scale (“vertically equated”), in which case the analysis can be based on simple differences or gain scores.…The TVAAS [Tennessee Value Added Assessment System] requires tests that are vertically linked—scores for fourth graders, for example, must be expressed on the same developmental scale as scores for third graders, fifth graders, etc. In order to compare the progress of students over time, test forms must be equated across years. (pp. 38, 43)
The purpose of this article is to examine the conditions that would need to be met before the vertical scaling process can be expected to have a significant impact on the ordering of schools or teachers with respect to estimates of value added. When the outcome variable of a VAM is a test score level, the presence or absence of a vertical scale is inconsequential. On the other hand, when the outcome variable consists of test score gains, the vertical scaling process can have an impact, but only when the variability in student achievement on a proposed vertical scale decreases substantially from grade to grade relative to the pattern that would have been observed in an alternative scaling (i.e., scale shrinkage; Camilli, Yamamoto, & Wang, 1993; Clemans, 1993; Hoover, 1984a, 1984b; Yen, 1986, 1988; Yen, Burket, & Fitzpatrick, 1995). We establish this through theoretical argument in the first section of the article and then demonstrate it empirically in the second section using longitudinal item response data with student achievement linked to schools for a medium-sized state from 2003 to 2006. In the third section, we present a heuristic approach to evaluate the possible impact of departures of a scale from the equal-interval ideal. When this approach is applied to the same data set, it appears that inferences about value added are relatively insensitive to the extent of a scale’s departure from the ideal interval scale. This is because VAMs focus attention primarily upon the ordering of schools and teachers not upon the magnitudes that separate the schools or teachers being ordered. In the fourth section, we establish a growth modeling context where the presence or absence of a vertical scale plays an important role. This context occurs when direct questions are being posed about the magnitudes of student-level growth trajectories over a 5-year period of time. Yet even in this context, where a vertical scale is desirable to facilitate inferences about student growth, if the model is also being used to categorize school districts in terms of estimated value added, very similar conclusions may be reached about the effectiveness of school districts whether or not the test scores have been vertically scaled. The article concludes with a discussion section.
The Theoretical Framework
A Brief Overview of the Vertical Scaling Process
An underappreciated aspect of creating a vertical scale is the importance of design considerations. What is the construct to be measured and how is it believed to change over time? On what grounds should common items be selected that overlap adjacent grades? How well are the common items aligned with the curriculum and instructional received by students? Although these design issues are of critical importance (cf. Kolen & Brennan, 2004; Peterson, Kolen, & Hoover, 1989), they are outside the scope of the present article, which has more limited ambitions. In what follows, we will optimistically assume that a defensible theory about student growth and development underlies a collection of test items that have been written for the purpose of creating a vertical scale. Our focus will be on the subsequent steps that might be taken to calibrate and transform item responses after a field test has been administered, and the conditions under which this will have an impact on the use of test scores for value-added inferences.
When using IRT-based methods (the predominant approach in large-scale assessment contexts), this process involves at least two implicit stages. In a first stage, the raw scores for students taking grade-specific test forms are first transformed through the application of an item response function. This places scores onto a logit scale with an arbitrary mean and standard deviation (SD). Given the common IRT identification constraint to set the mean and SD of the logit scale for each grade-specific test to (0,1), the task in establishing vertical links between the grades is to designate a base grade scale and then link adjacent grades to this scale. In this article, we focus on doing so through a separate, rather than a concurrent, estimation approach (Hanson & Béguin, 2002; Kim & Cohen, 1998). Focusing on the separate estimation approach makes the theoretical argument easier to follow since there is no closed form expression for the grade-specific transformations to a scale that take place under the concurrent approach. Empirically however, the results from taking a concurrent approach to create a vertical scale have been shown to be very similar to that from taking separate approach (Hanson & Béguin, 2002), so it seems likely that the same argument we build for the impact of the separate approach on gain scores would apply to the concurrent approach. A separate linking approach is implemented by embedding common items across grades in a given year of testing, and then, leveraging the IRT property of parameter invariance, these common items can be used to estimate linear constants that link a focal grade scale to a base grade scale. The linking constants needed for the transformation are estimated iteratively using a characteristic curve method such as the Stocking–Lord algorithm (Stocking & Lord, 1983).
In a second stage, additional choices are typically made to further transform the vertically linked scale away from the logit metric. Sometimes the transformation is mostly cosmetic, such as when a linear transformation is used to avoid displaying test score results with negative values. But in other cases, the transformation employed may be more elaborate. For example, Kolen and Brennan (2004) describe transformations that could be made to ensure that a score scale takes on a particular distributional shape. Finally, as a last step in establishing the vertical scale, test developers will typically establish the smallest unit of change along the scale, round transformed scores to the nearest integer of this unit, and designate the lowest and highest obtainable scale scores (i.e., the “LOSS” and “HOSS”) for a particular grade.
A Brief Overview of VAM
In the 2010 National Research Council and National Academy of Education report Getting Value out of Value-Added, VAMs are defined as “a variety of sophisticated statistical techniques that use one or more years of prior student test scores, as well as other data, to adjust for preexisting differences among students when calculating contributions to student test performance” (p. 1). According to Harris (2009), “the term is used to describe analyses using longitudinal student-level test score data to study the educational input-output relationship, including especially the effects of individual teachers (and schools) on student achievement” (p. 321). From these definitions, two key features of VAMs are implicit. First, all VAMs use, as inputs, longitudinal data for 2 or more years of student test performance. Second, VAMs are motivated by a desire to isolate the impact of specific teachers or schools from other factors that contribute to a student’s test performance. It follows from this that the output from a VAM is a numeric quantity that is intended to facilitate causal inferences about teachers or schools.
Two of the most commonly applied VAMs are based upon the use of fixed- and mixed-effect regression approaches, respectively. We briefly present each below, focusing attention on whether the model implies the need for longitudinal test scores that have been vertically scaled. Consider first the production function approach typically invoked by economists (Hanushek & Rivkin, 2010; Todd & Wolpin, 2003). Let Yit
represent an end of year test score on a standardized assessment for student i at time t. Let j, k, and m index unique classrooms, teachers, and schools, respectively. A general expression for a VAM is
In the regression model above, the vector
The Educational Value-Added Assessment System (EVAAS; Sanders, Saxton, & Horn, 1997) has the longest history as a VAM used for the purpose of educational accountability. While a detailed presentation is outside the scope of this article, a key point of differentiation between it and the production function approach presented above can be seen by writing out the equation for a single test subject as
In Equation 2, the achievement of student i in year t is expressed as a linear function of a year-specific average
The EVAAS is often referred to as the layered model because a student’s current grade achievement is expressed as a cumulative function of the current and previous year teachers to which a student has been exposed. For example, applying the model above to the context of univariate longitudinal data that span three consecutive years (i.e., Grades 3 through 5) results in the following system of equations:
In the first of the three equations, μ1 represents the average achievement of students in the base year grade and θ1 represents the deviation from this average for students assigned to a given teacher. In this first equation, θ1 combines both the effectiveness of a student’s teacher and all other factors that could influence a student’s achievement (e.g., socioeconomic status, motivation, etc.). However, when certain assumptions hold about the use of past achievement to adjust for any systematic sorting of students to teachers (cf. Ballou et al., 2004), it becomes possible to interpret θ2 and θ3 as distinct estimates of teacher value added in Years 2 and 3. This can be seen by substituting the first equation into the second equation in the system such that
How Vertical Scaling Can Affect Comparisons Based on Gain Scores
Imagine two tests administered across two adjacent grades. The two tests have been separately placed onto the logit metric using IRT. Denote the two test score scales that result by y and x, where y comes from time t and x comes from t − 1, where the time units are defined by grade levels. The two logit scales are linked by imposing the following linear transformations
For each transformation, the intercept parameters α0 and β0 shift the entire score scale up or down by a constant amount while the slope parameters α1 and β1 expand the scale when
It is easy to show that these linking transformations are inconsequential when the production function VAM (Equation 1) is being used to estimate value added. For example, consider the simplest specification with just a single cohort of students that only conditions on prior grade achievement, xi
. Let i index students and j index either a teacher- or school-fixed effect so we can write
Now, consider the same model after the two scales have been vertically linked using Equation 3:
With a little algebra, Equation 5 can be rewritten as
where
In contrast, consider the special case where
Once again, consider the same model after the two scales have been vertically linked:
The values of the additive linking constants α0 and β0 will have a uniform impact that will leave any normative comparisons of value added based on
Empirically Observed Shifts in Grade-to-Grade Variability From Existing Vertical Scales
To get a better sense for the shifts in variability that are plausible after tests have been vertically scaled, one can examine the empirical differences in the SDs of score distributions across Grades 3–8 in English Language Arts (ELA) and mathematics, respectively, for 16 states with existing vertical scales. This was accomplished using information gathered by Dadey and Briggs (2012) about grade-by-grade scale score means and SDs for 16 states from technical reports covering the years 2007 and 2008. For each state and test subject, all grade-specific SDs are divided by the Grade 3 SD and then differences are computed across adjacent grades. The summary statistics for grade-to-grade SD changes are shown in Table 1.
Summary Statistics for Changes in Standard Deviations (SDs) of Vertical Scales Across Adjacent Grade Pairs
Note. ELA = English Language Arts.
N = 16 states.
For the average state, the change in SD across grades is generally very small (between about .01 and .06 in absolute magnitude). The largest decrease in SDs across grades for any state was −0.29 in ELA (Grades 3–4) and −0.18 in math (Grades–7). The largest increase was 0.23 in ELA (Grades 6–7) and 0.22 in math (Grades 7–8).
To connect this back to the theoretical argument established in the previous section, recall the gain score model represented by Equation 7:
Given Equation 7, when evaluating the impact of choosing a proposed vertical scale over some alternative scale (as will be done in the next section), the variable of interest for any two adjacent grades will be the difference in SD differences. For example, let the superscript “p” indicate a proposed vertical scale and the superscript “a” indicate an alternative scale. Now define the difference in SD differences as
The g subscript indexes the higher of two adjacent grades. Suppose that for the adjacent Grades 5 and 6 on a proposed vertical scale that
An Empirical Demonstration
Data
To examine the impact that differences in grade-to-grade variability can have on the computation of school-level gains, we begin by replicating the process of creating a vertical scale using the empirical data from an existing state’s criterion-referenced large-scale assessment in reading. The longitudinal item responses under consideration here were administered to students in Grades 3 through 7 between 2003 and 2006. The vertical scale for this reading assessment was originally established by the state’s test contractor in 2001 on the basis of a common item nonequivalent groups linking design (Kolen & Brennan, 2004). The vertical score scale created for use in the present study derive from data that were obtained directly from the state’s department of education. There are two student cohorts of interest. The first cohort included students who were in Grade 3 in 2003 and Grade 6 in 2006; the second cohort included students who were in Grade 4 in 2003 and Grade 7 in 2006. The data from these two cohorts of students are used to mimic the original approach taken to create this state’s vertical scale up to through the first stage of the process. That is, using these two cohorts of students and common items between adjacent grades and years, we created a vertical score scale on the logit metric using the combination of a three parameter logistic IRT model (3PLM; Birnbaum, 1968), maximum likelihood estimation, and separate linking. In what follows, we refer to this as the “observed” scale [O] because it is closest to the vertical scale that is used by the state to capture grade-to-grade growth. We subsequently summarize SD patterns by grade only with respect to the first cohort of students who were in Grade 3 in 2003 and Grade 6 in 2006. On the observed scale, the Grades 3 through 6 SDs were 1.00, 0.87, 0.85, and 0.94.
Next, four new scales were created through successive grade-specific scale transformations that were applied in order to change the patterns of grade-to-grade variability. Constant SD [C]: Mean growth from grade to grade transformed to follow a linear trajectory, SD transformed to be constant [1, 1, 1, 1]. Constant Increasing SD [CI]: Grades 4 through 6 SDs transformed to increase by 0.15 each year [1.00, 1.15, 1.30, 1.45]. Nonconstant increasing SD [NCI]: SDs of Grades 4 and 5 transformed to increase by 0.10, while the Grade 6 SD increases by 0.30 [1.0, 1.1, 1.2, 1.5]. Nonconstant decreasing SD [NCD]: Grade 4 through 5 SDs transformed to decrease by 0.10 each year, while Grade 6 SD decreases by 0.30 [1.0, 0.9, 0.8, 0.5].
The purpose of these transformations was to intentionally create empirical scenarios that varied the shifts in scale SDs from grade to grade. Note that these sorts of transformations, though seemingly difficult to rationalize, are not inconceivable as an approach that could be taken by psychometricians to ensure that a vertical scale has “desirable” properties. Indeed, Kolen and Brennan (2004) and Kolen (2006, p. 178) have argued that in the context of vertical scaling, “the IRT proficiency scale also can be nonlinearly transformed to provide growth patterns that are consistent with expected growth patterns.…Suppose a test developer believes that the variability of scale scores should increase over grades. If the variability of the IRT proficiency estimates does not increase over grades, a nonlinear transformation of the proficiency scale could be used that leads to increasing variability.” Hence, while it is unlikely that a vertical scale would be transformed from the O scale to the NCD scale above, it represents a useful extreme for the purpose of analytical comparisons of gain scores.
Grade-to-grade growth trends for the resulting five scales are shown numerically in Table 2 in terms of means and SDs, and graphically in Figure 1 in terms of effect size units. The horizontal axis in Figure 1 shows three adjacent grade pairings: Grades 3–4, Grades 4–5, and Grades 5–6. Growth for each grade pair is computed as an effect size by subtracting mean scale scores for each grade and dividing by the average SD.

Growth in effect sizes units for transformed vertical scales. Note. Effect sizes are computed for each scale and grade pair by subtracting the lower grade mean from the upper grade mean and then dividing by the average standard deviation of the two grades. The solid line represents a base scale created to show linear growth with a constant standard deviation across grades.
Descriptive Statistics for Vertical Scale Transformations
Note. Means and SDs in logits. Growth is expressed in effect size units as upper grade mean less lower grade mean divided by average SD.
To create a common frame of reference, the observed vertical scale and the three vertical scales that result as a consequence of transformations that increase or decrease the SD of scores from grade to grade are most easily compared to a scale created to have linear growth and a constant SD [C]. We use this scale with constant variability as a frame of reference because this represents the pattern that would be observed if no attempt were made to create a vertical scale at all. This is represented in Figure 1 by a solid horizontal black line. The four dashed lines represent the effect size growth trajectories for the other four vertical scales [O, CI, NCI, and NCD]. The primary factor driving the varying trajectories of these lines is the differences in the magnitudes of grade-to-grade SD shifts.
Comparing School-Level Differences in Gain Scores by Scale
Our theoretical argument is that the transformations associated with the vertical scaling process should only be expected to have an impact on value-added inferences for models that rely upon gain scores when there are large relative differences in grade-to-grade SDs for two competing scales (
The results indicate that a large degree of scale shrinkage is needed in a proposed vertical scale to have a significant impact on the ordering of schools based on gain scores. When compared to the gain scores from a base vertical scale with constant variance across grades, there are 12 correlations of interest (four scales crossed by three grade pairs). Each of these correlations is shown in the cells of Table 3 along with the associated

Scatterplots of Grades 5–6 gain scores by school as a function of scale. Left panel compares scale with constant standard deviation (SD) (y-axis) to scale with decreasing SD (x-axis; r = .57). Right panel compares scale with constant SD (y-axis) to scale with increasing SD (x-axis; r = .86).
The Correlations of School-Level Gain Scores as a Function of the Difference in Grade-to-Grade Standard Deviation (SD) Differences
Note. The school-level gains computed for each proposed vertical scale are being compared to school-level gains for a base scale with a constant SD across grades. For details on the variable δg, see Equation 8 and accompanying narrative in text. O = observed vertical scale; C = constant SD, CI = constant increasing SD; NCI = nonconstant increasing SD; NCD = nonconstant decreasing SD.
In practice, a decrease in variability as large as 0.30 SDs was only observed for one of the five adjacent grade pairings for a single state (of the 16) in one test subject. Decreases in variability—to the extent that they were observed at all—were much more likely to be somewhere between −0.05 and −0.15, and these would not have a significant impact on the ordering of schools as a function of average gain scores. In additional analyses not shown here, we used data from the same state’s reading assessment across Grades 3 through 8 and examined the correlation between school-level gains under different vertical scales created from different linking constants by fixing α1 at 1 and letting β1 vary. For values of β1 between 0.90 and 1.10, the correlation between school-level estimates was 0.97. Only for values of β1 below 0.80 did we observe correlations that dropped below 0.90.
Departures From the Ideal Equal-Interval Scale
Ballou (2009) has pointed out that VAMs assume that test scores have interval scale properties, irrespective of whether the VAM expresses the outcome variable as test score gains or test score levels. With this in mind, it would be hard to argue that any of the vertically linked scales presented in the previous section have equal-interval properties. The observed vertical scale that was the source for the additional transformations described above was created by applying the 3PLM to sets of grade-specific dichotomous item responses and then linking these sets using the Stocking–Lord algorithm. The theory of conjoint measurement (Krantz, Luce, Suppes, & Tversky, 1971; Luce & Tukey, 1964) provides the only analytical framework that could be invoked to evaluate whether the resulting scale could be said to have interval as opposed to ordinal or nominal properties. In practice, such a rationale has seldom been applied empirically, and generally hinges upon making an analogy between the Rasch model and a specific version of the theory of conjoint measurement known as additive conjoint measurement (Borsboom, 2005; Borsboom & Scholten, 2008; Briggs, 2013; Brogden, 1977; Kyngdon, 2011; Michell, 2008a, 2008b; Perline, Wright, & Wainer, 1979).
However, from a pragmatic perspective, one might ask how large the departures of each scale from an equal-interval ideal would need to be before they would have an impact on inferences about school-level value added. To quantify the degree to which a scale departs from an ideal equal-interval scale, one could extend an approach previously employed by Hoover (1984a, 1984b) and more recently by Ballou (2009). The idea is to assess, for each of the five scales that were considered above, the amount of growth that would be required for a student to maintain his or her position at the 10th, 25th, 50th, 75th, or 90th percentiles of the normative score distribution across adjacent grades. These magnitudes are not directly comparable across scales because of the different transformations that were imposed to create each scale. Thus, to allow for such comparisons, we follow Ballou in taking, for each pair of adjacent grades and each scale, the ratio of the gains needed to maintain a position at the 25th, 50th, 75th, or 90th percentiles relative to the gain needed to maintain a position at the 10th percentile.
In the case of a scale with interval properties, one might anticipate that these ratios will be close to 1, as Table 4 illustrates using the canonical example of length, an attribute that can be expressed on a scale with not only interval but ratio properties. According to data from the National Center for Health Statistics, the amount of growth in inches required for boys to maintain the same position in a normative height distribution is almost the same across the five percentiles shown in Table 4. Boys whose initial height is in a higher percentile at 12 months of age have to grow about the same to maintain the same relative position compared to boys whose initial height is at a lower percentile. This supports the notion that the more that a given vertical scale has ratios departing from 1 across starting percentiles for any given grade pair, the stronger the circumstantial evidence that the scale has properties that depart from the interval ideal. The evidence is circumstantial in the sense that one cannot rule out the possibility that a scale has interval properties despite having ratios at different percentiles that are greater or less than 1. After all, if one was to discover that 12-month-old boys at the 75th percentile in height tend to grow 3 times as fast as boys at the 25th percentile, this would still not invalidate the units of a ruler as existing on an interval scale. Yet, when these sorts of values differ dramatically as a function of the starting percentile, it may suggest some scale-dependent growth patterns that merit closer examination.
Canonical Example of a Scale With Interval Properties: Length
Source. Kuczmarski et al. (2002).
Table 5 reports the same ratios of gains at the 25th, 50th, 75th, and 90th percentiles relative to the 10th percentile gain for the five vertical scales created for this study. For the observed vertical scale [O], the four ratios associated with Grades 3–4 growth were 1.03, 1.04, 0.98, and 0.71, respectively. This implies that it is at the 90th percentile that we see the strongest evidence against an interval score interpretation—the gains required for students to maintain their position at the 90th percentile are just 71% of the gains required to maintain their position at the 10th percentile. In general, for the O scale we see the strongest evidence for departures from an interval interpretation with Grades 5–6 gains. The C scale (constant growth and variability) provides an interesting contrast to the O scale. On the whole, the ratios for this scale are smaller; yet, here the ratios are largest for the percentiles associated with Grades 3–4 gains and smallest for the percentiles associated with the Grades 4–5 and 5–6 gains. In general, all of the versions of the vertical scales have growth patterns that would seem to indicate significant departures from the interval ideal for at least one of the three grade pairs for which gains scores have been computed. This demonstrates that vertical scale transformations can have a notable impact on the way gain magnitudes can/should be interpreted at different points along the scale.
Departures From the Interval Ideal for Transformed Vertical Scales
Note. O = observed vertical scale; C = constant SD; CI = constant increasing SD; NCI = nonconstant increasing SD; NCD = nonconstant decreasing SD.
What is less clear is whether departures from an ideal interval scale will have a significant impact on the relative rankings of schools as a function of average gain scores. To get a sense for this, we first compute the mean grade-to-grade score gain for all schools in our sample as a function of the five vertical scales previously introduced. For each school, there are a total of five mean gain scores for each of the three grade pairs. Next, we compute all pairwise correlations within each grade pair as a function of the underlying scale for which the gains were computed. This produces a total of 30 correlation coefficients (10 pairwise correlations within each of the three grade pairs). Higher correlations represent scale pairings where the transformation of one to the other will have less impact on school rankings.
We find little evidence that the rankings of schools are sensitive to departures from the ideal interval scale. That is, whether school gains are computed from a vertical scale associated with gain ratios close to 1 or with gain ratios far from 1 (see Table 5), the rankings of schools with respect to these gains remains about the same. In 20 of the 30 cases, the pairwise correlation is greater than .90 and the median correlation is .96. The pairwise correlations that most depart from this trend do not seem to be driven by apparent departures from the ideal interval scale, but by pairwise combinations of gain scores from source vertical scales with large
When Does a Vertical Scale Matter the Most?
Thus far, we have demonstrated that different vertical scales are most likely to lead to significantly different gain score rankings when the choice of one scale over the other has a large effect on scale variability from grade to grade. The crux of the issue is that the purpose of vertical scaling is to facilitate inferences about growth in absolute magnitudes, while the purpose of value-added modeling is to facilitate inferences about teacher or school effectiveness in a normative sense. Since most VAMs use test score levels rather than gain scores as the outcome variable of interest anyway, the act of establishing a vertical scale will likely be most relevant when questions are being posed about the average magnitudes of student-level growth trajectories. To help illustrate this, consider the following set of research questions that could be posed using the longitudinal Grades 5–9 math achievement data from students who attended public school districts in a medium-sized state between 2003 and 2008: What was the average annual growth rate of students in reading? Do growth rates differ significantly as a function of Gender? Free and reduced lunch eligibility status? English Language Learner status? Special education status? Gifted and talented (GT) status? Do initially low-achieving students in Grade 5 grow faster in reading than initially high-achieving students? How do school districts rank with respect to the average growth of their students?
One relatively sophisticated way to address these questions would be to specify a three-level hierarchical linear model (HLM; Raudenbush & Bryk, 2002), where a linear growth function for repeated measures (reading test scores from Grades 5 to 9) is nested within students who are nested within school districts. The three-level model is
where
In a value-added modeling context, the parameter
To simplify the illustration, only students who remain in the same school district from Grades 5 to 9 and who were tested in each grade are included in the analysis. In addition, student-level covariates are fixed to take on whatever value was observed for a given student as of Grade 5. This leaves us with a sample of 20,062 students from 174 distinct school districts. Of these students, 54% were female, 25% were eligible for free or reduced lunch services, 5% are classified with limited English proficiency, 1% with no English proficiency, 8% receive special education services, and 13% were identified as GT.
We estimate the parameters from the model above using the R package lme4 (Bates, Maechler, & Bolke, 2012) with three different versions of the longitudinal test score outcome. In Version 1 (z score), we sum together the number of multiple-choice items a student has answered correctly in a given grade and then standardize the resulting variable. As a result, the z score outcome variable has a mean of 0 and an SD of 1 across Grades 5 through 9. In Version 2 (theta), we transform the response pattern for each student in a given grade to an estimate of ability using the IRT 3PLM with maximum likelihood estimation. As a result, the θ outcome variable has a mean 2 of about 0.25 logits and an SD of about 1 across Grades 5 through 9. Finally, in Version 3 (vertical scale), we take the ability estimates from Version 2 and link them together across grades to create a vertical scale (i.e., thereby recreating the “observed” scale from the previous section, but this time with 5 as the base grade). The Grades 5 through 9 means for this scale, in logits, are 0.21, 0.67, 1.17, 1.55, 1.88, and the SDs are 0.95, 0.98, 0.94, 0.91, 0.79.
Clearly, the concept of growth is entirely different for the z score and θ scales relative to the vertically linked scale. For the first two scales, growth is purely normative—a student with higher scores from one grade to the next is a student whose achievement has improved over time relative to her peers. As such for these scales, it is difficult to make meaningful statements about the average “rate” of growth—by definition, this growth rate is 0. By contrast, according to the vertical scale, the growth rate of the average student is about 0.42 logits per grade, which represents about 44% of the Grade 5 SD and 53% of the Grade 9 SD.
The HLM parameter estimates for each scale are presented in Table 6. The fixed effects under the row heading “Grade 5 achievement” can be interpreted as the average Grade 5 achievement as a function of student-level covariates. The fixed effect for slope under the row heading “Annual growth rate from Grade 5 to 9” represents the average annual growth rate as a function of student-level covariates. The reference categories for the fixed effects are female students in the state who are not eligible for free and reduced lunch services, are native English speakers, and not classified as either GT or receiving special education services. The seven main fixed-effect coefficients associated with Grades 5 achievement levels are almost identical regardless of scale because they all reference student performance across districts in Grade 5. Where the interpretation of fixed-effect coefficients varies is when they are interacted with growth rates (under the row heading “Annual growth rate from Grades 5 to 9”). Here, we see that inferences about average differences in growth as a function of student characteristics can change significantly when the frame of reference of a scale shifts from normative to absolute. For example, consider students who were classified as GT in Grade 5. On the basis of the z score scale, the average GT student grows by an additional 0.07 SDs in reading achievement each grade, so cumulatively from Grades 5 to 9 she will have gained an additional 0.28 SDs (i.e., 4 × 0.07 = 0.28) relative to her peers. On the basis of the θ scale, the achievement of the average GT student stays about the same from grade to grade relative to her peers (the model predicts a cumulative marginal decrease from Grades 5 to 9 of .04 SDs). But compared to her non-GT peers on the basis of the vertical scale, the average GT student grows at a significantly slower rate in an absolute sense. By Grade 9, a GT student is predicted to have grown about 0.12 logits less than a non-GT student, which is about 15% of the Grade 9 SD. As this example demonstrates, the creation of a vertical scale will have a substantive impact when comparisons of growth are desirable on the basis of absolute magnitudes. For a different example, consider students receiving special education services. According to the z score scale these students are showing dramatic growth relative to their peers—cumulatively the average student receiving special education services grows almost half of a Grade 9 SD more than students who are not receiving special education services. However, this marginal increase in growth appears much less impressive on the θ scale and on the vertical scale. According to the vertical scale, these students only grow about 0.20 logits more than students not receiving special education services from Grades 5 to 9, which represents about 25% of a Grade 9 SD. This is still notable, but only half as large in magnitude relative to the results implied by the z score scale.
Hierarchical Linear Model (HLM) Parameter Estimates by Scale of Reading Outcome Measure
Note. Standard errors and p values are excluded because data consist of full population of students and given sample size, almost all p values are <.001.
Note that even in a normative sense, growth is consistently smaller on the θ scale than it is on the z score scale. One possible explanation for this is that many student subgroups who are significantly below average in achievement as of Grade 5 are much more likely to not only guess on the multiple-choice items given to them on their reading assessment but to become better at guessing the correct answers over time (one cause of this might be teachers coaching students on how to take standardized tests). The use of the 3PLM to scale response patterns may adjust for this spurious source of growth.
Do low-achieving students in Grade 5 grow faster than high-achieving students? For the two normative scales, the answer to this is “not really”: The correlation between the student-level intercept and slope is −0.27 for the z score scale and −0.19 for the θ scale. The answer is different for the vertical scale, where the respective correlation is −0.51, indicating that on average, lower achieving students in Grade 5 grow more through Grade 9 than higher achieving students.
Finally, what about value-added inferences? For each district, we can retrieve empirical Bayes estimates of the random effect 0.91 for the z score scale and θ scale, 0.85 for the z score scale and vertical scale, and 0.87 for the θ scale and vertical scale.
Whether the choice of scale would have a significant impact on value-added interpretations would depend upon how these district estimates would be used. If used to rank teachers according to quintiles of the effectiveness distribution, then even a correlation as high as 0.91 could lead to significant shifts across quintiles. On the other hand, if the estimates were only to be used to categorize districts in the tails of the distribution that are significantly different from average, it is much less likely that districts would see different categorizations by choice of scale with correlations this high.
Discussion
The purpose of VAMs is to support inferences about the effects of teachers and/or schools on student achievement. But these effects have a fundamentally normative interpretation—a school is considered “effective” if the value it appears to have added to student achievement is significantly larger than the average for all other schools to which it is being compared. Because of this, additive changes to a test score scale from grade to grade will not have an impact on value-added inferences. This is true even when a VAM uses gain scores as a dependent variable; the ordering of teachers and schools as a function of average gain scores is only sensitive to scale transformations that lead to significant decreases in score variability across grades. It follows from this that the decision to create a vertically linked score scale will only have an impact on value-added inferences based on gain scores when the process leads to substantial scale shrinkage relative to what would have been observed if a different approach had been taken to create the vertical scale, or if the scores had not been linked at all. This was shown to be the case theoretically and then demonstrated empirically. When school-level gain scores from Grades 5 to 6 were computed for two vertical scales—one that had been transformed to have constant variability across grades and the other transformed to have a 0.30 SD decrease—there was a significant change in the ordering of schools from one scale to the other.
This comparison captures a worst-case scenario if scale shrinkage represented the empirical truth about student achievement over time. Suppose for a sequence of tests across grades that if a vertical scale were to be established, one would in fact observe substantial scale shrinkage. Suppose further that instead of creating vertical links, grade-specific test scores are (a) standardized within each grade or (b) calibrated using an IRT model but not linked. In case (a), the variability of scores across grades would stay constant by definition; in case (b), because of the typical IRT N(0,1) identification constraint on the population distribution of ability it would also stay roughly constant. In either case, true scale shrinkage would be obscured by not creating a vertical scale, and this would distort inferences about value added for models with gain scores as the outcome variable of interest.
If the process of establishing a vertical score scale could always be trusted to provide test users with insights about the empirical reality of scale score variability, then it would always be prudent to create a vertical scale to underlie the computation of gain scores and/or growth trajectories. A recent review of existing vertical scales examined 160 different grade-to-grade SD changes (16 states × 5 grade pairs × 2 test subjects) and found only two examples of scale shrinkage that would imply
To a large extent, the issue of whether a scale can be treated as though it has interval properties is prior to the issue of whether or not scales for adjacent grades can be linked together. Along these lines, Ballou (2009) has argued that departures from an idealized interval scale could create serious problems for any of the commonly used VAMs presented in the second section of this article, because most of them make the implicit assumption that the outcome variable is a continuous variable with equal-interval properties. There are, in fact, rigorous ways that such an assumption could be tested (Briggs, 2013; Kyngdon, 2011). In the present article, we presented a less rigorous but much more easily implemented heuristic that can be used to establish the extent to which a given scale departs from the interval ideal. The basic idea is to compare competing scales with respect to departures from a percentile gain ratio of 1. The empirical question is whether observing a scale with a greater departure from the ideal has an impact on intended comparison in any pragmatic sense. In the example considered here, differences in a scale’s departure from the interval ideal did not appear to have a significant impact on the ordering of schools as a function of gain scores.
Vertical scales are desirable when direct inferences are to be made about how much a student has learned over two or more points in time. In this article, we provided the example of specifying a linear growth curve model with three different math outcome scales, two that were normative in nature and a third that had been vertically scaled. The choice of scale led to substantively different answers to questions such as “Do students receiving special education services grow faster in their math achievement than students who do not receive special education services?” For the two normative scales, questions about how much the average student has grown must be reconceptualized in terms of how much the average student’s achievement has increased relative to her peers. Nonetheless, note that when the growth curve model was used to generate value-added estimates at the district level, the choice of scale had a relatively small impact on the ordering of districts.
It is possible that choices in vertical scaling would have a more significant impact when they are used as a basis for simple linear models that project student achievement into the future. For example, in some states, a vertical scale might be used as a means of setting vertically articulated cut points across grades through the process of standard setting. Since projections of student achievement are evaluated relative to these cut points, if two different vertical scales led to different cut point locations, this could change the cumulative distribution of students below a given cut point. But in general, vertical scales seem much more likely to facilitate meaningful interpretations about growth when the focus is on individual students rather than the teachers or schools in which they are situated.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research in this article was supported by a grant from the Carnegie Corporation.
