Abstract
This article describes an extension to the use of heteroskedastic ordered probit (HETOP) models to estimate latent distributional parameters from grouped, ordered-categorical data by pooling across multiple waves of data. We illustrate the method with aggregate proficiency data reporting the number of students in schools or districts scoring in each of a small number of ordered “proficiency” levels. HETOP models can be used to estimate means and standard deviations of the underlying (latent) test score distributions but may yield biased or very imprecise estimates when group sample sizes are small. A simulation study demonstrates that the pooled HETOP models described here can reduce the bias and sampling error of standard deviation estimates when group sample sizes are small. Analyses of real test score data demonstrate the use of the models and suggest the pooled models are likely to improve estimates in applied contexts.
States administer millions of standardized assessments to public school students annually as a part of their school accountability systems. The results of these assessments are often made publicly available only in highly coarsened form and so are much less useful than they might be. Many states, for example, report the number students in a particular school or district scoring in each of a small number of ordered performance categories, such as “basic,” “proficient,” or “advanced,” rather than reporting the overall mean and standard deviation of students’ scores. These are referred to as “coarsened” test score data because they arise from coarsening continuous test scores according to a set of predetermined cut scores. Such data have many widely recognized shortcomings (Ho, 2008; Ho & Reardon, 2012; Holland, 2002; Jacob et al., 2014) but continue to be a primary, and sometimes the only, publicly available source of state or district achievement test data. Having access to estimates of the mean and standard deviation of test scores can support a wider range of interpretations and analyses, ultimately leading to more accurate and useful interpretations about student achievement.
Reardon et al. (2017) described how heteroskedastic ordered probit (HETOP) models can be used to estimate the underlying means and standard deviations of the test score distributions based on coarsened test score data via maximum likelihood (ML), thus overcoming some limitations of the coarsening. In addition, because HETOP models use only ordinal information in the data, they do not rely on common interval scale assumptions. This fact provides some interpretational benefits and allows the models to be connected to other widely used ordinal statistics, as we describe in more detail below. Use of the HETOP model in this context does require that the coarsened scores in each group be based on a common test (or other measure) across groups that is coarsened using a common set of cut scores. At the same time, HETOP models can readily be applied to other contexts in which grouped, ordered-categorical scores are available, and there is a need to summarize or compare the underlying distributions across groups. Examples include analyzing the aggregate responses to a Likert-style survey item across groups or across time, comparing aggregated Apgar (1953) scores across hospitals or regions, or analyzing continuous variables, such as income, that are reported in ordered categories in aggregate data sources such as the census.
The HETOP model described by Reardon et al. (2017) has some important limitations, however. When group sample sizes are small, the standard deviation estimates produced by the HETOP model are negatively biased and have large sampling variances (Reardon et al., 2017). Sparse data is the primary cause of this problem; when some groups have no observations in one or more categories, the coarse data provide limited information about the underlying distribution. In some cases, finite ML estimates may not exist (Agresti, 2013). These sparse data problems can occur frequently, particularly in the context of analyzing coarsened test score data, where group sample sizes are often small and the cut scores used to coarsen the original test scores may be asymmetrically located throughout the distribution.
Researchers have proposed several methods to improve small-sample HETOP estimates. To illustrate how these approaches work, consider a case in which a HETOP model is used to estimate, from coarsened proficiency data, the distribution of mathematics achievement of third graders in each school across an entire state. As described in prior work, the HETOP model requires that all students complete the same test and that scores were coarsened using a common set of cut scores across all schools. To overcome small-sample problems, Reardon et al. (2017) proposed using models that constrain standard deviations to be equal across some or all schools in the sample. These constrained models attempt to improve standard deviation estimates for schools with small sample sizes by borrowing information from other small schools and estimating a single, common third grade mathematics standard deviation parameter for these small schools. In their most extreme form, the constrained models estimate only a single standard deviation parameter for all schools, regardless of size. Lockwood et al. (2018) describe Bayesian HETOP models that use a form of shrinkage estimators to improve small-sample estimates by borrowing information from other schools that are similar on observed covariates. Both of these approaches rely on borrowing information across groups (schools, in this case) to improve small-sample estimates, which can preclude the study of heterogeneity of within-group variances and rely on the potentially unrealistic assumption that the within-group variances are equal.
In this article, we propose a generalized version of the HETOP model, which we refer to as a pooled HETOP model, that can be used to estimate multiple latent distributions for each group simultaneously when coarsened data are available from multiple measures or time points. Returning to the case of achievement testing, analysts will often have access to additional sets of coarsened data for each school based on tests administered in other grades, years, or subjects. The pooled HETOP model allows these distributions to be estimated simultaneously, even when the tests and cut scores vary across grades, years, or subjects. Estimating these distributions simultaneously allows the model to use information from the same school in other grades, years, or subjects to improve estimates rather than borrowing information from different schools within the same grade, year, or subject. The intuition behind our approach is that when possible, it is preferable to pool information from the same group observed on different occasions rather than to pool information across different groups observed on the same occasion. This is partly an empirical question, and we analyze test score data from a national database to evaluate the trade-off between pooling across versus within groups in the context of aggregate coarsened test score data.
The remainder of the article is organized as follows. The Statistical Models section provides an explanation of the HETOP model in the context of analyzing coarsened test score data and describes an extension of the model to define what we refer to as the pooled HETOP model, which can be used to estimate distributions across multiple tests simultaneously. The Empirical Test of Pooled Model Assumptions section analyzes test scores in a national database to evaluate the plausibility of assumptions made in the pooled HETOP model and to provide empirical evidence that placing constraints within rather than between groups is preferable. The Simulation section uses a Monte Carlo simulation to evaluate how well the pooled HETOP model can recover parameters using small sample sizes under known conditions and compares performance to the standard HETOP model and a constrained homoskedastic ordered probit model. The Real Data Example section uses school-level coarsened proficiency data from a statewide mathematics assessment to illustrate the use of a pooled HETOP model in practice. The Discussion section concludes with a brief discussion.
Statistical Models
To formalize discussion of the HETOP model, let there be a set of G groups (e.g., schools or districts). Students within each group take the same test, and their scores are coarsened into one of K ordered proficiency categories using a common set of cut scores across all groups. We assume that there is an underlying, normally distributed latent variable
where
We do not observe the values of
The model-implied proportion of students in group g scoring in category k is
where
Following the notation of Reardon et al. (2017), let
where A is a constant based on the multinomial distribution. The scale of the
Assumptions and Interpretation of the HETOP Model
The primary assumption of the HETOP model is that test score distributions are respectively normal and thus a probit link can adequately summarize the data (Albert & Chib, 1993; Ho & Haertel, 2006; Ho & Reardon, 2012; Reardon & Ho, 2015). Let y denote the latent test scores in their original, continuous metric. The scores are said to be respectively normal if there is a single, monotonic function
It would be possible to use alternate within-group distributional forms such as logistic distributions. In that case, the assumption would be that the latent distributions were respectively logistic. We elect to use normal distributions (i.e., a probit link function) due to their familiarity for many researchers and because analyses of real test score data by Reardon et al. (2017) suggest the respective normality assumption is reasonable and likely to be satisfied in practice when analyzing coarsened test score data. Prior research using similar methods in the two-group case to estimate achievement gaps suggests these models are likely to be robust to violations of respective normality and that the probit transformation may yield more accurate estimates than the logit transformation (Ho & Reardon, 2012).
The HETOP model parameters can be viewed as ordinal statistics because they rely only on ordinal information in the data. That is, the
Problems With the HETOP Model
Although the HETOP model works well for recovering the means and standard deviations in the
Reardon et al. (2017) considered two possible solutions to these challenges. The first was to fit a homoskedastic ordered probit (HOMOP) model that constrains all groups to have a common standard deviation. The second was a partially heteroskedastic ordered probit (PHOP) model that estimates a single, pooled standard deviation for all groups with sample sizes below a set threshold. The HOMOP model makes the potentially unrealistic assumption that all groups have equal standard deviations, precluding the study of heterogeneity of within-group variances. The PHOP model allows for the study of heterogeneity among some groups but entails the arbitrary constraint that a subset of groups (here, those with sample sizes below some threshold) have a common standard deviation.
Lockwood et al. (2018) describe a Bayesian model that addresses these challenges by borrowing information from other groups and from covariates. As anticipated, the Bayesian model solves the identification and existence problems and reduces sampling error of standard deviation estimates, but at the cost of additional bias and the requirement that analysts define or estimate appropriate prior distributions for the latent group parameters.
In the context of recovering achievement test score distributions for schools, each of these approaches borrows information from students in other schools taking the same test in the same year because the models are defined assuming that the coarsened data are from a single test with a common set of cut scores. In the next section, we describe a generalized version of the HETOP model that can be used to estimate multiple latent distributions for each group simultaneously, even if the distributions are for different measures coarsened using different cut scores. Estimating the distributions simultaneously allows one to borrow information from students in the same school taking tests in these additional years, grades, and subjects. This approach will be preferable, in theory, if borrowing information from the same group provides better estimates than borrowing information from other groups. This could occur, for example, if there is more variability in the relative magnitude of parameters across schools (within time points) than within schools (across time points). This is an empirical question that we investigate in the Empirical Test of Pooled Model Assumptions section with a national database of real test score data, where we find evidence that there is greater variability in standard deviations across districts than within districts over time.
The Pooled HETOP Model
When analysts have test score proficiency counts from multiple test administrations across years or grades for the same G groups, it is possible to pool information across administrations, resulting in a more general model that may also improve the estimates of some parameters, such as the estimates of
and that the observed data,
The goal is to estimate the school and grade-specific parameters
If we model the mean and standard deviation parameters with parametric functions of grade and group, with
Let
For now, we assume there are the same number of cut scores in each grade level (though they do not need to be equal across grades), but it is possible to relax this assumption. 2
To connect with the models above, fitting the HETOP model separately within each grade is equivalent to having fully nonparametric functions
Because this model estimates a single standard deviation parameter per group that is constant across grades, we refer to it as a “fully pooled HETOP model.” Second, we define a model that estimates the scale parameter for each group with a group-specific linear function of grade using
We refer to this as the “linear trend pooled HETOP model.” In Equation (10),
Although there may be very little information with which to estimate a group’s mean and standard deviation in a single year or grade, these models leverage additional data by pooling across multiple grades of data. While the model described here assumes data from multiple grades are available, the extension to additional dimensions (e.g., years or subjects) is straightforward. Pooled HETOP models can be applied most flexibly when pooling across time (e.g., grades or years) rather than subjects, although this will depend on both statistical and substantive considerations as we discuss below. In addition, while we focus on using the pooled model to improve small-sample standard deviation estimates, the models could be extended to have a functional form for the means or to include additional covariates in the model that represent other group variables (such as school characteristics or aggregate student demographic information). We focus on the standard deviations because prior work suggests small-sample standard deviation estimates are more problematic than small-sample mean estimates.
The pooled HETOP models defined here treat the group parameters as fixed effects to be estimated individually. The data structure described above can also be conceptualized as a multilevel data structure, with repeated observations nested within groups. One could potentially treat the group-level parameters as random effects, estimating the distributions of random effects and using a second step to predict values for specific observations. However, because the tests and cut scores can vary across the different levels (i.e., repeated grades or years), the model allows for heteroskedasticity, and individual-level data are not available, estimating these models would likely not be possible using standard ordered mixed effects regression models. The Bayesian HETOP model described in Lockwood et al. (2018), for example, treats the group parameters as random variables but was developed under the assumption that a single, common set of cut scores was used to coarsen all observed scores. Lockwood et al. discuss additional considerations when selecting between models that treat group parameters as fixed (i.e., directly estimated) or random effects.
Pooled HETOP Model Identification
Because the latent
There are different ways to select constraints that satisfy these requirements and that result in statistically equivalent models, where parameters will be linear transformations of one another, and the model log-likelihoods will be equal. One possibility, for example, would be to fix the first cut score in each grade level to a fixed value (e.g., to 0) and then constrain the second cut score for Ps
of the grade levels to another fixed value (e.g., to 1). In the linear trend pooled HETOP model, another option is to constrain the weighted sum of group means to be 0 within each grade and constrain the weighted sum of the
These constraints assume that ML estimates exist for each relevant parameter. Certain patterns of sampling zeroes can prevent finite ML estimates from existing for some samples, even when the model specifications and data structure (e.g., number of grades, number of categories, and number of constraints) should, in theory, support model estimation. For example, if all observations in a single group are in the highest or lowest category in a given grade, a finite ML estimate will not exist for this group mean and hence for the model overall, despite having a sufficient number of grades, categories, and constraints to identify the model as described above. This problem arises due to patterns in some samples of data rather than due to the specification of the model. In the Simulation section, we describe an adjustment that can be made to sampled frequency counts to ensure the existence of finite ML estimates for all samples. Placing additional structure on the model, for example, by modeling the group means with a linear trend in
Pooled HETOP Model Assumptions and Standardization
The HETOP model assumes that the test score distributions are respectively normal and were coarsened with common cut scores within grades, years, and subjects. The fully pooled and linear trend pooled HETOP models place additional constraints on the relative magnitude of group standard deviations, which imply assumptions about the overall structure of group standard deviation parameters. To aid with the interpretation of results, once estimates of
where
Standardizing estimates within grades makes the assumptions of the pooled HETOP models slightly less restrictive. In the fully pooled HETOP model, for example,
will be constant across all schools (g) for a fixed pair of grades r
1 and r
2. The model also implies that the ratio of standard deviations for any pair of groups g
1 and g
2 will be constant across grades, meaning that
The linear trend pooled HETOP model instead implies that the ratio of any single group’s (standardized) standard deviations across a pair of grades will depend on the group’s slope, distance of the grades, and grade-specific standardization constants:
Likewise, the ratio of standard deviations for any pair of groups changes by a common factor across grades:
Thus, the linear trend pooled HETOP model does not require that the rank ordering of group standard deviations remains constant across grades.
We have described the assumptions of the pooled HETOP models when pooling across grades here. The same assumptions would apply to other dimensions as well. If the model were used to pool across years, for example, the assumptions would apply to the relative magnitudes of group standard deviations across years; if the model were used to pool across subjects, the assumptions would apply to the relative magnitudes of group standard deviations across subjects. In addition to the statistical assumptions described here, one must consider whether it makes sense substantively to pool across dimensions, something we discuss further below. In the next section, we evaluate the plausibility of these assumptions about the relative magnitudes of group standard deviations within subjects across grades and years in an empirical data set.
Empirical Test of Pooled Model Assumptions
This section analyzes district-level test score proficiency data from 40 states to evaluate whether there is evidence that the patterns among relative magnitudes of district-level standard deviations are consistent with the assumptions made by the pooled HETOP models introduced above. We use publicly available data from the Stanford Education Data Archive Version 2.1 (SEDA; Reardon et al., 2018). SEDA contains estimated mathematics and English/Language Arts (ELA) Grade 3 through 8 test score means and standard deviations for nearly every U.S. public school district in the 2008–2009 through 2014–2015 school years.
The means and standard deviations in SEDA are estimated by fitting partially-constrained HETOP models separately in each state, grade, year, and subject using aggregate district-level proficiency counts obtained from the EDFacts database (Fahle et al., 2018). Because our goal is to study variation among group standard deviations, we exclude standard deviation estimates that were constrained during estimation and focus only on freely estimated standard deviations. The exact sample restrictions are described in the Appendix available in the online version of this article. The final sample consists of 620,588 unique standard deviation estimates across 40 states and 9,266 unique districts. Each district has between 1 and 42 repeated observations (across six grades and 7 years) in each subject, with an average of approximately 34 observations per district–subject. On average, there are 231 districts per subject and state, ranging from 54 to 699.
Models
SEDA contains estimates of
This implies that if the fully pooled or trend HETOP model assumptions are valid, the
We fit two precision-weighted hierarchical linear models (Raudenbush & Bryk, 2002) for each state–subject data set, with estimates
Results
Across the 80 state–subject data sets, in Model 1 on average, 65% of the total variance in
We can also use the magnitude of the estimated variance components to quantify the anticipated gains in accuracy obtained by fitting one of the pooled HETOP models relative to a HOMOP model. Based on the magnitude of the estimated variance components across models, we would expect HOMOP standard deviation estimates to be within approximately ±14% of the true
Simulation
A Monte Carlo computer simulation was used to investigate the small (i.e., finite) sample performance of the fully pooled HETOP model and the linear trend pooled HETOP model (referred to in this section as the “trend HETOP” model) relative to the standard HETOP and HOMOP models when pooling data across repeated observations. Data were generated for a set of 25 groups observed across six occasions. This scenario could represent having data for 25 schools across six grades, and hence, we refer to the occasions as “grades.” The simulation varied the true group standard deviation structure (either constant values or following group-specific linear trends across grades), group sample size (sizes of 10, 25, 50, 100, or 200), and cut score locations. The cut scores used to coarsen the data were placed at either the 20th/50th/80th (mid), 5th/30th/55th (skewed), or 5th/50th/95th (wide) percentiles of the overall distribution within each grade or were mixed such that scores in the first three grades were coarsened using the mid, skewed, and wide cut scores, respectively, with the same pattern for grades four through six. Overall, there were (2 group structures) × (5 sample size conditions) × (4 cut score conditions) for a total of 40 simulation conditions. We generated and analyzed 1,000 replications (i.e., samples) in each condition. All simulations and analyses were carried out using Stata v14.2 (StataCorp, 2015), with estimation of the HETOP models conducted using a custom program written by the authors and based on the Stata -ml- functions. All simulation code is available upon request from the authors.
Data Generation
For each group standard deviation structure by sample size condition, we began by defining a population of 25 groups with fixed mean and standard deviation parameters at each grade level. Defining the true group mean and standard deviation parameters began by creating a 5 × 5 grid of
To determine the true values, we first assigned values of
The standardized
Parameter Estimation
For each of the four coarsened data sets in each condition, we fit the HETOP and HOMOP models separately within each grade and fit the fully pooled and trend HETOP models simultaneously to all grades. The fully pooled HETOP model was expected to perform best when the data generating model specified constant
We used the following procedure to ensure finite ML estimates exist for all samples. When a sampled count vector had only one nonzero count, had nonzero counts in only the top and bottom categories, or had nonzero counts in only two adjacent categories, we replaced the sampled counts for that group with
where
Outcomes
Evaluation of model performance is based on four outcomes. First, the convergence rate for each model was recorded, indicating whether the ML algorithm could reach a solution. We then evaluated the bias, root mean squared error (RMSE), and confidence interval (CI) coverage for the estimated group means and standard deviations (in the within-grade standardized metric). The bias, RMSE, and CI coverage was aggregated across all groups and grades for a particular condition (i.e., it is the average bias or pooled RMSE across groups and grades for a given condition). The CI coverage was evaluated by determining the proportion of individual estimates for which the estimated parameter value was within ±1.96 estimated standard errors of the true parameter value.
To compare the relative gain in efficiency when using a fully pooled HETOP model rather than separate HETOP models in each grade, we conducted one additional analysis. For each replication in the equal standard deviation condition, we fit a fully pooled model using only the first two, three, four, or five grades of data, in addition to the model using all six grades. We then compared the empirical sampling variance of the group standard deviation estimates in these pooled models relative to the separate HETOP models fit within each grade. The efficiency ratio between the fully pooled and separate HETOP models was defined as the ratio of the average observed sampling variance in the separate HETOP models relative to each of the fully pooled HETOP models, computed as
This ratio indicates how much smaller the sampling error would be if the group standard deviations remain constant, and we pool across either two, three, four, five, or six grades rather than using only a single grade to estimate standard deviations. A ratio of 1 indicates that sampling error in the separate and pooled models is equal, ratios greater than 1 indicate the separate HETOP model estimates have larger sampling error, and ratios less than 1 indicate the pooled model has larger sampling error. A similar calculation was also made to compare the efficiency of the trend pooled HETOP model to the separate HETOP models.
Results
All models converged successfully. Table 1 summarizes the proportion of count vectors that were smoothed across simulation conditions. Across all conditions, approximately 6% of all sampled vectors were smoothed, and these were primarily concentrated in the wide and mixed cut score conditions with small sample sizes. In the wide cut score condition with n = 10, for example, approximately 43% of vectors were smoothed, while in the mixed cut score condition with n = 10, approximately 20% were smoothed. While smoothing the count vectors ensures existence of ML estimates, it may also lead to positive bias in standard deviation estimates by artificially adding variance to the observed count vectors, something we discuss below.
Proportion of Smoothed Count Vectors Across Simulation Conditions
Group means
We do not show detailed results for the estimated means here because these were not the primary outcome of interest and because there was little variation in the results across models. Average bias in estimated means was indistinguishable from 0 for all conditions. There was very little difference in the RMSE of means across models, and sample size was the primary factor influencing this outcome. CI coverage was generally good and converged toward the expected rate (95%) as sample sizes increased, with the following exceptions: Coverage rates became too low for the HOMOP model as sample sizes increased, and with skewed cut scores, rates were as low as 90% for the separate HETOP models with n = 10 and for the pooled HETOP model with n = 200 in the trend SD condition.
Group standard deviations
Figures 1 and 2 display the bias and RMSE for estimated standard deviations. Each panel displays results for a single cut score by group standard deviation structure condition; the x-axis depicts group sample sizes, the y-axis depicts the outcome of interest, and each line represents a different model. With n = 10 and n = 25, there was a reduction in bias for the fully pooled and linear trend models relative to the separate HETOP models. An exception was the wide cut score condition, in which all models slightly overestimated group standard deviations, on average, with very small sample sizes; as noted above, this is likely due to the correction factor applied to ensure ML existence, which was applied most often in the wide cut score condition. The fully pooled and trend models tended to slightly overestimate standard deviation estimates with samples of size n = 10, but this bias was smaller in magnitude than the negative bias in the separate HETOP model estimates and was reduced to near 0 with samples of size 25 or larger. The separate HOMOP models produced a small positive bias on average across nearly all conditions, which was larger when there were true trends in the standard deviations. This indicates that the single common standard deviation estimated in the HOMOP model was slightly larger than the true average within-group standard deviations and is likely due to the misspecification of the HOMOP model.

Bias in estimated standard deviations (SDs) by SD structure, cut score type, and sample size for each model. Constant SD and trend SD refer to different patterns of true group standard deviations described in text. The mid, skewed, wide, and mixed headings refer to different cut score locations; mid = symmetric cut scores at approximately the 20th/50th/80th percentiles; skewed = asymmetric cut scores at approximately the 5th/30th/55th percentiles; wide = symmetric cut scores at approximately the 5th/50th/95th percentiles; mixed = mix of mid/skewed/wide cut score locations across grades; HETOP = heteroskedastic ordered probit model; HOMOP = homoskedastic ordered probit model; pooled = fully pooled HETOP model; trend = linear trend pooled HETOP model.

Root mean squared error of estimated standard deviations (SDs) by SD structure, cut score type, and sample size for each model. Constant SD and trend SD refer to different patterns of true group standard deviations described in text. The mid, skewed, wide, and mixed headings refer to different cut score locations; mid = symmetric cut scores at approximately the 20th/50th/80th percentiles; skewed = asymmetric cut scores at approximately the 5th/30th/55th percentiles; wide = symmetric cut scores at approximately the 5th/50th/95th percentiles; mixed = mix of mid/skewed/wide cut score locations across grades; HETOP = heteroskedastic ordered probit model; HOMOP = homoskedastic ordered probit model; pooled = fully pooled HETOP model; trend = linear trend pooled HETOP model.
Figure 2, depicting the RMSEs of the estimated standard deviations, is simpler to summarize. The separate HETOP models had the largest RMSEs when
The CI coverage rates (not presented graphically) followed anticipated patterns. For the separate HETOP models, coverage rates were between 92.5% and 97.5% for all conditions when
Figure 3 displays the efficiency ratio of the separate HETOP models relative to the pooled models when pooling across varying numbers of grades. Each panel represents a different cut score condition, and each line represents the efficiency ratio when pooling across a different number of grades. When using only one grade, the fully pooled model is equivalent to the separate HETOP models, indicated by the efficiency ratio of 1. In general, the efficiency ratios approach a value of p, the number of data sets being pooled, indicating that the mean squared error (MSE) of estimates using the fully pooled model is approximately

Efficiency ratios between HETOP and pooled HETOP models by cut score type, sample size, and number of pooled grades in the constant SD condition. The “p” refers to the number of grades used to estimate the fully pooled HETOP model. The mid, skewed, wide, and mixed headings refer to different cut score locations; mid = symmetric cut scores at approximately the 20th/50th/80th percentiles; skewed = asymmetric cut scores at approximately the 5th/30th/55th percentiles; wide = symmetric cut scores at approximately the 5th/50th/95th percentiles; mixed = mix of mid/skewed/wide cut score locations across grades.
Figure 4 plots the observed efficiency ratios of the trend model estimates relative to the separate HETOP model estimates for the trend SD condition. Each panel represents a different cut score condition, and each line plots the efficiency ratio at a single grade level. The trend model has the greatest gains in efficiency for the middle grades (2 and 3), and the smallest efficiency gains for the extreme grades (0 and 5), in all but the mixed cut score condition (which we discuss below). This result is expected because the standard deviations are effectively predictions from a linear regression model, and regression predictions near the center of the predictor distribution will have smaller variance than predictions at the extremes. The estimated (or predicted) scale parameter in the trend model is

Efficiency ratios between HETOP and trend pooled HETOP models by cut score type, sample size, and grade in the trend SD condition. The “g” represents each of the six possible grade levels. The mid, skewed, wide, and mixed headings refer to different cut score locations; mid = symmetric cut scores at approximately the 20th/50th/80th percentiles; skewed = asymmetric cut scores at approximately the 5th/30th/55th percentiles; wide = symmetric cut scores at approximately the 5th/50th/95th percentiles; mixed = mix of mid/skewed/wide cut score locations across grades. The mixed cut score condition used mid cut scores for grades 0 and 3, skewed cut scores for Grades 1 and 4, and wide cut scores for Grades 2 and 5.
where
The approximations appear to work well for the mid, wide, and skewed cut score conditions but are less accurate for the mixed cut score conditions. In the mixed cut score conditions, the sampling variance of the separate HETOP estimates varies across grade levels depending upon the distribution of the cut scores, resulting in the equivalent of a heteroskedastic error term. These results suggest that the efficiency ratio of the trend model relative to separate HETOP models can be approximated using results from standard LS regression. When cut score locations vary substantially across grade levels, the approximations may be less accurate, but substantial gains in efficiency remain. Hence, although the trend estimates are more efficient than the separate HETOP estimates, the gain in efficiency depends on factors such as the number and coding of the grades and the cut score locations.
Summary
These results suggest that when data for repeated test administrations are available, the fully pooled and trend HETOP models can substantially reduce bias and sampling error of standard deviation estimates relative to fitting separate HETOP models, particularly with very small sample sizes. The reduction in bias is smaller with larger samples or more equally spaced cut scores, but gains in efficiency remain across conditions. The fully pooled and trend models also had smaller sampling variance than the separate HOMOP models across nearly all conditions. Use of the smoothing correction did appear to induce some positive bias in standard deviation estimates, as anticipated. The results illustrate that the relative performance of the models depends on many factors including the number of waves (grades) of data available, group sample sizes, cut score locations, and the true values of the standard deviations. In the next section, we illustrate how analysts might go about selecting and estimating a pooled HETOP model with real data.
Real Data Example
Determining whether to use the fully pooled, linear trend, HOMOP, or full HETOP model depends on a number of factors including the type of data available, group sample sizes, location of the cut scores, average values of
The data contain coarsened proficiency counts for 124 schools that enrolled at least 16 students in each grade (data for schools with smaller sample sizes were not reported publicly), resulting in
The analyses of national district-level data above suggest that, a priori, when test score data are available across multiple grades, we would expect a linear trend HETOP model with linear grade trends to be optimal. We also use statistical criteria to select a HETOP model for this particular data set. To do so, we fit a series of nested HETOP models that can be compared with likelihood ratio tests. Model 1 estimates a unique mean for each school in each grade while constraining the log standard deviation to be equal across schools within grades and is equivalent to estimating a separate HOMOP model in each grade. Model 1 is identified by constraining the weighted sum of the means to be 0 within grades and constraining the common scale parameter,
Table 2 summarizes the results across all three models. First, to determine whether between- or within-school constraints on the log standard deviations are preferable for these data, we compare the fit of Models 1 and 2. A likelihood ratio test at
Table 2 also summarizes the estimated log standard deviation intercepts and trends for Models 2 and 3. The summary statistics are not weighted by school sample size; in which case, the average intercepts and trends would have been exactly 0 by construction. The average estimated intercepts were similar in Models 2 and 3 although there was slightly more variation in Model 3 estimates. The linear trends in Model 3 ranged from −0.136 to 0.120 across schools (mean = 0.002, SD = 0.044), suggesting a level of heterogeneity in standard deviations that would lead the linear trend model to provide more accurate estimates than the fully pooled model based on the simulations. The table also summarizes the resulting means and standard deviations in the grade-standardized metric,
Summary Statistics for Estimated HETOP Models
Note. The rows corresponding to
In addition to comparing the relative fit of models, different approaches might be used to assess overall goodness of model fit. Table 2 reports an overall χ2 goodness of fit statistic based on the observed and expected frequency counts in each category in each school–grade cell. These may be of limited value because the large sample size could indicate statistically significant misfit that is not practically significant and because these statistics do not indicate the nature of model misfit. As a descriptive measure of fit, Table 2 also reports the mean absolute difference between observed and expected proportions of students scoring in each category for each school–grade cell. Across the 744 school–grade cells in Model 3, for example, the average difference was 0.030 (range
Discussion
This article presented a generalization of the HETOP model described by Reardon et al. (2017) that can be used to analyze grouped, ordered-categorical data when there are multiple waves of data available for each group. The fully pooled HETOP model leverages the repeated observations by estimating a constant scale parameter for each group across data sets, while the linear trend pooled HETOP model is more flexible and allows each group’s scale parameter to vary linearly across the data sets. The simulations and empirical analyses above document four primary reasons the pooled HETOP models might be preferred to standard HETOP models in practice. First, the pooled HETOP models can be estimated in some cases where there are not sufficient data to support estimation of full HETOP models. Second, the pooled HETOP models may better represent observed patterns in group standard deviations than do models placing constraints across groups, which provides another method to address sparse data problems. Third, the pooled models reduce bias in standard deviation estimates relative to full HETOP models when sample sizes are very small, particularly when cut scores are widely or asymmetrically placed as is common in coarsened proficiency data. And, fourth, the pooled HETOP models improve the precision (i.e., reduce RMSE) of estimated standard deviations beyond gains made through reductions in bias.
Whether these gains are realized in practice will depend on the nature of the data. When multiple waves of data are available, the pooled or trend HETOP models will be preferable to models placing constraints across groups if there is more variability between groups than within. Our empirical analysis of national district-level data suggests that constraints within districts and subjects are likely to produce more accurate estimates than constraints across districts in the context of coarsened proficiency data and that linear trends are likely to produce slightly more accurate estimates than fully pooled models. Analysts must also consider whether it is reasonable to expect greater heterogeneity between or within groups, and whether a linear trend is conceptually appropriate based on the nature of the data. It may not be reasonable, for example, to fit a linear trend across data from different subjects, where the repeated observations cannot be placed in a logical order as is possible with grades or years. However, it may still be reasonable to fit fully pooled models across subjects if the assumptions about the relative magnitudes of group standard deviations across subjects are plausible. The example in the Real Data Example section demonstrated how analysts could select an appropriate model using theoretical and statistical criteria. The simulation results provide additional information about the conditions under which pooled HETOP models are expected to lead to the greatest reductions in bias or RMSE relative to full HETOP or HOMOP models. The anticipated reductions in sampling error, for example, can be approximated based on the number of pooled data sets and the coding of the linear predictors used in the trend models.
This article also leaves important directions for future work. As with any simulation study, many additional factors could have been varied. These factors include additional structures for the standard deviations (including structures that do not conform to the linear trends) as well as violations of respective normality. Another avenue for additional work revolves around the problems caused by nonexistence of ML estimates. Essentially, this is a problem of small samples containing limited information about the parameters of interest. In some simulation conditions, for example, when sample sizes were n = 10 for each group and cut scores were widely spaced, a substantial proportion of group count vectors needed to be adjusted to guarantee existence of the ML estimates when group means were freely estimated. A more complete proof of existence conditions for the ML estimates was not provided and would be a useful extension of the results here. It would also be worth testing models that place additional constraints (e.g., linear trends) on the estimated means as another method for overcoming sparse data problems and evaluating additional model fit statistics.
As mentioned above, Bayesian and random effects models provide an alternative approach to addressing existence and small-sample problems but were beyond the scope of the present investigation. These models rely on specifying or estimating prior distributions rather than attempting to estimate each term individually (e.g., Hedeker et al., 2009; Kapur et al., 2015). Recent work pursuing a Bayesian HETOP model (Lockwood et al., 2018) is similar to the framework described here with an additional random component. However, these Bayesian models have not yet been extended to simultaneously model data from multiple measures with potentially varying cut scores. While Bayesian approaches can overcome problems with the nonexistence of the ML estimates and potentially produce estimates with smaller RMSE, they can increase the bias in estimates for individual groups, require appropriate specifications or estimates of prior distributions, and as with the HOMOP and PHOP models, they have so far relied on constraints across rather than within groups. Under certain conditions, including when estimates might be used in secondary analyses, ML estimates may be preferable, and in those cases, the models described here are a useful alternative. Pursuing extensions to these models that incorporate multiple sets of data would be a useful area for further study.
Finally, we note that the models described in this article can be applied to a wide range of ordered-categorical data beyond coarsened test scores. The pooled HETOP models described here are applicable any time analysts have multiple sets of grouped, ordered-categorical data for a common set of groups and wish to estimate distributional parameters of an underlying continuous variable. These data could arise from test scores reported only on ordinal scales such as Advanced Placement scores, from responses to Likert survey items, or from continuous variables such as income that are often reported in a coarsened form.
Supplemental Material
Supplemental Material, pooled-hetop-jebs-rev-appendix-submit - Using Pooled Heteroskedastic Ordered Probit Models to Improve Small-Sample Estimates of Latent Test Score Distributions
Supplemental Material, pooled-hetop-jebs-rev-appendix-submit for Using Pooled Heteroskedastic Ordered Probit Models to Improve Small-Sample Estimates of Latent Test Score Distributions by Benjamin R. Shear and Sean F. Reardon in Journal of Educational and Behavioral Statistics
Footnotes
Authors’ Note
This article is based on work completed as part of the first author’s doctoral dissertation. An earlier version of this article was presented at the 2017 NCME Annual Meeting.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work was supported in part through funding from an Institute for Education Sciences training grant (#R305B090016) and a grant from the Bill and Melinda Gates Foundation to Stanford University.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
