Abstract
Modeling within-person change over time and between-person differences in change over time is a primary goal in prevention science. When modeling change in an observed score over time with multilevel or structural equation modeling approaches, each observed score counts toward the estimation of model parameters equally. However, observed scores can differ in terms of their precision—both within and across participants. We propose an approach to weight observed scores by their level of precision, which is estimated as the inverse of their standard error of measurement in the context of item response modeling. Thus, scores with lower standard errors of measurement have greater weight, and scores with higher standard errors of measurement are down weighted. We discuss the weighting approaches and illustrate how to apply this approach with commonly available software. We then compare this approach to modeling change without weighting based on standard errors of measurement.
Prevention and intervention studies have the goal of changing the behavior of individuals. The main analytic tool to assess individual and group-level change over time is the growth model (Grimm et al., 2017; Laird & Ware, 1982; McArdle & Epstein, 1987; Meredith & Tisak, 1984, 1990; Singer & Willett, 2003). When modeling change with growth models, or any other statistical model, every observed score contributes equally to the estimation of model parameters. If observed scores are equally reliable, then it makes sense that they should all contribute equally. A single reliability for a measurement instrument (i.e., test, scale, or survey) for a given sample is consistent with the notion of reliability from classical test theory (CTT); however, modern measurement theory dictates otherwise. That is, notions of reliability from item response theory (IRT) suggest that the reliability of scores can differ and that there is not simply a single level of reliability.
In this article, we propose an approach to account for the differential reliability of the scores when examining change over time with growth models. We continue with an overview of growth modeling from the mixed-effects modeling perspective, a discussion of reliability in the context of item response modeling, and propose and approach to account for differential reliability when estimating growth models in the mixed-effects modeling framework. We then illustrate this estimation approach with longitudinal mathematics data collected from the Applied Problems test from the Woodcock–Johnson Psycho-educational Battery–III. We conclude with a discussion of the extensions as well as the limitations of this approach.
Mixed-Effects Models
Mixed-effects models are referred to by various names (often depending on the discipline) such as hierarchical linear models and multilevel models. These models are commonly used to analyze data with a nested (clustered) structure. Data with a nested structure have lower level units nested within one or more upper level units. This nested structure can occur in cross-sectional data as well as in longitudinal data. In the prevention sciences, a common example of a cross-sectional nested data is when children are nested within schools. In these situations, children in the same school are likely to be more similar to each other than students from other schools. If the intervention is at the level of the school (e.g., schools are randomized to the intervention and all children in a school are in the same arm of the intervention) or if the intervention is at the level of the student (e.g., students within different schools are randomized to the intervention), it is important to use a mixed-effects model to account for the dependency in the data due to the shared school environment. Longitudinal data are inherently nested because the repeated measures data from one individual are likely to be more similar to one another than observations from different individuals. Intervention studies often collect repeated measures data to assess individual change over time, and whether the intervention is implemented at the between-person level (e.g., different individuals are randomized to the intervention) or the intervention varies within person and over time (e.g., A-B-A Design; all individuals experience the levels of the intervention across time) a mixed-effects model can account for the dependency in the scores due to the repeated measurement protocol.
When applied to longitudinal data, mixed-effects models allow researchers to capture sample-level changes and individual-level changes using the fixed- and random-effects parameters (Searle et al., 1992). Fixed-effect parameters capture aspects of sample-level (mean) change, whereas the random-effects parameters capture aspects of individual differences in the change process. Mixed-effects models can be broken down into two types: linear and nonlinear models (Laird & Ware, 1982; Lindstrom & Bates, 1990). The distinction between the two types of models involves how the fixed and random effects enter the model. More details on each type of model are discussed in the next sections, starting with linear mixed-effects models as nonlinear mixed-effects models are built off the same foundation.
Linear mixed-effects models
Linear mixed-effects models are one of the most commonly used models to analyze longitudinal prevention data. In linear mixed-effects models, both the fixed and random effects enter the model in a linear (additive) fashion. While these models are linear models, they can be used to model nonlinear change patterns. That is, nonlinear change can be modeled using transformations in the timing variable. For example, squaring timing variable (e.g.,
Linear mixed-effects models use fixed and random effects to model the mean and individual change trajectories. Fixed-effects parameters are used to describe the change pattern for the entire sample (sample-level parameters). For example, a fixed-effect parameter for the timing variable describes the expected rate of change in the outcome for a one-unit change in the timing variable for the sample. Random-effects parameters are used to describe the magnitude of differences between individuals. That is, random coefficients allow each person to deviate from a given fixed effect, and the variances and covariances of these random coefficients are random-effects parameters. For example, allowing for the effect of the timing variable to be random (i.e., a random slope) indicates that each individual is allowed to have a different rate of change in the outcome for a one-unit change in the timing variable. The random slope would have an estimated fixed-effect parameter describing the average rate of change for the sample and an estimated random-effect parameter describing the magnitude of differences in the rate of change across individuals.
Linear mixed-effects models take on the general form
where
The level-2 and level-1 random effects are often assumed to follow multivariate normal distributions, such that
As an example of a linear mixed-effects model, a linear growth model with
where
Nonlinear Mixed-Effects Models
Nonlinear mixed-effects models are an extension of linear mixed-effects models and are often used to analyze repeated measures data (Davidian & Giltinan, 1995). Nonlinear mixed-effects models are best suited for modeling data whose individual trajectories follow complex nonlinear patterns. These models are more flexible than linear mixed-effects models and can capture the nonlinearity in the data through inherently nonlinear functions (e.g., Gompertz function). Just like linear mixed-effects models, nonlinear mixed-effects models use fixed- and random-effects parameters to model the mean trajectory for the sample and individual-level trajectories. The difference between the two is how the parameters enter the model. In nonlinear mixed-effects models, at least one parameter enters the model in a nonlinear (multiplicative) fashion. Parameters that enter the model in a nonlinear fashion are contained within the mathematical function of time (e.g., parameter in an exponent, parameter raised to an exponent; Burchinal & Appelbaum, 1991).
Grimm et al. (2017) discuss two types of nonlinear mixed-effects models. The first type of nonlinear mixed-effects models are models that are nonlinear with respect to the fixed effects. These models have at least one fixed-effect parameter that enters the model in a nonlinear fashion. In this model, the fixed-effects parameter that enters the model nonlinearly does not have an associated random effect. These models have also been referred to as conditionally linear models (Blozis & Cudeck, 1999) because the random coefficients are additive. An example of this kind of model would be an exponential growth model where the rate parameter is a fixed-effect parameter without an associated random effect (e.g., all participants have the same rate parameter). The second type of nonlinear mixed-effects models are models with at least one random effect that enters the model in a nonlinear fashion. These models are considered to be nonlinear with respect to the random effects and have been referred to as fully nonlinear models. An example of this kind of model would be an exponential growth model where the rate parameter has an associated random effect (e.g., participants are allowed to differ in the rate).
A general expression for a nonlinear mixed-effects model is
where
The fixed-effect parameters,
An example of a nonlinear mixed-effects model is the power model, which can be written as
with fixed effects
In the mixed-effects model, as with regression models, each observation of the outcome variable,
Standard Errors of Measurement in IRT
IRT is a latent variable measurement framework whereby the items on a scale are assumed to be related to an underlying latent variable. Thus, an individual’s observed item responses are thought to be determined by the individual’s score on the latent variable, item parameters representing the strength of the association between the latent variable and the item and the difficulty of endorsing or correctly responding to the item, and random variability. The two-parameter logistic model (2PLM) is a commonly used item response model for binary data (items coded 0/1) and the model that we use in our empirical example. We discuss this model to demonstrate the features of item response models. The 2PLM can be written as
where
To visualize the nature of the association between the probability of a correct response and the latent variable score, we have plotted this association given specific values of the discrimination and location parameters in Figure 1. In this plot, the probability of a correct response is on the y-axis and the latent variable score is on the x-axis. The item parameters are

Example of an ICC with
The trace line in Figure 1 highlights an important aspect of item response modeling regarding when an item provides information to help estimate an individual’s latent variable score. For example, the item whose trace line is contained in Figure 1 will provide little to no information regarding an individual’s latent variable score estimate if the individual’s true latent variable score is less than −1 or if the individual’s true latent variable score is greater than 2 (when
The amount of information that an item provides is defined as
where

Example of an Information Curve for the Item with
The amount of information to estimate an individual’s latent variable score is simply the sum of item information functions for the items in which the individual responded. If all participants respond to every item on a measurement instrument, then the test information,
where

Example of a Test Information Function for Five Items with
The variance of a latent variable score estimate is equal to the reciprocal of the amount of test information at the value of the latent variable score estimate, such that
and this variance represents noise variance in the estimation of the latent variable score, which is related to error variance from CTT. Taking the square root of this variance yields the standard error of measurement of the latent variable score estimate and indicates how much the latent variable score estimate is likely to vary by chance. The magnitude of this uncertainty depends on the number and type (item parameters) of items to which the individual responded and the estimate of the individual’s latent variable score. And it is this uncertainty that we want to account for during the estimation of growth models.
Accounting for Standard Errors of Measurement in Estimation
When estimating a growth model using the mixed-effects modeling framework with maximum likelihood estimation, we attempt to maximize the log-likelihood (LL) function. For a continuous outcome, the LL function for each observation is defined as
where
To account for the standard error of measurement for each observation, we propose to weight the
where
where
and summing these values over time and individuals will yield the final weighted LL. That is,
This type of weighting allows each observation to contribute to the LL depending on the relative size of the standard error of measurement for each observation, which is our preference when accounting for differential reliability in the outcome.
Implementation
This weighted approach to estimation can be implemented in
Illustration
Data
Longitudinal data from the Applied Problems subtest of the Woodcock–Johnson Tests of Achievement III collected as part of the National Institute of Health and Human Development’s Study of Early Child Care and Youth Development (SECCYD; NICHD Early Child Care Research Network, 1997 is analyzed for illustration). The SECCYD is a longitudinal study of
Analytic Methods
Item response modeling
The 2PLM (equation (6)) was fit to the longitudinal item-level data to estimate each child’s ability (latent variable score) to analyze and solve word problems at each measurement occasion. Analyzing the longitudinal item-level data with an item response model is an appropriate approach to estimate latent variable scores and ensures that the measurement properties of the scale (e.g., item parameters) are the same over time (Davoudzadeh, 2017). This allows the latent variable estimates to capture change over time. As noted above, incomplete item-level data occurred because of the starting and stopping rules of the Applied Problems subtest and due to the participant-level missingness (e.g., participant was not assessed at a given grade). The latent variable estimates from the 2PLM were the expected a posteriori estimates and are based on the items asked and the child’s pattern of correct and incorrect responses. Standard errors of the latent variable scores were also obtained from fitting the 2PLM.
The estimated latent variable scores are plotted against age at testing in Figure 4. In this plot, dots connected by a line are scores belonging to the same individual, and the slope of this line represents the observed rate of change between the assessments. Based on these observed trajectories, children’s mathematical problem-solving appears to increase rapidly during the early elementary school years before leveling off. There are individual differences in mathematical problem-solving at all ages, and there appears to be individual differences in the rate of growth in mathematical problem-solving.

Longitudinal Plot of Mathematical Problem-Solving Estimates from the 2PLM Against Age at Testing. 2PLM = two-parameter logistic model.
Growth modeling
Given the observed trajectories of mathematical problem-solving, a nonlinear change model is needed to capture the individual change process and the individual differences in the trajectories for mathematical problem-solving. Given the observed changes, a three-parameter exponential growth model was specified. This exponential growth model can be written as
where
Estimation
The 2PLM was estimated using Marginal Maximum Likelihood with the
Results
Standard errors of measurement
The mean standard error of measurement was 0.162 and the standard errors of measurement ranged from 0.138 to 0.405. The standard errors of measurement are plotted against the latent variable estimates in Figure 5. From this figure, there is a flat u-shaped association between the latent variable estimates and their standard errors of measurement. This type of association is common because there is often less variability in the pattern of correct and incorrect responses (e.g., most responses were incorrect or correct) at the upper and lower ends of the ability range (e.g., participant got almost all items correct or incorrect). Additionally, it is also common that participants with low latent variable estimates responded to fewer questions on the Applied Problems subtest because of stopping rules.

Plot of the Standard Error of Measurement Against Latent Variable Estimate Based on the 2PLM. 2PLM = two-parameter logistic model.
Growth modeling
The exponential growth model in Equation (15) was specified and fit to the latent variable estimates with and without using the standard error informed weight (equations (11) and (12)). Parameter estimates for the two models are contained in Table 1. Overall, the estimates are fairly similar, which is expected; however, certain estimates have noticeable differences with and without the standard error weights. The rate of approach to the asymptotic level and the correlation between the intercept and the total amount of change were sufficiently different. The estimate for the rate of approach to the asymptote was slightly smaller when the standard error weights were utilized (0.193 vs. 0.188), and the estimated correlation between the intercept and the total amount of change to the asymptotic level was slightly closer to zero (−0.294 vs. −0.207). These changes were caused by down weighting the ability scores at the lower and upper ends of the distribution, which had higher standard errors of measurement due to the low variability in the pattern of correct and incorrect responses.
Parameter Estimates from the Exponential Growth Model (a) With and (b) Without Standard Error of Measurement Weights.
Discussion
The reliability of individual change can be particularly low. Reliability of change worsens under a variety of circumstances. For example, change reliability is diminished when there are fewer measurement occasions (see Bryk & Raudenbush, 1987). Change reliability becomes poorer when observed scores are close to the lower or upper boundary of the scale. This is a particular issue with psychological scales that were designed to measure relatively large between-person differences in behavior at a particular point in time, as well as scales that were designed to classify people with relatively extreme behaviors (e.g., depression). In these cases, scores at the upper and lower end of the distribution can represent a wide variety of behaviors, which hampers the measurement of change. Given that scores at the extremes of the distribution are more unreliable, it can be beneficial to down weight those observations in order to get a better representation of the overall change process.
Mixed-effects modeling is one framework used to assess individual change and between-person differences in change and allows for the inclusion of observation-level weights, as opposed to person-level weights. Observation-level weighting in mixed-effects modeling has been discussed in the literature (see Zhou, 2009) and has been implemented in a variety of ways. For example, Zhou (2009) down-weighted observations with higher residuals to reduce the impact of outliers. Grilli and Pratesi (2004) discussed probability-weighted estimation in mixed-effects models when there are differential inclusion probabilities at each sampling stage. The approach of down weighting less reliable scores here follows the ideas of Zhou (2009), where observed scores are weighted based on characteristics of the scores themselves.
The weighted LL approach is straightforward to implement in
Growth Mixture Modeling
Growth mixture modeling (Muthén & Shedden, 1999; Ram & Grimm, 2009) is an extension of growth models that can be used to explore whether there are groups of participants with different change trajectories, and this approach is common in the behavioral sciences. The issue of differential reliability, especially for extreme scores, may be a greater issue for growth mixture models because these models can account for non-Gaussian data through the incorporation of latent classes (Bauer & Curran, 2003). Extreme scores tend to be less reliable, and down weighting these scores may decrease the likelihood of growth mixture modeling identifying spurious classes (Bauer & Curran, 2003; K. Masyn, personal communication, September 27, 2018). Growth mixture models can be fit using
Second-Order Growth Models
Growth models can and have been specified with lower-order measurement models, such as confirmatory factor models (McArdle, 1988) and item response models (McArdle et al., 2009; Wang et al., 2016). When analyzing longitudinal item response data, the second-order growth model with a lower-order item response model is considered optimal for at least two reasons. First, latent variable estimation is unnecessary. Second, the approach inherently weighs observations given the number of item responses. That is, an observation from a participant who responded to 45 items counts more than an observation from a participant who only responded to 6 items. This second reason aligns with our goals to differentially weigh observations according to the standard error of measurement, which is partially related to the number of administered items.
The approach outlined here is a two-step approach with latent variable estimates obtained in the first step and longitudinal modeling comprising the second step. There are known limitations of latent variable estimation; however, the two-step approach is common because the combined estimation of longitudinal models and item response models is challenging, particularly when there is a large number of time points, a small sample, a large number of items, or a complex longitudinal model (e.g., nonlinear mixed-effects model). Moreover, the two-step approach is common when conducting integrative data analyses (Curran et al., 2008) because of the challenges with simultaneous estimation with a lack of consistent measurement protocols across studies. While simultaneous estimation of measurement and growth is optimal, two-step approaches remain common (e.g., every time a sum score is analyzed) and analyzing latent variable estimates from item response models is becoming more common. Furthermore, there are times where latent variable estimates and standard errors of measurement are available and the item-level data are not (e.g., Early Childhood Longitudinal Study–Kindergarten Cohort). Thus, we see many potential uses of this estimation approach.
Concluding Remarks
Psychological measurement is not often a priority when conducting longitudinal research even though it should be a top priority. In large-scale longitudinal studies, short forms are often implemented to minimize participant burden. While this is an important consideration, researchers should take advantage of item response models to quantify differences in the size of the standard error of measurement. Ideally, adaptive tests can be implemented with the goal of obtaining a specific standard error of measurement in an attempt to have equally reliable latent variable estimates. When this is not possible, we encourage researchers to give priority to scores that are known to have greater reliability whether in a change analysis or any statistical analysis (e.g., regression). This approach will help ensure that conclusions are not due to the least reliable scores.
Footnotes
Acknowledgments
The authors would like to thank Jack McArdle, Aki Hamagami, and Keith Widaman for their thoughtful comments on this work. This work was presented at the Developmental Methods Conference in Whitefish, MT, in September 2018.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science Foundation Grant REAL-1252463 awarded to the University of Virginia, David Grissmer (Principal Investigator), and Christopher Hulleman (Co-Principal Investigator).
