Abstract
Recent work on measuring growth with categorical outcome variables has combined the item response theory (IRT) measurement model with the latent growth curve model and extended the assessment of growth to multidimensional IRT models and higher order IRT models. However, there is a lack of synthetic studies that clearly evaluate the strength and limitations of different multilevel IRT models for measuring growth. This study aims to introduce the various longitudinal IRT models, including the longitudinal unidimensional IRT model, longitudinal multidimensional IRT model, and longitudinal higher order IRT model, which cover a broad range of applications in education and social science. Following a comparison of the parameterizations, identification constraints, strengths, and weaknesses of the different models, a real data example is provided to illustrate the application of different longitudinal IRT models to model students’ growth trajectories on multiple latent abilities.
1. Introduction
In education, one is often interested in determining student growth. These changes can sometimes be captured by latent variable models. The latent variables, such as students’ abilities, are typically measured by binary (or polytomous) responses to items. Item response theory (IRT) models are useful tools to model the relationship between the categorical outcome variables and the latent continuous traits. Recent work has extended IRT models to model changes in latent traits, leading to the family of longitudinal IRT (L-IRT) models (e.g., Andersen, 1985; Cai, 2010; Hsieh, von Eye, & Maier, 2010; Huang, 2013; McArdle, Grimm, Hamagami, Bowles, & Meredith, 2009; Paek, Li, & Park, 2016; von Davier, Xu, & Carstensen, 2011; Wang, Kohli, & Henn, 2016; Wilson, Zheng, & McGuire, 2012). Within this family, models differ mainly in the following aspects: (1) the measurement model that implies the factor structure of the primary latent traits measured repeatedly, which could either be unidimensional, multidimensional (Hsieh et al., 2010), or hierarchical (Huang, 2013); (2) the relationship of the latent traits over time, which could either be captured by a completely unstructured covariance matrix (Andrade & Tavares, 2005; Cai, 2010; Paek et al., 2016) or by linear/nonlinear change patterns via the latent growth curve (LGC) models (Bollen & Curran, 2006; Duncan, Duncan, & Strycker, 2006); and (3) whether nuisance factors are in place to account for the dependency of the same items administered over time (e.g., two-tier model; Cai, 2010; Paek et al., 2016; Wang et al., 2016).
Due to the well-known connection between IRT and categorical factor analysis (e.g., Takane & de Leeuw, 1987), L-IRT models can also be discussed in structural equation modeling (SEM) terms. However, IRT offers two conceptual advantages: (1) assuming item (or anchor item) parameters are the same over time to ensure longitudinal invariance of the lowest order traits and (2) incorporating guessing parameters into the functional form of the model.
Different forms of L-IRT models were proposed by different groups of researchers, and they have all been individually demonstrated to work well; however, few studies have explored the connections among the models or the strengths and limitations of each of them. Our goal here is to capitalize on the shared features and distinctions among various L-IRT models to provide practitioners with coherent guidelines about the conditions under which each model could be applied and/or should be preferred.
Three specific types of models will be the focus of discussion. In order of complexity, these models include the longitudinal unidimensional IRT (L-UIRT) model (Wang et al., 2016; Wilson et al., 2012), longitudinal multidimensional IRT (L-MIRT) model (Hsieh et al., 2010), and longitudinal higher-order IRT (L-HO-IRT) model (Huang, 2013). All of these models are variations of the general LGC model and the respective measurement model: The UIRT model assumes that a single latent trait is measured by all the items; MIRT models posit that item responses are probabilistically determined by multiple, usually correlated, latent traits; the HO-IRT models (de la Torre & Song, 2009; Sheng & Wikle, 2008) capture the hierarchical nature of factor structure (e.g., Huang & Wang, 2014; Sawaki, Stricker, & Oranje, 2009), whereby a general factor (such as math aptitude) informs domain-specific factors (such as algebra, geometry, calculus, or subsets thereof). These three models were selected to cover a majority of practical applications. Moreover, LGC models were chosen over an unstructured covariance matrix because LGC results in both group-level and individual-level growth trajectories, which are often useful for interpreting data patterns. On the other hand, LGC introduces additional latent variables (i.e., individual intercepts and slopes) that complicate model identification constraints and requires additional guidelines for model estimation. Note that the L-MIRT model with unstructured covaraince matrix of θ over time is discussed in detail in Paek, Li, and Park (2016).
In the remaining sections, we introduce the three models and explain when each model could be applied. For each model, we describe identification constraints, which can be different depending on whether some items have precalibrated parameters. After determining the identification requirements, we are then ready to estimate the models. Estimation presents various challenges, and we describe the available estimation methods, complications due to high dimensionality, and possible solutions. We finally illustrate the models with a real data example.
2. L-IRT Models
2.1. L-UIRT Model
If only one primary latent trait is measured over time, then the simplest model, the L-UIRT model, can be applied. Let
In Equation 1,
For a simple linear growth model with a single person-specific intercept and slope, we can rewrite Equation 1 as
where
The latent variable described by Equation 2,
where

A path diagram for the longitudinal unidimensional item response theory model with three items per time point and three time points.
Many large-scale educational surveys have primary measurements that differ from one occasion to another (Edwards & Wirth, 2009; McArdle et al., 2009). Yet, to establish a common scale, one must either have a common set of anchor items that is shared across time or sets of anchor items that already have parameters precalibrated and put on a common scale (e.g., Wang et al., 2016). Kolen and Brennan (2004) recommended that assessments should have at least 20% of items to anchor the parameters to the common scale. If enough items are linked across time, and assuming no item parameter drift, then assessments with unknown item parameters require some model identifiability constraints to be imposed. Constraints are required to fix the mean and variance of the latent variable (ξ) at one time point (commonly All of the residuals having mean 0 (i.e., The mean of the person-specific intercept parameter being set to 0 (i.e., The residual variance at the first time point being fixed to be a constant (i.e.,
Note that after imposing a growth curve structure on θ, θ becomes an endogenous variable in Equations 1 and 2. Hence, instead of directly fixing the mean and variance of θ (as is often desired), most SEM software packages (such as Mplus) only allow fixing its intercept and the residual variance. The value of c
1 is arbitrary and results in the variance of θ at
2.2. L-MIRT Model
As a multivariate extension of the L-UIRT model, the L-MIRT model combines the MIRT model with the associative LGC model. The earliest version of the L-MIRT model was proposed by McArdle (1988) and called the “curve of factors” (CUFFS) model. The CUFFS model was developed for multiple, correlated latent traits being tracked over time. For instance, the National Educational Longitudinal Study (NELS: 88) tracked students’ academic performance across three measurement occasions on four correlated cognitive scales: mathematics, reading, science, and social studies. In this case, the L-MIRT instead of L-UIRT can better recover the group-level and individual-level growth trajectories by considering all related information. Please note that name “L-MIRT” instead of “CUFFS” is used throughout the didactic for consistency with the other models’ names.
Let
Similar to the notations in Equation 1,
Similarly,
where
To be consistent with the description of the L-UIRT model, assume that each domain-level latent trait follows a simple linear trajectory without any additional covariates, which is analogous to the assumption made in the preceeding section. Then
We can also rewrite the model by expanding Equation 4 as follows:
where
The L-MIRT IRF takes the form of
where

A path diagram for the longitudinal multidimensional item response theory model with three items per domain-level trait, two domain-level traits per time point, and three time points.
As in the L-UIRT model, items can differ across time, as reflected by the superscript t on item parameters in Equation 8, but anchor items must still be embedded in the item parameter sets to link the scale. Because each domain has a potentially unique scale, anchor items must load on every domain, so that the scale of All of the residuals having mean 0 (i.e., The mean of the person-specific intercept parameters being set to 0 (i.e., The residual variances at the first time point being set to a constant (i.e.,
As before, when anchor items are precalibrated with known parameters, only the first constraint must be specified to identify the model.
2.3. L-HO-IRT Model
Hierarchical factor structures often emerge in the social sciences to represent a latent construct of interest such as intelligence (Golay & Lecerf, 2011), cognitive ability (Murray & Johnson, 2013), or personality (DeYoung, 2006). General factors are often comprised of several highly related specific factors (a.k.a. first-order factors), each of which is measured by multiple indicators (usually referred to as items). For example, in many educational assessments, one is often required to report both overall proficiency for accountability purposes as well as domain-specific proficiency for diagnostic purposes. To this end, the HO-IRT model was developed by introducing a higher order ability (de la Torre & Hong, 2010; de la Torre & Song, 2009) that relates to each of the first-order abilities. The HO-IRT model contains two levels: (1) a link between a single overall latent trait and one of several domain latent traits and (2) a probabilistic relationship between each domain latent trait and items designed to measure that domain. Specifically, let θ represent the domain latent trait underlying responses to test items and denote ξ as the higher order trait. Then, one can hypothesize that
where
To extend the HO-IRT model across T time points, assume the second-order factor (i.e., overall ability) follows the LGC model, as in Equation 1. Then, the domain-specific ability for person i at time t would also be predicted to systematically change over time (Huang, 2013, 2015) as follows:
Equation 10 can be further understood by expanding it using a scalar equation. That is, given Equations 6 and 7, a domain-specific ability for person i at time point t,
Notably, Equation 11 implies that the loading of the domain-specific factors on the overall factor remains the same over time, as indicated by the lack of a superscript t on

A path diagram for the longitudinal, higher order item response theory model with three items per domain-level trait, two domain-level traits, and three time points.
As shown in Equations 5 and 11, the L-HO-IRT model is nested within the L-MIRT model. This is because the L-MIRT model allows for separate, potentially unrelated, individual intercept and slope parameters across each dimension (i.e.,
Assuming either the same sets of items are repeatedly administered or that the test includes shared items between adjacent time points for all domains, the minimum model identifiability constraints include: All of the residuals having mean 0 (i.e., The mean of the person-specific intercept parameters being set to 0 (i.e., All of the residuals in the measurement model having mean 0 (i.e., The residual variances at the first time point being set to a constant (i.e., One of the loading parameters,
The first two constraints are essentially the same as the first two constraints for both the L-UIRT model and the L-MIRT model described earlier. The remaining constraints are unique to the L-HO-IRT model. The last constraint is similar to the “reference indicator” constraint in factor analysis. That is, the variance of a factor can be determined by fixing the loading of one marker indicator. Here, the “marker indicator” is one of the first-order factors,
They argued that the variance of
2.4. Applications of the Models
Applying one of the above models versus another depends mostly on the hypothesized factor structure of the latent traits. Higher-order models are often applicable in contexts where a measurement instrument assesses several related constructs that can be accounted for by one or more underlying second-order factors (Chen et al., 2006). For instance, a common scale to measure “quality of life” is composed of four subscales that each presume to measure a distinct first-order factor: mental health, cognition, vitality, and health worry (Chen et al., 2006). The covariance between each pair of first-order factors can be explained by a higher order factor, which is usually called “global quality of life.” Similarly, educational measures are often constructed to assess several, separate but correlated, content domains that can be partially explained by a more general ability. For instance, a mathematics test may have items measuring numerical computation skills and data analysis skills (Reckase, 2009, p. 232). Both of these are examples of content-based multidimensionality rather than strict construct-based multidimensionality.
In practice, one cannot typically distinguish between content multidimensionality and construct multidimensionality because content-based subscales often measure distinct constructs. Yet certain content-based domains sometimes have exceedingly high correlations, implying that these domains essentially measure the same skill or construct (Reckase, 2009). In cases like these, one should always provide evidence that combining domains makes substantive sense or yields a better fit than keeping those domains separate.
Although a correlated-factor MIRT model will always fit data generated from the HO-IRT model, the higher order model has at least four advantages for being preferred in practice: As compared with the correlated-factor MIRT model, the HO-IRT model (1) parsimoniously explains the covariance between lower order factors (Gustafsson & Balke, 1993; Rindskopf & Rose, 1988), (2) separates the variance in the lower order factors shared by the common higher order factor from the unique variance of the lower order factors, (3) simplifies model estimation due to the exploitation of the dimension reduction technique (as described in the next section), and (4) allows for potential construct shifts over time.
To elaborate on the last point, assume teachers want to track students’ ability in a general subject area such as math knowledge. If math knowledge is a unidimensional trait, it can be measured directly by a set of items, and if the teacher is not interested in measuring any specific subareas of mathematics, then the L-UIRT model is sufficient. However, math knowledge might relate to a number of specific content areas that teachers might also wish to track. For example, Table 1 presents the content coverage of the mathematics common core domains across five domains. The domains (such as Domain 5 “Geometry” and Domain 4 “Measurement and Data”) are expected to be taught and developed in every grade from Kindergarten–4. Student growth in these domains can be tracked across all five grades. However, the required content coverage shifts from grade to grade, and many domains only appear in limited grades. For instance, Domain 1 (“Counting and Cardinality”) is expected to be assessed only in Kindergarden, whereas Domain 6 (“Numbers and Operations-Fractions”) does not emerge until Grade 3. In these cases, the L-MIRT model and L-UIRT model overlook crucial details. In particular, the L-MIRT model (Hsieh et al., 2010) essentially assumes a constant set of traits measured over time. For this relatively straightforward example, the domains are designed to change over time.
Mathematics Common Core Domains by Grade (K–4)
However, when indeed the same sets of domains are measured overtime, the L-MIRT model is preferred because the L-HO-IRT model is parametrically more restricted than the L-MIRT model. That is, any growth patterns in the lower level traits that can be captured with the L-HO-IRT model can ultimately be captured with the L-MIRT model. Yet, if the multidimensional (lower level) constructs each change differently over time, then the L-HO-IRT model would no longer fit the data, and one should use the L-MIRT model. For instance, if certain domain-level traits grow linearly, whereas others grow in a piecewise fashion, then one should no longer use the L-HO-IRT model due to the restrictions implicit in Equation 10. On the other hand, the L-MIRT model can handle different growth patterns if needed.
When assessing change over time, one must consider whether the measures retain measurement invariance. Often, practitioners use the exact same scale on multiple occasions. This practice can ensure that identical constructs are continuously assessed and that the metric of measurement remains the same over time. However, out of necessity, scales often differ across repeated measurements due to the need for “developmentally appropriate measures” (Widaman, Ferrer, & Conger, 2010). Adjusting the scale to consider the typical range of traits over repeated measurements can help avoid ceiling and floor effects.
Determining whether the same construct, measured by multiple indicators, has the same meaning and metric over time falls under the rubric of measurement invariance (Widaman et al., 2010), and is often referred to, especially in a longitudinal setting, as longitudinal invariance. The factorial invariance of longitudinal measures is paramount in evaluating the change in behavior over time (McArdle, 2001; McArdle & Hamagami, 2001; Meredith & Tisak, 1990; Widaman & Reise, 1997). Using the same set of items or a set of anchor items (Grimm, Kuhl, & Zhang, 2013) partially satisfies longitudinal invariance. A thorough examination of longitudinal invariance is beyond the scope of this article. Interested readers can refer to Teresi (2006), Isiordia and Ferrer (2018), Liu et al. (2017) for details regarding invariance assumptions of L-UIRT, L-MIRT (i.e., CUFFS), and L-HO-IRT, respectively.
3. Model Estimation
Within the general framework of SEM, the L-IRT models can be viewed as a multilevel LGC model with the lowest level represented by categorical indicators. Unsurprisingly, the L-IRT models can also be motivated from the framework of generalized linear models (McCullagh & Nelder, 1989), a conceptualization favored within biostatistics. The most common methods for estimating multilevel models are based on integrating the likelihood over the distribution of random effects, which is often referred to as marginal likelihood estimation. For instance, in the L-HO-IRT model, the overall- and domain-specific latent abilities as well as the latent intercepts and slopes represent the random effects over which to integrate. Because analytical integrals often do not exist for these types of models, researchers frequently adopt one of the two classes of methods. One could either approximate the integrand analytically or evaluate the integral via numerical approximation. The first approach includes Laplace’s method of linearizing the integrand via a sixth-order Taylor series approximation (called “Laplace 6”) as well as quasi-likelihood methods such as marginal quasi likelihood (MQL; Goldstein, 1991; Goldstein & Rasbasch, 1996) and penalized quasi likelihood (PQL; Breslow & Clayton, 1993; Laird, 1978). Because the performance of PQL and MQL depends on the validity of a normal approximation, these methods tend to perform poorly when the observed data are markedly nonnormal (Rodriguez & Goldman, 1995; Tuerlinckx, Rijmen, Verbeke, & Paul De Boeck, 2006) and are thus typically not recommended for use in IRT models with binary responses. The second approach includes ML using Gauss–Hermite quadrature, adaptive quadrature, and simulation methods such as the Monte Carlo expectation-maximization (EM) algorithm (Wang & Xu, 2015).
However, ML estimation via the EM algorithm is known to converge slowly in many applications (e.g., Meng & van Dyk, 1997) and is computationally intensive when the number of latent variables is large. Bayesian estimation using Markov chain Monte Carlo (MCMC) with diffuse (or noninformative) priors (Patz & Junker, 1999) is an alternative to EM (Huang, 2013; Wang & Nydick, 2015) and is usually preferred for complex models.
All of the above estimation methods are based on full information, in that the likelihood is constructed directly from the raw response pattern. Alternatively, one could adopt limited information estimation methods, such as modified weighted least squares (WLS) estimation. Rather than basing the likelihood on the complete response pattern, modified WLS estimates model parameters via the first four moments of the response contingency table. By avoiding the time-consuming numerical integration or sampling steps of the full information methods, WLS leads to much faster convergence. However, WLS is known to yield inaccurate estimation with small sample sizes or large amounts of missing data (e.g., Forero & Maydeu-Olivares, 2009). Moreover, the parameter estimates from WLS are not as efficient as a full information method (Muthén & Asparouhov, 2015). Given these limitations, WLS is not discussed further in this article.
In the following subsections, we describe estimating the L-IRT models in Mplus with ML or MCMC methods. Mplus software was chosen due to being widely used in social science research. Other IRT estimation software packages, such as flexMIRT (see Paek et al., 2016, for details on how to estimate similar models to those described in this article), or general-purpose estimation packages, such as WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000), should also be able to recover L-IRT-based model parameters. Interested readers could refer to Curtis (2010) or Isiordia and Ferrer (2018), which present BUGS code and R code (using the “lavaan” package, see Rosseel, 2012), respectively, for estimating a subset of L-IRT models. Details of estimating L-IRT models using WLS are explained in Wang, Kohli, and Henn (2016).
3.1. MLE
When using Mplus, one must specify the model estimation method in the
As indicated in the last line of the previous statement, we recommend using Mplus’s
Table 2 illustrates the dimensions of numeric integration for each of the three models with values in parentheses assuming that
Number of Continuous Dimensions and Dimensions of Numerical Integration for Different Models and Methods (T denotes the number of time points, K denotes the number of lower-order latent traits, q denotes the number of random effects)
Note. IRT = item response theory; L-UIRT = longitudinal unidimensional IRT; L-MIRT = longitudinal multidimensional IRT; L-HO-IRT = longitudinal higher order IRT.
The right-most column in Table 2 indicates the dimensions of integration if using an analytic dimension reduction technique. Analytic dimension reduction is often used to rearrange terms in the marginal likelihood integral to yield a series of integrals, each of much lower dimension than the original integral (Cai, Yang, & Hansen, 2011; Gibbons & Hedeker, 1992; Rijmen, Vansteelandt, & de Boeck, 2008). Applying a dimension reduction technique to the L-UIRT model, rewrite Equation (3) as
If assuming that
The L-HO-IRT model has a different dimension reduction solution given the addition of the higher level trait. First, write the HO-IRT IRF as
where
Advantages of estimating parameters using the EM algorithm, as compared with Bayesian methods, in Mplus include: (1) being able to estimate the three-parameter logistic (3PL) model rather than only being able to estimate one or two parameter normal ogive models, (2) providing comparative model fit indices such as Akaike information criterion (AIC) and Bayesian information criterion (BIC), and (3) being able to impose equality constraints on model parameters. Note that these limitations of Bayesian methods are not necessarily inherent to the methods themselves, only to the application of those methods in Mplus. Due to the high-dimensional integration, we have had more success estimating the L-IRT models with the MCMC option in Mplus. Researchers and practitioners should always keep in mind complexity and feasibility when choosing a model and corresponding estimation algorithm.
3.2. MCMC
If estimating IRT-based item parameters with MCMC, include the following
In the above statement, the
The next section provides a real data example of applying Mplus (Version 8 used in this study) to estimate parameters of data that fit the L-IRT model. A corresponding simulation study, demonstrating parameter recovery of the three L-IRT models, is included as an Appendix in the online version of the journal to this article.
4. A Real Data Example
The current section applies the three L-IRT models to a real data example. The purpose of this demonstration is to illustrate the potential application of each model as well as the information each model provides to researchers and practitioners. For this purpose, we adopted and analyzed a series of math assessments that students in one Midwest state took between 2009 and 2012. These students were assessed in each of Grades 3 through 6 using a five-dimensional, simple-structure test with precalibrated item parameters. The five dimensions had been termed “number and operation,” “geometry and spatial sense,” “data analysis, statistics, probability,” “measurement,” and “algebra, functions, and patterns,” respectively. Students took 57 items in 2009 (with 23, 9, 7, 11, and 7 items, respectively, measuring each dimension) and 52 items in each of the three subsequent years (with 23, 9, 7, 11, and 7 items, respectively, measuring each dimension). After initial data cleaning, only
Due to different sets of items being administered in each year, common-item linking is not possible. However, precalibrated anchor items were embedded within each of the five dimensions across all 4 years and are all on the same scale. Because of fixing known anchor items, many of the identifiability constraints need not be explicitly specified (see the model description section for additional details). Only
To evaluate global model fit in Bayesian models with categorical outcome variables, Mplus provides the Bayesian posterior predictive p value (Kaplan & Depaoli, 2012; Muthén, 2010). In our case, the Bayesian p value for the L-UIRT, L-MIRT, and L-HO-IRT 4 models were estimated to be .103, .081, and .106, respectively, implying that all three models yielded acceptable global fit. Note that other Bayesian software packages such as JAGS (Plummer, 2003) provides the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linden, 2002) for model comparison. Mplus does not yet include DIC for models with categorical indicators.
Table 3 presents the parameter estimates from the three L-IRT models. Because of fixing
Structural Model Parameter Estimates for Three Different Models
Note. NP denotes the number of free parameters in each model. The covariances between random intercepts and random slopes from the L-MIRT model are omitted to save space because they are between −.01 and .01. “*” denotes a fixed constant. IRT = item response theory; L-UIRT = longitudinal unidimensional IRT; L-MIRT = longitudinal multidimensional IRT; L-HO-IRT = longitudinal higher order IRT.
In contrast to the L-HO-IRT model, the L-UIRT and L-MIRT models can be directly compared in this case due to anchor items setting the scale for the lower order traits. From Table 3, one can see that averaging the intercepts and slopes from the L-MIRT model leads to estimates similar to those from the L-UIRT model. Yet the variance of the intercept and slopes from the L-MIRT model is much larger, implying that evaluating individual performance at the domain level leads to higher variability than assuming that responses are all generated from a single, common trait. That said, if a test is constructed across several domains, considering domain-level growth patterns may reveal subgroup differences otherwise diminished if assuming responses came entirely from a unidimensional trait.
Figure 4 presents a spaghetti plot of the overall ability across time for

A spaghetti plot, illustrating the linear trend of ξ (overall-level ability) on math between Grades 3 and 6 for

A spaghetti plot, illustrating the linear trend of θ (domain-level ability) on math between Grades 3 and 6 for
5. Conclusion
Many teachers, administrators, and policymakers require the measurement of student growth. Teachers can use estimated growth to modify lesson plans based on strategies of improvements. Administrators can use estimated growth to examine school performance and help make budgetary decisions. In either case, one must ensure estimates are accurate across several, possibly correlated, ability dimensions. Several L-IRT models haven been proposed for different purposes. These L-IRT models all share the same form and contain two components: (1) an IRT measurement model for each measurement occasion and (2) a LGC model imposed on the latent trait, quantifying the intraindividual developmental trajectories. In this article, we reviewed three specific types of L-IRT models with the goal of demonstrating appropriate applications of these models for longitudinal assessment. We also illustrated fitting different models with a commonly used software package.
Among the three models, the L-UIRT model is the simplest and has been the most extensively studied in the literature (e.g., Andersen, 1985; Embreston, 1991, Grimm et al., 2013; McArdle et al., 2009; von Davier et al., 2011; Wang et al., 2016; Wilson et al., 2012). In contrast to the L-UIRT model, which tracks change in a unidimensional latent trait, the L-MIRT model describes change in multiple, correlated latent traits (see Paek et al., 2016). Compared to models that directly model change in the lower level abilities, the L-HO-IRT model includes two unique features. First, because the HO-IRT model captures the hierarchical nature of learning, the L-HO-IRT model simultaneously models the growth trajectories of both overall- and domain-specific abilities. Second, as described earlier in this article, the L-HO-IRT model allows for a shift in domain coverage over time, as long as one carefully verifies the second-order longitudinal invariance requirement (e.g., Chen et al., 2006; Liu et al., 2017). Allowing for a shift in the domain coverage over time is extremely important in educational measures, as one typically finds more advanced domains added and basic domains eliminated as students complete more schooling. Furthermore, a higher order model allows one to find trends at the individual, domain level. Domain-level information can hint at particular academic subjects that improve the most over particular grades. For instance, in our real data example,
In terms of model estimation, we provided a thorough discussion of the analytical dimension reduction techniques that are available to alleviate high-dimensional integration challenges of marginal MLE (MMLE). Even after dimension reduction, the number of integration dimensions can still be high. In this case, the Metropolis–Hastings Robbins Monro algorithm (Cai, 2010) or the MCMC algorithm can be used in lieu of MMLE via EM. Given that the L-MIRT and L-HO-IRT are less studied in the literature, a simulation study was conducted to provide a thorough quality control check on the precision in estimating model parameters (refer to the Supplementary File in the online version of the journal for details of the simulation, which evaluated the recovery of both structural parameters and individual latent traits/growth parameters). When examining simulation results, all model parameters were adequately recovered, and the generating model evidenced adequate model fit. Even with the supporting evidence from the simulation study, interested users of the L-HO-IRT and L-MIRT models should keep in mind that both of these models should only be applied when there are sufficient items per domain, otherwise the domain-level θs and the resulting higher order factors (i.e., ξ and growth parameters) would not be reliably estimated.
This article serves two purposes. First, no prior paper has explicitly documented and reviewed the three popular L-IRT models as well as their identifiability constraints with and without known item parameters. Including this information has profound didactic value for practitioners who wish to apply the models to their own data. Sample Mplus code is provided in the Appendix in the online version of the journal for each model for readers’ reference. Second, this article is the first attempt to thoroughly compare and demonstrate the applicability of each of the discussed models. Even though these models can adequately capture changes in typical longitudinal measures, they are by no means exhaustive. A handful of other longitudinal models exist, such as the two-tier model (Cai, 2010), in which nuisance factors are introduced to account for residual dependencies between common items over time, or the item-level growth curve model (Paek et al., 2016), in which growth rates for different items can differ and therefore be described and examined.
Regardless of chosen model, constructing and estimating growth using L-IRT can improve the measurement of educational outcomes and thus provide educators with tools they need to better help students learn. Currently available software packages can estimate growth across a wide variety of measurement models (e.g., 1PL, 2PL, 3PL, unidimensional, multidimensional, and higher order) and LGC models (i.e., Equations 1 and 4). Interested practitioners should be cognizant of the different estimation methods offered in each of the programs and to choose the method appropriate for the problem at hand, especially given complex models with many estimable parameters. For instance, the discussed analytic dimension reduction technique is only relevant to MML estimation approaches but not to the Bayesian MCMC estimation approach commonly used to estimate parameters of complex models. Software packages such as Mplus may not automatically use a given dimension reduction unless the command file (or source script) is written with dimension reduction in mind. 5 Hence, understanding the logic of dimension reduction can help with constructing the command file or script processed by the algorithm and greatly reduce computation time.
Although this didactic offers sufficient technical details for three popular L-IRT models for researchers and practitioners to use those models in their own research, two relevant topics were outside the scope of the current discussion. First, LCG models with intrinsically nonlinear growth patterns were not discussed because this family of models is not currently included in a majority of software packages for LCG model estimation. An example of this kind of model is a “piece-wise growth curve model with unknown knots” (e.g., Kohli, Hughes, Wang, Zopluoglu, & Davison, 2015). Second, we have not discussed how to evaluate global model fit. Although most SEM software packages will output one or multiple absolute fit indices, few studies have examined appropriate cutoffs for these indices in determining adequate fit. Moreover, the DIC that is often used with MCMC can take different forms. The first-level conditional DIC provided by WinBUGS may not always provide the best estimates of model fit, whereas a second-level joint DIC might be more appropriate for multilevel IRT models (Zhang, Tao, & Wang, 2019). A thorough examination of model fit for L-IRT models is needed to ensure credible conclusions drawn from any model-based results.
Supplemental Material
Supplemental Material, Longitudinal_IRT_Comparison_-_MPlus_Code - On Longitudinal Item Response Theory Models: A Didactic
Supplemental Material, Longitudinal_IRT_Comparison_-_MPlus_Code for On Longitudinal Item Response Theory Models: A Didactic by Chun Wang and Steven W. Nydick in Journal of Educational and Behavioral Statistics
Supplemental Material
Supplemental Material, Longitudinal_IRT_Comparison_Simulation_Appendix - On Longitudinal Item Response Theory Models: A Didactic
Supplemental Material, Longitudinal_IRT_Comparison_Simulation_Appendix for On Longitudinal Item Response Theory Models: A Didactic by Chun Wang and Steven W. Nydick in Journal of Educational and Behavioral Statistics
Supplemental Material
Supplemental Material, Longitudinal_IRT_Comparison_Simulation_Appendix - On Longitudinal Item Response Theory Models: A Didactic
Supplemental Material, Longitudinal_IRT_Comparison_Simulation_Appendix for On Longitudinal Item Response Theory Models: A Didactic by Chun Wang and Steven W. Nydick in Journal of Educational and Behavioral Statistics
Footnotes
Authors’ Note
The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D170042 (or R305D160010) awarded to the University of Washington and the Spencer/National Academy of Education Post-doc Fellowship (2014).
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
