Abstract
Many latent traits in the human sciences have a hierarchical structure. This study aimed to develop a new class of higher order item response theory models for hierarchical latent traits that are flexible in accommodating both dichotomous and polytomous items, to estimate both item and person parameters jointly, to allow users to specify customized item response functions, and to go beyond two orders of latent traits and the linear relationship between latent traits. Parameters of the new class of models can be estimated using the Bayesian approach with Markov chain Monte Carlo methods. Through a series of simulations, the authors demonstrated that the parameters in the new class of models can be well recovered with the computer software WinBUGS, and the joint estimation approach was more efficient than multistaged or consecutive approaches. Two empirical examples of achievement and personality assessments were given to demonstrate applications and implications of the new models.
Educational tests and psychological inventories have been developed to measure latent traits. It is common for latent traits in the human sciences to have a hierarchy. For example, a hierarchical structure has been posited in mental abilities, with g (general ability) at the top of the hierarchy and other broad abilities (e.g., mathematical, spatial-mechanical, and verbal reasoning) underneath. Personality, measured by the Revised NEO Personality Inventory (Costa & McCrae, 1992), is another example of hierarchical latent traits. It is composed of five broad factors: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, with each broad factor containing six domains. The domains can be viewed as the first-order latent traits, and the broad factors can be viewed as the second-order latent traits. In the framework of linear factor analysis, models for such hierarchical latent traits are referred to as second- or higher-order factor analysis.
Take the simplest case of two-order hierarchy as an example, where each of the first-order latent traits is measured by a test; these first-order latent traits are governed by the same second-order latent trait. When item response theory (IRT) models are fit to such data, at least five approaches can be taken, as depicted in Figures 1a through 1e, respectively. First, one can fit a unidimensional IRT model to each test consecutively (Figure 1a), which is referred to as the consecutive unidimensional approach. In doing so, a person measure for each of the first-order latent traits is available, whereas that for the second-order is not directly attainable. In addition, this approach, ignoring the association among the first-order latent traits, will not yield precise person measures if the tests are not long. Direct estimation for the association among the first-order latent traits is not feasible. Second, one can fit a between-item multidimensional IRT model (Adams, Wilson, & Wang, 1997) to all tests (Figure 1b), which is referred to as the multidimensional approach. This approach takes into account the association among the first-order latent traits so that the precision of person measures in each of the first-order latent traits can be improved, and the association among the first-order latent traits can be estimated directly (Wang, Chen, & Cheng, 2004). Unfortunately, person measures for the second-order latent trait are still not attainable using the multidimensional approach.

Five approaches to data with hierarchal latent traits: (a) consecutive unidimensional, (b) multidimensional, (c) composite unidimensional, (d) bi-factor, (e) higher order.
Third, one can fit a unidimensional IRT model to all tests by treating all items across tests as measuring the same latent trait (Figure 1c); this is referred to as the composite unidimensional approach. In doing so, person measures for the “overall” latent trait appear to be available, whereas those for the first-order latent traits are not attainable. In fact, treating all tests as measuring the same latent trait violates the theory of hierarchical latent traits. Fourth, one can fit a bi-factor IRT model (Gibbons & Hedeker, 1992) to all tests (Figure 1d), in which each item is assumed to measure a common latent trait and a latent trait specific to each test. In most cases, the common latent trait is the major concern, and the specific latent traits are regarded as nuisance. The bi-factor has a very important application in testlets, in which items are connected with a common stimulus. Several testlet response models were developed, in which item responses within a testlet are assumed to be affected by a common latent trait throughout the test and a latent trait specific to each testlet (Wainer, Bradlow, & Wang, 2007).
Fifth, one can fit a higher-order IRT model to all tests (Figure 1e) in which items in the same test are assumed to measure a first-order latent trait and in which these domain latent traits are governed by a second-order latent trait. Among these five approaches, only the high-order approach posits an explicit hierarchical structure in latent traits. Although the bi-factor approach is mathematically equivalent to the higher-order approach when there are two orders of latent traits (Yung, Thissen, & McLeod, 1999), the higher-order approach is more straightforward in specifying the relationship among latent traits and more general in accommodating three or more orders of latent traits.
Past studies have shed light on the development of the higher-order IRT (denoted as HIRT). de la Torre and Douglas (2004) proposed a higher-order latent trait model in the context of cognitive diagnosis, where the bottom of the hierarchy belonged to latent binary variables (i.e., mastery or not mastery). Sheng and Wikle (2008) proposed a multidimensional normal ogive IRT model for dichotomous items. de la Torre and Song (2009) proposed a high-order IRT model in which the item response function follows the three-parameter logistic model (3PLM). In their simulations of parameter recovery, de la Torre and Song assumed that item parameters were known and then evaluated recovery of person measures and the factor loadings. This “known item parameters” assumption may not hold in practice. de la Torre and Hong (2010) demonstrated the superiority of parameter estimations in higher-order IRT models over convenient nonhierarchical IRT models in a small sample size, and estimated both item and person parameters simultaneously. de la Torre, Song, and Hong (2011) compared four methods of IRT scoring approaches (including the higher-order approach) and observed that the correlation among the first-order latent traits was an important factor for the estimation of person measures in the first-order latent traits: The higher the correlation, the more precise the estimation. Their simulation study focused on the estimation of the first-order latent traits only and paid little attention to the joint estimation of the second-order latent trait.
Inspired by these research works, in this study, the authors aimed to develop a general class of HIRT models where both item and person parameters are estimated jointly, item response functions can be flexibly defined by users, more than two orders of hierarchical latent traits can be posited, and a nonlinear relationship between latent traits is allowed. This general class of HIRT models not only integrates all the existing HIRT models but also provides new directions for future research.
In this article, the new class of HIRT models is introduced, and the model identification issue is described. The Markov chain Monte Carlo (MCMC) method for parameter estimation is addressed. Three simulation studies were conducted to assess the parameter recovery and the advantage of using the new class of HIRT models to estimate person measures over using traditional methods, and their results were summarized. Two empirical examples for achievement assessment and personality assessment are presented. Finally, conclusions about the new models are drawn, and suggestions for future study are provided.
The HIRT Approach to Hierarchical Latent Traits
Assume the latent traits of interest have a higher-order hierarchy. Each of the first-order latent traits is measured by a set of unidimensional items. Let
where
which is commonly used in the HIRT literature.
For illustrative purposes, in the following paragraph, Equation 2, rather than Equation 1, is used to describe the new class of HIRT models. Among dichotomous items measuring the first-order latent traits, one may assume that their item response functions follow the three-parameter logistical model (3PLM):
where Pni1v is the probability of scoring 1 in item i of test v for person n, α
iv
is the slope (discrimination) parameter, δ
iv
is the location (difficulty) parameter, and
The first-order latent traits (
When HIRT models are to be applied to polytomous items, one may assume that item response functions follow the generalized partial credit model (GPCM):
where Pnijv and Pni(j-1)v are the probabilities of scoring j and j− 1 in item i of test v for person n, respectively; δ ijv is the jth step parameter of item i in test v, and after reparameterization, δ iv is the overall difficulty of item i in test v, τ ijv is the jth threshold parameter in item i of test v, and the others are defined as above. Combining Equations 2 and 5 leads to
If item response functions follow the partial credit model (PCM), then α iv in Equation 6 becomes one. If item response functions follow the rating scale model (RSM), then α iv in Equation 6 becomes one and τ ijv = τ jv .
If item response functions follow the graded response model (GRM), then one has
where
In addition to the preceding common IRT models, any item response function is possible, such as the linear logistic test model (Fischer, 1983), faceted model (Linacre, 1989), testlet response models (Wainer et al., 2007), models that were developed in the edited book by De Boeck and Wilson (2004), IRT models for ability-based guessing (San Martin, del Pino, & De Boeck, 2006), and unfolding IRT models (Roberts, Donoghue, & Laughlin, 2000). The formulation of other HIRT models is straightforward. All one needs to do is to specify an item response function of interest at the first order and then combine it with Equation 1 or 2 to form a HIRT model. The great flexibility in accommodating different IRT models becomes ever more noticeable when users wish to (a) fit different numbers of orders for different latent traits, for example, some latent traits have two orders and some have three orders; (b) fit different IRT models to different tests, for example, the 3PLM is fit to Test 1, the ability-based guessing model (San Martin et al., 2006) is fit to Test 2, and the testlet response model is fit to Test 3; and (c) fit different IRT models to different items within a test, for example, Items 1 to 10 follow the 1PLM, Items 11 to 15 follow the PCM, and Items 15 to 20 follow the RSM. When necessary, users can specify their own customized item response functions. Due to space constraints, the authors were not able to exhaust all possible IRT models; rather, they chose to focus on common ones, like the 1PLM, 2PLM, and 3PLM for dichotomous items, and the PCM, RSM, GPCM, and GRM for polytomous items in this study.
For model identification, Sheng and Wikle (2008) suggested
In Equation 2, if β v is 1 for all tests (thus the error variance is zero for all tests), then HIRT models become single-order and unidimensional because all items across tests measure the same latent trait. If β v is very close (but not equal) to 1 for all tests, indicating a very high correlation between the second-order latent trait and the first-order latent traits and thus a very high correlation between the first-order latent traits, then one may argue that a single latent trait underlies all items across tests. That is, the model becomes essentially unidimensional. In such a case, the HIRT approach (Figure 1d) and the composite unidimensional approach (Figure 1c) will yield very similar person measures for the second-order latent trait. In contrast, if some of β v are not close to 1, indicating that the correlations between the second-order latent trait and the first-order latent traits are not all very high, then the HIRT approach is appropriate, but the composite unidimensional approach will overestimate the measurement precision of the second-order latent trait (de la Torre & Song, 2009; Sheng & Wikle, 2008).
Note that a high correlation between latent traits does not necessarily mean that they are the same latent trait. For example, although weight and height are highly correlated, they are very different attributes. HIRT models allow one to test the theory of hierarchal latent traits whereas nonhierarchical IRT models do not. Besides, information about individual first-order latent traits is not attainable when the composite unidimensional approach is used.
However, if β v is zero for all tests (thus the error variance is 1 for all tests), then the HIRT model becomes single-order multidimensional where the latent traits are uncorrelated. When HIRT models are fit to data generated from IRT models with uncorrelated latent traits, one can expect that the estimates for β v will be very close to zero and that those for the error variance will be very close to 1.
MCMC Estimation With WinBUGS
HIRT models involve high dimensionality. Traditional IRT estimation methods, like marginal maximum likelihood estimation, become inefficient. Although computer programs like Mplus and LISREL can be used to estimate parameters in some HIRT models, they have limitations. First, LISREL uses multistaged methods to estimate parameters, which is not statistically optimal because it does not consider full information about individuals. Second, Mplus, although it can unitize full information, is not flexible enough to accommodate many IRT models. For ordered categorical data, Mplus adopts the cumulated logit models so that the GRM and its special cases of the 1PLM and 2PLM are feasible. The GPCM, PCM, and RSM, however, are not feasible because they belong to the adjacent-categories logit models. More complicated IRT models, such as the 3PLM and the ability-based guessing model, are not feasible with Mplus either. In contrast, the free WinBUGS (Spiegelhalter, Thomas, & Best, 2003) utilizes full information and is very general and flexible to accommodate many (if not all) existing and customized IRT models.
Bayesian estimation with MCMC methods, implemented in WinBUGS, has been applied to IRT in recent years. In Bayesian estimation, specifications of a statistical model, prior distributions of model parameters, and observed data produce a joint posterior distribution for the model parameters. Because it is difficult to directly work on the joint posterior distribution, MCMC methods are used to simulate the joint posterior distribution of the unknown quantities and obtain simulation-based estimates of posterior parameters of interest.
There are two major procedures to construct a Markov chain through transition density to target density: the Metropolis-Hastings and the Gibbs sampling algorithms. In the Metropolis-Hastings algorithm, whether the drawing parameters (a candidate state) will be updated at the next chain depends on the probability of move, which is the ratio of the posterior density under the candidate state over that under the current state. If the ratio is higher, the candidate state will have a higher probability of being the next stage of the chain; however, the previous state will be retained with a lower ratio (Bolt & Lall, 2003). The Gibbs sampling algorithms involve cycling through the model parameters one at a time using the current estimate of the univariate conditional posterior probability distribution as the proposal distribution, which reduces multidimensional problems to a series of univariate calculations.
Three simulation studies were conducted to evaluate the parameter recovery of the second-order and third-order IRT models, and to reveal the advantages of using the HIRT over multistaged approaches to the estimation of latent traits. Specifically, Study 1 focused on the parameter recovery for the second-order IRT models for dichotomous and polytomous items. Study 2 focused on estimation methods for the first-order and second-order latent traits. Study 3 focused on the parameter recovery for the third-order IRT models for dichotomous and polytomous items with a simple or complex structure of latent traits.
Simulation Study 1: Parameter Recovery of Second-Order IRT Models
Simulation Design
There were three first-order latent traits and one second-order latent trait. Two kinds of generating models were used, including high factor loadings of .9, .8, and .7 and diverse factor loadings of .9, .6, and .3. The high factor loadings represented a typical case where the second-order latent trait had large loading on all first-order latent traits, whereas the diverse factor loadings represented an extreme case where the second-order latent trait had a large loading on a single first-order latent trait. In the literature of higher-order linear factor analysis, many practical cases fall between the high factor loadings and diverse factor loadings (e.g., Blackburn, Renwick, Donnelly, & Logan, 2004; Golay & Lecerf, 2011).
The underlying item response functions were the 1PLM, 2PLM, and 3PLM for dichotomous items and the GRM, GPCM, PCM, and RSM for polytomous items. All generated item responses were analyzed with the HIRT approach where the item response functions matched the generating item response functions. For example, if data were generated with the 1PLM, then they were analyzed with the 1PLM, and if data were generated with the GPCM, then they were analyzed with the GPCM. Each data set consisted of 1,000 persons and three tests, each test having 20 dichotomous items or 20 four-point polytomous items. When the underlying item response function was the 3PLM, the sample size was increased from 1,000 persons to 5,000 persons, because the 3PLM requires a larger sample size to reach the acceptable precision of estimation than do the other simpler IRT models. For the 60 dichotomous items, their difficulty parameters were generated from a uniform distribution U(−2, 2), their discrimination parameters were generated from N(1, 0.25) and were truncated above 0, and their pseudo-guessing parameters were generated from a uniform distribution U(0, 0.3). For the 60 four-point polytomous items, the step parameters were generated from a uniform distribution U(−2.5, 2.5) for the GPCM and the PCM, and the overall difficulty parameters were generated from a uniform distribution U(−1.5, 1.5) with −1, 0, and 1 as the category threshold parameters, respectively, for both the GRM and the RSM. The specifications of item parameters were in line with those commonly found in practice. The item parameters were generated independently because they were in line with the IRT literature. Where appropriate, item parameters can be correlated (Frederickx, Tuerlinckx, De Boeck, & Magis, 2010; van der Linden, Klein Entink, & Fox, 2010).
A Matlab computer program was written by the authors to generate item responses. A total of 20 replications were conducted in each condition. The number of replications was limited to 20 because there were a large number of conditions, and each replication took dozens of hours of computer time; in addition, the empirical sampling variations across replications were rather small, indicating that the results were very stable across replications. Furthermore, many IRT studies that used WinBUGS for simulation conducted 10 or fewer replications (Bolt & Lall, 2003; Cao & Stokes, 2008).
Analysis
All Markov chains were simulated using the computer program WinBUGS 1.4 through the Metropolis-within-Gibbs algorithm. A normal prior with mean 0 and variance 4 was set for all the location and threshold parameters, a lognormal prior with mean 0 and variance 1 was set for the discrimination parameters, a beta prior with both hyperparameters equal to 1 was set for the pseudo-guessing parameters, and a normal prior with mean 0.5 and variance 10 with a truncated distribution between 0 and 1 was set for the factor loadings. After monitoring the convergence diagnostic according to the multivariate potential scale reduction factor (Brooks & Gelman, 1998) with three parallel chains for the first simulated data set across all analysis models, the authors observed that the chain lengths were sufficient to reach stationarity for all structural parameters, where 10,000 iterations were obtained with the first 1,000 iterations as burn in, because all the multivariate potential scale reduction factors were less than 1.1. For each estimator, the authors computed the bias and the root mean square error (RMSE). It was hypothesized that the parameters could be well recovered.
Results
Let 1P-HIRT, 2P-HIRT, and 3P-HIRT denote the HIRT model with item response functions of the 1PLM, 2PLM, and 3PLM, respectively. As shown in Table 1, the mean RMSE for the difficulty parameters was 0.074 for the 1P-HIRT with high factor loadings (i.e., .9, .8, and .7), 0.072 for the 1P-HIRT with diverse factor loadings (i.e., .9, .6 and .3), 0.079 for the 2P-HIRT with high factor loadings, 0.080 for the 2P-HIRT with diverse factor loadings, 0.282 for the 3P-HIRT with high factor loadings, and 0.285 for the 3P-HIRT with diverse factor loadings. The mean RMSE for the slope parameters was 0.014 for the 1P-HIRT with high factor loadings, 0.015 for the 1P-HIRT with diverse factor loadings, 0.093 for the 2P-HIRT with high factor loadings, 0.097 for the 2P-HIRT with diverse factor loadings, 0.143 for the 3P-HIRT with high factor loadings, and 0.147 for the 3P-HIRT with diverse factor loadings. The mean RMSE for the pseudo-guessing parameters was 0.089 for the 3P-HIRT with high factor loadings, and 0.091 for the 3P-HIRT with diverse factor loadings. Considering the RMSE for the difficulty parameters, one found that the parameter recovery under the 1P-HIRT and 2P-HIRT was much better than that for the 3P-HIRT. The difficulty in recovering the parameters of the 3P-HIRT was also observed by de la Torre and Hong (2010). The recovery for the factor loadings was very good for the 1P-HIRT, 2P-HIRT, and 3P-HIRT, with a mean RMSE of 0.035.
Summary of Parameter Recovery of Dichotomous Items in Simulation Study 1.
Note: 1P = one-parameter logistic; 2P = two-parameter logistic; 3P = three-parameter logistic; betas are factor loadings; HIRT = higher-order item response theory; RMSE = root mean square error; — = not applicable.
Let PC-HIRT, GPC-HIRT, RS-HIRT, and GR-HIRT denote the HIRT model with item response functions of the GPCM, PCM, RSM, and GRM, respectively. The detailed results are not shown but are available on request. The mean RMSE for the location parameters was 0.112 for the PC-HIRT with high factor loadings, 0.113 for the PC-HIRT with diverse factor loadings, 0.130 for the GPC-HIRT with high factor loadings, 0.127 for the GPC-HIRT with diverse factor loadings, 0.057 for the RS-HIRT with high factor loadings, 0.054 for the RS-HIRT with diverse factor loadings, 0.056 for the GR-HIRT with high factor loadings, and 0.056 for the GR-HIRT with diverse factor loadings. The mean RMSE for the slope parameters was 0.034 for the PC-HIRT with high factor loadings, 0.026 for the PC-HIRT with diverse factor loadings, 0.074 for the GPC-HIRT with high factor loadings, 0.076 for the GPC-HIRT with diverse factor loadings, 0.039 for the RS-HIRT with high factor loadings, 0.022 for the RS-HIRT with diverse factor loadings, 0.060 for the GR-HIRT with high factor loadings, and 0.060 for the GR-HIRT with diverse factor loadings. Given that the RMSE values were rather small, the parameter recovery for the polytomous items under the PC-HIRT, GPC-HIRT, RS-HIRT, and GR-HIRT was satisfactory. With respect to recovery for the factor loadings, it was found that the mean RMSE was as small as 0.038. Hence, the parameter recovery was also very satisfactory.
To reveal the consequences of ignoring the hierarchy in the latent traits on item parameter estimation, the authors deliberately adopted the consecutive approach (Figure 1a) by fitting the 3PLM to the 3P-HIRT data, one test at a time. The results showed that the RMSE was between 0.059 and 0.991 (M = 0.317) for the difficulty parameters with high factor loadings and between 0.090 and 1.137 (M = 0.312) with diverse factor loadings, between 0.050 and 0.349 (M = 0.159) for the slope parameters with high factor loadings and between 0.076 and 0.252 (M = 0.155) with diverse factor loadings, and between 0.018 and 0.360 (M = 0.097) for the pseudo-guessing parameters with high factor loadings and between 0.015 and 0.374 (M = 0.100) with diverse factor loadings. It appeared that the RMSE values were slightly larger than those when the data-generating 3P-HIRT was fit, suggesting that the ignorance of the hierarchy led to worse item parameter estimation. However, the consequence was not very serious because the sample size was large and the item parameter estimation relied majorly on the item responses rather than on the hierarchy in the latent traits. If the sample size was smaller, the consequence would be more serious, as found by de la Torre and Hong (2010). The same findings applied to the consecutive 1PLM and 2PLM.
The GPCM was also deliberately fit to the GPC-HIRT data, one test at a time, to reveal the consequences of ignoring the hierarchy in the latent traits on item parameter estimation. The results showed that the RMSE was between 0.036 and 0.382 (M = 0.125) for the location parameters with high factor loadings and between 0.044 and 0.356 (M = 0.125) with diverse factor loadings, and between 0.028 and 0.132 (M = 0.074) for the slope parameters with high factor loadings and between 0.038 and 0.152 (M = 0.073) with diverse factor loadings. The RMSE values were slightly worse than those when the data-generating GPC-HIRT was fit. The same findings applied to the consecutive PCM, RSM, and GRM.
Simulation Study 2: Estimation of the Latent Traits
Design and Analysis
If HIRT models are not available, how can one obtain person measures for the second-order latent traits? Taking the setting of Simulation Study 1 as an example, there were three first-order and one second-order latent traits. If only nonhierarchical IRT models are available, then one might take the consecutive approach to obtain person measures for each of the first-order latent traits, and then apply a one-factor confirmatory analysis (CFA) to these person measures to obtain an estimate for the second-order latent trait (denoted as IRT-CFA). Specifically, LISREL was used to conduct the one-factor CFA on person measures that were obtained from the consecutive approach. The command SIMPLIS was used to obtain factor scores for individuals, which were the estimates for the second-order latent trait (Du Toit & Du Toit, 2001, pp. 250-254). Alternatively, one might simply average these person measures across tests to serve as an estimate for the second-order latent trait (denoted as AVERAGE).
The first part of Simulation Study 2 was conducted to reveal the advantage in measurement precision of the single-staged HIRT approach over the multistaged IRT-CFA and AVERAGE approaches. As demonstrated in the literature, multistaged approaches produce biased results and are less efficient than the single-staged approach (Little & Rubin, 1983; Mislevy, 1984). The simulation design of Simulation Study 2 was identical to that in Simulation Study 1. To compare the relative efficiency of the three approaches (HIRT, IRT-CFA, and AVERAGE) in parameter estimation for the second-order latent trait, the authors computed the relative efficiency of the HIRT approach over another approach A (i.e., IRT-CFA or AVERAGE) as
where MSR was the mean square residuals in person measures of the second-order latent trait across 20 replications:
N was the sample size, and R was the number of replications (20). A relative efficiency greater than 1 reveals the superiority of the HIRT approach in estimating the second-order latent trait. It was expected that the HIRT approach would be the most efficient and the AVERAGE would be the least efficient, especially when the factor loadings were diverse.
In addition to the relative efficiency on the second-order latent trait, in the second part of Simulation Study 2, the authors computed the relative efficiency of the HIRT approach over the consecutive approach in the estimation for the three first-order latent traits. After the burn-in period, they took 1,000 MCMC samples from the remaining 9,000 iterations with a thin factor of 9. In each of the MCMC samples, the estimate for the second-order latent trait, the residuals of the first-order latent traits, and the regression weights were readily available, so that the estimates for the first-order latent traits can be computed with Equation 2. The means and the standard deviations of these estimates across the 1,000 MCMC samples were then treated as the expected a posteriori (EAP) estimates and the standard errors for the first-order latent traits.
Results
When the factor loadings were high, the mean root mean square residuals (RMSR) for the person measures of the second-order latent trait were between 0.407 and 0.527 for the HIRT approach, between 0.415 and 0.548 for the IRT-CFA approach, and between 0.442 and 0.568 for the AVERAGE approach. The relative efficiency was between 1.032 and 1.082 for the HIRT approach over the IRT-CFA approach, and between 1.179 and 1.141 for the HIRT approach over the AVERAGE approach. When the factor loadings were diverse, the mean RMSE was between 0.487 and 0.621 for the HIRT approach, between 0.503 and 0.655 for the IRT-CFA approach, and between 0.624 and 0.707 for the AVERAGE approach. The relative efficiency was between 1.037 and 1.113 for the HIRT approach over the IRT-CFA approach and between 1.299 and 1.641 for the HIRT approach over the AVERAGE approach. In summary, the HIRT approach yielded the smallest mean RMSR (the most efficient), whereas the AVERAGE approach yielded the largest mean RMSR (the least efficient), and high factor loadings had a smaller mean RMSR than diverse factor loadings.
Next, consider the RMSR for the person measures of the three first-order latent traits. When the factor loadings were high, the relative efficiency was between 1.06 and 1.31 in the 3PLM condition and between 1.05 and 1.14 in the GPCM condition. When the factor loadings were diverse, the relative efficiency was between 0.96 and 1.02 in the 3PLM condition, and between 0.96 and 1.03 in the GPCM condition. As consistent with those found by de la Torre et al. (2011), the higher the correlation among the first-order latent traits, the more advantageous the HIRT approach over the consecutive approach in the estimation of the first-order latent traits. It could be inferred, as well as supported by the simulation study of de la Torre et al., that the advantage of the HIRT approach over the consecutive approach would be more significant when there are more tests and the test lengths are short.
Simulation Study 3: Parameter Recovery of Third-Order IRT Models
Design and Analysis
For simplicity, there was one third-order latent trait, which governed three second-order latent traits, and each second-order latent trait governed two first-order latent traits, resulting in a total of six first-order latent traits. Each first-order latent trait was measured by either 20 dichotomous items following the 2PLM or 10 five-point items following the GPCM. The sample size was 1,000. These latent traits had a simple or complex structure. In the simple structure, every lower-order latent trait was governed by one higher-order latent trait, as shown in Figure 2a. In the complex structure, a lower-order latent trait could be governed by more than one higher-order latent trait, as shown in Figure 2b, where in this study the first, second, and fifth first-order latent traits were governed by two second-order latent traits. The regression coefficients were set between .4 and 1. When a high-order latent trait governed only two latent traits, one of the two factor loadings was constrained at unity for model identification. The item difficulty and threshold parameters for the 2PLM and GPCM were generated from N(0, 1), truncated between −4 and 4, and the item slope parameters were generated from N(1, 0.2). After the data was generated, the true model was fit to the data by using WinBUGS. A total of 10 replications were made. The bias and RMSE were computed to evaluate the parameter recovery.

Third-order item response models used in Simulation Study 3: (a) simple structure, (b) complex structure.
Results
Table 2 summarizes the parameter recovery of the four models: 2P-HIRT with a simple structure, 2P-HIRT with a complex structure, GPC-HIRT with a simple structure, and GPC-HIRT with a complex structure. Generally speaking, the magnitudes of the bias and RMSE were small and comparable with those in the second-order models in Simulation Study 1. In addition, the parameter recovery for the complex structure models was very similar to that for the simple structure models. In short, the parameter recovery was satisfactory.
Summary of Parameter Recovery for the Third-Order Models in Simulation Study 3.
Note: 2P = two-parameter logistic; HIRT = higher-order item response theory; GPC = generalized partial credit; S = simple structure; C = complex structure; RMSE = root mean square error.
Two Empirical Examples
Example 1: Achievement Assessment
In Taiwan, junior high school graduates who wish to enter senior high schools should take the Basic Competence Tests for Junior High School Students (BCT). These tests consist of five subjects: language, mathematics, English, social sciences, and nature sciences. In the 2006 BCT, there were 48 items in language, 33 items in mathematics, 45 items in English, 63 items in social sciences, and 58 items in nature sciences, and all items were in the multiple-choice format. Each of the five tests was treated as measuring a first-order latent trait, and the five latent traits were governed by a second-order latent trait of “overall academic ability.” A total of 5,000 examinees (2,639 boys and 2361 girls) were randomly selected from a population of more than 300,000 examinees.
The authors were particularly interested in two questions. First, were the discrimination and pseudo-guessing parameters necessary? The 1P-HIRT, the 2P-HIRT, and the 3P-HIRT were thus fit to the data. Second, was the relationship between the first- and second-order latent traits linear or nonlinear? A linear HIRT model and a quadratic HIRT model were thus fit to the data. According to the Bayesian deviance information criterion (DIC; Spiegelhalter et al., 2003), the 3P-HIRT yielded a better fit to the data than the 1P-HIRT and the 2P-HIRT. Thus, the pseudo-guessing parameters and discrimination parameters were useful. According to the DIC, the linear 3P-HIRT was preferred to the quadratic 3P-HIRT, suggesting that the relationship between the first- and second-order latent traits was linear. The posterior predictive model checking on the Bayesian chi-square statistic yielded a p value of .048, suggesting that the model-data fit of the linear 3P-HIRT was not bad. Under this model, the item parameter estimates were between 0.61 and 7.21 (M = 2.83) for the discrimination parameters, between −3.97 and 4.35 (M = −0.35) for the difficulty parameters, and between 0.02 and 0.38 (M = 0.21) for the pseudo-guessing parameters. Note that a discrimination parameter estimate as high as 7.21 was unusual and deserved further investigation. The factor loadings were .87, .83, .86, .90, and .92 for language, English, mathematics, social sciences, and nature sciences, respectively.
Treating the EAP estimates of the second-order latent trait derived from the HIRT approach as a gold standard, one found that those estimates derived from the IRT-CFA and AVERAGE approaches were shrunken, although highly correlated with those derived from the HIRT approach (rank correlation coefficient = .99). Such a high rank order correlation in person measures across methods might lead one to overlook the consequences of applying the IRT-CFA and AVERAGE approaches to obtain person measures for the second-order latent trait. For higher-stakes examinations like the BCT, rank ordering in person measures is all that matters, because examinees are admitted according to their rank orders. Actually, the rank order changes in person measures between these approaches were very substantial. The rank order changes between the HIRT and the IRT-CFA approaches (in absolute value) had a mean of 47.70, a standard deviation of 40.94, and a maximum of 405. Those between the HIRT and the AVERAGE approaches had a mean of 53.42, a standard deviation of 46.61, and a maximum of 466. For this high-stakes entrance examination, a maximum rank order change of 405 or 466 in 5,000 examinees, which corresponded to a maximum rank order change of 24,300 or 27,960 in 300,000 examinees, was certainly influential. It appeared that the AVERAGE approach was only slightly worse than the IRT-CFA approach, which was mainly because the five factor loadings were very similar (.83 ~ .92).
For the person measures of the first-order latent traits, the EAP estimates directly obtained from the HIRT approach were treated as the gold standard, with which those EAP estimates from the consecutive approach were compared. It was found that the rank order changes between the HIRT approach and the consecutive approach (in absolute value) had a mean of 170.95, a standard deviation of 145.08, and a maximum of 1,008 for language; a mean of 161.93, a standard deviation of 133.72, and a maximum of 1,174 for mathematics; a mean of 134.65, a standard deviation of 163.56, and a maximum of 1,389 for English; a mean of 129.75, a standard deviation of 107.10, and a maximum of 1,002 for social sciences; and a mean of 138.97, a standard deviation of 120.18, and a maximum of 1,006 for nature sciences. Apparently, the consecutive approach was not a good method to obtain the person measures for the first-order latent traits because it ignored the hierarchical structure of latent traits, as also demonstrated in Simulation Study 2 and by de la Torre et al. (2011).
Because the quadratic 3P-HIRT was fit to the data, a 10-replication simulation on the model was done to evaluate its parameter recovery. The generating item parameters were the same as those in Simulation Study 1, except that all the pseudo-guessing parameters were set at 0.25. There were five tests and 2,000 persons. The factor loadings were adopted from the nonlinear factor analysis example of Mplus (ex5.7), between 0.97 and 1.12 for the linear factor loadings, and between −0.25 and −0.19 for the quadratic factor loadings. The simulation results showed that the RMSE was between 0.053 and 0.414 (M = 0.184) for the difficulty parameters, between 0.033 and 0.255 (M = 0.125) for the slope parameters, 0.010 for the pseudo-guessing parameters, between 0.103 and 0.142 (M = 0.126) for the linear factor loadings, and between 0.151 and 0.218 (M = 0.170) for the quadratic factor loadings. These RMSE values were comparable with those in Simulation Study 1, suggesting that the parameter recovery of the quadratic 3P-HIRT was satisfactory.
Example 2: Personality Assessment
Internet addiction and Internet hostility are two typical behaviors of pathological Internet use. Depression and cognitive distortions are two potential factors that facilitate pathological Internet use (Davis, 2001). Lin and Liu (2003) used the Internet Hostility Questionnaire (25 four-point items), Chinese Internet Addiction Scale (26 four-point items), Internet-Related Depression Inventory (21 four-point items), and Questionnaire of Cognitive Distortions (four domains including social benefit, realistic distortion, time/space distortion, and social control, each having 4 four-point items) to assess these first-order latent traits, respectively. There were a total of seven first-order latent traits: Internet addiction, Internet hostility, depression, and the four domains of cognitive distortions. A total of 987 Taiwan Internet users (625 male and 362 female college students) were recruited. Two kinds of higher-order structures for pathological Internet use were proposed: (a) all the seven first-order latent traits constituted one second-order latent trait and (b) the seven first-order latent traits constituted two corrected second-order latent traits. In (a), the second-order latent trait can be referred to as “pathological Internet use.” In (b), Internet addiction, Internet hostility, and depression constituted a second-order latent trait called “pathological Internet use,” and the four domains in the Questionnaire of Cognitive Distortions constituted another second-order latent trait called “cognitive distortion.”
The authors were particularly interested in two questions. First, were the discrimination parameters necessary? To answer this question, the PC-HIRT, GPC-HIRT, RS-HIRT, and GR-HIRT were fit to the data. Second, did the seven first-order latent traits constitute one or two second-order latent traits? To answer this question, two models (one second-order latent trait and two second-order latent traits) were fit to the data. According to the DIC values, the GPC-HIRT with two second-order latent traits was the best-fitting model. The Bayesian chi-square statistic yielded a posterior predictive p value of .36, suggesting that the model-data fit of the GPC-HIRT was satisfactory. Under this model, the item parameter estimates were between 0.05 and 4.47 (M = 1.08) for the discrimination parameters, and between −5.16 and 5.93 (M = 0.98) for the location parameters. The correlations between the second-order latent trait “pathological Internet use” and the three first-order latent traits (Internet addiction, depression, and Internet hostility) were .54, .67, and .89, respectively. The correlations between the select-order latent trait “cognitive distortion” and the four first-order latent traits (social benefit, realistic distortion, time/space distortion, and social control) were .96, .97, .49, and .66, respectively. The correlation between the two second-order latent traits was .68.
Conclusion and Discussion
The combination of linear higher-order factor analysis and IRT exhibits the advantages of the higher-order structure for hierarchical latent traits and the nonlinear relationship between categorical item responses and latent traits. Recent research into this area has provided a solid foundation. To further enhance the development, this study proposes a new class of HIRT models, which is general enough to accommodate many (if not all) item response functions, to allow users to specify their own customized models for both dichotomous and polytomous items, and to go beyond two orders of latent traits and the linear relationship between latent traits. Parameters in the new class of models can be estimated fairly well by using WinBUGS, as demonstrated in Simulation Studies 1 and 3. The relation between a lower- and a higher-order latent trait can be nonlinear, and a lower-order latent trait can be governed by more than one higher-order latent trait (i.e., complex structure). Although the second-order latent trait can be estimated indirectly with the multistaged IRT-CFA or AVERAGE approach, these two approaches are less efficient than the direct estimation in the single-staged HIRT approach, and the AVERAGE approach performs even worse in diverse factor loadings, as demonstrated in Simulation Study 2. The new models can be applied to both achievement assessment and personality assessment, as demonstrated in the two empirical examples. Rank order changes reveal practical significance of the HIRT approach over the IRT-CFA and AVERAGE approaches, and consecutive approaches in the estimation for the first- and second-order latent traits.
In HIRT models, each person can obtain a parameter estimate together with a corresponding standard error for each latent trait at any order. Direct estimation of the standard error is another advantage of HIRT models. Such a standard error for the second-order latent trait is not directly attainable in the IRT-CFA or AVERAGE approach because the person measures for the first-order latent traits are treated as true values without any measurement error. Hence, the precision of person measures for the second-order latent trait derived from the IRT-CFA and AVERAGE approaches will be generally overestimated, which is a drawback of using these two methods. In the estimation of the first-order latent traits, the HIRT approach is also more advantageous than the consecutive approach, especially when the first-order latent traits are highly correlated. In addition, as demonstrated by de la Torre et al. (2011), the more the tests and the shorter the test lengths, the more advantageous the HIRT approach over the consecutive approach.
The simulations were conducted on personal computers with 3.6-GHz Intel Pentium IV. It took approximately 40 to 80 hr per replication, depending on model complexity, sample size, and test length. The computation time is feasible for most real data analyses but is not efficient for comprehensive simulations, which was the main reason why only 10 or 20 replications were conducted. To speed up parameter estimation of HIRT models using WinBUGS, one may adopt standard IRT computer programs, such as BILOG-MG, MULTILOG, or PARSCALE, to fit each test at the first order, and then use the parameter estimates as starting values for HIRT models using WinBUGS. Preferably, a stand-alone computer program should be developed for HIRT models.
The simulation design of this study was not comprehensive mainly because of computation burden. Further studies can explore the performances of HIRT models in more complicated conditions, such as multiple third-order latent traits, and different settings of factor loadings, item response functions, test lengths, and sample sizes. The development of model selection indices for HIRT models under the Bayesian framework (e.g., the deviance information coefficient, pseudo-Bayes factor, and posterior predictive model checks) is also an important issue for future studies. A possible further extension of HIRT models is on the person side. In this study, persons are assumed to be randomly selected from the same population. In reality, persons may have a multilevel structure: For example, students are nested in schools. It is desirable to embed the HIRT within a multilevel framework (Fox, 2005) to develop IRT models that can take into account both multilevel structures in persons and latent traits. Finally, a combination of HIRT models (measurement part) and structural models to form structural equation models is straightforward and deserves further investigation.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The first author was sponsored by the National Science Council Taiwan (Grant NSC 98-2410-H-364-014), and the second author was sponsored by General Research Fund, Research Grants Council, Hong Kong (Grant 84511).
