The Accuracy of Bayesian Model Fit Indices in Selecting Among Multidimensional Item Response Theory Models

Abstract

Item response theory (IRT) models are often compared with respect to predictive performance to determine the dimensionality of rating scale data. However, such model comparisons could be biased toward nested-dimensionality IRT models (e.g., the bifactor model) when comparing those models with non-nested-dimensionality IRT models (e.g., a unidimensional or a between-item-dimensionality model). The reason is that, compared with non-nested-dimensionality models, nested-dimensionality models could have a greater propensity to fit data that do not represent a specific dimensional structure. However, it is unclear as to what degree model comparison results are biased toward nested-dimensionality IRT models when the data represent specific dimensional structures and when Bayesian estimation and model comparison indices are used. We conducted a simulation study to add clarity to this issue. We examined the accuracy of four Bayesian predictive performance indices at differentiating among non-nested- and nested-dimensionality IRT models. The deviance information criterion (DIC), a commonly used index to compare Bayesian models, was extremely biased toward nested-dimensionality IRT models, favoring them even when non-nested-dimensionality models were the correct models. The Pareto-smoothed importance sampling approximation of the leave-one-out cross-validation was the least biased, with the Watanabe information criterion and the log-predicted marginal likelihood closely following. The findings demonstrate that nested-dimensionality IRT models are not automatically favored when the data represent specific dimensional structures as long as an appropriate predictive performance index is used.

Keywords

model comparisons predictive performance Bayesian multidimensional IRT model fit propensity

Different dimensional structures can be represented in item response data gathered from the administration of educational and psychological instruments. The dimensions could have a non-nested form, such as a unidimensional or a between-item-dimensionality structure (Adams et al., 1997), or the dimensions could have a nested structure, such as when secondary (or specific) dimensions are nested within a single primary dimension (i.e., a bifactor structure; Gibbons & Hedeker, 1992; Holzinger & Swineford, 1937) or are nested within multiple primary dimensions (i.e., a two-tier structure; Cai, 2010). Therefore, when examining the dimensional structure of the data during a confirmatory phase, researchers often assess whether the structure has a non-nested or a nested form by comparing various item response theory (IRT) models with respect to the predictive performance of the data.

Recent work suggests that confirming nested-dimensionality structures in data may be challenging, however. Bonifay and Cai (2017) demonstrated that bifactor IRT models and two-dimensional exploratory IRT models have a greater propensity to fit random data when compared with alternative IRT models having equal—or even higher—number of parameters (e.g., a three-parameter logistic IRT model, or 3PL). For example, they showed that, relative to a unidimensional 3PL model, a bifactor model with one fewer parameter was better at explaining simulated data that did not represent any dimensional structures. An implication of their findings is that nested-dimensionality models such as the bifactor model have a greater fit propensity that may bias model comparison results toward those models, thereby misleading researchers to believe that more complex dimensional structures are represented in the data. Recent work in a frequentist structural equation modeling (SEM) context further supports the notion that model selection indices tend to favor more complex models, such as a bifactor model (Greene et al., 2019).

However, it is unclear as to what extent model comparison results in practice can be biased toward a bifactor or other complex models when Bayesian model selection indices are used. Investigations into fitting propensity, such as Bonifay and Cai (2017), often equate competing models on the number of parameters, randomly generate data from some data space, and observe whether unadjusted model fit differs across models. Such investigations are critical for understanding that model complexity is more than just the estimated number of parameters per model. However, it is difficult to extrapolate from the results of the said investigations to real-world model selection for several reasons.

In practice, researchers often examine real data that should represent dimensional structures corresponding to the theoretical frameworks of the latent traits that the psychological instruments that elicited those data were intended to measure, given that the theory and the design of the instruments are not too far off. In addition, researchers are not always interested in selecting among models with an equal number of parameters. Along these lines, Preacher (2006) compared the fit propensity of several competing two-dimensional models inspired by empirical research but did not include a bifactor model or other nested-dimensionality models. Typical alternative candidate models to the bifactor model were also not examined by Bonifay and Cai (2017). It may be rare, for example, that a researcher would contemplate among a bifactor model, a unidimensional 3PL model, and a diagnostic classification model. It is arguably more common that competing models will have similar dimensional structures and differ in the number of estimated parameters. Although Falk and Muthukrishna (2021) compared a few such alternative models to a bifactor model, the pattern of the fit propensity results was difficult to interpret.

A first step in model selection is often to compute some index that could be used to differentiate the models. For example, in a frequentist context, the Akaike information criterion (AIC; Akaike, 1998) and the Bayesian information criterion (BIC; Schwarz 1978) penalize complex models based on at least the number of estimated parameters, and some work in the SEM literature suggests that asymptotically the BIC is able to select the true model (if a parametric model exists), whereas the AIC will tend to choose a model that minimizes the mean squared error of prediction (Vrieze, 2012). Other model fit indices in SEM do not necessarily have these properties and may be ill-suited for model selection. Yet, some evidence also suggests that the AIC and BIC do not perform well when selecting among a bifactor model and various alternatives (Greene et al., 2019).

A valid argument from the fit propensity literature is that adjustments for parsimony based solely on the number of estimated parameters may not be adequate. This issue is further complicated when using Bayesian estimation because the prior distributions assigned to the parameters can affect the penalty terms of predictive performance indices, such as the effective number of parameters. Although the DIC is often the most popular predictive performance index due to its wide availability, there are other recently developed indices that could be used for model selection (Gelman et al., 2014; Vehtari et al., 2017). As we will review later, the indices examined in our study have typically not been simultaneously compared for their ability to perform model selection with multidimensional IRT models corresponding to the type of structures outlined earlier. These indices quantify model fit and model complexity in different ways and, therefore, may perform differently when conducting model selection in a multidimensional IRT context.

Understanding how much the bifactor model’s fit propensity could bias model comparison results (e.g., Bonifay & Cai, 2017; Preacher, 2006) is crucial because of the way those results could be used. For instance, the dimensional structure confirmed in data could inform the theoretical framework of the measured trait, inform the necessary measurement model to scale the respondents, and be used as evidence for the internal structural aspect of validity as outlined in Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014). We note that there is much further debate surrounding the bifactor model in the literature, including criticisms (Bonifay et al., 2017; Sellbom & Tellegen, 2019), overviews (Markon, 2019), and illustrations of the utility of the model (Hansen et al., 2014; Stucky & Edelen, 2015). We emphasize upfront that clues other than from pure evaluation of model fit—such as the size and direction of the general and specific item discrimination estimates—should be inspected, and a strong theoretical justification for conceptualizing the measured trait with a bifactor structure should exist. However, model fit is often the first step of a dimensionality analysis, and ideally, it is desirable if the model fit conclusions align with other steps in determining the most appropriate dimensional structure of the data.

Because of the implications incorrect conclusions from a dimensionality analysis could have in practice, we conducted a simulation study to investigate whether nested-dimensionality IRT models were favored over non-nested-dimensionality IRT models in situations in which the latter were the true models. For this study, the nested-dimensionality models included the bifactor model and the two-tier model (Cai, 2010). We included the two-tier model in our study because of the relationship between that model and the bifactor model. The former can be viewed as an extension of the latter, or conversely, the bifactor model can be viewed as a special case of the two-tier model (Cai, 2010), as the two differ with respect to the number of primary dimensions within which the secondary dimensions are nested. Therefore, if the bifactor model could be automatically favored over simpler models, then the two-tier model could be as well.

In our simulation study, we generated data to represent specific dimensional structures, with some being nested-dimensionality structures and others being non-nested-dimensionality structures. By generating data in this way, we were able to determine how much a nested-dimensionality model’s fit propensity could bias model comparison results. We also used Bayesian estimation methods for the IRT models rather than marginal maximum likelihood estimation because Bayesian estimation is suitable for smaller samples (Fujimoto & Neugebauer, 2020) and high-dimensional spaces (Fox, 2010), such as spaces often explored with nested-dimensionality models. To determine which model was the most appropriate for the data, we compared them with respect to predictive performance of the data. Our study, then, was designed to demonstrate which Bayesian predictive performance indices are more likely to accurately identify whether nested- or non-nested-dimensionality structures are represented in the data, making our findings useful to practitioners.

In regard to the remainder of the article’s organization, next, we review the Bayesian predictive performance indices that we examined. We then report on the simulation study that we conducted to test the accuracy of these indices. We end the article with a discussion and concluding remarks.

Predictive Performance Indices

The predictive performance indices we investigated were the deviance information criterion (DIC; Spiegelhalter et al., 2002), Watanabe–Akaike information criterion (WAIC; Watanabe, 2013), and two approximations of the leave-one-out cross-validation (or LOO for brevity). The two approximations of LOO are based on importance sampling—one based on raw ratios, leading to the log-predicted marginal likelihood (LPML; Gelfand, 1996), and the other based on Pareto-smoothed weights, leading to the Pareto-smoothed importance sampling (PSIS) version of the LOO, or PSIS-LOO (Vehtari et al., 2017). We focused on these indices for the following reasons.

The DIC is available in various Bayesian estimation software (e.g., OpenBUGS) and Bayesian estimation packages within statistical software (e.g., SAS), is simple to compute, and has been reported in many studies, including in multidimensional IRT settings. The LPML is also easy to compute and has been used in simulation studies involving multidimensional IRT models. Relative to the DIC and LPML, the WAIC and PSIS-LOO are newer predictive performance indices; thus, these indices have not been as extensively studied as the DIC and LPML, especially in multidimensional IRT situations that we explored (at least to our knowledge). However, the WAIC and PSIS-LOO were included in our study because of their relationship to the DIC and LPML, respectively, as we review below and because general Matlab and R code are available that can be adapted to work with many Bayesian estimation software (see Vehtari et al., 2017).

Next, we present the technical details of these indices and review the literature on how well these indices perform in selecting IRT models. We use the following notations for the presentation and the remainder of this article. Let $i$ represent a person (where $i = 1, 2, \dots, N$ , with $N$ representing the sample size). Let $j$ represent an item (where $j = 1, 2, \dots, J$ , with $J$ representing the number of items). Let $q$ represent a dimension (where $q = 1, 2, \dots, Q$ , with $Q$ representing the number of dimensions). Let $s$ represent the $s th$ saved sample during the Markov chain Monte Carlo (MCMC) sampling process (where $s = 1, 2, \dots, S$ , with $S$ representing the number of saved samples). Let $c$ represent a response category (where $c = 0, 1, \dots, m_{j}$ , with $m_{j}$ representing the highest score category for item $j$ ). Let $ω_{ij}$ represent the parameters of an IRT model associated with the $i th$ and $j th$ response (i.e., the person ability and item parameters). Then $P (y_{ij} | ω_{ij})$ is the probability of a response of $y$ by person $i$ to item $j$ given $ω_{ij}$ , with this probability based on an IRT model.

Deviance Information Criterion

The first index we review is the DIC, which can be represented as follows:

DIC = \bar{D} + p_{DIC},

(1)

where $\bar{D}$ is the posterior mean of the deviance, that is,

\bar{D} = - 2 \times \frac{1}{S} \sum_{s = 1}^{S} \sum_{i = 1}^{N} \sum_{j = 1}^{J} \log [P (y_{i j} | ω_{i j}^{s})],

(2)

with $ω_{ij}^{s}$ being the $s th$ saved values during the MCMC sampling process, and $p_{DIC}$ is the effective number of parameters (or penalty term), that is,

p_{DIC} = \bar{D} - D (\bar{ω}),

(3)

with $D (\bar{ω})$ being the deviance based on the means of the parameters’ posterior distributions.

Although the DIC is widely used, a shortcoming of it is that it depends only on summaries of the posterior distribution instead of using all the information contained in the posterior, thereby possibly under-penalizing more complex models (Plummer, 2008). Such under-penalization could result in the DIC to favor complex models over simpler ones even when the latter are more appropriate for the data. The DIC’s tendency to favor more complex models has been observed within multidimensional IRT. In a simulation study, a full bifactor IRT model was compared with constrained variations of the model (e.g., a testlet model), and the DIC consistently favored the full bifactor IRT model regardless of whether it was the true model (Li et al., 2006). In another simulation study, the DIC selected the correct multidimensional IRT model, but in each of the dimensional conditions, only two models were compared—the true model to a unidimensional IRT model, thereby always making the true model the more complex one (Zhu & Stone, 2012). For example, in one condition, a testlet graded response model (GRM) was compared with a unidimensional GRM. There was never a condition in which the true model was the simpler one. Thus, the study conditions make it unclear whether the DIC was being accurate or was displaying a tendency to select the more complex model regardless of whether that model was the true model.

Watanabe–Akaike Information Criterion

We also investigated the performance of the WAIC, which has a similar form to the DIC (i.e., a model fit component and an adjustment for model complexity), but this index is fully Bayesian in that it uses all the information in the posterior distribution. The WAIC, when placed on the deviance scale, can be obtained through

\begin{matrix} WAIC = \sum_{i = 1}^{N} \sum_{j = 1}^{J} - 2 \times {\hat{elppd}}_{WAI C_{ij}} \\ = \sum_{i = 1}^{N} \sum_{j = 1}^{J} - 2 [\log (\frac{1}{S} \sum_{s = 1}^{S} P (y_{ij} | ω_{ij}^{s})) - p_{WAI C_{ij}}], \end{matrix}

(4)

where $\hat{elppd}$ denotes the computed expected log-pointwise predictive density. The $p_{WAIC}$ is the effective number of parameters and can be obtained in two ways, although we focus on the version recommended by Vehtari et al. (2017), which is based on the variance across the MCMC saved samples for the $i th$ and $j th$ observation:

p_{WAIC} = \sum_{i = 1}^{N} \sum_{j = 1}^{J} p_{WAI C_{ij}} = \sum_{i = 1}^{N} \sum_{j = 1}^{J} V_{s = 1}^{S} [\log P (y_{ij} | ω_{ij}^{s})] .

(5)

The existing literature on the performance of the WAIC in selecting the correct IRT model is limited to unidimensional situations. Thus, it is unclear how this index behaves when used to select among multidimensional IRT models. However, within a unidimensional IRT setting, the WAIC was able to accurately differentiate among the one-, two-, and three-parameter logistic IRT models (Luo & Al-Harbi, 2017) and between unidimensional versions of the GRM and the generalized partial credit model (da Silva et al., 2019). However, to date, no studies have examined how accurately the WAIC can differentiate between non-nested- and nested-dimensionality IRT models.

Log-Predicted Marginal Likelihood

As previously noted, the LPML is an approximation to the LOO based on raw importance sampling. We refer to this index as the LPML rather than the raw importance sampling version of the LOO to differentiate it from the next version of the LOO that we describe shortly. The LPML is based on raw importance ratios, or formally,

r_{ij}^{s} = \frac{1}{P (y_{ij} | ω_{ij}^{s})},

(6)

leading to

\begin{matrix} LPML = \sum_{i = 1}^{N} \sum_{j = 1}^{J} \log (\frac{\sum_{s = 1}^{S} r_{ij}^{s} P (y_{ij} | ω_{ij}^{s})}{\sum_{s = 1}^{S} r_{ij}^{s}}) \\ = \sum_{i = 1}^{N} \sum_{j = 1}^{J} \log (\frac{1}{\frac{1}{S} \sum_{s = 1}^{S} \frac{1}{P (y_{ij} | ω_{ij}^{s})}}) . \end{matrix}

(7)

This index is simple to compute, but it could be unstable because the importance ratios could have a high or infinite variance (Vehtari et al., 2017). Nevertheless, the performance of the LPML has been investigated in multidimensional situations. In simulation studies in which nested-dimensionality IRT models (e.g., a bifactor IRT model) and constrained (or more parsimonious) variations of those nested-dimensionality models (e.g., a testlet IRT model) were being compared, with these studies having conditions in which each type of model was the true model, the LPML consistently identified the correct model (Fujimoto, 2018, 2019, 2020; Li et al., 2006). In other words, in these studies, the LPML did not consistently favor the more complex models—potentially making this index less biased toward nested-dimensionality models than the DIC.

In another study, however, the LPML displayed a slight tendency to favor the more complex model when the true model was the bifactor IRT model and the comparison model was a two-tier IRT model, within which the bifactor model is nested—the two-tier model was favored over the bifactor model up to 24% of the time depending on the sample size (Fujimoto & Neugebauer, 2020). These two models, though, were nested-dimensionality models, and the researchers noted that even though the two-tier model was incorrectly favored over the bifactor model in these instances, the estimates from the two-tier model indicated that the model was reduced to a bifactor model, thereby leading to the same substantive conclusion about the dimensional structure of the data as that based on a bifactor model. As of now, it is unclear as to whether the LPML can correctly identify the more parsimonious model in situations in which a unidimensional model is being compared with a bifactor model.

Pareto-Smoothed Importance Sampling–Leave-One-Out Cross-Validation

Vehtari et al. (2022) replaced the raw importance ratios with weights $(u_{ij}^{s})$ to address the instability of the LPML’s variance, leading to the PSIS-LOO:

\begin{matrix} PSIS - LOO = \sum_{i = 1}^{N} \sum_{j = 1}^{J} {\hat{elppd}}_{PSIS - LOO ij} \\ = \sum_{i = 1}^{N} \sum_{j = 1}^{J} \log (\frac{\sum_{s = 1}^{S} u_{ij}^{s} P (y_{ij} | ω_{ij}^{s})}{\sum_{s = 1}^{S} u_{ij}^{s}}) . \end{matrix}

(8)

The weights are based on the raw importance ratios in Equation 6, but a smoothing procedure involving the generalized Pareto distribution is applied to the $M$ largest raw ratios (for the technical details about the smoothing procedure, see Vehtari et al., 2022). Any difference in the performance between the PSIS-LOO and the LPML, then, can be attributed to the smoothing procedure performed on the raw ratios. Although this procedure makes the PSIS-LOO more stable than the LPML, the former index is more computationally intensive, as the raw ratios have to be calculated and a smoothing procedure has to be subsequently performed.

Similar to the WAIC, the studies that focused on the performance of the PSIS-LOO within IRT are limited to unidimensional settings, and the studies are the same as those we reviewed for the WAIC (da Silva et al., 2019; Luo & Al-Harbi, 2017). In unidimensional situations, overall, the PSIS-LOO identified the correct model, but again, these studies involved differentiating among unidimensional models (e.g., the one-, two-, and three-parameter logistic IRT models). Thus, how capable the PSIS-LOO is at differentiating between non-nested- and nested-dimensionality IRT models has not been investigated.

These Bayesian predictive performance indices have not been studied together to determine whether they can differentiate between nested- and non-nested-dimensionality IRT models, especially when the data represent non-nested-dimensionality structures. In other words, whether the greater fit propensity of nested-dimensionality models leads one or more of these indices to be biased toward those models has not been explored, a void our study fills.

Simulation Study

We conducted a simulation study with a $3 \times 4$ design (sample size by dimensional structure) to examine how well the DIC, WAIC, LPML, and PSIS-LOO could identify the correct IRT model for the data. We generated 100 data sets for each condition, with the data resembling 4-point ratings to 30 items. Regarding the sample sizes, they included 200, 500, and 1,000 to see whether the size affected the performance of these indices.

As for the other variable aspect of the study, the dimensional structures included two non-nested-dimensionality and two nested-dimensionality structures (see Figure 1 for visualizations of these structures). The non-nested-dimensionality structures were a unidimensional and a two-dimensional structure (Figure 1A and 1B, respectively). For the latter structure, the dimensions were correlated at .50, and each item discriminated on only one dimension (i.e., a simple structure), with the items evenly distributed across the dimensions for 15 items per dimension.

Figure 1.

The Following Are Visualizations of the Dimensional Structures to Which the Data Were Generated to Represent (Figures 1A, 1B, 1D, and 1E for the Unidimensional, Two-Dimensional, Bifactor, and Two-Tier Conditions, Respectively) and of the Dimensional Structures Specified for the Models. (A) Unidimensional Structure (Model 1), (B) Two-Dimensional Structure (Model 2), (C) Six-Dimensional Structure (Model 3), (D) Bifactor Structure (Model 4), (E) Two-Tier Structure (Model 5), (F) Alternative Two-Tier Structure (Model 6).

The nested-dimensionality structures were a bifactor and a two-tier structure (Figure 1D and 1E, respectively). The bifactor structure included one primary (or general) dimension and six secondary (or specific) dimensions, with all dimensions being orthogonal to each other. Each item discriminated on the primary dimension and one secondary dimension, with the items evenly distributed across the secondary dimensions. The two-tier structure consisted of two primary dimensions correlated at .50 and six secondary dimensions orthogonal to each other and the primary dimensions. Each item discriminated on one primary and one secondary dimension, with 15 items per primary dimension and five items per secondary dimension.

Analytic Strategy

Models Used

We used six models to analyze each data set. The models included a unidimensional (Model 1), two-dimensional (Model 2), six-dimensional (Model 3), bifactor (Model 4), and two variations of the two-tier IRT model (Models 5 and 6). Models 5 and 6 differed in that Model 5’s dimensional structure matched that of the two-tier condition, and Model 6’s structure was identical to Model 5’s structure in all ways except the first 20 items discriminated on one of the primary dimensions and the last 10 items discriminated on the other primary dimension (rather than 15 items discriminating on each primary dimension, as with Model 5). Visualizations of the dimensional structures specified for the models are also in Figure 1. These models were selected because many of them differed noticeably in complexity but could be competing models, such as the two-tier model having six more dimensions and 30 more item discriminations than the between-item two-dimensional model, or were similar in complexity but had slightly different structures, such as both two-tier models (Models 5 and 6).

These models, then, together with the dimensional conditions, allowed us to determine whether one or more of the predictive performance indices were biased toward nested-dimensionality models and to provide insight into some of the factors that could contribute to the bias. The unidimensional and two-dimensional data generation conditions were where such bias could be revealed. If a predictive performance index is biased toward nested-dimensionality models because of their greater fit propensity, it should favor those models in all dimensional conditions, including the unidimensional and two-dimensional data generation conditions. If the index is not influenced by a model’s fit propensity, then it should favor the nested-dimensionality models corresponding to the data generation models in only the bifactor and the two-tier conditions, and the index should favor the corresponding non-nested-dimensionality models in the unidimensional and two-dimensional conditions. The six-dimensional model (Model 3) was included to be an additional competing non-nested-dimensionality model that was more complex than the unidimensional and two-dimensional models but less complex than the nested-dimensionality models. In addition, the six-dimensional model was included because the nested-dimensionality conditions had six secondary dimensions, and often in model comparisons, nested-dimensionality models are compared with non-nested-dimensionality models that represent the secondary dimensions of the nested structure, with the dimensions allowed to be correlated (e.g., Canivez, 2016).

Technical Details of the Models

All of the models we used were special cases of the multidimensional version of the generalized partial credit model (Muraki, 1992), where the conditional probability of a response of $y$ by person $i$ to item $j$ is obtained through

P (y_{ij} | ω_{ij}) = \frac{\exp \sum_{h = 0}^{y_{ij}} [α_{j} θ_{i}^{⊤} - (β_{j} + τ_{jh})]}{\sum_{k = 0}^{m_{j}} \exp \sum_{h = 0}^{k} [α_{j} θ_{i}^{⊤} - (β_{j} + τ_{jh})]},

(9)

with the constraint

\sum_{h = 0}^{0} [α_{j} θ_{i}^{⊤} - (β_{j} + τ_{jh})] \equiv 0 .

(10)

In Equation 9, $θ_{i}$ is person $i' s$ $1 \times Q$ vector of latent trait positions; $α_{j}$ and $β_{j}$ are item $j' s$ $1 \times Q$ vector of discriminations and overall intercept, respectively, and $τ_{jc}$ is item j’s relative intercept for category c. For the unidimensional model, $θ_{i}$ and $α_{j}$ are scalars (i.e., $θ_{i} = θ_{i}$ and $α_{j} = α_{j}$ ).

We assigned prior distributions to the parameters that have been shown to lead to models suitable for data representing sample sizes of at least 100 (Fujimoto & Neugebauer, 2020). For these distributions, we use $N_{Q} (\cdot, \cdot)$ to represent a $Q$ -variate normal distribution parameterized by a mean vector and a variance–covariance matrix; $N (\cdot, \cdot)$ , a univariate normal distribution parameterized by a mean and standard deviation (SD); $N_{(a, b)} (\cdot, \cdot)$ , a normal distribution truncated from $a$ to $b$ ; and $δ_{0} (\cdot)$ , a degenerate distribution that fixes a parameter to 0.

The latent trait dimensional positions were assumed to be distributed as

θ_{i} ~ N_{Q} (0, Σ_{θ}),

(11)

where $0$ is a mean vector of 0s to set the location of the latent trait scale, and $Σ_{θ}$ is a variance–covariance matrix with the main diagonal elements set to 1 to establish the metic of the scale, and formally,

Σ_{θ} = (\begin{matrix} 1 & ρ_{12} & \dots & ρ_{1 q^{'}} & ρ_{1 Q} \\ ρ_{21} & 1 & \dots & ρ_{2 q^{'}} & ρ_{2 Q} \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ ρ_{q 1} & ρ_{q 2} & \dots & 1 & ρ_{qQ} \\ ρ_{Q 1} & ρ_{Q 2} & \dots & ρ_{Q q^{'}} & 1 \end{matrix}),

(12)

making it a correlation matrix as well. The unique correlations were assigned the prior of

ρ_{qq'} ~ {\begin{matrix} N_{(- 1, 1)} (0, 2), if dimensions q and q' were allowed to be correlated, \\ δ_{0} (\cdot), if the correlation was fixed to 0, \end{matrix}

(13)

with $q$ and $q'$ representing different dimensions (where $q > q'$ ) and the sampled values requiring to lead to a positive-definite matrix.

Regarding the item parameters, the item discriminations were assigned a prior distribution of

α_{jq} ~ {\begin{matrix} \log N (μ_{α_{q}}, 0.50), if item j discriminated on dimension q, \\ δ_{0} (\cdot), if the discrimination was fixed to 0, \end{matrix}

(14)

with the mean of the lognormal distribution assigned a hyperprior of

μ_{α_{q}} ~ {\begin{matrix} N (0, 1), if q was a genera l (o r primary) dimension . \\ N (- 0.41, 1.00), if q was a specific (or secondary) dimension . \end{matrix}

(15)

The overall intercepts were assigned a prior of

β_{j} ~ N (μ_{β}, 5),

(16)

with $μ_{β}$ assigned a hyperprior distribution of

μ_{β} ~ N (0, 2) .

(17)

Finally, the relative category intercepts were assigned a prior of

(τ_{j 1}, τ_{j 2}, \dots, τ_{j m_{j} - 1}) ~ N_{m_{j} - 1} (0, 100 I),

(18)

where $I$ is an identity matrix, and $τ_{j m_{j}} = - \sum_{c = 1}^{m_{j} - 1} τ_{jc}$ (i.e., item $j' s$ last relative category intercept was constrained to be the negative of the sum of all preceding category intercepts).

The models used in this study were obtained by modifying the sizes of $α_{j}$ and $Σ_{θ}$ and specifying which of their elements were estimated. To give an example, for the bifactor model, $α_{j}$ was a $1 \times 7$ vector (one primary and six secondary dimensions), with the first and one of the remaining elements (depending on which secondary dimension the item discriminated on) being estimated and the other elements being fixed to 0, and $Σ_{θ}$ was a $7 \times 7$ identity matrix.

The MCMC algorithm used to estimate the posterior distributions for all models was written in C++. The algorithm was run for 120,000 iterations for the six-dimensional, bifactor, and two-tier models; the first 20,000 samples were discarded and thereafter every $10 th$ sample was saved, leaving a total of 10,000 saved samples on which inferences were made. For the unidimensional and two-dimensional models, the algorithm was run for 50,000 iterations, with the first 20,000 samples being discarded and thereafter saving every $3 rd$ sample, leaving a total of 10,000 saved samples. To determine whether the Markov chains stabilized, the batch means and the MCMC 95% half-width intervals were inspected (Geyer, 2011).

Model Comparisons

In each dimensional condition, we report the percentage of times each model was favored over all other models. We used the following criteria to determine when a model was favored over another. These criteria addressed model selection uncertainty (Preacher & Merkle, 2012) and identified differences in predictive performances that were more than just trivial. When using the DIC to compare two models (say Model A to Model B), we favored Model A regardless of whether it was the parsimonious model (i.e., the model with fewer estimated parameters) when the difference between their indices, or ${diff}_{DIC} = {DIC}_{B} - {DIC}_{A}$ , was greater than 10. This threshold was chosen because it indicates that Model A’s predictive performance is substantially better than Model B’s performance (MRC Biostatistics Unit, n.d.) and also makes it harder to favor more complex models. When Model A was the parsimonious model, it was also favored when the absolute value of ${diff}_{DIC}$ was less than 10 because a difference falling between $- 10$ and $10$ would indicate that one model was not strongly favored over another, leading to Model A being favored for reasons of parsimony.

When using the LPML, we calculated the pseudo Bayes factor (PsBF) on the two times the log scale, that is, $PsBF = 2 \times ({LPML}_{A} - {LPML}_{B})$ . Regardless of whether Model A was the simpler model, it was favored when the PsBF was greater than 10, as this threshold indicates that the data very strongly support Model A over Model B (Kass & Raftery, 1995). When Model A was the simpler model, we also favored it when the absolute value of the PsBF was less than 10 for reasons of parsimony.

When using the PSIS-LOO to compare Models A and B, a standard error (SE) of the difference in the PSIS-LOOs can be calculated. Thus, to determine which model was favored, we divided the difference by its corresponding SE, that is,

z_{PSIS - LOO} = \frac{PSIS - LO O_{A} - PSIS - LO O_{B}}{S . E .},

(19)

where

S . E . = \sqrt{NJ [V_{i = 1, j = 1}^{N, J} ({\hat{elppd}}_{PSIS - LO O_{ij}}^{A} - {\hat{elppd}}_{PSIS - LO O_{ij}}^{B})]},

(20)

with $V_{i = 1, j = 1}^{N, J} ({\hat{elppd}}_{PSIS - LO O_{ij}}^{A} - {\hat{elppd}}_{PSIS - LO O_{ij}}^{B})$ representing the variance of the difference between the models’ elppds across the i and j observations (where ${\hat{elppd}}_{PSIS - LO O_{ij}}$ is from Equation 8). Vehtari et al. (2017) did not provide guidelines for interpreting the $z_{PSIS - LOO}$ , as they noted that this approach needs further evaluation to determine its accuracy. For this study, regardless of whether Model A was the simpler model, we favored it over Model B when $z_{PSIS - LOO} > 1.96$ , and when Model A was the simpler model, we also favored it when $| z_{PSIS - LOO} | < 1.96$ for reasons of parsimony. We used the criterion of 1.96 because 2.5% of the values fall above that point in the standard normal distribution.

For the WAIC, the corresponding $z_{WAIC}$ can be obtained by taking the difference between the two models’ values for WAIC and dividing that by its SE, with the SE obtained by replacing the ${\hat{elppd}}_{PSIS - LO O_{ij}}$ in Equation 20 with the elppd for the WAIC on the deviance scale (i.e., −2× ${\hat{elppd}}_{WAI C_{ij}}$ in Equation 4). The same criteria used to interpret the $z_{PSIS - LOO}$ was used to interpret the $z_{WAIC}$ .

The criteria could make it so that one model is not outright favored over all other models for a data replicate, regardless of the index used. This could occur when both two-tier IRT models (Models 5 and 6) outperform the other models but are equivalent to each other. Because both of these models have similar complexity and the same number of parameters, neither model would be favored, leading to no one model being favored over all other models for that data replicate.

Data Generation

For the unidimensional condition, the latent trait dimensional positions were randomly drawn from $N (0, 1)$ , and for the other dimensional conditions, the values were drawn from $N_{Q} (0, Σ_{θ})$ , where $0$ and $Σ_{θ}$ were, respectively, a vector of 0s and a variance–covariance matrix with all of the main diagonal elements set to 1. The off-diagonal elements of $Σ_{θ}$ depended on the condition. For the two-dimensional condition, the off-diagonal element (i.e., the correlation) was .50; for the bifactor condition, the off-diagonal elements were 0, making $Σ_{θ}$ an identity matrix; and for the two-tier condition, the element representing the correlation between the two primary dimensions was .5 and the other off-diagonal elements were 0.

Regarding the item parameters, the nonzero item discriminations were randomly drawn from a uniform distribution ranging from 1 to 3, or $U (1, 3)$ , and then rescaled as follows. For the unidimensional and two-dimensional conditions, the drawn values for each dimension were rescaled to a target scaling factor of 1.4. For the nested-dimensionality structures, the values related to the primary dimensions were rescaled to a target scaling factor of 1.4, and the values related to the secondary dimensions were rescaled to target scaling factors of 1.0, 1.0, 0.7, 1.0, 0.7, and 1.0 for the first to sixth secondary dimensions, respectively. The target factors for the nested-dimensionality conditions, on the whole, led to the items to discriminate more strongly on the primary dimensions than on the secondary dimensions, resembling a situation for which nested-dimensionality IRT models are most appropriate (DeMars, 2013; Reise, 2012; Toland et al., 2017). In addition, the target factors led to the secondary dimensions’ influence on the responses to vary.

To obtain the data generation values for the items’ overall $(β_{j})$ and relative category $(τ_{j})$ intercepts, we first obtained values for the direct category intercepts (i.e., $δ_{jc}$ , where $δ_{jc} = β_{j} + τ_{jc}$ , for $c > 0$ ) and then transformed those values into $β_{j}$ and $τ_{j}$ . This approach ensured that the relative category intercepts monotonically increased in a sufficient manner so that all of the items’ categories were represented in the data (i.e., there were no null categories). The first direct intercept for item $j$ $(δ_{j 1})$ was randomly drawn from $U (- 2.5, 0)$ . The second and third direct intercepts ( $δ_{j 2}$ and $δ_{j 3}$ , respectively) were $δ_{j 2} = δ_{j 1} + U (0.75, 1.50)$ and $δ_{j 3} = δ_{j 2} + U (0.75, 1.50)$ . Item $j' s$ overall intercept was obtained through $β_{j} = (\sum_{z = 1}^{3} δ_{jz}) / 3$ , and the relative category intercepts were obtained through $τ_{jc} = β_{j} - δ_{jc}$ , for $c > 0$ . The data generation values used for the item parameters are in the Online Supplementary Material.

Results

The model comparison results are summarized in Tables 1 to 4. In each table, the values in the predictive performance index columns are the averages across the 100 data replicates. Each value in the “%” column indicates the percentage of times the corresponding model was favored over all other models.

Table 1.

Model Comparison Results From the Unidimensional Condition.

Model	N = 200		N = 500		N = 1,000
Model	DIC	%	DIC	%	DIC	%
1. Unidimensional^a	11,813	0	29,343	0	58,627	0
2. Two-dimensional	11,814	0	29,343	0	58,627	0
3. Six-dimensional	11,827	0	29,355	0	58,632	0
4. Bifactor	11,749	100	29,238	99	58,477	88
5. Two-tier	11,752	0	29,243	1	58,485	5
6. Two-tier (alternative)	11,751	0	29,244	0	58,484	6
Model	LPML	%	LPML	%	LPML	%
1. Unidimensional^a	−5,915	97	−14,681	100	−29,327	97
2. Two-dimensional	−5,916	2	−14,682	0	−29,327	1
3. Six-dimensional	−5,932	0	−14,698	0	−29,341	1
4. Bifactor	−5,920	1	−14,687	0	−29,332	1
5. Two-tier	−5,922	0	−14,687	0	−29,333	0
6. Two-tier (alternative)	−5,922	0	−14,687	0	−29,333	0
Model	WAIC	%	WAIC	%	WAIC	%
1. Unidimensional^a	11,825	98	29,356	100	58,645	96
2. Two-dimensional	11,828	0	29,358	0	58,646	1
3. Six-dimensional	11,850	0	29,381	0	58,664	0
4. Bifactor	11,828	1	29,358	0	58,647	1
5. Two-tier	11,830	0	29,360	0	58,648	0
6. Two-tier (alternative)	11,830	0	29,360	0	58,648	0
Model	PSIS-LOO	%	PSIS-LOO	%	PSIS-LOO	%
1. Unidimensional^a	−5,915	100	−14,681	100	−29,326	100
2. Two-dimensional	−5,916	0	−14,682	0	−29,327	0
3. Six-dimensional	−5,932	0	−14,698	0	−29,341	0
4. Bifactor	−5,920	0	−14,686	0	−29,332	0
5. Two-tier	−5,921	0	−14,687	0	−29,333	0
6. Two-tier (alternative)	−5,921	0	−14,687	0	−29,333	0

The unidimensional IRT model (Model 1) was the data generation model.

Table 2.

Model Comparison Results From the Two-Dimensional Condition.

Model	N = 200		N = 500		N = 1,000
Model	DIC	%	DIC	%	DIC	%
1. Unidimensional	13,397	0	33,319	0	66,564	0
2. Two-dimensional^a	12,129	0	30,209	0	60,338	0
3. Six-dimensional	12,136	0	30,214	0	60,333	0
4. Bifactor	12,377	0	30,884	0	61,717	0
5. Two-tier	12,055	100	30,094	100	60,170	100
6. Two-tier (alternative)	12,214	0	30,490	0	60,953	0
Model	LPML	%	LPML	%	LPML	%
1. Unidimensional	−6,708	0	−16,669	0	−33,294	0
2. Two-dimensional^a	−6,082	98	−15,129	100	−30,207	100
3. Six-dimensional	−6,096	0	−15,143	0	−30,219	0
4. Bifactor	−6,260	0	−15,563	0	−31,055	0
5. Two-tier	−6,087	2	−15,135	0	−30,213	0
6. Two-tier (alternative)	−6,177	0	−15,355	0	−30,648	0
Model	WAIC	%	WAIC	%	WAIC	%
1. Unidimensional	13,412	0	33,334	0	66,582	0
2. Two-dimensional^a	12,149	95	30,234	94	60,375	90
3. Six-dimensional	12,164	0	30,249	0	60,382	0
4. Bifactor	12,447	0	30,989	0	61,871	0
5. Two-tier	12,148	2	30,234	3	60,371	4
6. Two-tier (alternative)	12,303	0	30,622	0	61,142	0
Model	PSIS-LOO	%	PSIS-LOO	%	PSIS-LOO	%
1. Unidimensional	−6,708	0	−16,669	0	−33,294	0
2. Two-dimensional^a	−6,081	100	−15,129	100	−30,207	100
3. Six-dimensional	−6,095	0	−15,142	0	−30,218	0
4. Bifactor	−6,257	0	−15,558	0	−31,049	0
5. Two-tier	−6,086	0	−15,135	0	−30,213	0
6. Two-tier (alternative)	−6,175	0	−15,352	0	−30,644	0

The two-dimensional IRT model (Model 2) was the data generation model.

Table 3.

Model Comparison Results From the Bifactor Condition.

Model	N = 200		N = 500		N = 1,000
Model	DIC	%	DIC	%	DIC	%
1. Unidimensional	12,851	0	31,913	0	63,745	0
2. Two-dimensional	12,715	0	31,584	0	63,095	0
3. Six-dimensional	11,897	0	29,576	0	59,080	0
4. Bifactor^a	11,676	59	29,032	49	58,028	33
5. Two-tier	11,671	9	29,025	6	58,020	10
6. Two-tier (alternative)	11,669	13	29,024	12	58,016	23
Model	LPML	%	LPML	%	LPML	%
1. Unidimensional	−6,434	0	−15,966	0	−31,886	0
2. Two-dimensional	−6,371	0	−15,811	0	−31,577	0
3. Six-dimensional	−6,008	0	−14,895	0	−29,726	0
4. Bifactor^a	−5,928	90	−14,671	80	−29,272	54
5. Two-tier	−5,928	2	−14,671	3	−29,270	9
6. Two-tier (alternative)	−5,928	3	−14,671	5	−29,269	12
Model	WAIC	%	WAIC	%	WAIC	%
1. Unidimensional	12,863	0	31,927	0	63,764	0
2. Two-dimensional	12,733	0	31,607	0	63,130	0
3. Six-dimensional	11,930	0	29,624	0	59,155	0
4. Bifactor^a	11,740	82	29,114	66	58,143	61
5. Two-tier	11,739	2	29,109	4	58,136	2
6. Two-tier (alternative)	11,738	3	29,109	7	58,134	4
Model	PSIS-LOO	%	PSIS-LOO	%	PSIS-LOO	%
1. Unidimensional	−6,434	0	−15,966	0	−31,886	0
2. Two-dimensional	−6,371	0	−15,811	0	−31,577	0
3. Six-dimensional	−6,005	0	−14,890	0	−29,719	0
4. Bifactor^a	−5,922	93	−14,663	92	−29,260	88
5. Two-tier	−5,922	2	−14,662	0	−29,258	2
6. Two-tier (alternative)	−5,922	1	−14,662	1	−29,257	0

Note. The two-tier IRT model with the alternative specification (Model 6) was the version in which 20 and 10 items discriminated on the primary dimensions. Within a predictive performance index section, each value in the “%” column indicates the percentage of times (across the 100 data replicates) the corresponding model was favored over all other models. For example, in the $N = 1, 000$ condition, when using the PSIS-LOO to compare models, the bifactor IRT model (Model 4) was favored over all other models 88% of the time, and the two-tier IRT model (Model 5) was favored over all other models 2% of the time. No one model was favored over all other models 10% of the time; this occurred when both two-tier IRT models (Models 5 and 6) outperformed the other models but demonstrated equivalent performance to each other. Because both of these models had similar complexity, neither model was favored over the other, leading to no one model being favored over all other models for that data replicate. DIC = deviance information criterion; LPML = log-predicted marginal likelihood; WAIC = Watanabe-Akaike information criterion; PSIS-LOO = leave-one-out cross-validation approximation based on Pareto-smoothed importance sampling.

The bifactor IRT model (Model 4) was the data generation model.

Table 4.

Model Comparison Results From the Two-Tier Condition.

Model	N = 200		N = 500		N = 1,000
Model	DIC	%	DIC	%	DIC	%
1. Unidimensional	14,147	0	35,182	0	70,283	0
2. Two-dimensional	13,061	0	32,500	0	64,901	0
3. Six-dimensional	12,248	0	30,477	0	60,863	0
4. Bifactor	12,183	0	30,371	0	60,690	0
5. Two-tier^a	12,016	100	29,966	100	59,907	100
6. Two-tier (alternative)	12,087	0	30,143	0	60,263	0
Model	LPML	%	LPML	%	LPML	%
1. Unidimensional	−7,082	0	−17,601	0	−35,153	0
2. Two-dimensional	−6,547	0	−16,273	0	−32,485	0
3. Six-dimensional	−6,188	0	−15,357	0	−30,640	0
4. Bifactor	−6,192	0	−15,351	0	−30,615	0
5. Two-tier^a	−6,119	100	−15,168	100	−30,252	100
6. Two-tier (alternative)	−6,158	0	−15,264	0	−30,443	0
Model	WAIC	%	WAIC	%	WAIC	%
1. Unidimensional	14,161	0	35,197	0	70,300	0
2. Two-dimensional	13,078	0	32,521	0	64,930	0
3. Six-dimensional	12,277	0	30,516	0	60,920	0
4. Bifactor	12,256	0	30,458	0	60,798	0
5. Two-tier^a	12,101	93	30,076	100	60,049	100
6. Two-tier (alternative)	12,170	0	30,245	0	60,390	0
Model	PSIS-LOO	%	PSIS-LOO	%	PSIS-LOO	%
1. Unidimensional	−7,082	0	−17,601	0	−35,153	0
2. Two-dimensional	−6,546	0	−16,273	0	−32,485	0
3. Six-dimensional	−6,184	0	−15,351	0	−30,631	0
4. Bifactor	−6,186	0	−15,343	0	−30,603	0
5. Two-tier^a	−6,112	99	−15,159	100	−30,238	100
6. Two-tier (alternative)	−6,151	0	−15,253	0	−30,428	0

The two-tier IRT model (Model 5) was the data generation model.

The Unidimensional Condition

The model comparison results from the unidimensional condition are in Table 1. Overall, the LPML, PSIS-LOO, and WAIC outperformed the DIC. Across the sample sizes, the PSIS-LOO favored the data generation model 100% of the time, and the WAIC and LPML favored the correct model at least 96% of the time.

In contrast, when using the DIC, the unidimensional model was never favored over all other models. Instead, the DIC outright favored the bifactor model at least 99% of the time for the sample sizes of 200 and 500. For the sample size of 1,000, the DIC was still biased toward nested-dimensionality models, although it favored the bifactor model less frequently than it did in the other sample size conditions (i.e., 88% of the time) but at the expense of favoring one of the two-tier models more often (i.e., 11% of the time). The nested-dimensionality models being favored would suggest that the data reflected secondary dimensions to be nested within at least one primary dimension. The two-tier models being favored over Model 1 in some instances were not surprising because of the relationship between the two-tier and the bifactor models. As noted earlier, a bifactor model could be viewed as a two-tier model with the correlations among the primary dimensions (i.e., the dimensions in the first tier of the two-tier structure) fixed to 1. Thus, even though the primary dimensional portion of the nested-dimensionality structure consisted of two dimensions for the two-tier models, the correlation between the primary dimensions for each of these models was at least $. 993$ across the 100 data replicates. In other words, both versions of the two-tier models were nearly reduced to a bifactor model, indicating that the data represented a bifactor structure in those instances that a two-tier model was favored.

Although the DIC was biased toward nested-dimensionality models, it fared better when focusing on only the non-nested-dimensionality models and looking at specific pairwise model comparisons rather than the comparisons reported in Table 1, which are across all the models. The DIC correctly favored the unidimensional model (Model 1) over the two-dimensional IRT model (Model 2) at least 97% of the time. However, the DIC’s performance was less consistent across the sample sizes when selecting between Model 1 and the six-dimensional IRT model (Model 3); Model 1 was correctly favored over Model 3 at least 96% and 99% for the sample sizes of 200 and 500, respectively, but only 89% of the time for the sample size of $1, 000$ .

The Two-Dimensional Condition

The results from the two-dimensional condition are summarized in Table 2. The results show that the LPML, PSIS-LOO, and WAIC outperformed the DIC. However, among the first three indices, the LPML and PSIS-LOO performed slightly better than the WAIC relative to what was observed in the unidimensional condition. In the two-dimensional condition, the data generation model (Model 2) was correctly favored over the other models at least 98% of the time based on the LPML and 100% of the time based on the PSIS-LOO. The WAIC correctly favored Model 2 over the other models at least 90% of the time but no greater than 95% of the time, as observed across the sample sizes.

Regarding the DIC, it again displayed a bias toward nested-dimensionality models. The DIC favored one of the two-tier models over the non-two-tier models 100% of the time, as observed for all sample sizes, with the favored two-tier IRT model (Model 5) being the version in which the primary dimensional structure matched that of the data generation structure. The DIC never favored the bifactor model (Model 4) or the alternative two-tier model (Model 6) in which the primary dimensional portion did not match that of the data generation two-dimensional structure. In fact, when the data generation two-dimensional model (Model 2) was directly compared with the bifactor and the alternative two-tier models, the two-dimensional model was favored 100% of the time, as suggested by the average of the DIC for Model 2 being less than those for the bifactor and the alternative two-tier models within each sample size condition (e.g., for $N = 1, 000$ , the DICs were $60, 338$ , $61, 717$ , and $60, 953$ for Models 2, 4, and 6, respectively). This pattern among the two-dimensional and the nested-dimensionality models suggests that nested-dimensionality models are not automatically favored over non-nested-dimensionality models based on the DIC. Within the study conditions, only when the dimensional structures of the models shared similar features—such as the primary dimensional portion of the two-tier structure (Model 5) matching that of the data generation condition (Model 2)—did the DIC incorrectly favor the nested-dimensionality model.

The DIC also demonstrated some bias toward more complex non-nested-dimensionality models. When focusing on just the unidimensional (Model 1), two-dimensional (Model 2), and six-dimensional (Model 3) models, the DIC correctly favored Model 2 over Model 1 100% of the time. However, the DIC’s ability to correctly favor Model 2 over Model 3 varied depending on the sample size; it favored Model 2 at least 92% of the time when $N = 200$ but only 63% of the time when $N = 1, 000$ , suggesting that the DIC had a slight tendency to favor the more complex model when it was used to compare among only the non-nested-dimensionality models, as observed in the largest sample size condition.

The Bifactor Condition

The results from the bifactor condition are summarized in Table 3. The four indices never favored one of the non-nested-dimensionality models. However, they varied in how much they favored the data generation bifactor model (Model 4).

Among the indices, the PSIS-LOO was the most accurate, favoring the bifactor model over all other models 88% of the time for $N = 1, 000$ and approximately 92% of the time for the other sample sizes. The DIC was the worst-performing index, favoring the bifactor model over all other models 33% to 59% of the time, as observed across the sample sizes. The LPML and the WAIC’s performances were in between the other two indices, favoring the bifactor model 54% to 90% of the time.

In the instances where the bifactor model was not the most favored, one or both of the two-tier models were favored. It is indicated in Table 3 when only one of the two-tier models was favored over all other models. For example, when $N = 200$ , the PSIS-LOO favored Model 5 over all other models, including the other two-tier model (Model 6), 2% of the time and favored Model 6 over all other models 1% of the time. The percentage of times both of the two-tier models were favored over the remaining models is not directly noted in the table. However, recall that the two-tier models have the same number of parameters. Thus, when the difference in predictive performance between these two models did not exceed the criteria for favoring a model, no single model was considered favored. Any remaining percentage to sum to 100, then, represents the percentage of times both two-tier models were favored over the other models. For example, when $N = 200$ , under the PSIS-LOO, both two-tier models demonstrated equivalent performances to each other while still outperforming the other models 4% of the time.

Regardless of whether one or both versions of the two-tier model were favored over the other models, the averages of the posterior means of the correlation between the two primary dimensions (averaged across the 100 data replicates) were at least .96 under Models 5 and 6, suggesting that the two-tier models were nearly reduced to a bifactor model. In this situation, then, the indices incorrectly favoring the two-tier models over the bifactor model are not that misleading because the two-tier models became a bifactor model, thereby leading to the same conclusion about the dimensional structure represented in the data as that based on the true model.

The Two-Tier Condition

The results from the two-tier condition are summarized in Table 4. In this condition, the data were generated to represent a two-tier structure that matched that of Model 5, and all the predictive performance indices favored this model over the other models—including the alternative two-tier IRT model—nearly 100% of the time, if not 100% of the time. The one exception to this was when Model 5 was favored only 93% of the time by the WAIC for $N = 200$ . In the instances where Model 5 was not outright favored, it was considered to be displaying an equivalent performance to the alternative two-tier model (Model 6), although these instances were only for $N = 200$ . Nevertheless, a model with secondary dimensions nested within multiple primary dimensions would be selected.

Discussion

Performing model comparisons to evaluate the dimensionality of item response data can be challenging because such comparisons can be biased toward nested-dimensionality IRT models. One reason for the possible bias is that the fit propensity of nested-dimensionality models may be greater than that of non-nested-dimensionality models (e.g., Bonifay & Cai, 2017). However, it is unclear just how much these models’ greater fit propensity can bias model comparison results when working with data representing certain dimensional structures—a situation closer to those seen in practice. The lack of clarity is in part because previous work has not thoroughly evaluated the ability of Bayesian model selection indices to appropriately make adjustments for model complexity and choose the appropriate IRT model from a set of competing models—especially when the set includes non-nested- and nested-dimensionality IRT models.

Our study provides some insight into this issue—that model comparisons being biased could depend on the predictive performance index one uses. We demonstrated that the DIC was severely biased toward nested-dimensionality models (at least within our study conditions). Conversely, the PSIS-LOO was the most accurate at identifying the correct model. It correctly favored the non-nested-dimensionality models in the appropriate conditions 100% of the time. The PSIS-LOO was also mostly accurate at identifying the correct model among a set of nested-dimensionality models; it had a slight tendency to favor the two-tier model in the bifactor condition, although the two-tier model was reduced to a bifactor model, which in turn led to the same conclusion about the dimensional structure being represented in the data as that based on a bifactor model.

Although the PSIS-LOO’s ability to identify the correct model was the greatest among the indices we tested, it was also the most computationally intensive to calculate, which could be a drawback in practice. Thus, the LPML, which was the next best-performing index, could be an alternative to the PSIS-LOO, with the LPML balancing model selection accuracy and ease of calculation. The LPML was nearly perfect at identifying the correct model in every condition except the bifactor condition, where it incorrectly favored the two-tier models more frequently than the PSIS-LOO did. The WAIC’s performance, in general, was similar to the LPML’s. Thus, the WAIC could also be an alternative to the PSIS-LOO, although the LPML is simpler to calculate.

In contrast, the DIC consistently favored the nested-dimensionality models, even when the non-nested-dimensionality models were the true models. In other words, among the indices we examined, the DIC is the most likely to be biased toward nested-dimensionality models. Our findings related to the DIC are consistent with other studies that demonstrated the DIC’s tendency to favor more complex models (e.g., Celeux et al., 2006; da Silva et al., 2019; Li et al., 2006; Plummer, 2008). However, we showed that a certain condition may be required for this tendency to appear in a multidimensional IRT setting—for a nested-dimensionality model to be incorrectly favored over a non-nested-dimensionality model, the dimensional structures the models represent may have to share some features, such as the primary dimensional portion of the structure specified for the nested-dimensionality model may need to match the structure specified for the non-nested-dimensionality model. One example of this was observed in the two-dimensional data generation condition, where only one of the nested-dimensionality models was favored over the data generation model. The favored model was the two-tier model (Model 5) in which the primary dimensional portion of the full dimensional structure matched that of the two-dimensional data generation model. The DIC did not favor the alternative two-tier model (Model 6) or the bifactor model (Model 4) over the two-dimensional model.

Our simulation study, then, demonstrates that the greater fit propensity of nested-dimensionality models does not necessarily lead them to be favored over non-nested-dimensionality models when the latter models are the true models, although which predictive performance index is used matters. Our conclusions, however, should be considered within the context of the study’s limitations. First, we investigated the performance of these indices in four different dimensional contexts, which were selected to determine whether nested-dimensionality IRT models were always favored over non-nested-dimensionality models when the latter models were the true models. There are many types of dimensional structures, but only a limited number can be considered in a single study. Future studies should investigate the accuracy of these indices with other types of nested-dimensionality structures. Second, we fit models that could be applied to all conditions, so we did not fit a simpler bifactor model within which the two-dimensional model was nested (e.g., a version with two secondary dimensions nested within a single primary dimension; Asparouhov & Muthén, 2019). Based on a brief follow-up simulation we conducted, such a model would not add more insight into how these indices are biased toward nested-dimensionality models, given our goal—to determine whether Bayesian predictive performance indices tend to be biased toward nested-dimensionality models. However, future studies should investigate other instances where simpler, nested-dimensionality models are included in more applicable dimensional conditions to understand further how these models’ complexity and their relationship to non-nested-dimensionality models affect the performance of these indices.

The third limitation is that we compared the $z_{PSIS - LOO}$ and the $z_{WAIC}$ to 1.96 to determine which model to favor. This criterion was effective for our study. However, the appropriateness of this threshold should be investigated further in conditions beyond those explored in this study. Fourth, we only applied the models to data representing 30 items. We selected this number so a sufficient number of items discriminated on the secondary dimensions for the nested-dimensionality models, ensuring that all the parameters related to the secondary dimensions could be estimated. Whether the pattern of the findings observed in this study holds with a different number of items should be investigated. Fifth, we only used prior distributions that have been demonstrated to be suitable for highly multidimensional IRT models and data representing smaller samples, conditions similar to our study. There are many other prior distributions that could be used, but using less informative priors with nested-dimensionality models has been shown to produce inaccurate parameter estimates with sample sizes of 500 or less (Fujimoto & Neugebauer, 2020). Thus, we did not explore other priors, as educational and psychological researchers frequently conduct studies on a smaller scale, and we wanted to demonstrate that it is possible to differentiate between non-nested- and nested-dimensionality models in similar situations. Whether these trends hold with other priors should be investigated, although using less informative priors may require larger sample sizes. Sixth, we only investigated model selection indices based on conditional likelihood, as estimating the marginal likelihood is still extremely challenging for high-dimensional models. As more advancements are made in obtaining the marginal likelihood for high-dimensional models (e.g., Merkle et al., 2019) and thus making it more commonly implemented in practice, whether marginal likelihood versions of these indices are biased toward nested-dimensionality models should be investigated.

Last but not least, we relied on simulated data generated to clearly represent certain dimensional structures. In practice, the data may not clearly reflect one of the dimensional structures represented by the models being compared, such as in Greene et al. (2019). However, if these indices do not perform optimally in ideal situations in which the correct answer is known, then these indices cannot be trusted in less-than-optimal conditions. Thus, we concentrated on establishing how well these indices functioned in ideal situations, with the models being distinctly different for the most part. However, future research should investigate the performance of these indices in less optimal situations.

Even with these limitations, our study challenges the notion of the broader literature that models with greater fit propensity automatically bias model selection results toward those models, although the predictive performance index one uses matters. Given our study conditions, the DIC is most likely to lead to the pitfall of favoring nested-dimensionality models over non-nested-dimensionality models when the latter are more appropriate for the data, thereby misguiding researchers to conclude that more complex dimensional structures are represented in their data. However, such a misstep may be avoidable with the PSIS-LOO.

Supplemental Material

sj-pdf-1-epm-10.1177_00131644231165520 – Supplemental material for The Accuracy of Bayesian Model Fit Indices in Selecting Among Multidimensional Item Response Theory Models

Supplemental material, sj-pdf-1-epm-10.1177_00131644231165520 for The Accuracy of Bayesian Model Fit Indices in Selecting Among Multidimensional Item Response Theory Models by Ken A. Fujimoto and Carl F. Falk in Educational and Psychological Measurement

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Ken A. Fujimoto

Carl F. Falk

Supplemental Material

Supplemental material for this article is available online.

References

Adams

R. J.

Wilson

Wang

W.-C.

(1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21(1), 1–23.

Akaike

(1998). Information theory and an extension of the maximum likelihood principle. In Parzen

Tanabe

Kitagawa

(Eds.), Selected papers of Hirotugu Akaike (pp. 199–213). New York: Springer.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Asparouhov

Muthén

(2019). Nesting and equivalence testing for structural equation models. Structural Equation Modeling: A Multidisciplinary Journal, 26(2), 302–309.

Bonifay

Cai

(2017). On the complexity of item response theory models. Multivariate Behavioral Research, 52(4), 465–484.

Bonifay

Lane

S. P.

Reise

S. P.

(2017). Three concerns with applying a bifactor model as a structure of psychopathology. Clinical Psychological Science, 5(1), 184–186.

Cai

(2010). A two-tier full-information item factor analysis model with applications. Psychometrika, 75(4), 581–612.

Canivez

G. L.

(2016). Bifactor modeling in construct validation of multifactored tests: Implications for understanding multidimensional constructs and test interpretation. In Schweizer

DiStefano

(Eds.), Principles and methods of test construction: Standards and recent advancements (pp. 247–271). Hogrefe.

Celeux

Forbes

Robert

C. P.

Titterington

D. M.

(2006). Deviance information criteria for missing data models. Bayesian Analysis, 1(4), 651–673.

10.

da Silva

M. A.

BazÃąn

J. L.

Huggins-Manley

A. C.

(2019). Sensitivity analysis and choosing between alternative polytomous IRT models using Bayesian model comparison criteria. Communications in Statistics-Simulation and Computation, 48(2), 601–620.

11.

DeMars

C. E.

(2013). A tutorial on interpreting bifactor model scores. International Journal of Testing, 13(4), 354–378.

12.

Falk

C. F.

Muthukrishna

(2021). Parsimony in model selection: Tools for assessing fit propensity. Psychological Methods, 28(1), 123–136.

13.

Fox

J.-P.

(2010). Bayesian item response modeling: Theory and applications. Springer.

14.

Fujimoto

K. A.

(2018). A general Bayesian multilevel multidimensional item response theory model for locally dependent data. British Journal of Mathematical and Statistical Psychology, 71(3), 536–560.

15.

Fujimoto

K. A.

(2019). The Bayesian multilevel trifactor item response theory model. Educational and Psychological Measurement, 79(3), 462–494.

16.

Fujimoto

K. A.

(2020). A more ﬂexible multilevel bifactor item response theory model. Journal of Educational Measurement, 57(2), 255–285.

17.

Fujimoto

K. A.

Neugebauer

S. R.

(2020). A general Bayesian multidimensional item response theory model for small and large samples. Educational and Psychological Measurement, 80(4), 665–669.

18.

Gelfand

A. E.

(1996). Model determination using sampling-based methods. In Gilks

W. R.

Richardson

Spiegelhalter

D. J.

(Eds.), Markov chain Monte Carlo in practice (pp. 145–161). Chapman and Hall/CRC Press.

19.

Gelman

A. E.

Hwang

Vehtari

(2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6), 997–1016.

20.

Geyer

(2011). Introduction to MCMC. In Brooks

Gelman

Jones

Meng

(Eds.), Handbook of Markov Chain Monte Carlo (pp. 3–48). Chapman and Hall/CRC Press.

21.

Gibbons

R. D.

Hedeker

D. R.

(1992). Full-information item bi-factor analysis. Psychometrika, 57(3), 423–436.

22.

Greene

A. L.

Eaton

N. R.

Forbes

M. K.

Krueger

R. F.

Markon

K. E.

. . . Kotov

(2019). Are fit indices used to test psychopathology structure biased? A simulation study. Journal of Abnormal Psychology, 128(7), 740–764.

23.

Hansen

Cai

Stucky

B. D.

Tucker

J. S.

Shadel

W. G.

Edelen

M. O.

(2014). Methodology for developing and evaluating the PROMIS® smoking item banks. Nicotine & Tobacco Research, 16(Suppl. 3), S175–S189.

24.

Holzinger

K. J.

Swineford

(1937). The bi-factor method. Psychometrika, 2(1), 41–54.

25.

Kass

R. E.

Raftery

A. E.

(1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795.

26.

Bolt

D. M.

(2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3–21.

27.

Luo

Al-Harbi

(2017). Performances of LOO and WAIC as IRT model selection methods. Psychological Test and Assessment Modeling, 59(2), 183–205.

28.

Markon

K. E.

(2019). Bifactor and hierarchical models: Specification, inference, and interpretation. Annual Review of Clinical Psychology, 15(1), 51–69.

29.

Merkle

E. C.

Furr

Rabe-Hesketh

(2019). Bayesian comparison of latent variable models: Conditional versus marginal likelihoods. Psychometrika, 84, 802–829.

30.

MRC Biostatistics Unit. (n.d.). DIC: Deviance information criteria. https://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-dic/

31.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159–176.

32.

Plummer

(2008). Penalized loss functions for Bayesian model comparison. Biostatistics, 9(3), 523–539.

33.

Preacher

K. J.

(2006). Quantifying parsimony in structural equation modeling. Multivariate Behavioral Research, 41(3), 227–259.

34.

Preacher

K. J.

Merkle

E. C.

(2012). The problem of model selection uncertainty in structural equation modeling. Psychological Methods, 17(1), 1–14.

35.

Reise

S. P.

(2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696.

36.

Schwarz

(1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.

37.

Sellbom

Tellegen

(2019). Factor analysis in psychological assessment research: Common pitfalls and recommendations. Psychological Assessment, 31(12), 1428–1441.

38.

Spiegelhalter

D. J.

Best

N. G.

Carlin

B. P.

Van Der Linde

(2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4), 583–639.

39.

Stucky

B. D.

Edelen

M. O.

(2015). Handbook of item response theory modeling: Applications to typical performance assessment. In Reise

S. P.

Revicki

D. A.

(Eds.), Markov chain Monte Carlo in practice (pp. 183–206). Routledge/Taylor & Francis Group.

40.

Toland

M. D.

Sulis

Giambona

Porcu

Campbell

J. M.

(2017). Introduction to bifactor polytomous item response theory analysis. Journal of School Psychology, 60, 41–63.

41.

Vehtari

Gelman

Gabry

(2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.

42.

Vehtari

Simpson

Gelman

Yao

Gabry

(2022). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646.

43.

Vrieze

S. I.

(2012). Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological Methods, 17(2), 228–243.

44.

Watanabe

(2013). A widely applicable Bayesian information criterion. Journal of Machine Learning Research, 14, 867–897.

45.

Zhu

Stone

C. A.

(2012). Bayesian comparison of alternative graded response models for performance assessment applications. Educational and Psychological Measurement, 72(5), 774–799.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB