A Note on Standard Errors for Multidimensional Two-Parameter Logistic Models Using Gaussian Variational Estimation

Abstract

Accurate item parameters and standard errors (SEs) are crucial for many multidimensional item response theory (MIRT) applications. A recent study proposed the Gaussian Variational Expectation Maximization (GVEM) algorithm to improve computational efficiency and estimation accuracy (Cho et al., 2021). However, the SE estimation procedure has yet to be fully addressed. To tackle this issue, the present study proposed an updated supplemented expectation maximization (USEM) method and a bootstrap method for SE estimation. These two methods were compared in terms of SE recovery accuracy. The simulation results demonstrated that the GVEM algorithm with bootstrap and item priors (GVEM-BSP) outperformed the other methods, exhibiting less bias and relative bias for SE estimates under most conditions. Although the GVEM with USEM (GVEM-USEM) was the most computationally efficient method, it yielded an upward bias for SE estimates.

Keywords

multidimensional item response theory standard error Gaussian variational EM supplemented EM bootstrap sampling

Introduction

Recently, multidimensional item response theory (MIRT) model has gained popularity for handling complex educational data, mostly due to its capability to simultaneously estimate multiple latent constructs with possible correlation. For example, several studies have shown that, compared to unidimensional IRT, MIRT can substantially improve the accuracy of parameter estimates when the test measures several highly correlated latent constructs with a small number of items (de la Torre & Patz, 2005; Wang et al., 2004).

For MIRT, accurate estimates of item parameters, as well as their standard errors (SEs), are important prerequisites for many applications (Chen & Wang, 2021; Wang & Zhang, 2019), including but not limited to multidimensional computerized adaptive testing (Yao, 2012), item parameter calibration (Chen, 2017), limited-information goodness-of-fit testing (Cai et al., 2006), as well as differential item functioning (Woods et al., 2013). However, estimating item parameters for high-dimensional MIRT models can be computationally challenging due to the intractable integrals in the likelihood function (Andersson & Xin, 2021). Several solutions have been proposed in the literature. For instance, the adaptive Gaussian quadrature method (Cagnone & Monari, 2013) and the Laplace approximation (Lindstrom & Bates, 1988) conduct direct numerical approximations to the integrals. Unfortunately, the former is known to be computationally demanding in high dimensions, whereas the latter, though being computationally efficient, becomes less accurate especially when there are only a few dichotomous items measuring each latent trait (Joe, 2008). Other approaches are based on stochastic approximation, such as the Metropolis–Hastings Robbins–Monro (MH-RM) method (Cai, 2010a, 2010b). This method has been implemented in several software programs (Bashkov & DeMars, 2017), but again, it could be computationally intensive since the procedure requires multiple sampling from a posterior distribution.

Recently, Cho et al. (2021) proposed an alternative algorithm, namely, the Gaussian variational expectation maximization (GVEM) algorithm, to further improve computational efficiency and estimation accuracy. To greatly reduce the computational complexity, the GVEM bypasses calculating intractable integrals by approximating the marginal likelihood with a more tractable variational lower bound, such that both the integration in E-step and solution to the score function in the M-step involve analytic closed forms. While Cho et al. (2021) show the GVEM outperforms the MH-RM regarding item parameter recovery under different conditions, their algorithm does not produce standard error (SE) estimates. Since SE is a key element for statistical inference, our study aims to fill this gap by proposing two SE estimation procedures under the GVEM framework: the updated version of supplemented EM (USEM; Tian et al., 2013) method and the bootstrap method.

So far, studies on SE estimation are scarce (Paek & Cai, 2014), especially for MIRT models. The gold standard approach to estimating SE is to take the square root of the diagonal entries of expected Fisher’s information matrix (FIS), computed by the second-order derivatives of the log-likelihood function. However, a key drawback of this method is that the computation cost increases exponentially as the test length increases (Paek & Cai, 2014; Wang & Zhang, 2019). Alternative approaches include the empirical cross-product approach (XPD) and supplemented EM (SEM; Cai, 2008; Meng & Rubin, 1991). XPD defines the information matrix as the expectation of the cross-product of the first derivative of the log-likelihood function. Although XPD is computationally most efficient, it could exhibit some upward bias when the test length is long and the sample size is relatively small (Paek & Cai, 2014). Alternatively, prior research suggests a robust SE estimation using a sandwich-type variance-covariance matrix, which is a weighted combination of XPD and FIS (Yuan et al., 2014). However, it is challenging to compute the observed information matrix, which is required in the sandwich covariance matrix. To address this issue, the SEM algorithm estimates SE based on the complete-data information matrix, which is obtained as a by-product from the EM algorithm. Compared with other approaches, SEM is easy to implement and produces accurate SE under different conditions (Tian et al., 2013). Recently, Tian et al. (2013) proposed an updated version of SEM (USEM) which is computationally more efficient than SEM and still maintains its accuracy. Due to these advantages, we aim to integrate the USEM into the GVEM framework to generate SE of MIRT item parameters in a fast and precise manner.

The aforementioned methods necessitate computing the first or second derivative of the likelihood function; however, in some cases, the likelihood is intractable analytically or challenging to obtain. To deal with this issue, the bootstrap method is an alternative for estimating SE, which has been applied in a wide range of statistical models (i.e., Efron & Tibshirani, 1986; Gonçalves & White, 2005). In the educational measurement field, the bootstrap approach is commonly used for estimating SE of IRT equating methods (i.e., Liu et al., 2008; Tsai et al., 2001; Zhang & Zhao, 2019), and providing accurate SEs for person parameter estimates (i.e., Fitzpatrick & Yen, 2001; Liou & Yu, 1991; Patton et al., 2014). Its application to SE estimation of item parameters in MIRT, however, is rarely discussed. While in general, bootstrap is computationally demanding due to resampling, by integrating it into the GVEM framework, we aim to maintain a decent computational cost.

The current study complements the work by Cho et al. (2021) by focusing on the SE estimation procedure under the GVEM framework. Two estimation methods are proposed: USEM and bootstrap. Their performance is evaluated via simulation studies. The rest of the article is organized as follows. We begin with an overview of current SE methods. Then, the general framework of the GVEM algorithm for MIRT models is introduced, followed by a modified version of the GVEM algorithm. Then the USEM as well as the bootstrap procedure for SE estimates within the GVEM framework are described. Next, simulation studies are presented to illustrate the performance of two proposed procedures, followed by a discussion of the results summary and directions for future studies.

An Overview of SE Methods

In this section, we will provide a brief review of three SE methods: FIS, XPD, and sandwich covariance matrix. The USEM and bootstrap methods will be introduced in subsequent sections. The exposition will be based on one of the most widely used MIRT models, the multidimensional two-parameter logistic model (M2PL). Suppose N examinees respond to J items, resulting in a binary response matrix Y = {Y_i, i = 1, …, N}, where Y_i = {Y_ij, j = 1, …, J} refers to the ith examinee’s response vector. The item response function of the ith respondent to the jth item is

P (Y_{i j} = 1 ∣ θ_{i}) = \frac{\exp (α_{j}^{T} θ_{i} - b_{j})}{1 + \exp (α_{j}^{T} θ_{i} - b_{j})},

(1)

where θ _i denotes the K-dimensional vector of latent ability for examinee i, α _j denotes a K-dimensional vector of item discrimination parameters for jth item ( α _j = {α_jk, k = 1, …, K}), and b_j denotes the corresponding item difficulty parameter. For conciseness, let M denote all model parameters for simplicity, where M = {A, B}, A = { α _j, j = 1, …, J} and B = {b_j, j = 1, …, J}. Considering the local independence assumption in IRT, the marginal log-likelihood function can be expressed as

l (M; Y) = \sum_{i = 1}^{N} \log P (Y_{i} ∣ M) = \sum_{i = 1}^{N} \log \int \prod_{j = 1}^{J} P (Y_{i j} ∣ θ_{i}, M) ϕ (θ_{i}) d θ_{i},

(2)

where ϕ denotes the multivariate normal density function of θ _i, with mean 0 and covariance Σ_θ.

The FIS approach computes the observed information matrix using the negative expectation of the Hessian matrix, that is, the second-order derivatives of the log-likelihood function

I_{FIS} = - E (H (M)) = - E (\frac{\partial^{2}}{\partial M \partial M^{T}} l (M; Y))

(3)

The square roots of the diagonal elements of

I_{FIS}^{- 1}

are SE estimates for model parameters. While the FIS is considered a golden approach, computing the Hessian matrix can be computationally demanding, especially as the number of items increases. The computation of the expected Fisher information matrix involves a summation loop over the total number of possible response patterns (Paek & Cai, 2014). Alternatively, XPD offers a more computationally efficient method to express the information matrix, utilizing the cross-product of gradients

I_{XPD} = E [\frac{\partial l (M; Y)}{\partial M} \cdot \frac{\partial l (M; Y)}{\partial M^{T}}],

(4)

where

\frac{\partial l (M; Y)}{\partial M}

denotes the gradient of the log-likelihood function. When employing the EM algorithm for estimation, Equation (4) is considerably more tractable than Fisher’s equation (Chalmers et al., 2017). This is attributed to XPD (4), which exclusively involves the first derivatives of the log-likelihood function, whereas Fisher’s equation (3) entails the computation of the Hessian matrix. In practice, XPD also has shown superior computational efficiency compared to other methods (Paek & Cai, 2014). Note that Equations (3) and (4) are equivalent and their sample-average-based versions (i.e., empirical versions) are asymptotically equivalent under correct model specification (Yuan et al., 2014). However, under misspecification, neither of the two methods can provide consistent SE estimates (Falk & Monroe, 2018; White, 1982). Instead, the sandwich-type covariance matrix can be considered as a weighted combination of the observed information and cross-product, offering a robust estimate of the variance-covariance matrix (Yuan et al., 2014). The variance-covariance matrix V_SW can be expressed as follows

V_{SW} = I_{FIS}^{- 1} I_{XPD} I_{FIS}^{- 1}

(5)

GVEM for MIRT

In this section, we will briefly introduce the key idea of the GVEM algorithm discussed in Cho et al. (2021). Maximizing this marginal likelihood function (i.e., Equation (2)) is often intractable under MIRT since it involves K-dimensional integrals. To solve this challenge, the variational approximation of Equation (2) could be employed. The derivation is shown as follows. First, the marginal log-likelihood in Equation (2) can be rewritten as

\begin{array}{l} l (M; Y) & = \sum_{i = 1}^{N} \int_{θ_{i}} \log P (Y_{i} ∣ M) \times q_{i} (θ_{i}) d θ_{i} \\ = \sum_{i = 1}^{N} \int_{θ_{i}} \log \frac{P (Y_{i}, θ_{i} ∣ M)}{P (θ_{i} ∣ Y_{i}, M)} \times q_{i} (θ_{i}) d θ_{i} \\ = \sum_{i = 1}^{N} \int_{θ_{i}} \log \frac{P (Y_{i}, θ_{i} ∣ M)}{q_{i} (θ_{i})} \times q_{i} (θ_{i}) d θ_{i} + K L {q_{i} (θ_{i}) ‖ P (θ_{i} ∣ Y_{i}, M)}, \end{array}

where

K L {q_{i} (θ_{i}) ‖ P (θ_{i} ∣ Y_{i}, M)} = \int_{θ_{i}} \log \frac{q_{i} (θ_{i})}{P (θ_{i} ∣ Y_{i}, M)} \times q_{i} (θ_{i}) d θ_{i}

denotes the Kullback–Leibler (KL) divergence between the posterior distribution P( θ _i∣Y_i, M) and an arbitrary probability density function q_i( θ _i), where note that KL{q_i( θ _i)‖P( θ _i∣Y_i, M)}≥ 0. Then, a lower bound of the marginal log-likelihood can be obtained as

l (M; Y) \geq \sum_{i = 1}^{N} \int_{θ_{i}} \log P (Y_{i}, θ_{i} ∣ M) \times q_{i} (θ_{i}) d θ_{i} - \sum_{i = 1}^{N} \int_{θ_{i}} \log q_{i} (θ_{i}) \times q_{i} (θ_{i}) d θ_{i} .

(6)

Since the equality in (6) holds if and only if q_i ( θ _i) = P( θ _i∣Y_i, M) for i = 1, …, N, the best choice of q_i( θ _i) is the posterior distribution P( θ _i∣Y_i, M). However, it is not practically applicable considering that the posterior distribution P( θ _i∣Y_i, M) is unknown. Alternatively, we could choose q_i( θ _i) from a normal distribution and employ a local variational method (Bishop, 2006; Jordan et al., 1999) to obtain a closed-form lower bound expression of the expected log-likelihood with respect to q_i( θ _i). Following the derivations in Cho et al. (2021), the optimal choice of q_i( θ _i) is q_i( θ _i) ∼ N( θ _i∣ μ _i, Σ_i), and its mean and covariance are

μ_{i} = Σ_{i} \times \sum_{j = 1}^{J} {2 η (ξ_{i, j}) b_{j} + Y_{i j} - \frac{1}{2}} α_{j}^{⊤},

(7)

Σ_{i}^{- 1} = Σ_{θ}^{- 1} + 2 \sum_{j = 1}^{J} η (ξ_{i, j}) α_{j} α_{j}^{⊤},

(8)

where ξ_i,j denotes a variational parameter indexed by i and j, and

η (ξ_{i, j}) = {(2 ξ_{i, j})}^{- 1} [e^{ξ_{i, j}} / (1 + e^{ξ_{i, j}}) - 1 / 2]

Let E^(t)(M, ξ ) denote the tth iteration’s lower bound expression of the expected log-likelihood, and now it has a closed-form expression

\begin{array}{l} E^{(t)} (M, ξ) & = \sum_{i = 1}^{N} \sum_{j = 1}^{J} (\log \frac{e^{ξ_{i, j}^{(t)}}}{(1 + e^{ξ_{i, j}^{(t)}})} + (\frac{1}{2} - Y_{i j}) b_{j}^{(t)} + (Y_{i j} - \frac{1}{2}) α_{j}^{(t) ⊤} μ_{i}^{(t)} - \frac{1}{2} ξ_{i, j}^{(t)} \\ - η (ξ_{i, j}^{(t)}) {b_{j}^{(t) 2} - 2 b_{j}^{(t)} α_{j}^{(t) ⊤} μ_{i}^{(t)} + α_{j}^{(t) ⊤} [Σ_{i}^{(t)} + (μ_{i}^{(t)}) {(μ_{i}^{(t)})}^{⊤}] α_{j}^{(t)} - ξ_{i, j}^{(t) 2}}) \\ + \frac{N}{2} \log | Σ_{θ}^{(t) - 1} | - \sum_{i = 1}^{N} \frac{1}{2} T r (Σ_{θ}^{(t) - 1} [Σ_{i}^{(t)} + (μ_{i}^{(t)}) {(μ_{i}^{(t)})}^{⊤}]) . \end{array}

(9)

In every E step, the expectation function is updated iteratively with all recently updated model parameters. In every M step, we maximize the E^(t)(M, ξ ) to estimate the parameters (M, ξ ). This is achieved by setting the derivative of E^(t)(M, ξ ) with respect to (M, ξ ) to be zero.

The current study modifies (9) by adding prior beliefs on item parameters. That is, we can assume some prior distributions of $α_{j} \sim N (μ_{α}^{(0)}, Σ_{α}^{(0)})$ and $b_{j} \sim N (μ_{b}^{(0)}, σ_{b}^{(0) 2})$ , and the expectation could be rewritten as

\begin{array}{l} E^{(t)} (M, ξ) & = \sum_{i = 1}^{N} \sum_{j = 1}^{J} (\log \frac{e^{ξ_{i, j}^{(t)}}}{(1 + e^{ξ_{i, j}^{(t)}})} + (\frac{1}{2} - Y_{i j}) b_{j}^{(t)} + (Y_{i j} - \frac{1}{2}) α_{j}^{(t) ⊤} μ_{i}^{(t)} - \frac{1}{2} ξ_{i, j}^{(t)} \\ - η (ξ_{i, j}^{(t)}) {b_{j}^{(t) 2} - 2 b_{j}^{(t)} α_{j}^{(t) ⊤} μ_{i}^{(t)} + α_{j}^{(t) ⊤} [Σ_{i}^{(t)} + (μ_{i}^{(t)}) {(μ_{i}^{(t)})}^{⊤}] α_{j}^{(t)} - ξ_{i, j}^{(t) 2}}) \\ + \frac{N}{2} \log | Σ_{θ}^{(t) - 1} | - \sum_{i = 1}^{N} \frac{1}{2} T r (Σ_{θ}^{(t) - 1} [Σ_{i}^{(t)} + (μ_{i}^{(t)}) {(μ_{i}^{(t)})}^{⊤}]) \\ - N \sum_{j = 1}^{J} (\frac{{(α_{j}^{(t)} - μ_{α}^{(0)})}^{⊤} (Σ_{α}^{(0) - 1} (α_{j}^{(t)} - μ_{α}^{(0)})}{2} - \frac{{(b_{j}^{(t)} - μ_{b}^{(0)})}^{2}}{2 σ_{b}^{(0) 2}}) . \end{array}

(10)

This is comparable to the Bayes modal estimation approach described by Tierney and Kadane (1986). Incorporating these priors could prevent abnormal parameter estimates and improve the accuracy and stability of the parameter estimates (Cho et al., 2022). The current study applied the bootstrap approach to the SE estimation procedure and denoted this version with prior information as Gaussian Variational EM with Bootstrap Sampling and Prior (GVEM-BSP). For the origianl GVEM, the USEM and the bootstrap approaches were implemented and denoted as GVEM-USEM and GVEM-BS, respectively. Details for the SE estimation procedure are presented in the next subsections.

USEM Approach Under the GVEM Framework

The USEM algorithm (Tian et al., 2013) is implemented to estimate SEs of item parameters. It is a more computationally efficient version of SEM (Cai, 2008), which calculates the observed-data information matrix according to the missing information principle (Orchard & Woodbury, 1972). This principle states that the observed-data information I_o is the difference between the complete-data information I_c and the missing-data information I_m. Therefore, the p × p dimensional variance-covariance matrix for model parameters M could be derived as

V_{M} = I_{o}^{- 1} = {(I_{c} - I_{m})}^{- 1} = I_{c}^{- 1} {(I_{p} - I_{m} I_{c}^{- 1})}^{- 1} = I_{c}^{- 1} {(I_{p} - Δ)}^{- 1},

(11)

where

I_{o} = - \frac{\partial^{2}}{\partial M \partial M^{T}} l (M; Y)

I_{c} = E {- \frac{\partial^{2}}{\partial M \partial M^{T}} l (M ∣ Y, θ) ∣ Y}

I_{m} = E {- \frac{\partial^{2}}{\partial M \partial M^{T}} \log (P (θ ∣ Y, M)) | Y, M}

, I_p is the p × p identity matrix,

Δ = I_{m} I_{c}^{- 1}

is the fraction of missing information. Typically, the EM algorithm computes the Newton–Raphson algorithm to optimize the M-step, where

I_{c}^{- 1}

is obtained by-product. However, this does not apply to the GVEM algorithm since the item parameters are updated via the closed-form solutions.

Alternatively, GVEM computes I_c based on the variational lower bound E(M, ξ ): $I_{c} = - \frac{\partial^{2}}{\partial M \partial M^{T}} E (M, ξ)$ . The second derivatives with respect to α _j, b_j have closed-form solutions

\begin{array}{l} \frac{\partial^{2}}{\partial α_{j} \partial {α_{j}}^{T}} E (M, ξ) = \sum_{i = 1}^{N} - 2 η (ξ_{i, j}) [Σ_{i} + μ_{i} μ_{i}^{⊤}], \\ \frac{\partial^{2}}{\partial α_{j} \partial b_{j}} E (M, ξ) = \sum_{i = 1}^{N} 2 η (ξ_{i, j}) μ_{i}, \\ \frac{\partial^{2}}{\partial b_{j}^{2}} E (M, ξ) = \sum_{i = 1}^{N} - 2 η (ξ_{i, j}) \end{array}

(12)

Other elements in I_c are 0. In general, I_c can be expressed as

[\begin{array}{l} - \frac{\partial^{2}}{\partial α_{1} \partial α_{1}^{T}} E (M, ξ) & - \frac{\partial^{2}}{\partial α_{1} \partial b_{1}} E (M, ξ) & \dots & 0 & 0 \\ - \frac{\partial^{2}}{\partial α_{1} \partial b_{1}} E (M, ξ) & - \frac{\partial^{2}}{\partial b_{1}^{2}} E (M, ξ) & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & \dots & - \frac{\partial^{2}}{\partial α_{j} \partial α_{j}^{T}} E (M, ξ) & - \frac{\partial^{2}}{\partial α_{j} \partial b_{j}} E (M, ξ) \\ 0 & 0 & \dots & - \frac{\partial^{2}}{\partial α_{j} \partial b_{j}} E (M, ξ) & - \frac{\partial^{2}}{\partial b_{j}^{2}} E (M, ξ) \end{array}]

Then the key part is to compute Δ. Cai (2008) indicated Δ governs the rate of convergence of the EM process and it is the Jacobian matrix and each element Δ_ij can be computed as

\begin{array}{l} Δ_{i j} & = \frac{\partial F_{j} (\hat{M})}{\partial M_{i}} = \lim_{M_{i} \to {\hat{M}}_{i}} \frac{F_{j} ({\hat{M}}_{1}, \dots, {\hat{M}}_{i - 1}, M_{i}, {\hat{M}}_{i + 1}, \dots, {\hat{M}}_{p}) - F_{j} (\hat{M})}{M_{i} - {\hat{M}}_{i}} \\ = \lim_{t \to \infty} \frac{F_{j} (M_{(i)}^{(t)}) - {\hat{M}}_{i}}{M_{i}^{(t)} - {\hat{M}}_{i}} = \lim_{t \to \infty} Δ_{i j}^{(t)} . \end{array}

(13)

where

\hat{M}

refers to the GVEM estimates of item parameters M,

M_{(i)}^{(t)} = ({\hat{M}}_{1}, \dots, {\hat{M}}_{i - 1}, M_{i}^{(t)}, {\hat{M}}_{i + 1}, \dots, {\hat{M}}_{p})

equals to

\hat{M}

except that the ith element is replaced by

M_{i}^{(t)}

F

is a vector-valued mapping function:

M^{(t + 1)} = F (M^{(t)})

We conduct the “forced-EM” process (Meng & Rubin, 1991) to obtain Δ_ij. Specifically, run one iteration of GVEM code to get $F (M^{(t)})$ for some t, and use it to calculate $Δ_{i j}^{(t)}$ for all i, j = 1, 2, …, p The step is repeated for t = 1, 2, …, T* until the entire matrix converges (i.e., $| Δ_{i j}^{(T^{*})} - Δ_{i j}^{(T^{*} - 1)} | \leq ϵ$ for all of the elements, where T* ≤ T, T is the number of iterations for GVEM). In our study, we adopt the USEM algorithm which is a row-wise convergence criterion (Tian et al., 2013). That is, for a given i, if $| Δ_{i j}^{(T^{*})} - Δ_{i j}^{(T^{*} - 1)} | \leq ϵ$ holds for all j = 1, 2, …, p, then the i-th row is considered converged and is no longer involved in the “forced-EM” process. By doing so, USEM converges faster compared to SEM.

Bootstrap Approach Under the GVEM Framework

The bootstrap is an alternative approach to SE estimates of item parameters when the SE computation is mathematically intractable (Efron & Tibshirani, 1986). It is a resampling procedure that generates a large number of bootstrap samples in either a parametric fashion or a nonparametric fashion (Zhang & Zhao, 2019). From each bootstrap sample, the statistic of interest is calculated. The sampling distribution of the replications could be obtained to make some inferences related to the accuracy of the statistic (Patton et al., 2014). The current study implements the parametric bootstrap sampling and the steps are outlined below.

1. B bootstrap datasets are simulated based on GVEM estimates $\hat{M}$ .

2. The GVEM is conducted to estimate item parameters for each bootstrap dataset, and these item parameter estimates are denoted as ${\hat{M}}^{1}, {\hat{M}}^{2}, \dots {\hat{M}}^{B}$ .

3. The SE estimates are simply the sample standard deviations of the estimated item parameters. The SE estimate for any item parameter M_j (such as α_jk or b_j) could be expressed as

σ_{{\hat{M}}_{j}} = \sqrt{\frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{M}}_{j}^{b} - {\hat{M}}_{j})}^{2}} .

(14)

While the number of bootstrap samples B is usually taken relatively large, our current study finds that B = 50 provides good and stable numerical results, and our pilot study results indicated even using as few as 5 bootstrap samples could produce similar estimates. Note if

\hat{M}

is obtained by GVEM with prior information (i.e., using Equation (10)), then in the bootstrap samplers, the GVEM would utilize the same prior distributions. In the current simulation study, the prior distributions for α _j and b_j were set as N ∼ (1.5, I ₃) and N ∼ (0, 10), respectively.

Simulation Study

A simulation study was conducted to evaluate the performance of three proposed SE estimation procedures (GVEM-USEM, GVEM-BS, and GVEM-BSP). The manipulated factors included: (1) the test length was fixed at 45 or 30; (2) the factor correlations were simulated from Unif(0.1, 0.3) or Unif(0.5, 0.7) to represent the low and high correlation conditions, respectively; (3) the multidimensional structure was either between-item M2PL or within-item M2PL; the number of factors was either 3 or 5 (i.e., K = 3, 5), representing either low or high dimensionality. In the between-item M2PL, with a test length of 45, 15 items were loaded onto each factor for K = 3, and 9 items were loaded onto each factor for K = 5. Similarly, at a test length of 30, 10 items were loaded per factor for K = 3 and 6 items for K = 5. For the within-item M2PL, when the test length was 45 and K = 3, about 60%, 24%, and 16% items were loaded onto one, two, and three factors, respectively. For K = 5, the proportions of items loaded onto one, two, and three factors were about 56%, 22%, and 22%, respectively. When the test length was 30 and K = 3, about 60%, 20%, and 20% were loaded onto one, two, and three factors respectively. For K = 5, about 33% items loaded onto one, two, and three factors, respectively. Under all conditions, true item discrimination parameters were drawn from Unif(1, 2) and true item difficulty parameters were drawn from N(0, 1), which is the same settings as Cho et al. (2021). The sample size was fixed at 1000. The ability parameters θ _i were drawn from a multivariate normal distribution, N ∼ (0, Σ_θ), where Σ_θ is a variance-covariance matrix whose diagonal elements were 1 and the off-diagonal elements depended on the factor correlation conditions.

As alluded to above, both GVEM-BS and GVEM-BSP generated 50 bootstrap samples per replication. For GVEM-BSP, the prior distribution for α _j was set as N ∼ (1.5, I ₃), which was informative since the mean value of the true distribution was 1.5. On the other hand, the prior distribution for b_j was N ∼ (0, 10), following the setting from Sinharay (2005). Fifty replications were conducted for each condition. The true item parameters held constant across all replications within each condition. The empirical standard deviations of the estimated item parameters across replications per condition were served as “true” SEs for each method. Note that the current design was based on a confirmatory model, hence, there were many α _j known as 0. Therefore, the evaluation criterion below excluded these parameters and focused on non-zero model parameters. Given a non-zero item parameter β_j (which is used as a general notation to represent α_jk or b_j), the empirical standard deviation $σ_{β_{j}}$ can be computed as

σ_{β_{j}} = \frac{1}{R - 1} \sum_{r = 1}^{R} {({\hat{β}}_{j}^{r} - β_{j})}^{2},

(15)

where

{\hat{β}}_{j}^{r}

is the rth replication of item parameter estimate for β_j, and R is the total number of replications. For the overall comparisons of different SE estimation procedures, bias, relative bias, and estimated SEs were computed for each non-zero model parameter across all items within a condition.¹ For each replication, they can be defined by

Bias = \frac{1}{J^{*}} \sum_{j = 1}^{J^{*}} ({\hat{σ}}_{β_{j}}^{r} - σ_{β_{j}}),

(16)

Relative Bias = \frac{1}{J^{*}} \sum_{j = 1}^{J^{*}} \frac{{\hat{σ}}_{β_{j}}^{r} - σ_{β_{j}}}{σ_{β_{j}}},

(17)

Estimated SEs = \frac{1}{J^{*}} \sum_{j = 1}^{J^{*}} {\hat{σ}}_{β_{j}}^{r},

(18)

where J* denotes the number of non-zero model parameters, and

{\hat{σ}}_{β_{j}}^{r}

denotes the rth replication of estimated SE for β_j.

Results

It should be noted that the GVEM-USEM algorithm failed to estimate some SEs of item parameters under some replications and these values were excluded for further analysis. Table 1 in the appendix summarizes the number of unsuccessful replications with missing values for the GVEM-USEM method. The simulation results under various conditions are presented by boxplots to show the distribution of bias, relative bias, and estimated SEs. All three methods (GVEM-BSP, GVEM-BS, and GVEM-USEM) were consistently presented in the same order under each manipulated condition. Figures 1 and 2 illustrate bias and relative bias results for K = 3. In low-dimensional settings, the GVEM-USEM yielded nearly unbiased results for SE estimates of item discrimination parameters under the simplest condition (K = 3, J = 45, low factor correlations and between-item structure). However, it tended to overestimate SEs for item difficulty parameters. Both bootstrap methods, GVEM-BSP and GVEM-BS, slightly underestimated the SEs for item parameters. As conditions became more complex, GVEM-BSP and GVEM-BS methods performed better by producing close to 0 bias and relative bias results for SE estimates. Particularly, GVEM-BSP provided approximately unbiased SE estimates under most conditions. In contrast, the GVEM-USEM method consistently overestimated SEs of item parameters. Under the most complex condition (i.e., short test length, high latent correlations, and within-item structure), both GVEM-BS and GVEM-USEM methods overestimated item discrimination parameters and yielded some outliers. In contrast, the GVEM-BSP method yielded unbiased results for SE estimates of item discrimination parameters in terms of bias and relative bias results, but slightly underestimated SEs for item difficulty parameters. The bias and relative bias varied more for the within-item conditions than the between-item conditions. This was expected since the within-item model had a more complex loading structure.

Figure 1.

Bias comparison for item parameters’ standard errors estimates when K = 3.

Figure 2.

Relative bias comparison for item parameters’ standard errors estimates when K = 3.

Figures 3 and 4 illustrate bias and relative bias results for K = 5. In high-dimensional conditions, both GVEM-BSP and GVEM-BS methods consistently outperformed the GVEM-USEM method, yielding approximately zero bias and relative bias results for SE estimates under all conditions, except when factor correlations were high and the item structure was within-item M2PL, for both test lengths of 45 and 30. Under these two conditions, all methods exhibited increased variability in SE estimates. The GVEM-BS method consistently yielded positive bias and relative bias results. When J = 45, the GVEM-BSP overestimated SEs, whereas it underestimated them when J = 30. The GVEM-USEM method yielded more outliers in scenarios where J = 30, factor correlations were high, and the item structure was within-item M2PL. Under other conditions, it tended to overestimate SEs, consistently with our findings for K = 3. The general trend of bias and relative bias results remained the same as in low-dimensional conditions. That is, the SE estimates became more challenging with high factor correlations and a within-item structure, although the GVEM-BSP consistently demonstrated superior performance across all conditions.

Figure 3.

Bias comparison for item parameters’ standard errors estimates when K = 5.

Figure 4.

Relative bias comparison for item parameters’ standard errors estimates when K = 5.

Figures 5 and 6 present estimated SEs for each method. It is important to note that different scales are employed for the left and right columns to enhance the visibility of differences between methods. In general, GVEM-BS and GVEM-BSP method produced comparable results in terms of estimating SEs smaller than 0.15. Three exceptions were observed: (1) when K = 3, J = 30, factor correlations with high factor correlations and a within-item M2PL structure; (2) when K = 5, J = 45, with high factor correlations and a within-item M2PL structure; (3) when K = 5, J = 30, with high factor correlations and a within-item M2PL structure. Under these three conditions, the GVEM-BS estimated larger SEs for item discrimination parameters, possibly contributing to increased variability in bias and relative bias results under high latent correlation and within-item M2PL conditions. The GVEM-USEM method estimated smaller SEs than the GVEM-BS under these three conditions, while generally estimating relatively larger SEs than the two bootstrap methods and presenting some outliers in other scenarios.

Figure 5.

Estimated standard errors comparison when K = 3.

Figure 6.

Estimated standard errors comparison when K = 5.

Discussions

The recent work of Cho et al. (2021) has shown the superiority of the GVEM algorithm in terms of computation efficiency and item parameter estimation accuracy, but its SE estimation is not fully discussed. Since obtaining accurate SEs is also an important prerequisite for many applications, the current study applied the USEM and bootstrap methods within the GVEM framework and compared the performance of three GVEM methods (GVEM-BSP, GVEM-BS, and GVEM-USEM) with respect to SE recovery. The simulation results showed that the GVEM-BSP performed the best under most conditions, because adding a prior could make parameter estimation more stable and robust. Although computationally more efficient, GVEM-USEM tended to exhibit an upward bias. GVEM-BS method demonstrated comparable performance to GVEM-BSP under conditions with low factor correlations and between-item structure, yet displayed increased variability under scenarios involving high factor correlations and within-item conditions.

The GVEM-BSP is a promising method to estimate SE. Moreover, our pilot study finds that it can produce accurate item parameter estimates. Besides, the GVEM-BSP can be extended to the Variational Bayesian (VB) estimation (Bishop, 2006), an alternative approximation technique to solve intractable integrals by specifying variational distributions of item parameters. The main advantage of VB is that SEs of item parameters can be derived with closed-form solutions. However, our pilot study shows that interestingly, it underestimates SE consistently, and we will defer a detailed examination of this method to a future study.

Although the literature demonstrates several great benefits of the USEM method in terms of SE estimation, incorporating it into the GVEM algorithm has several drawbacks. First, unlike traditional EM algorithms, GVEM requires additional derivation (e.g., the inverse of the complete-data information matrix $I_{c}^{- 1}$ ) to compute SE. Second, the USEM method relies on the information matrix based on the variational lower bound E(M, ξ ) of the marginal log-likelihood, which could incur some bias of SE estimates. Third, the information matrix in the current study is only a sub-block of the entire information matrix induced by (M, ξ ), and hence could be ill-conditioned. This also explains the occurrence of unsuccessful replications (see Table 1).

The current study can be extended in the following directions. First, all of the conditions were in a confirmatory mode, meaning that the factor loading structure was assumed to be known. For unknown cases such as exploratory factor analysis, the proposed SE estimation procedures can be combined with, for example, the GVEM with adaptive lasso penalty, which has been shown to accurately recover the model parameters and the loading structure for such purpose (Cho et al., 2021, 2022). Second, the current study was based on a two-parameter MIRT model. It is desirable to investigate the performance of different SE estimation procedures within the GVEM framework under other types of MIRT models, such as the M3PL model including guessing behaviors, or even the M4PL considering inattention situations. Lastly, the current simulation design was more of an ideal scenario where the response dataset did not have any missing values. However, missing data is ubiquitous in practice, which could result in biased parameter estimates and inflated SEs (Kalkan et al., 2018). Future studies are necessary to explore the performance of GVEM with different SE estimation procedures under various missing-data scenarios.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The study is funded by IES R305D200015 and NSF SES-1846747, SES-2150601, ECR-2300382.

ORCID iDs

Jiaying Xiao

Chun Wang

Note

Appendix

Table 1.

The Number of Unsuccessful Replications for the GVEM-USEM.

Test length	Factor correlation	Model structure	Number of factors	Replications
45	0.1–0.3	Within-item	3	1
	0.5–0.7	Between-item	3	3
	0.5–0.7	Within-item	3	4
30	0.5–0.7	Within-item	3	1

Table 2.

The Elapsed Time for Three GVEM Methods per Replication Under Two Conditions.

Condition	Method	Elapsed time (mins)
Between-item M2PL, low correlations, K = 5, J = 45	GVEM-BS	0.87
	GVEM-BSP	1.20
	GVEM-USEM	0.08
Between-item M2PL, low correlations, K = 5, J = 30	GVEM-BS	1.17
	GVEM-BSP	1.32
	GVEM-USEM	0.04

CPU: 2.40 GHz 20-Core Intel Xeon; RAM: 1.00 TB 2133 MHz DDR4.

References

Andersson

Xin

(2021). Estimation of latent regression item response theory models using a second-order Laplace approximation. Journal of Educational and Behavioral Statistics, 46(2), 244–265. https://doi.org/10.3102/1076998620945199

Bashkov

B. M.

DeMars

C. E.

(2017). Examining the performance of the Metropolis–Hastings Robbins–Monro algorithm in the estimation of multilevel multidimensional IRT models. Applied Psychological Measurement, 41(5), 323–337. https://doi.org/10.1177/0146621616688923

Bishop

C. M.

(2006). Pattern recognition and machine learning (Vol. 4). Springer.

Cagnone

Monari

(2013). Latent variable models for ordinal data by using the adaptive quadrature approximation. Computational Statistics, 28(2), 597–619. https://doi.org/10.1007/s00180-012-0319-z

Cai

(2008). SEM of another flavour: Two new applications of the supplemented EM algorithm. British Journal of Mathematical and Statistical Psychology, 61(Pt 2), 309–329. https://doi.org/10.1348/000711007X249603

Cai

(2010a). High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika, 75(1), 33–57. https://doi.org/10.1007/s11336-009-9136-x

Cai

(2010b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35(3), 307–335. https://doi.org/10.3102/1076998609353115

Cai

Maydeu-Olivares

Coffman

D. L.

Thissen

(2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2 tables. British Journal of Mathematical and Statistical Psychology, 59(Pt 1), 173–194. https://doi.org/10.1348/000711005X66419

Chalmers

R. P.

Pek

Liu

(2017). Profile-likelihood confidence intervals in item response theory models. Multivariate Behavioral Research, 52(5), 533–550. https://doi.org/10.1080/00273171.2017.1329082

10.

Chen

(2017). A comparative study of online item calibration methods in multidimensional computerized adaptive testing. Journal of Educational and Behavioral Statistics, 42(5), 559–590. https://doi.org/10.1007/s10709-017-9982-x

11.

Chen

Wang

(2021). Using EM algorithm for finite mixtures and reformed supplemented EM for MIRT calibration. Psychometrika, 86(1), 299–326. https://doi.org/10.1007/s11336-021-09745-6

12.

Cho

A. E.

Wang

Zhang

(2021). Gaussian variational estimation for multidimensional item response theory. British Journal of Mathematical and Statistical Psychology, 74(Suppl 1), 52–85. https://doi.org/10.1111/bmsp.12219

13.

Cho

A. E.

Xiao

Wang

(2022). Regularized variational estimation for exploratory item factor analysis. Psychometrika, 89(1), 347–375. https://doi.org/10.1007/s11336-022-09874-6

14.

de la Torre

Patz

R. J.

(2005). Making the most of what we have: A practical application of multidimensional item response theory in test scoring. Journal of Educational and Behavioral Statistics, 30(3), 295–311. https://doi.org/10.3102/10769986030003295

15.

Efron

Tibshirani

(1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1(1), 54–75. https://doi.org/10.1214/ss/1177013815

16.

Falk

C. F.

Monroe

(2018). On Lagrange multiplier tests in multidimensional item response theory: Information matrices and model misspecification. Educational and Psychological Measurement, 78(4), 653–678. https://doi.org/10.1177/0013164417714506

17.

Fitzpatrick

A. R.

Yen

W. M.

(2001). The effects of test length and sample size on the reliability and equating of tests composed of constructed-response items. Applied Measurement in Education, 14(1), 31–57. https://doi.org/10.1207/s15324818ame1401_04

18.

Gonçalves

White

(2005). Bootstrap standard error estimates for linear regression. Journal of the American Statistical Association, 100(471), 970–979. https://doi.org/10.1198/016214504000002087

19.

Joe

(2008). Accuracy of Laplace approximation for discrete response mixed models. Computational Statistics & Data Analysis, 52(12), 5066–5074. https://doi.org/10.1016/j.csda.2008.05.002

20.

Jordan

M. I.

Ghahramani

Jaakkola

T. S.

Saul

L. K.

(1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233. https://doi.org/10.1023/A:1007665907178

21.

Kalkan

Ö. K.

Kara

Keleciolu

(2018). Evaluating performance of missing data imputation methods in IRT analyses. International Journal of Assessment Tools in Education, 5(3), 403–416. https://doi.org/10.21449/ijate.430720

22.

Lindstrom

M. J.

Bates

D. M.

(1988). Newton—raphson and EM algorithms for linear mixed-effects models for repeated-measures data. Journal of the American Statistical Association, 83(404), 1014–1022. https://doi.org/10.1080/01621459.1988.10478693

23.

Liou

L.-C.

(1991). Assessing statistical accuracy in ability estimation: A bootstrap approach. Psychometrika, 56(1), 55–67. https://doi.org/10.1007/BF02294585

24.

Liu

Schulz

E. M.

(2008). Standard error estimation of 3PL IRT true score equating with an MCMC method. Journal of Educational and Behavioral Statistics, 33(3), 257–278. https://doi.org/10.3102/1076998607306076

25.

Meng

X.-L.

Rubin

D. B.

(1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86(416), 899–909. https://doi.org/10.1080/01621459.1991.10475130

26.

Orchard

Woodbury

M. A.

(1972). A missing information principle: Theory and applications. In Volume 1 theory of statistics (pp. 697–716). University of California Press.

27.

Paek

Cai

(2014). A comparison of item parameter standard error estimation procedures for unidimensional and multidimensional item response theory modeling. Educational and Psychological Measurement, 74(1), 58–76. https://doi.org/10.1177/0013164413500277

28.

Patton

J. M.

Cheng

Yuan

K.-H.

Diao

(2014). Bootstrap standard errors for maximum likelihood ability estimates when item parameters are unknown. Educational and Psychological Measurement, 74(4), 697–712. https://doi.org/10.1177/0013164413511083

29.

Sinharay

(2005). Assessing fit of unidimensional item response theory models using a bayesian approach. Journal of Educational Measurement, 42(4), 375–394. https://doi.org/10.1111/j.1745-3984.2005.00021.x

30.

Tian

Cai

Thissen

Xin

(2013). Numerical differentiation methods for computing error covariance matrices in item response theory modeling: An evaluation and a new proposal. Educational and Psychological Measurement, 73(3), 412–439. https://doi.org/10.1177/0013164412465875

31.

Tierney

Kadane

J. B.

(1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393), 82–86. https://doi.org/10.1080/01621459.1986.10478240

32.

Tsai

T.-H.

Hanson

B. A.

Kolen

M. J.

Forsyth

(2001). A comparison of bootstrap standard errors of IRT equating methods for the common-item nonequivalent groups design. Applied Measurement in Education, 14(1), 17–30. https://doi.org/10.1207/s15324818ame1401_03

33.

Wang

Zhang

(2019). A note on the conversion of item parameters standard errors. Multivariate Behavioral Research, 54(2), 307–321. https://doi.org/10.1080/00273171.2018.1513829

34.

Wang

W.-C.

Chen

P.-H.

Cheng

Y.-Y.

(2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9(1), 116–136. https://doi.org/10.1037/1082-989X.9.1.116

35.

White

(1982). Maximum likelihood estimation of misspecified models. Econometrica, 50(1), 1–25. https://doi.org/10.2307/1912526

36.

Woods

C. M.

Cai

Wang

(2013). The langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73(3), 532–547. https://doi.org/10.1177/0013164412464875

37.

Yao

(2012). Multidimensional CAT item selection methods for domain scores and composite scores: Theory and applications. Psychometrika, 77(3), 495–523. https://doi.org/10.1007/s11336-012-9265-5

38.

Yuan

K.-H.

Cheng

Patton

(2014). Information matrices and standard errors for mles of item parameters in irt. Psychometrika, 79(2), 232–254. https://doi.org/10.1007/s11336-013-9334-4

39.

Zhang

Zhao

(2019). Standard errors of IRT parameter scale transformation coefficients: Comparison of bootstrap method, delta method, and multiple imputation method. Journal of Educational Measurement, 56(2), 302–330. https://doi.org/10.1111/jedm.12210