Abstract
The present study was motivated by the recognition that standard errors (SEs) of item response theory (IRT) model parameters are often of immediate interest to practitioners and that there is currently a lack of comparative research on different SE (or error variance–covariance matrix) estimation procedures. The present study investigated item parameter SEs based on three error variance–covariance matrix estimation procedures for unidimensional and multidimensional IRT models: Fisher information, empirical cross-product, and supplemented expectation maximization. This study centers on the direct comparisons of SEs from different procedures and complements a recent study by Tian, Cai, Thissen, and Xin by providing insights and suggestions on the nature of the differences and similarities as well as on practical matters such as the computational cost. The simulation results show that all three procedures produced similar results with respect to bias in the SE estimates for most conditions. When the number of items is large and sample size is small, empirical cross-product, which was the most computationally efficient procedure, appeared to be affected most, producing slight upward bias.
Keywords
Item response theory (IRT) has been widely used in educational and psychological testing. Bock and Aitkin’s (1981) application of the expectation–maximization (EM) algorithm made estimation of item parameters in IRT models practical. For popular unidimensional IRT models, Bock and Aitkin’s EM algorithm or its variations (e.g., Mislevy, 1986) have become the predominant method for estimation of structural parameters in IRT models (i.e., item parameters and parameters of the latent trait distribution), and parameter recovery studies abound. On the other hand, studies related to the performance of standard error estimation procedures are scarce.
Recently, Cai (2008) introduced the supplemented EM algorithm (SEM; see also Meng & Rubin, 1991) for computing structural parameter standard errors (SEs) in IRT. Cai (2008) compared SEM SE estimates under maximum marginal likelihood parameter estimation with expected SEs calculated from the Fisher information matrix and SE estimates from popular IRT software packages such as BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 2003) and MULTILOG (Thissen, 2003). While real data examples were used in Cai (2008), there was no systematic investigation of the performance of SEM against existing SE estimation procedures that include empirical cross-product (XPD) information matrix and Fisher’s expected information matrix (FIS).
The study by Tian, Cai, Thissen, and Xin (2013) is among the very few studies examining IRT SE procedures. Tian et al. compared the XPD procedure, a few direct numerical differentiation procedures, SEM, and an accelerated version of SEM against the FIS procedure. Although their comparisons were comprehensive, because of their use of overall summary statistics for entire error variance–covariance matrices, it is not possible to make direct observations regarding SE estimates themselves using the summary results from simulation. This study supplements Tian et al.’s research by focusing directly on SEs.
The current study investigated the performance of SEM, XPD information matrix, and FIS. The investigations of SEs for the structural parameters were conducted for popular unidimensional and multidimensional IRT models for dichotomous item responses. Prior to this research, there were no systematic investigations of SE estimation procedures in the context of multidimensional IRT. In addition to filling a critical void in the literature, knowledge about the performance of the SE procedures can inform the selection of Wald test statistics for differential item functioning testing (see Woods, Cai, & Wang, 2013), goodness-of-fit testing (Cai, Maydeu-Olivares, Coffman, & Thissen, 2006), anchor item parameter invariance testing in the nonequivalent groups equating design, and test scoring when item parameter calibration uncertainty must be taken into account (Yang, Hansen, & Cai, 2012).
Overview of Standard Error Estimation Procedures
The chosen unidimensional model in the current study was the two-parameter logistic model (U2PL; Birnbaum, 1968). Two multidimensional models were used: a between-item, two-parameter logistic multidimensional model (BtM2PL) in which each item loads onto a single dimension (but with correlated dimensions), and the item bifactor model (BIFACT; Gibbons & Hedeker, 1992), which exhibits within-item multidimensionality where an item loads onto more than a single dimension.
For consistency, item parameters for all models in this study were estimated by maximum marginal likelihood with the Bock–Aitkin EM (BAEM) algorithm. Note that BAEM is one of the most popular algorithms in the unidimensional IRT modeling. In all cases, a latent trait vector
The SE estimation procedures evaluated in this study are outlined below. The XPD information matrix approach and the FIS reflect standard likelihood theory. SEM reuses the EM code to address one of the fundamental deficiencies of EM: the lack of an error covariance matrix as a by-product of the iterations. For more technical details of SEM, readers are referred to Meng and Rubin (1991), Cai (2008), and Cai and Lee (2009).
Empirical Cross-Product Approach
Let l(
where
Fisher Information Approach
An alternative expression for the information matrix is minus one times the expectation of the Hessian matrix H(ξ), which is the second derivative matrix of l(ξ):
Again, the square roots of the diagonal elements of
Supplemented EM Approach
BAEM estimation in IRT is fundamentally based on the missing data formulation for latent variable models (see Dempster, Laird, & Rubin, 1977). The latent traits are treated as the missing data, and in each E-step the conditional expectation of the missing data given observed data (the item responses) and provisional structural parameter estimates is used to set up M-step complete data estimation, iterating to convergence. The iteration converges at a linear rate in the EM algorithm, but its first-order iterative scheme does not produce an estimate of an error covariance matrix on convergence.
As the name suggests, SEM “supplements” the main EM by providing a procedure to compute the asymptotic covariance matrix for the maximum likelihood estimates. Compared with alternative SE estimation techniques (e.g., Louis, 1982), SEM is particularly easy to implement for BAEM and reasonably efficient for a wide class of models and problem sizes.
SEM calculates the observed data information matrix based on the missing information principle (Orchard & Woodbury, 1972), which states that the observed data information is the difference between the complete data information and the missing data information:
where Ic is the complete data information matrix, Im is the missing data information matrix, I is an identity matrix, and
In BAEM, the M-step is iterative. Typically the Newton–Raphson algorithm is used to optimize the M-step logit analysis log-likelihood. As such, I
c is obtained as a by-product. Therefore, to implement SEM, only the rate matrix Δ needs to be estimated. Each component of Δ is estimated with a “forced EM” process, which uses the code for running one cycle of EM and structural parameter estimates from the EM estimation history (both provisional and final converged estimates). Once Δ and I
c are obtained, the observed information matrix I
o is computed from Equation (3). The square roots of the diagonal elements of
Standard Error of the Item Difficulty Parameter in the Unidimensional Model
Because multidimensional IRT models subsume unidimensional IRT models, the general form of the linear predictor for the logistic IRT models used in the present study can be written as
Study Design
The manipulated design factors for data generation were test length (TL), sample size (N), and IRT models. TL was 16 (short test) or 40 (long test). N was 500 (small) or 1,500 (large). The IRT models were unidimensional two-parameter logistic model (U2PL), between-item two-parameter logistic multidimensional model (BtM2PL), and bifactor model (BIFACT). There were a total of 2 × 2 × 3 = 12 data simulation conditions. In each of the 12 conditions, three SE procedures were applied (FIS, XPD, and SEM) and the number of replications was 400 within each condition. Table 1 shows the summary of TL, N, IRT models, SE procedures used, and settings for numerical quadrature used.
Simulation Design.
Note. U2PL = two-parameter logistic model; BtM2PL = between-item, two-parameter logistic multidimensional model; BIFACT = bifactor model; XPD = empirical cross-product; FIS = Fisher’s expected information matrix; SEM = supplemented expectation–maximization; SE = standard error.
For reasons described earlier, the FIS was nearly or literally impossible to implement when the number of items was close to or greater than 20 with the personal computer available to the authors and so was only used for the short test condition (TL = 16). For the multidimensional conditions, the number of quadrature nodes and the range were reduced compared with the unidimensional case specifications to make estimation possible and to make estimation time within a reasonable boundary across replications. Note that the number of quadrature nodes and the ranges in the multidimensional cases are specified per dimension. Thus, for example, 19 nodes in a two-dimensional case require 192 = 361 total function evaluations. The quadrature settings are consistent with published research (see, e.g., Cai, Yang, & Hansen, 2011). For the multidimensional conditions with a long test (TL = 40), the number of dimensions was increased to better approximate realistic structures (e.g., four subdomains in an educational/psychological testing). In estimating all models here, a lognormal prior was used for the item slopes:
Generating Parameters
For both unidimensional and multidimensional models, the slope parameters and the difficulty parameters were randomly selected from truncated normal distributions:
When BtM2PL was used, the same number of items was assigned to each dimension. The number of items loading on each specific dimension in the bifactor model was held constant across all specific dimensions. For example, the three-dimensional bifactor model with a 16-item test has 8 items for each of the two specific dimensions, and the five-dimensional bifactor model with a 40-item test has 10 items for each of the four specific dimensions in addition to a general dimension on which all items load.
Convergence Criteria
Across all conditions, the maximum number of E-steps of BAEM was 1,000, and 500 was set as the maximum number of M-steps. The E-step and the M-step convergence tolerances for maximum absolute intercycle parameter difference were 10−6 and 10−9, respectively. The SEM convergence tolerance was 10−3.
Evaluation Criteria
Bias, absolute bias, and root mean squared difference (RMSD) were calculated for SEs of a generic parameter ξ across all replications in a given SE procedure:
where
where
For the overall comparisons of different SE procedures, averages of bias, absolute bias, and RMSD were taken across all items and are reported in this article. In addition to these overall summary measures, boxplots based on the mean SEs (across all replications) of model parameters were constructed for different types of parameters (e.g., slope parameter, intercept parameter, and difficulty parameter). For example, for a unidimensional model, boxplots for the mean SEs across all items were displayed for the slope, the intercept, and the difficulty parameters, respectively.
In many applied settings, computing time for a chosen SE procedure may be important. FIS is considered to be the gold standard. However, as the number of items increases, the FIS procedure becomes computationally infeasible. Other methods, such as XPD or SEM, may show much shorter run times whereas their SEs are not practically significantly different from the gold standard SEs. Thus, the elapsed run times to estimate SEs for different SE procedures were recorded in addition to the time required for estimating the parameters. Average run times for SEs and parameter estimation were also reported. Computer speed differences notwithstanding, a key takeaway point in the run-time comparison is the relative amount of time spent for the SE procedure to the total time required, which is the time spent on the main EM algorithm plus the SE procedure run time.
The R language (R Development Core Team, 2012) was used for item response data generation and calculation of summary measures. For the implementation of the EM algorithm and the SE procedures, the C++ numeric engine of flexMIRT (Cai, 2012) was used. The computer used for the parameter and SE estimation had an Intel Core i7 CPU processor with 2.0 GHz (8 GB RAM and Windows 7 Ultimate 64-bit operating system).
Results
All 400 replications in each condition converged. For the parameter recovery, no particularly odd cases or patterns emerged. Thus, only a brief summary of model parameter recovery, using bias and RMSD (where the criterion is the true parameter), is provided in the appendix. Note that the performance of BAEM for IRT model parameter estimation is well established in the IRT literature; the focus of the present study is on the SEs.
Boxplots that summarize the performance of different SE procedures were constructed for each condition. Figures 1 through 4 display the boxplots for all 12 design conditions. Three plots in a given row in Figures 1 through 4 represent the results from each of the 12 conditions. ESD represents the empirical standard deviation of parameter estimates from all replications. For the case of the bifactor model, the slope parameters for the primary (general) dimension (denoted by “aP”) and the slope parameters for specific dimensions (denoted by “aS”) were summarized separately.

Summary of SEs when TL = 16 and N = 500.

Summary of SEs when TL = 16 and N = 1,500.

Summary of SEs when TL = 40 and N = 500.

Summary of SEs when TL = 40 and N = 1,500.
Figures 1 and 2 present results for TL = 16 with N = 500 and 1,500, respectively. Figures 3 and 4 are for TL = 40 with N = 500 and 1,500, respectively. Boxplots show the means of the SEs for each parameter, with separate plots for each type of parameter. For boxplots showing the mean SEs of item parameters, the number of means represented within each plot is equal to the number of items (i.e., TL).
One exception is the two-dimensional between-item dimension correlation. The side-by-side boxplots for the two-dimensional between-item dimension correlation are titled “TL16 N500 BtM: SE(cor)” in Figure 1 which is located in the panel on row 2 column 3, and “TL16 N11500 BtM: SE(cor)” in Figure 2. In this two-dimensional between-item multidimensional model, there is only one correlation parameter. Therefore, ESD does not have any variation in the plot because it is a constant (equal to the standard deviation of estimated correlations across all replications). The boxplots for FIS, SEM, and XPD in the two-dimensional between-item case is not based on the mean SEs, but on the correlation estimates of the 400 replications for one unknown true correlation parameter.
Visual inspection of Figures 1 through 4 confirms the effect of sample size on all SE procedures. When sample size increases, SEs across all IRT models, test length, and all SE procedures tend to decrease. Test length and IRT models did not show any noticeable systematic patterns of impact on different SE procedures. Overall, all SE procedures produced results similar to ESD. Especially when sample size was large (N = 1,500), all SE procedure became close to one another and the ESD. However, when a long test (TL = 40) was combined with a relatively small sample size (N = 500), XPD exhibited some upward bias.
Averaged over items, bias and absolute bias estimates are shown in Table 2. Average RMSD where the criterion was ESD is provided in Table 3. Average bias shows that overall, SEs from all procedures tended to be slightly larger than the ESD (i.e., positively biased). Consistent with Figures 1 through 4, the impact of sample size on the reduction of bias, absolute bias, and average RMSD was salient. XPD’s slight upward bias, compared with FIS and SEM, was noticeable (see the highlighted absolute bias and bias values in Table 2). The largest average RMSD is also highlighted in bold; and it is revealed that XPD had the largest RMSD of all SE procedures on average (see Table 3). Compared with XPD, SEM showed smaller bias, absolute bias, and RMSD except for the bifactor model with TL = 16. In the conditions where FIS was used (TL = 16 and N = 500 and 1,500), SEM performed as well as FIS except in the case of bifactor model.
Average Bias and Average Absolute Bias.
Note. U2PL = two-parameter logistic model; BtM2PL = between-item, two-parameter logistic multidimensional model; BIFACTOR = bifactor model; XPD = empirical cross-product; SEM = supplemented expectation–maximization; SE = standard error. For a given SE procedure, the first row represents average bias and the second row represents average absolute bias. The highest average absolute bias values for a given parameter type (within each simulation condition) are in boldface. “cor” represents correlations between dimensions.
Average Root Mean Squared Difference (RMSD).
Note. U2PL = two-parameter logistic model; BtM2PL = between-item, two-parameter logistic multidimensional model; BIFACTOR = bifactor model; XPD = empirical cross-product; SEM = supplemented expectation–maximization. The largest average RMSD values for a given parameter type (within each simulation condition) are in bold type. “cor” represents correlations between dimensions.
The run-time summaries are provided in Table 4. The run-time values in Table 4 are time spent (in seconds) for the EM algorithm (time to obtain all structural parameter estimates), time spent by an SE procedure (time to obtain all SEs), and total time required (EM time plus SE procedure time). The last column in Table 4 is the SE time divided by the total time, providing relative computing cost associated with an SE procedure.
Run-Time Summary.
Note. U2PL = two-parameter logistic model; BtM2PL = between-item, two-parameter logistic multidimensional model; BIFACTOR = bifactor model; XPD = empirical cross-product; SEM = supplemented expectation–maximizations; SE = standard error.
Whenever FIS was used, the total run-time was heavily governed by the FIS procedure. The total amount of run time increased by a large amount with FIS (e.g., 308.82 seconds per run, compared with 3.15 and 7.11 seconds for XPD and SEM, respectively, for the two-dim-BtM2PL with T = 16 and N = 1,500). The relative cost of an SE procedure (SE/Total: the last column in Table 4) showed that FIS took 65% to virtually 100% of the total run-time. FIS took the longest total run-time with the highest relative run-time cost by its implementation for SE. The fastest total run-time was achieved by XPD, ranging from 0.16 of a second to 174.51 seconds. Also, XPD had the smallest relative run-time cost of the three SE procedures. XPD took 19% of the total run-time at the minimum and 44% at the maximum. SEM run-time was shorter than FIS but longer than XPD in terms of both total run-time and relative run-time cost. SEM took between 26% and 86% of the total run-time, with this maximum occurring when the number of dimensions was equal to 5.
Summary and Discussion
The present study was motivated by two observations. First, there is currently not enough information in the IRT literature on the degree to which SEs of structural parameters may differ as a result of different error covariance estimation procedures. Second, SEs of structural parameters are often of immediate interest to practitioners whenever IRT calibration is conducted. This study complements a recent study by Tian et al. (2013) by directly evaluating SEs from two well-known procedures (FIS and XPD) and a recent procedure (SEM).
Across all simulation conditions, sample size showed the largest visible impact on the accuracy of SEs from all three SE procedures, which is not surprising. Overall, FIS, XPD, and SEM produced similar results, and no bias of practical significance was found. The results from all SE procedures converge around the ESD when a large sample size (N = 1,500) was used. XPD’s slight upward bias was more pronounced than other conditions when a long test (TL = 40) was combined with a relatively small sample size (N = 500). In general, FIS showed the least amount of bias and smallest RMSD values across different sample sizes and IRT models for the test length condition where it was used. SEM showed the same level of performance as FIS for unidimensional and between-item multidimensional models. XPD in general showed larger bias and RMSD, although the differences between different SE procedures were small or negligible. Regarding the run-time comparison, XPD was the fastest, SEM was the second fastest, and FIS was the slowest. For the between-item multidimensional models, the computation time for SEM increased rapidly with the number of dimensions (e.g., for TL = 40 and N = 500, 1.2 seconds for U2PL and 235.56 seconds for two-dim-BtM2PL—which is about 80% of the total estimation time in both cases). Compared with SEM, the computation time for XPD never exceeded half of the total estimation time across all conditions.
FIS is not feasible when the number of items is large. In this study, FIS could only be used for the condition with TL = 16. Some readers might find that, based on findings in this study, XPD is the most appealing choice since it requires the least amount of computing time, with little difference in accuracy compared with the other SE procedures. Indeed, when the sample size exceeds the test length by many times (e.g., sample size of a few thousands with 40-60 or fewer items), XPD should perform well. However, XPD can become ill-suited to the case of long tests with small N, which may be more common in psychological testing or patient-reported outcome measurement situations than in educational measurement. In general, when the test length is large compared with the sample size and the number of dimensions is small (e.g., ≤3), SEM may be a better choice than XPD. When a very long test and a relatively small sample size (e.g., 300 test takers and a few hundred item, as in some patient-reported outcome measurement studies) are coupled with high dimensionality (e.g., ≥4), a recently proposed parameter estimation method, Metropolis–Hastings Robbins–Monro algorithm (Cai, 2010b) could be a computationally efficient option for estimating the structural parameters and their SEs.
Although attempts were made to accommodate important design factors relevant to practice, there are many limitations to this study. First, the criterion SE was the ESD based on 400 replications in each condition. A better criterion would be either FIS standard errors evaluated at the true parameter values or an ESD with a very large number of replications (e.g., ≥10,000). Second, due to limited computing resources, when multidimensional models were estimated, especially the four-dimensional between-item multidimensional model, the number of quadrature nodes and the range for the quadrature were reduced from the settings for unidimensional models. This might have introduced negative impact on XPD and FIS, because both procedures rely on the converged parameter estimates alone to calculate SEs. FIS did not seem to be affected much, but XPD’s slightly less optimal performance compared with the other SE procedures might be partly because of the lack of numerical integration precision, especially when the number of items is large and the sample size is small (TL = 40 and N = 500). Third, the performance of SE procedures in this study was evaluated across all items in a test. In a future study one might wish to investigate the SEs from different procedures in more detail, for example, separated by different levels of item difficulty and/or item slope. Fourth, the IRT models used here did not consider the effect of examinee guessing. Further investigations of different SE procedures for the three-parameter logistic model and its multidimensional versions may be desirable given their popularity in educational testing. Last, more extreme item parameter values and different population parameters may be used in future studies to investigate their impact on the SE procedures.
In practice, a researcher’s choice of the SE estimation procedure should be based on a complete set of features characterizing the IRT calibration in question, for example, sample size, test length, model complexity, and computing cost given the operational resource allocation. The current study, while still preliminary, shows that for a reasonably large sample size, any of the three procedures (XPD, FIS, or SEM) can be chosen with negligible differences in results. When the time-sensitive nature of operational test calibration becomes a key issue, XPD may be considered as the default choice. If one is faced with a very long test with a small sample size, SEM can be a practically viable option.
Footnotes
Appendix
Summary of Parameter Recovery
| U2PL | BtM2PL | BIFACTOR | |||||||
|---|---|---|---|---|---|---|---|---|---|
| d | a | b | d | a | cor | d | aP | aS | |
| TL = 16 and N = 500 | |||||||||
| Average bias | 0.004 | 0.030 | 0.002 | −0.019 | 0.010 | −0.004 | 0.079 | 0.107 | 0.040 |
| Average RMSD | 0.125 | 0.153 | 0.156 | 0.122 | 0.173 | 0.046 | 0.172 | 0.218 | 0.231 |
| TL = 16 and N = 1,500 | |||||||||
| Average bias | 0.001 | 0.012 | 0.001 | 0.036 | 0.004 | 0.012 | 0.006 | 0.046 | 0.027 |
| Average RMSD | 0.075 | 0.092 | 0.087 | 0.079 | 0.101 | 0.031 | 0.082 | 0.149 | 0.153 |
| TL = 40 and N = 500 | |||||||||
| Average bias | 0.004 | 0.064 | 0.001 | −0.040 | −0.065 | −0.021 | 0.014 | 0.060 | 0.009 |
| Average RMSD | 0.130 | 0.152 | 0.161 | 0.135 | 0.168 | 0.151 | 0.146 | 0.165 | 0.185 |
| TL = 40 and N = 1,500 | |||||||||
| Average bias | 0.004 | 0.025 | 0.003 | 0.016 | −0.009 | −0.005 | 0.018 | 0.025 | 0.027 |
| Average RMSD | 0.076 | 0.088 | 0.086 | 0.075 | 0.097 | 0.086 | 0.085 | 0.100 | 0.121 |
Note. U2PL = two-parameter logistic model; BtM2PL = between-item, two-parameter logistic multidimensional model; BIFACTOR = bifactor model; RMSD = root mean squared difference.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
