Abstract
Traditional estimators of item response theory scale scores ignore uncertainty carried over from the item calibration process, which can lead to incorrect estimates of the standard errors of measurement (SEMs). Here, the authors review a variety of approaches that have been applied to this problem and compare them on the basis of their statistical methods and goals. They then elaborate on the particular flexibility and usefulness of a multiple imputation–based approach, which can be easily applied to tests with mixed item types and multiple underlying dimensions. This proposed method obtains corrected estimates of individual scale scores, as well as their SEMs. Furthermore, this approach enables a more complete characterization of the impact of parameter uncertainty by generating confidence envelopes (intervals) for item trace lines, test information functions, conditional SEM curves, and the marginal reliability coefficient. The multiple imputation–based approach is illustrated through the analysis of an artificial data set, then applied to data from a large educational assessment. A simulation study was also conducted to examine the relative contribution of item parameter uncertainty to the variability in score estimates under various conditions. The authors found that the impact of item parameter uncertainty is generally quite small, though there are some conditions under which the uncertainty carried over from item calibration contributes substantially to variability in the scores. This may be the case when the calibration sample is small relative to the number of item parameters to be estimated or when the item response theory model fit to the data is multidimensional.
Keywords
Introduction
In applications of item response theory (IRT), it is common to score individuals using maximum marginal likelihood (MML) estimates of item parameters, which are ideally obtained using a large and independent calibration sample. In this process, the standard errors of measurement (SEMs) are estimated as either the reciprocal of the square root of the Fisher information function evaluated at the posterior mode for maximum a posteriori (MAP) scoring (or maximum likelihood scoring if there is no prior) or the standard deviation of the posterior under expected a posteriori (EAP) scoring. The use of MML estimates in conjunction with MAP or EAP scoring is a form of empirical Bayes (EB; Carlin & Louis, 2000).
However, both the scale score and its standard error—the latter in particular—are affected by the uncertainty in the item parameter estimates, which is carried over from the calibration process (Tsutakawa & Johnson, 1990). For frequentists, the item parameter estimates differ from true parameter values because of sampling error, which is represented in the standard errors of the parameter estimates. From a likelihood perspective, the uncertainty is reflected by the curvature of the log-likelihood, often represented as the information matrix of the item parameters. For Bayesians, point estimates of item parameters alone do not convey information about the width of the posterior distributions, which is nonnegligible unless the calibration sample size tends to infinity so that the posteriors become peaked. From any statistical point of view, then, uncertainty in the item parameter estimates is concerning.
Unfortunately, traditional scoring approaches (e.g., MAP or EAP) fail to acknowledge this uncertainty. Chief among the problems is an incorrect statement of measurement error. The scale score’s SEM is often underestimated and the scale’s marginal reliability overestimated, which means that the scores may be taken as having greater precision than is justified. As Cheng and Yuan (2010) pointed out, an understated SEM can also lead to premature termination of a computerized adaptive test. The problem is most pronounced in situations where a small calibration sample is used or when item parameters are estimated using the sample to be scored (Tsutakawa & Johnson, 1990).
Researchers have proposed a number of approaches to address this problem and related issues (e.g., Cheng & Yuan, 2010; Embretson, 1999; Hoshino & Shigemasu, 2008; Lewis, 1985, 2001; Mislevy, Wingersky, & Sheehan, 1993; Mislevy & Yan, 1991; Patz & Junker, 1999; Thissen & Wainer, 1990; Tsutakawa & Johnson, 1990; Tsutakawa & Soltys, 1988; Zhang, Xie, Song, & Lu, 2011). In the research reported here, we first provide a brief review of existing approaches, comparing them with respect to both their underlying methods and intended objectives. In broad terms, three types of methods have been applied to this problem: analytic approximations, fully Bayesian sampling based approaches, and multiple imputation (MI)–based approaches. These approaches have been used to accomplish two related but distinct goals: (a) to obtain corrected SEMs that take uncertainty in the item parameters into account and (b) to characterize the nature and impact of item parameter uncertainty on subsequent estimation and inference.
After reviewing these approaches, we provide a formal justification for the MI-based strategy and illustrate its use with a three-item artificial data set. Extending Mislevy et al.’s (1993) results, we use MI to obtain corrected estimates of individual scale scores and their standard errors of measurement. Using MI, we also characterize the specific ways in which uncertainty in the item parameters can affect scoring. Thissen and Wainer (1990) proposed that it may be helpful to visualize the uncertainty by generating confidence envelopes for item characteristic curves. We carry this idea further by constructing confidence envelopes for test information functions and conditional SEM curves. Such depictions make it possible to observe the ways in which the effects of parameter uncertainty vary across the values of the latent trait. The MI-based approach may also be used to obtain confidence intervals for the marginal reliability coefficient. After illustrating the approach with the artificial data set, we analyze data from a large educational assessment. Finally, through simulation studies, we examine the ways in which model complexity, calibration sample size, and test length contribute to these effects.
Our approach improves on the existing alternatives in several important ways:
First, in contrast to analytical approximation methods (e.g., Cheng & Yuan, 2010; Zhang et al., 2011) that explicitly require the calculation of a number of non–standard derivative matrices that are model specific, the MI-based approach is easily applied to any IRT model (e.g., unidimensional or multidimensional, dichotomous or polytomous) and scoring method (e.g., MAP, EAP, or even summed score EAP), provided that the asymptotic covariance matrix of the item parameter estimates is available. This flexibility is an important feature, as we note that previous studies have not considered item parameter uncertainty in the context of polytomous or multidimensional IRT models. Here, we illustrate the proposed approach for tests of mixed-item types (including two-parameter logistic [2PL] and three-parameter logistic [3PL] models for dichotomous data and a logistic graded-response model for ordered responses) analyzed using either a unidimensional or a multidimensional model, specifically the item bifactor model (e.g., Cai, Yang, & Hansen, in press; Gibbons & Hedeker, 1992).
Second, in contrast to fully Bayesian methods (e.g., Patz & Junker, 1999) that must rely on Markov chain Monte Carlo (MCMC) to sample the intractable posterior distributions, the MI approach is computationally simple and efficient. It relies on information that is routinely printed in the output of most standard IRT software programs, in conjunction with the easily accomplished task of random sampling from the multivariate normal distribution.
Third, in contrast to previous MI-based methods, in which analyses were either limited to single items or the between-item parameter error covariances were not estimated (and, thus, treated as zero), our approach makes use of a modern estimation algorithm (supplemented EM; Cai, 2008) for computing the asymptotic covariance matrix of the item parameters that is applicable to any IRT model. This covariance matrix is not an automatic by-product of the current gold standard Bock and Aitkin (1981) EM algorithm of item parameter estimation.
Finally, this approach systematizes a number of seemingly disparate methods under a single framework, including Thissen and Wainer’s (1990) confidence envelopes and Lewis’s (1985) expected response functions. Even approximation methods based on pseudo maximum likelihood (e.g., Hoshino & Shigemasu, 2008) can be reinterpreted in light of the MI framework.
A Brief Review of Existing Approaches
Two-Stage Versus Single-Stage
With the exception of fully Bayesian sampling–based methods, nearly all existing proposals assume a two-stage process for estimating the IRT scale scores. The items are first calibrated, preferably using MML or Bayesian methods. In this stage, the item parameters are estimated, as well as their error covariance matrix. In the second stage, the item parameters are used to produce the IRT scale scores. Corrections are generally made in the second stage, using (a) the point estimates, (b) the error covariance matrix, and (c) either additional derivatives and linearization arguments for analytic approximations or, in the case of MI methods, random sampling from an approximation to the posterior distributions of the item parameters. Two-stage methods are popular not only because they generally lead to sound estimates but also because they are consistent with the standard operating procedures in applied educational and psychological testing situations.
Fully Bayesian sampling–based methods (e.g., Patz & Junker, 1999), on the other hand, involve a single stage. They rely on MCMC to produce random draws from a Markov chain having the full joint posterior of the item parameters and the individual latent variables as its invariant distribution. Under the ergodicity of the Markov chain, dependent samples from the chain can be used to approximate the full posterior. When inferences regarding individual latent traits are desired, one simply “marginalizes” over the other dimensions of the posterior—that is, by integrating out the item parameters—and the IRT scale scores thus obtained (e.g., as posterior means) would already have taken the uncertainty in item parameters into account. Given the MCMC output, marginalization amounts to ignoring that part of the MCMC output related to the item parameters. The single-stage approach is appealing conceptually. However, a significant barrier to its widespread adoption is its complexity, computational intensiveness, and, in some cases, inflexibility. Even as MCMC gains popularity, its proper usage still requires considerably more effort and statistical expertise when compared with more deterministic and better understood methods such as the EM algorithm. As Edwards (2005) noted, a number of competing MCMC samplers have been proposed for IRT, and the question of their relative algorithmic efficiency has not been entirely settled in the methodological literature. Furthermore, we point out that in contrast to two-stage IRT scoring methods, it is difficult to conceive how the single-stage approach can easily accommodate certain nonstandard but nevertheless popular and useful IRT scoring algorithms such as summed-score to EAP translations (Thissen & Wainer, 2001).
Two Related Goals: Correction Versus Characterization
Most existing approaches have sought solely to obtain corrected estimates of SEM for IRT scale scores that take into account the uncertainty in the item parameters. These include analytic approximations based on Bayesian calculations (Tsutakawa & Johnson, 1990; Tsutakawa & Soltys, 1988), analytic approximations based on pseudo maximum likelihood (Cheng & Yuan, 2010; Hoshino & Shigemasu, 2008), as well as MI-based expected response functions (Lewis, 1985, 2001; Mislevy et al., 1993). MCMC methods (e.g., Patz & Junker, 1999) may also be regarded as belonging to this category.
For analytic approximations (Bayesian or likelihood), the asymptotic argument is typically based on Taylor series expansions of the nonlinear estimating equations of the IRT scale scores. From the first-order (sometimes second-order) approximation, corrected standard error formulae are obtained. In contrast, the expected response functions are motivated by an explicit analogy to MI for treating missing data (Rubin, 1987). Using the point estimates of item parameters to represent the item response functions amounts to a single round of imputation, replacing the unknown (hence “missing”) item parameter values by the modes of their marginal log-likelihood. In contrast, an MI approach uses more than one random imputation to recover the uncertainty caused by not knowing the item parameters exactly. Averaged response functions under multiple random imputations are expected response functions.
Thissen and Wainer (1990) demonstrated an approach with an entirely different goal. Rather than seeking corrected estimates, they used confidence envelopes around the item response functions to reflect the uncertainty in the item parameters. This approach can be viewed as complimentary to the expected response function approach of Mislevy et al. (1993). Unlike Mislevy et al., however, the focus of Thissen and Wainer (1990) is not on obtaining corrected SEM for the scale scores. The resulting graphical displays (referred to as the M-line plots) are simply used to characterized how errors in the item parameter estimates affect the plausible shapes of item characteristic curves. The confluence of Thissen and Wainer’s (1990) confidence envelopes and Mislevy et al.’s (1993) expected response functions leads to the dominating insight of this research. That is, MI provides a natural framework to integrate the two goals, simultaneously providing corrected SEMs and visual representations of the uncertainty.
The Proposed Approach
Some Notation
Without loss of generality, let
If the prior is taken to be uniform, then the posterior is proportional to the marginal log-likelihood logL(
Because of the Bernstein–von Mises phenomenon (see, e.g., van der Vaart, 1998), it is well known that a multivariate normal with mean
where
Illustration
To illustrate the proposed MI-based procedure, we created a simple data set consisting of 500 simulated responses to three items. Each of the items was of a different type: 2PL, 3PL, and three-category graded response (GRM3). Although this illustration is somewhat unrealistic, the very short test length offers some advantages. Most important, it allows us to easily present the generating and estimated values of all eight item parameters, their error covariance matrix, and examples of the randomly imputed parameter sets. In addition, the numbers of possible response patterns (12) and summed scores (5) for this three-item test are small, allowing us to demonstrate the MI-based scoring for each possibility. Finally, despite the simplicity of this example, it nonetheless features two kinds of complexity not previously considered in studies addressing parameter uncertainty: polytomous items (as represented by Item 3) and tests of mixed-item types. These features will also be present in the real data set considered later.
The generating item slopes and intercepts for this illustration were randomly drawn from distributions chosen to resemble values commonly observed in educational or psychological testing. Both data generation and parameter estimation were conducted using IRTPRO (Cai, du Toit, & Thissen, in press).
Table 1 shows the generating values, along with the MML estimates and their error covariance matrix. The standard errors of the item parameter estimates are the square roots of the diagonal elements of this matrix. The off-diagonal elements of the matrix represent covariances between the parameters. The fact that some of these elements are nonzero is notable, as some prior methods seeking to account for parameter uncertainty have ignored these covariances. In the case of the 3PL model (used for Item 2), the guessing parameter is expressed as the logit of the guessing probability g; the generating value of − 1.39 corresponds to a probability of about .2, as might be expected for a multiple choice item with five options. The MML parameter estimates and the error covariance matrix will be used in the MI-based approach, as we describe in the following sections.
Item Parameter Estimates and the Asymptotic Covariance Matrix for the Three-Item Artificial Data Set
Note. 2PL = two-parameter logistic model; 3PL = three-parameter logistic model; GRM3 = three-category graded response model.
Multiple Imputation Inference for EAP Scores
Preliminaries
We now consider EAP scoring under item parameter uncertainty. The logic developed in this section, however, is quite general. Though we will assume a unidimensional θ for simplicity of notation, we note that the methods apply directly to multidimensional IRT models, where θ is a vector. In the ideal case where item parameters are known, inference for θ for an individual with J × 1 response pattern
The EAP estimator with given
with SEM given by the square root of posterior variance
If the item parameters are unknown, standard practice is to use the “plug-in” estimator, in which the MML estimates
ignores the inherent variability of
A formal justification for the proposed multiple imputation procedure
Tsutakawa and Johnson (1990) demonstrated that to properly account for the uncertainty in
Note that a critical feature of Equation (5) is that the posterior of the item parameters f(
This estimator does not depend on any particular values of
also automatically takes the uncertainty in
It turns out that f(
Now, we have an MI algorithm to approximate
Draw M > 1 sets of values from a multivariate normal distribution with mean
Plug each
The MI EAP approximation to
which approximates the right-hand side expression of Equation (8).
4. The MI variance approximation is
where
The ratio
is known as the relative increase in variance due to missing data (Schafer, 1997). It can be taken as a crude measure of the impact of item parameter uncertainty. We will use this measure in the simulation studies we describe subsequently.
As previously mentioned, the MI-based approach may be applied to any number of other scoring methods. Thus, in our analyses, we present results not only for pattern EAP scoring but also for summed score to EAP translations (Lord & Wingersky, 1984). As will be demonstrated, the impact of item parameter uncertainty may differ according to scoring method.
Illustration for the proposed procedure
The MI-based procedure was applied to the three-item data set, following the steps described above.
We drew M = 20 sets of plausible item parameter values from a multivariate normal distribution with mean
We now plug each
The MI-based approximation to
The MI-based variance approximation
MML Parameter Estimates and Random Imputations for the Three-Item Artificial Data Set
Note. MML = maximum marginal likelihood; 2PL = two-parameter logistic model; 3PL = three-parameter logistic model; GRM3 = three-category graded response model.

Item trace lines and posterior distributions for three response patterns
EAP Scores Based on Imputed Item Parameters
Note. EAP = expected a posteriori; MI = multiple imputation.
The scores obtained using the MML parameter estimates are compared with MI-based scores in Figure 2. These two sets of scores are almost perfectly correlated, for both the response pattern EAPs and summed-score EAPs. This result is consistent with previous studies; although item parameter uncertainty may result in overestimates of score precision, it does not necessarily result in biased scores.

Traditional and MI-based EAP estimates
In Figure 3, the SEM are plotted against the score estimates. There are differences in the SEM. As expected, the SEMs are generally larger for the MI scores. The magnitude of the correction, however, is substantially larger for the full-pattern EAP scores than for the summed-score EAPs (consistent with the values of r reported in Table 4).

MI-based scale score and SEM corrections
Response Pattern and Summed-Score EAPs
Note. EAP = expected a posteriori.
Multiple Imputation Confidence Envelopes
Confidence envelopes for item characteristic curves
Thissen and Wainer (1990) considered confidence envelopes for the item characteristic curves. Let T
k
(θ|
where c is the item intercept, a is the item slope, and both are components of
We present here a slight variation of Thissen and Wainer’s (1990) basic idea:
Draw M > 1 sets of values from a multivariate normal distribution with mean
Choose a reasonably fine grid to numerically represent θ, for example, from − 3 to + 3 in step sizes of .01.
Plug each
Repeat the last step for all θ levels to find a (1 − α) × 100% confidence envelope.
To ensure that the boundaries of the confidence envelopes are well characterized, a large M is necessary. We generally use M = 1,000 random draws.
Figure 4 presents confidence envelopes for the three items in our illustration. The upper, middle, and lower panels correspond to Items 1 (2PL), 2 (3PL), and 3 (GRM) in order. The solid curve is the usual item characteristic curve based on the MML item parameter estimates. The dotted curves represent the upper and lower 95% confidence limits, based on M = 1,000 random samples drawn from the multivariate normal approximation to the posterior distribution of the item parameters. Incidentally, we also computed the expected response functions (dashed curves), obtained by averaging the response functions across all the imputed parameter sets. As Mislevy et al. (1993) noted, the expected response functions are not logistic and tend to have slightly lower slopes than the item response function based on the MML item parameter estimates.

Confidence envelopes for item trace lines
Confidence envelopes for information and SEM curves
For general multiple-categorical IRT models, item i’s Fisher information function is given by the following expression:
where K is the number of categories (see Baker & Kim, 2004). The Fisher information functions are additive. For N items, the test information function is the sum of item information functions:
When a prior for the ability distribution is used in scoring (e.g., MAP scoring), test information must also include the contribution from the prior. The standard error of measurement curve is found as a one-to-one transformation of the test information curve:
Recognizing that F(θ |
Draw M > 1 sets of values from a multivariate normal distribution with mean
Choose a reasonably fine grid to numerically represent θ, for example, from − 3 to + 3 in step sizes of .01.
Plug each
Repeat the last step for all θ levels to find a (1 − α) × 100% confidence envelope.
Because F(θ |
As an illustration, consider the confidence envelopes for the test information function and the conditional SEM for the three-item data set, presented in Figure 5.

Confidence envelopes for test information and the conditional SEM
These curves highlight the extent to which our characterizations of test information or the conditional SEM can be affected by item parameter uncertainty. This may have implications for test assembly, in which particular combinations of items may be selected to produce a particular test information or conditional SEM curve. The width of the confidence envelopes we observe for these functions suggests that assembling tests to closely match target information of SEM curves may be misguided. The confidence envelopes generated through the MI-based approach allow the imprecision because of uncertainty in item parameters to be visualized.
Confidence intervals for marginal reliability
As a final application, we consider confidence intervals for the marginal reliability coefficient, which is an important IRT-based measure of overall scale reliability. Let the prior f(θ) for the ability distribution have variance σ2. Marginal reliability is defined as
(see, e.g., Thissen & Wainer, 2001), where the integral in the numerator returns the average error variance, and sem(θ |
Draw M > 1 sets of values from a multivariate normal distribution with mean
Plug each
For the three-item data set, the MML item parameter estimates yield an estimated marginal reliability of .29 (which is low, as expected, given the very short test length). However, the MI-based approach offers a more complete characterization—that the marginal reliability has a 95% confidence interval between .17 and .43 based on M = 1,000 imputations.
Summation
The proposed MI approach provides not only corrected scale scores and standard errors of measurement that take parameter uncertainty into account but also confidence envelopes or intervals for other important quantities such as the item characteristic curve, the test information function, the conditional SEM curve, and the marginal reliability coefficient. As demonstrated with the three-item data set, the approach is flexible enough to be applied to not only full-pattern EAP scores but also other scoring methods (e.g., summed-score EAP) and various IRT models (e.g., 2PL, 3PL, or GRM).
Application to Empirical Data
Data and Method
As an empirical demonstration of the proposed MI approach, we analyzed data from the 2000 Program for International Student Assessment (PISA; Adams & Wu, 2002). We extracted a random sample of 1,500 students from the United States with complete responses for Math Booklet 8. For our analyses, we used 500 students as a calibration sample and 1,000 students as a scoring sample. Math Booklet 8 consists of 15 items—8 free response (FR), 5 multiple choice (MC), and 2 complex multiple choice (CMC). Logistic graded-response models were fit to the FR and CMC items with the number of ordered categories determined by the number of different scores assigned (either 2 or 3). A 3PL model was used for the MC items. In estimating item parameters, a log-normal prior was placed on the guessing parameter. The mean of this prior was based on the expected probability of a correct response, assuming blind guessing and given the number of response options. Item parameter estimates and the item parameter asymptotic covariance matrix were obtained using IRTPRO (Cai, du Toit, et al., in press).
Because PISA items are nested within testlets or passages, we obtained estimates for both a unidimensional (ignoring the testlet structure but consistent with how the test data are scored in practice) and a bifactor item response model with specific factors modeling testlet effects. Parameter estimation and scoring for the bifactor model has been described elsewhere (Cai, Yang, et al., 2011). Here, we focus on the effects of item parameter uncertainty on the precision of scores for the general dimension and show the application of the MI-based method in both the unidimensional and multidimensional IRT contexts.
As outlined in previous sections, the MML item parameter estimates and the item parameter error covariance matrix were used to generate multiple sets of plausible parameter values. For the unidimensional model, 39 parameters were estimated. The bifactor model required estimation of additional 10 parameters (specific factor slopes for the 10 items loading on three specific factors).
As before, 20 sets of parameters were imputed to obtain the MI-based score estimates based on the full response patterns and summed scores for individuals in the scoring sample. Confidence envelopes for test information and the conditional SEM and a confidence interval of the marginal reliability coefficient were based on M = 1,000 imputations.
Results
As shown in Figure 6, the scale score estimates are almost perfectly correlated for both unidimensional model and bifactor model, regardless of the scoring method (full response pattern or summed-score EAP).

Estimated IRT scale scores for PISA Math Booklet 8
Figure 7 shows the scale score and SEM estimates and their corrections based on the MI approach. For most individuals in the scoring sample, the SEM increases under the MI-based approach. Across the scoring sample, the average relative increase in variance (r) based on Equation (12) was 3.5% for the unidimensional model and 7.6% for the bifactor model under full–response pattern scoring. In contrast, the average relative increase in variance for the summed-score EAP estimates was less than 1% for either scoring method. In addition to having larger average increases, the impact of item parameter uncertainty on the full-pattern scores appears to be more variable than for the summed score translations. Specifically, there are some response patterns with vary large increases in the SEM (up to 32.9% for the unidimensional model and 67.1% for the bifactor model). The increases for the summed score are much more uniform across all response patterns.

Traditional and MI-based response pattern and summed-score SEM estimates plotted against EAPs for PISA Math Booklet 8
The 95% confidence envelopes for test information and the conditional SEM for the unidimensional model are shown in Figure 8. Graphical representations and proper interpretation of test information in the multidimensional case, though potentially quite interesting, are beyond the scope of this study.

Confidence envelopes for test information and the conditional SEM for PISA Math Booklet 8
For this 15-item test, the MML parameter estimates produce a marginal reliability of .82. From the MI-based approach, its 95% confidence interval was found to be .78 to .84.
A Simulation Study
Previous studies have shown that the magnitude of bias in the SEMs can vary depending on conditions such as item response model complexity, calibration sample size, and test length, all of which influence the test information. For example, Tsutakawa and Johnson (1990) observed that relatively simple Rasch models tend to provide decent estimation of the latent ability levels and their SEMs, even with small calibration samples. In contrast, a 3PL model with a calibration sample size of 400 individuals produced biases of the posterior means of the latent trait and underestimation of the posterior standard deviations by more than 40%, on average.
As a complement to the simple illustration and the empirical analysis of PISA data, a simulation study was conducted. The goal was to further characterize the conditions in which the uncertainty carried over from item calibration leads to uncertainty in the scoring process. Based on the findings of previous studies, we manipulated the following: model complexity, calibration sample size, and test length. For this study, the number of imputations was fixed to M = 20, as previous research has already demonstrated that 20 imputations should provide reasonably good correction of the SEMs (Mislevy et al., 1993). The combination of the number of items (J = 5, 10, 20, 40), the size of calibration sample (N = 500, 1,000, 2,000, 5,000), and item type (2PL, 3PL, and logistic GRM with five categories) yielded a total of 48 conditions.
For each condition, 500 calibration samples were generated; in addition, one independent scoring sample of 10,000 cases was produced. The item parameters used for data generation were chosen to resemble estimates obtained from real educational and psychological data sets. Following data generation, we calibrated the items using MML estimation and obtained the parameter error covariance matrix. There were calibrations for which either the EM algorithm (used to obtain the parameter estimated) or supplemented EM (needed for the covariance matrix) failed to converge. These replications (about 9% of the total) were omitted from the subsequent analyses. Twenty item parameter sets were randomly drawn from the multivariate normal approximation to the posterior distribution of the item parameters, as described in previous sections. These parameter sets were then used to obtain the MI-based scores and SEM for the scoring sample. For each of the accepted calibrations (500, for most conditions), r was calculated for each of case in the corresponding scoring sample. The median value of r (across calibrations) was then obtained for each case. These medians were then averaged across the entire scoring sample and within deciles (1,000 individuals) of the “true” level of the latent trait for each condition. The results are reported in Table 5 and Figure 9.
Relative Increase in Variance, r(%)

Mean relative increase in variance (by deciles) under various simulated test lengths and calibration sample sizes for the logistic graded-response model with five categories
For most conditions, the average relative increase in variance is rather small, ranging from 1.8% to 14.7%. However, some trends are evident. For a given test (i.e, for a fixed number of items and item type), the average relative increase in variance decreases as the size of the calibration sample increases. This is to be expected, as larger calibration samples yield more precise estimates of the item parameters. The patterns of relative increase across tests of different length and across item types are more complex. Test information increases with the number of items. Thus, for longer tests, we expect smaller “within”-imputation error variance. At the same time, longer tests require estimation of more item parameters, resulting in greater “between” variance. In cases where the calibration sample is small, those parameter estimates may not be very precise. Consequently, the highest relative increase in variance was observed for the 40-item tests with the smallest calibration sample examined. Even for these conditions, however, the increases appear rather modest.
Figure 9 presents a more detailed view of the relative increase in variance for the five-category graded items across the simulation conditions of varied test length and calibration sample size. Here, the average relative increase in variance is obtained within each decile of the true scores on the latent trait (which are known for these simulated data but, of course, are never known in practice). As in Table 5, it is clear that the relative increase in variance diminishes with increasing sample size. It is also apparent that the percentages are not uniform across the range of the latent trait. For those conditions where there is substantial relative increase in variance (e.g, J = 40, N = 500), the percentages are greatest at the highest and lowest deciles of the true θs. This is again consistent with existing results obtained by analytic approximation (e.g., Cheng & Yuan, 2010) for simpler IRT models.
Discussion
In this research, we have proposed an MI-based approach for characterizing the ways in which uncertainty about item parameters affects item response functions, test information functions, SEMs, and marginal reliability. We argue that the approach presents several advantages over the existing alternatives. First, it can be applied to a variety of IRT models and scoring methods. Second, it is computationally simple and uses information that is readily available in the output of standard IRT software programs. Third, the approach makes use of the asymptotic covariance matrix of the item parameters, obtained through an application of the supplemented EM algorithm (Cai, 2008). This allows us to conduct analyses of entire tests, whereas past efforts have mostly focused on parameter uncertainty with respect to individual items. Finally, our proposed approach connects a variety of seemingly disparate methods that have been used to handle item parameter uncertainty. In so doing, our approach integrates the related goals of providing corrected standard errors of measurement and the (highly visual) characterization of the effects of uncertainty.
To demonstrate the relevance of the proposed approach to the problem of uncertainty in item parameters, we derived approximations for EAP scores and their SEM (Equations 8 and 9, respectively) that use the MML estimates of the item parameters and their covariance matrix. The MI-based EAP approximation is an average of the EAP scores obtained with M > 1 sets of plausible values for the item parameters. The error variance for this estimate is a combination of the within- and between-imputation variances. The square root of the total error variance provides a corrected SEM.
Crucial to the MI-based method is the random sampling of plausible item parameter values from a multivariate normal approximation to the item parameter posteriors. The MI-based score estimates and SEMs are obtained by scoring with these imputed parameter sets and combining the results. In addition to these corrections, we constructed confidence envelopes for item characteristic curves, building on the work of Thissen and Wainer (1990). A depiction such as this helps to convey the extent to which error in the item parameters leads to uncertainty about the shape of the response functions. Importantly, the amount of uncertainty may vary across levels of the latent trait. A similar approach was taken to generate confidence envelopes for the test information function and the conditional SEM curve. Finally, we used MI to calculate confidence intervals for marginal reliability. This allows us to convey the uncertainty in the marginal reliability that is because of error in the estimation of the item parameters.
After illustrating the proposed approach with an artificial, three-item data set, we examined data from the 2000 PISA math test. We also conducted simulations, which allowed us to investigate how various factors can influence the uncertainty in scores. In these simulations, we manipulated the size of the calibration sample, the length of the test, and the complexity of the response model. The effect of parameter uncertainty was quantified as the relative increase in error variance. This analysis demonstrated that, on the whole, parameter uncertainty contributes little to total error variance. However, in situations where the calibration sample is small and the number of items is large (and especially in the case of a more complex response model), the error carried over from item calibration may occasionally be nonnegligible.
It is important to know the extent to which the latent trait estimates are uncertain because a number of critical decisions are based on the SEM. In variable-length computerized adaptive testing algorithms, for example, the SEM is often used as a termination criterion. In such cases, underestimation of SEM can result in premature termination of the test. In addition, such underestimation could result in flawed inferences concerning individuals’ standing relative to a certain performance standard or to one another. Specifically, scores might be assumed to have greater precision than is warranted, given the known uncertainty in the item parameters. Standard errors have also been incorporated into statistical models such as hierarchical linear models with latent variables (Raudenbush & Bryk, 2002). For such applications, improved SEM estimates will enhance estimation of regression parameters and their associated standard errors. The MI-based approach presented in this article provides a simple and flexible means of obtaining these improved estimates.
Footnotes
Appendix
The EAP estimator of θ is a posterior expectation, that is,
From Equation (5), the equation above can be rewritten as
Interchanging the order of integration, we see that
where the last line requires the conditional independence of
Acknowledgements
We thank Dr. David Thissen for comments on an earlier draft.
The views expressed in this article belong to the authors and do not reflect the views/policies of the funding agencies or grantees.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:
The authors acknowledge financial support from the following sources: Institute of Education Sciences (R305B080016 and R305D100039) and the National Institute on Drug Abuse (R01DA026943 and R01DA030466).
