The Impact of Variability of Item Parameter Estimators on Test Information Function

Abstract

The impact of uncertainty about item parameters on test information functions is investigated. The information function of a test is one of the most important tools in item response theory (IRT). Inaccuracy in the estimation of test information can have substantial consequences on data analyses based on IRT. In this article, the major part (called the adjusted term) of the deviation of an estimated test information function from the true test information function due to the uncertainty of item parameters was approximated asymptotically, and a simulation study shows that this approximation captures the difference between the estimated and the true information functions rather well. A real data example shows that the magnitude of an estimated adjusted term can be substantially large when a sample size is relatively small.

Keywords

item response theory IRT estimation error measurement-error approach item information test information

1. Introduction

The information function of a test is one of the most important quantities in item response theory (IRT). The reciprocal of the test information function is the (asymptotic) variance of the maximum likelihood estimator of ability (Lord, 1980). The test information function is the sum of item information functions, which can be obtained from item response functions or models and their first derivatives. These IRT models contain both item and examinee ability parameters. When applied to analyze test item response data, a typical IRT procedure first estimates item parameters, then regards these estimates as the true values of item parameters in the subsequent statistical analyses (e.g., ability estimation, equating). However, in general, parameter estimators are subject to estimation errors (bias, variance, and covariance). The presence of estimation errors can be damaging to results from IRT-based analyses. Researchers in educational statistics and psychometrics are becoming more and more interested in this issue. Mislevy (1992) suggested an approximation for the variance of ability estimates under the Rasch model by Cohen’s closed-form approximation. Lewis (1985, 2001) incorporated the uncertainty regarding item parameters into expected response functions, which are the expectations of the original item response functions with respect to the posterior distributions of item parameters. This methodology was applied in various contexts (Mislevy, Sheehan, & Wingersky, 1993; Mislevy, Wingersky, & Sheehan, 1994). Tsutakawa and Soltys (1988) used a Bayesian method to approximate the magnitude of the statistical inferential errors based on the standard IRT procedure. Tsutakawa and Johnson (1990) studied the effect of uncertainty of item parameters on ability estimation and showed that the standard IRT practice of using maximum likelihood or empirical Bayesian techniques may underestimate the variance of estimated ability when a calibration sample is only moderately large. Oosterloo (1984) derived an asymptotic distribution and confidence intervals of test information function for Rasch models using conditional maximum likelihood estimation (MLE). However, this method cannot be applied directly to the two- and three-parameter logistic (2PL and 3PL) models. Zhang, Xie, Song, and Lu (2011) used a measurement-error approach and demonstrated how the uncertainty of item parameters can cause a large bias in ability estimation in some cases (also see Zhang & Lu, 2007). Along this line, in this article we investigate the impact of estimation errors of item parameter estimators in 2PL and 3PL models on test information functions.

2. Asymptotic Results

Suppose that a test consists of n dichotomous items. Two IRT models, depending on item types, are widely used in the analysis of dichotomously scored response data: A three-parameter logistic (3PL) model is typically used for the multiple-choice items (which are scored correct or incorrect), and a two-parameter logistic (2PL) model is used for the short constructed-response items, also scored as correct or incorrect. The 3PL model (Birnbaum, 1968) is

P_{i} (θ) = P (θ; a_{i}, b_{i}, c_{i}) = c_{i} + (1 - c_{i}) \frac{1}{1 + exp {- D a_{i} (θ - b_{i})}},

where D is a constant, usually 1.7, and $a_{i}$ , $b_{i}$ , and $c_{i}$ are the item discrimination, difficulty, and lower-asymptote parameters, respectively. Denote

F_{i} (θ) = F (θ; a_{i}, b_{i}) = \frac{1}{1 + exp {- D a_{i} (θ - b_{i})}},

and $G_{i} (θ) = 1 - F_{i} (θ)$ . Note that $F_{i} (θ)$ is the IRF of a 2PL model and $P_{i} (θ) = c_{i} + (1 - c_{i}) F_{i} (θ)$ . A one-parameter logistic (1PL) model is

F_{i}^{*} (θ) = F (θ; 1, b_{i}) = \frac{1}{1 + exp {- D (θ - b_{i})}},

and $G_{i}^{*} (θ) = 1 - F_{i}^{*} (θ)$ .

Denote $I_{n} (θ) = I_{n} (θ; a, b, c)$ be the test information function (see Lord, 1980) of items with 3PL models, where $a = {a_{1}, a_{2}, \dots, a_{n}}$ , $b = {b_{1}, b_{2}, \dots, b_{n}}$ , and $c = {c_{1}, c_{2}, \dots, c_{n}}$ . In this article, we use the convention that we do not always explicitly write out item parameters in a function of item parameters (e.g., $P_{i} (θ)$ and $I_{n} (θ)$ ) whenever the function is evaluated at the true item parameters and the omission does not cause a confusion. When item parameters are known, the asymptotic variance of a maximum likelihood estimator of θ is $V a r (\hat{θ}) = 1 / I_{n} (θ)$ . For 3PL models, the test information can be expressed as

I_{n} (θ) = D^{2} \sum_{i = 1}^{n} a_{i}^{2} (1 - c_{i}) F_{i} (θ) G_{i} (θ) K_{i} (θ),

where

K_{i} (θ) = K (θ; a_{i}, b_{i}, c_{i}) = \frac{F_{i} (θ)}{P_{i} (θ)} = \frac{1}{1 + c_{i} exp {- D a_{i} (θ - b_{i})}} .

Note that $0 < K_{i} (θ) \leq 1$ , and $K_{i} (θ) = 1$ if item $i$ is modeled by a 2PL model. Thus, when all items are modeled by 2PL models,

I_{n} (θ) = I_{n} (θ; a, b) = D^{2} \sum_{i = 1}^{n} a_{i}^{2} F_{i} (θ) G_{i} (θ) .

Models 1 through 3 contain both item and examinee parameters. In practice, as mentioned in Section 1, item parameters are first estimated and then assumed to be the true values when subsequent statistical analyses are performed. Suppose that item parameters are estimated using a calibration sample with J examinees. The item parameter estimators, ${\hat{a}}_{i}$ , ${\hat{b}}_{i}$ , and ${\hat{c}}_{i}$ , are related to $J$ . The label J is usually suppressed in these and other related quantities for convenience unless necessary. In applications, the estimated test information function, ${\hat{I}}_{n} (θ) = I_{n} (θ; \hat{a}, \hat{b}, \hat{c})$ , is used instead of $I_{n} (θ)$ .

Every estimator has chance error, and possibly bias. The basic equation of an estimator is

E s t i m a t o r = T r u e v a l u e + B i a s + C h a n c e e r r o r .

The chance error, measured by the variance (or standard error [SE], the square root of the variance), affects the estimates (the values of the estimator) randomly, causing the estimates to differ from the true value in different directions, while the bias affects all estimates in the same direction. Bias may not exist (equal zero) if an estimator is unbiased, but chance error is inevitable. For multiple unknown parameters, chance errors of parameter estimators are measured by the variances (or SEs) and covariances of these estimators.

Two research questions can be raised here: (1) Do estimation errors of item parameter estimators have an impact on statistical inferences based on estimated item parameters or estimated IRT models? (2) If yes, when do they have a substantial impact and what are the consequences?

In statistics, a model typically has unknown parameters. One needs to estimate them first, then use the estimated model to make a prediction or to explain some phenomena or relationships among variables. In the latter stage, the parameter estimators are treated as covariates and their uncertainty due to estimation errors are taken into account. If there are simple closed formulae for the variances and covariances of those parameter estimators, it is usually not too difficult to show the impact of the uncertainty about model parameters. Below, we take simple linear regression as an example to show how chance errors of estimated parameters affect the prediction of a response variable.

The simple linear model is

Y = β_{0} + β_{1} X + ε,

where $E (ε) = 0$ and $V a r (ε) = σ^{2}$ . Suppose that $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})$ is a sample of observations. Denote $s x x = \sum_{i = 1}^{n} (x_{i} - \overset{ˉ}{x})^{2}$ and $s x y = \sum_{i = 1}^{n} (x_{i} - \overset{ˉ}{x}) (y_{i} - \overset{ˉ}{y})$ . Using the least-squares method, the fitted model is

\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x,

where ${\hat{β}}_{1} = s x y / s x x$ and ${\hat{β}}_{0} = \overset{ˉ}{y} - {\hat{β}}_{1} \overset{ˉ}{x}$ . Under certain conditions, ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ are unbiased estimators of $β_{0}$ and $β_{1}$ , respectively. The chance errors associated with ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ are

V a r ({\hat{β}}_{0}) = σ^{2} (\frac{1}{n} + \frac{{\overset{ˉ}{x}}^{2}}{s x x}), V a r ({\hat{β}}_{1}) = σ^{2} \frac{1}{s x x}, a n d C o v ({\hat{β}}_{0}, {\hat{β}}_{1}) = - σ^{2} \frac{\overset{ˉ}{x}}{s x x} .

One can use ${\hat{y}}_{0} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{0}$ to predict an individual $Y$ value at $x_{0}$ . The (estimated) variance of the predictor is

s_{\hat{y}}^{2} = {\hat{σ}}^{2} (1 + \frac{1}{n} + \frac{(x_{0} - \overset{ˉ}{x})^{2}}{s x x}),

where ${\hat{σ}}^{2}$ is the mean squared residuals. For more details, see Weisberg (2005). The variability of the predictor has two sources: variation due to the randomness of $Y$ and variation in the parameter estimators ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ . Let

h (x) = \frac{1}{n} + \frac{(x - \overset{ˉ}{x})^{2}}{s x x} .

Note that $h (x_{i})$ is the leverage of $x_{i}$ , which indicates the extent of the influence of $x_{i}$ on the regression line. If ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ were treated as fixed, the variance of the predictor would be $s_{\hat{y}}^{2} = {\hat{σ}}^{2}$ . Their difference is

{\hat{σ}}^{2} h (x_{0}) = {\hat{σ}}^{2} (\frac{1}{n} + \frac{(x_{0} - \overset{ˉ}{x})^{2}}{s x x}),

which is the consequence of chance errors (Equation 7) of the two estimated parameters in the simple linear model. In fact, Equation 9 is the (estimated) variance of ${\hat{β}}_{0} + {\hat{β}}_{1} x_{0}$ . When the sample size is large enough and $x_{0}$ is relatively close to $\overset{ˉ}{x}$ , the difference (9) can be very small and negligible. Otherwise, it is not appropriate to use ${\hat{σ}}^{2}$ as the variance of the predictor. In other words, when the chance errors of regression parameter estimators are small in the sense that Equation 9 is relatively small compared to ${\hat{σ}}^{2}$ , the regression parameter estimators ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ may be treated as fixed in predictions. In general, however, one cannot treat ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ as fixed. As a matter of fact, the uncertainty about model parameters will increase the SE of a predictor. From this example, we learn that the uncertainty about model parameters has an impact, which sometimes can be substantial, on subsequent statistical analyses.

In IRT, because no closed formulae exist for the variances and covariances of estimated parameters for commonly used item response models, it becomes extremely difficult to investigate the impact of chance errors of estimated parameters.

According to the basic estimation error model (6), we follow the approach of Zhang et al. (2011) and express item parameter estimators as

{\hat{a}}_{i} = a_{i} + δ_{a i} + ε_{a i},

{\hat{b}}_{i} = b_{i} + δ_{b i} + ε_{b i},

{\hat{c}}_{i} = c_{i} + δ_{c i} + ε_{c i},

where $δ_{a i}$ , $δ_{b i}$ , and $δ_{c i}$ are the biases of corresponding item parameter estimators, ${(ε_{a i}, ε_{b i}, ε_{b i})}$ is an independent sequence of random vectors with mean zero,

E ({\hat{a}}_{i}) = a_{i} + δ_{a i}, E ({\hat{b}}_{i}) = b_{i} + δ_{b i}, E ({\hat{c}}_{i}) = c_{i} + δ_{c i},

V a r ({\hat{a}}_{i}) = σ_{a i}^{2}, V a r ({\hat{b}}_{i}) = σ_{b i}^{2}, V a r ({\hat{c}}_{i}) = σ_{c i}^{2},

C o v ({\hat{a}}_{i}, {\hat{b}}_{i}) = σ_{a b i}, C o v ({\hat{b}}_{i}, {\hat{c}}_{i}) = σ_{b c i}, C o v ({\hat{a}}_{i}, {\hat{c}}_{i}) = σ_{a c i} .

The theorem below requires the following regularity conditions referred to as (C0)–(C2) specified in Zhang et al. (2011):

(C0) Item parameters $a_{i}$ and $b_{i}$ are uniformly bounded. θ is a bounded variable.

(C1) There exists $n_{0}$ such that for any $n > n_{0}$ ,

lim_{J \to \infty} σ_{n}^{2} = 0,

where

σ_{n}^{2} = max_{1 \leq i \leq n} {σ_{a i}^{2}, σ_{b i}^{2}, σ_{c i}^{2}, δ_{a i}^{2}, δ_{b i}^{2}, δ_{c i}^{2}} .

(C2)

lim_{J \to \infty} \frac{1}{n} \sum_{i = 1}^{n} V a r [({\hat{a}}_{i} - a_{i})^{2}] = 0, lim_{J \to \infty} \frac{1}{n} \sum_{i = 1}^{n} V a r [({\hat{b}}_{i} - b_{i})^{2}] = 0,

lim_{J \to \infty} \frac{1}{n} \sum_{i = 1}^{n} V a r [({\hat{a}}_{i} - a_{i}) ({\hat{b}}_{i} - b_{i})] = 0, lim_{J \to \infty} \frac{1}{n} \sum_{i = 1}^{n} V a r [({\hat{c}}_{i} - c_{i})^{2}] = 0,

lim_{J \to \infty} \frac{1}{n} \sum_{i = 1}^{n} V a r [({\hat{a}}_{i} - a_{i}) ({\hat{c}}_{i} - c_{i})] = 0, lim_{J \to \infty} \frac{1}{n} \sum_{i = 1}^{n} V a r [({\hat{b}}_{i} - b_{i}) ({\hat{c}}_{i} - c_{i})] = 0.

These regularity conditions are usually assumed when investigating asymptotic properties related to 3PL models (see Zhang et al., 2011). In the theorem, a notation $o_{p}$ is used. For details about $o_{p}$ , see Serfling (1980). Denote

\begin{aligned} L_{i} (θ) = \frac{G_{i} (θ)}{P_{i} (θ)} = \frac{1}{c_{i} + exp {D a_{i} (θ - b_{i})}}, \\ M_{i} (θ) = 1 - 2 F_{i} (θ) + c_{i} L_{i} (θ), \\ N_{i} (θ) = M_{i}^{2} (θ) - 2 F_{i} (θ) G_{i} (θ) - c_{i} K_{i} (θ) L_{i} (θ), \\ S_{i} (θ) = 1 + (1 - c_{i}) L_{i} (θ), \\ T_{i} (θ) = M_{i} (θ) - 2 (1 - c_{i}) L_{i} (θ) [F_{i} (θ) - c_{i} L_{i} (θ)] . \end{aligned}

Theorem. Assume that regularity conditions (C0)–(C2) hold. Then, for any fixed θ,

\frac{1}{n} [{\hat{I}}_{n} (θ) - I_{n} (θ)] = \frac{1}{n} \sum_{i = 1}^{n} R_{i} (θ) + o_{p} (max (σ_{n}^{2}, \frac{1}{\sqrt{n}})),

where

\begin{aligned} R_{i} (θ) = D^{2} F_{i} (θ) G_{i} (θ) K_{i} (θ) {a_{i} (1 - c_{i}) [2 + D a_{i} (θ - b_{i}) M_{i} (θ)] δ_{a i} \\ - D a_{i}^{3} (1 - c_{i}) M_{i} (θ) δ_{b i} - a_{i}^{2} S_{i} (θ) δ_{c i} \\ + (1 - c_{i}) [1 + 2 D a_{i} (θ - b_{i}) M_{i} (θ) + \frac{1}{2} D^{2} a_{i}^{2} (θ - b_{i})^{2} N_{i} (θ)] (σ_{a i}^{2} + δ_{a i}^{2}) \\ + \frac{1}{2} D^{2} a_{i}^{4} (1 - c_{i}) N_{i} (θ) (σ_{b i}^{2} + δ_{b i}^{2}) \\ - D a_{i}^{2} (1 - c_{i}) [3 M_{i} (θ) + D a_{i} (θ - b_{i}) N_{i} (θ)] (σ_{a b i} + δ_{a i} δ_{b i}) \\ + a_{i}^{2} L_{i} (θ) S_{i} (θ) (σ_{c i}^{2} + δ_{c i}^{2}) \\ - a_{i} [D a_{i} (θ - b_{i}) T_{i} (θ) + 2 S_{i} (θ)] (σ_{a c i} + δ_{a i} δ_{c i}) + D a_{i}^{3} T_{i} (θ) (σ_{b c i} + δ_{b i} δ_{c i})} . \end{aligned}

When item i is modeled by a 2PL model, $R_{i} (θ)$ can be simplified as

\begin{aligned} R_{i} (θ) = D^{2} F_{i} (θ) G_{i} (θ) {a_{i} [2 + D a_{i} (θ - b_{i}) M_{i} (θ)] δ_{a i} - D a_{i}^{3} M_{i} (θ) δ_{b i} \\ + [1 + 2 D a_{i} (θ - b_{i}) M_{i} (θ) + \frac{1}{2} D^{2} a_{i}^{2} (θ - b_{i})^{2} N_{i} (θ)] (σ_{a i}^{2} + δ_{a i}^{2}) \\ + \frac{1}{2} D^{2} a_{i}^{4} N_{i} (θ) (σ_{b i}^{2} + δ_{b i}^{2}) \\ - D a_{i}^{2} [3 M_{i} (θ) + D a_{i} (θ - b_{i}) N_{i} (θ)] (σ_{a b i} + δ_{a i} δ_{b i})} . \end{aligned}

When item i is modeled by a 1PL model,

R_{i} (θ) = D^{2} F_{i}^{*} (θ) G_{i}^{*} (θ) {- D M_{i}^{*} (θ) δ_{b i} + \frac{1}{2} D^{2} N_{i}^{*} (θ) (σ_{b i}^{2} + δ_{b i}^{2})},

where $M_{i}^{*} (θ) = 1 - 2 F_{i}^{*} (θ)$ and $N_{i}^{*} (θ) = [M_{i}^{*} (θ)]^{2} - 2 F_{i}^{*} (θ) G_{i}^{*} (θ)$ .

A proof of this theorem is given in Appendix A. The theorem can be applied to cases where different models (1PL, 2PL, and 3PL) are used to characterize different types of items in a test. It presents the major part of the difference between the estimated and the true test information functions. The $R_{i} (θ)$ in Equations 16, 17, or 18 specifies the contribution from each item, which is a function of the biases, variances, and covariances of item parameter estimators, as well as item parameters. There are two terms for a 1PL model corresponding to the contributions from the bias $δ_{b i}$ and from the mean squared error $σ_{b i}^{2} + δ_{b i}^{2}$ (see Equation 18), five terms for a 2PL model, and nine terms for a 3PL model. The term, $\sum_{i = 1}^{n} R_{i} (θ)$ , can be applied to evaluate the impact of estimation errors of item parameter estimators on the test information function and how accurate the estimated test information function is. In this article, $\sum_{i = 1}^{n} R_{i} (θ)$ is called the adjusted term and ${\hat{I}}_{n} (θ) - \sum_{i = 1}^{n} R_{i} (θ)$ is the adjusted test information function.

In practice, one needs estimates of item parameters, biases, variances, and covariances to calculate an estimate of the adjusted term. Any estimation method (e.g., an MLE or Bayesian estimation method) can be used to estimate item parameters and related quantities as long as it provides decent estimates as required by regularity conditions (C1) and (C2). An IRT calibration program, such as BILOG (Mislevy & Bock, 1982) and PARSCALE (Muraki & Bock, 1997), typically provides the estimates of the variances and covariances along with item parameter estimates. However, bias estimates of item parameter estimators are not available yet. In general, the maximum likelihood estimator is asymptotically unbiased under certain conditions. Thus, it may not be unrealistic to assume that item parameter estimators are unbiased. That is, one may only consider chance errors of item parameter estimators. Under this assumption, the theorem can be simplified.

Corollary. Assume that regularity conditions (C0)–(C2) hold. If the item parameter estimators are unbiased, then, for any fixed θ,

\frac{1}{n} [{\hat{I}}_{n} (θ) - I_{n} (θ)] = \frac{1}{n} \sum_{i = 1}^{n} R_{i}^{*} (θ) + o_{p} (max (σ_{n}^{2}, \frac{1}{\sqrt{n}})),

where ${\hat{I}}_{n} (θ) = I_{n} (θ; \hat{a}, \hat{b}, \hat{c})$ , and

\begin{matrix} R_{i}^{*} (θ) = D^{2} F_{i} (θ) G_{i} (θ) K_{i} (θ) {(1 - c_{i}) [1 + 2 D a_{i} (θ - b_{i}) M_{i} (θ) + \frac{1}{2} D^{2} a_{i}^{2} (θ - b_{i})^{2} N_{i} (θ)] σ_{a i}^{2} \\ + \frac{1}{2} D^{2} a_{i}^{4} (1 - c_{i}) N_{i} (θ) σ_{b i}^{2} \\ - D a_{i}^{2} (1 - c_{i}) [3 M_{i} (θ) + D a_{i} (θ - b_{i}) N_{i} (θ)] σ_{a b i} \\ + a_{i}^{2} L_{i} (θ) S_{i} (θ) σ_{c i}^{2} \\ - a_{i} [D a_{i} (θ - b_{i}) T_{i} (θ) + 2 S_{i} (θ)] σ_{a c i} + D a_{i}^{3} T_{i} (θ) σ_{b c i}} . \end{matrix}

When item i is modeled by a 2PL model,

\begin{aligned} R_{i}^{*} (θ) = D^{2} F_{i} (θ) G_{i} (θ) {[1 + 2 D a_{i} (θ - b_{i}) M_{i} (θ) + \frac{1}{2} D^{2} a_{i}^{2} (θ - b_{i})^{2} N_{i} (θ)] σ_{a i}^{2} \\ + \frac{1}{2} D^{2} a_{i}^{4} N_{i} (θ) σ_{b i}^{2} - D a_{i}^{2} [3 M_{i} (θ) + D a_{i} (θ - b_{i}) N_{i} (θ)] σ_{a b i}} . \end{aligned}

When item $i$ is modeled by a 1PL model,

R_{i}^{*} (θ) = D^{2} F_{i}^{*} (θ) G_{i}^{*} (θ) {\frac{1}{2} D^{2} N_{i}^{*} (θ) σ_{b i}^{2}},

where $N_{i}^{*} (θ) = [1 - 2 F_{i}^{*} (θ)]^{2} - 2 F_{i}^{*} (θ) G_{i}^{*} (θ)$ .

There are six terms in Equation 20, corresponding to the contributions from the variances and covariances of item parameter estimators for a 3PL model. Upon obtaining the estimates of item parameters, the variances, and the covariances of item parameter estimators, one can construct the estimate of the simplified adjusted term, $\sum_{i = 1}^{n} R_{i}^{*} (θ)$ , by simply replacing the unknown parameters with the corresponding estimated ones, and obtain the adjusted estimated test information function.

3. A Simulation Study

A simulation study was conducted to evaluate the impact of uncertainty about item parameters on test information functions and to verify the accuracy of the approximation formulas developed in the previous section using simulated data. Item parameters obtained from the 1998 National Assessment of Educational Progress (NAEP) Grade 4 reading assessment (Allen, Donoghue, & Schoeps, 2001) were used as true item parameters to generate response data. We chose 60 items: 26 2PL items corresponding to short constructed-response items and 34 3PL items corresponding to multiple-choice items. These item parameters are presented in Table 1.

Table 1.

Item Parameters Used in the Simulation Study (From the 1998 NAEP Grade 4 Reading Assessment)

Item	a	b	c	Item	a	b	c
1	0.623	−0.872	0.000	31	1.342	−0.457	0.175
2	0.920	1.008	0.000	32	1.110	0.148	0.244
3	1.052	1.009	0.000	33	1.228	0.259	0.247
4	0.754	0.015	0.000	34	0.951	−0.864	0.319
5	0.763	−0.284	0.000	35	1.472	1.204	0.167
6	1.025	0.107	0.000	36	1.859	0.213	0.265
7	0.647	−1.008	0.000	37	1.133	0.916	0.297
8	0.520	−1.425	0.000	38	1.374	0.307	0.269
9	0.757	−0.630	0.000	39	0.504	−0.932	0.247
10	0.832	1.118	0.000	40	1.415	0.891	0.271
11	1.123	1.057	0.000	41	2.303	0.609	0.418
12	0.814	0.306	0.000	42	0.966	−1.318	0.244
13	0.506	−1.272	0.000	43	1.029	0.327	0.300
14	0.269	−0.904	0.000	44	0.721	−1.193	0.247
15	1.172	0.645	0.000	45	0.941	0.401	0.264
16	0.877	−0.523	0.000	46	0.793	0.642	0.247
17	0.761	−1.242	0.000	47	1.032	0.507	0.248
18	0.619	−1.113	0.000	48	0.533	−0.835	0.218
19	1.154	0.645	0.000	49	1.203	0.257	0.165
20	1.536	1.192	0.000	50	1.104	−0.155	0.247
21	0.597	1.341	0.000	51	1.464	0.774	0.138
22	0.970	0.906	0.000	52	2.300	0.416	0.264
23	1.086	−0.060	0.000	53	0.562	−0.073	0.237
24	0.795	−0.238	0.000	54	0.883	−1.015	0.310
25	0.838	−0.076	0.000	55	1.261	1.084	0.206
26	1.031	−0.310	0.000	56	0.597	−0.206	0.156
27	1.506	−0.495	0.215	57	0.938	−1.691	0.294
28	0.607	0.712	0.251	58	1.414	−0.608	0.275
29	1.288	0.554	0.190	59	1.185	−0.590	0.312
30	1.798	−0.899	0.248	60	0.579	−0.688	0.276

The numbers of examinees in simulated calibration samples were 250, 500, and 1,000. Examinees’ ability parameters were independently generated from a standard normal distribution. Based on these ability parameters and item parameters, 100 sets (for 100 replications) of calibration response data were generated using IRT method for each of the three sample sizes. Each simulated data set was used to estimate item parameters separately. In this study, a NAEP version of PARSCALE (Allen et al., 2001) was used to estimate item parameters. The NAEP PARSCALE is an item parameter estimation program that combines Mislevy and Bock’s (1982) BILOG and Muraki and Bock’s (1997) PARSCALE computer programs. For convenience, it is simply called PARSCALE in this article. Tables 2 through 4 present the bias of estimated item parameters based on 100 replications for sample sizes 250, 500, and 1,000, respectively. The variances and covariances of estimated item parameters are not reported here because their sizes were too large.

Table 2.

Bias of Estimated Item Parameters With Calibration Sample Size 250, Based on 100 Replications

Item	a	b	c	Item	a	b	c
1	0.0225	−0.0468	0.0000	31	0.0458	−0.0051	0.0330
2	−0.0051	−0.0341	0.0000	32	−0.0429	−0.0958	−0.0162
3	−0.0269	−0.0122	0.0000	33	−0.0729	−0.0894	−0.0238
4	0.0309	−0.0633	0.0000	34	−0.0097	−0.2190	−0.0920
5	0.0159	−0.0375	0.0000	35	−0.1097	0.0286	0.0132
6	0.0029	−0.0437	0.0000	36	−0.2607	−0.1298	−0.0367
7	0.0440	−0.0343	0.0000	37	−0.1691	−0.1687	−0.0528
8	0.0531	0.0427	0.0000	38	−0.1416	−0.1409	−0.0372
9	0.0272	−0.0594	0.0000	39	0.0865	0.0689	−0.0149
10	0.0173	−0.0293	0.0000	40	−0.2276	−0.0913	−0.0334
11	−0.0060	−0.0287	0.0000	41	−0.9640	−0.3142	−0.1201
12	0.0299	−0.0488	0.0000	42	0.0271	−0.0589	−0.0189
13	0.0532	0.0286	0.0000	43	−0.0962	−0.2028	−0.0646
14	0.0857	0.1224	0.0000	44	0.0444	−0.0651	−0.0201
15	−0.0321	−0.0431	0.0000	45	−0.0303	−0.1490	−0.0337
16	0.0148	−0.0428	0.0000	46	0.0021	−0.0784	−0.0159
17	0.0633	−0.0024	0.0000	47	−0.0092	−0.0763	−0.0172
18	0.0407	−0.0471	0.0000	48	0.0661	0.0435	0.0144
19	0.0059	−0.0556	0.0000	49	0.0641	−0.0092	0.0331
20	−0.0643	−0.0098	0.0000	50	−0.0173	−0.1002	−0.0192
21	0.0333	−0.0598	0.0000	51	−0.0161	−0.0055	0.0301
22	0.0075	−0.0250	0.0000	52	−0.4895	−0.1168	−0.0377
23	−0.0034	−0.0534	0.0000	53	0.0671	−0.0647	−0.0027
24	0.0105	−0.0681	0.0000	54	−0.0231	−0.2349	−0.0822
25	0.0179	−0.0542	0.0000	55	−0.0725	−0.0435	−0.0052
26	0.0175	−0.0554	0.0000	56	0.0997	0.1442	0.0724
27	−0.0040	−0.0713	−0.0003	57	−0.0140	−0.2004	−0.0681
28	0.0866	−0.0958	−0.0127	58	−0.0977	−0.1343	−0.0505
29	−0.0025	−0.0218	0.0104	59	−0.0890	−0.2341	−0.0837
30	−0.1218	−0.1043	−0.0291	60	0.0521	−0.1053	−0.0440

Table 3.

Bias of Estimated Item Parameters With Calibration Sample Size 500, Based on 100 Replications

Item	a	b	c	Item	a	b	c
1	0.0198	−0.0388	0.0000	31	0.0851	0.0180	0.0394
2	−0.0074	−0.0307	0.0000	32	0.0084	−0.0452	−0.0007
3	−0.0065	−0.0281	0.0000	33	−0.0328	−0.0624	−0.0096
4	0.0224	−0.0562	0.0000	34	−0.0346	−0.1923	−0.0752
5	0.0100	−0.0322	0.0000	35	−0.0627	0.0180	0.0115
6	−0.0051	−0.0356	0.0000	36	−0.1362	−0.0887	−0.0237
7	0.0265	−0.0335	0.0000	37	−0.0959	−0.1177	−0.0288
8	0.0232	0.0178	0.0000	38	−0.0955	−0.0986	−0.0239
9	0.0194	−0.0507	0.0000	39	0.0568	0.0543	0.0009
10	0.0079	−0.0297	0.0000	40	−0.1474	−0.0616	−0.0186
11	−0.0004	−0.0285	0.0000	41	−0.6196	−0.1671	−0.0617
12	0.0214	−0.0359	0.0000	42	0.0125	−0.0379	−0.0056
13	0.0367	0.0105	0.0000	43	−0.0631	−0.1243	−0.0441
14	0.0499	0.0597	0.0000	44	0.0206	−0.0255	−0.0056
15	−0.0217	−0.0368	0.0000	45	−0.0167	−0.1042	−0.0211
16	0.0081	−0.0361	0.0000	46	0.0100	−0.0484	−0.0051
17	0.0228	−0.0344	0.0000	47	−0.0030	−0.0491	−0.0078
18	0.0238	−0.0269	0.0000	48	0.0525	0.0701	0.0285
19	0.0177	−0.0385	0.0000	49	0.0816	0.0053	0.0298
20	−0.0234	−0.0191	0.0000	50	−0.0322	−0.0672	−0.0092
21	0.0191	−0.0592	0.0000	51	0.0255	−0.0025	0.0257
22	0.0079	−0.0231	0.0000	52	−0.3427	−0.0794	−0.0231
23	0.0013	−0.0431	0.0000	53	0.0427	−0.0052	0.0115
24	0.0114	−0.0467	0.0000	54	−0.0358	−0.1956	−0.0681
25	0.0078	−0.0440	0.0000	55	−0.0121	−0.0165	0.0016
26	0.0130	−0.0412	0.0000	56	0.0878	0.1686	0.0809
27	0.0334	−0.0426	0.0077	57	−0.0111	−0.1504	−0.0548
28	0.0553	−0.0319	0.0019	58	−0.0568	−0.0839	−0.0306
29	0.0378	−0.0182	0.0131	59	−0.0811	−0.1833	−0.0612
30	−0.0566	−0.0742	−0.0149	60	0.0334	−0.0872	−0.0283

One hundred traditional estimated test information functions for each of the three sample sizes, as well as the true test information function, were calculated. Specifically, we computed these functions at 81 ability levels: $- 4.0$ , $- 3.9$ , $- 3.8$ , . . . , $3.8$ , $3.9$ , and $4.0$ . For easy presentation, the results at 17 selected ability levels ( $- 4.0$ , $- 3.5$ , . . . , $3.5$ , and $4.0$ ) are reported in Table 5. The second column of Table 5 presents the values of the true test information function at the 17 selected ability levels. Columns 3, 5, and 7 are the averages of traditional estimated test information functions based on 100 replications for sample sizes 250, 500, and 1,000, respectively. When the sample size increases, the difference between the traditional estimated and true test information functions decreases. The maximum differences between the averages of traditional estimated test information functions and the true test information function are $- 2.773$ , $- 1.694$ , and $- 1.013$ for these three sample sizes, respectively. These differences are not unreasonably substantial because the simulated response data are perfect in the sense that they were generated according to IRT mechanisms and were calibrated using true models.

Table 4.

Bias of Estimated Item Parameters With Calibration Sample Size 1,000, Based on 100 Replications

Item	a	b	c	Item	a	b	c
1	0.0132	−0.0342	0.0000	31	0.0750	0.0020	0.0357
2	−0.0007	−0.0413	0.0000	32	0.0078	−0.0477	−0.0005
3	−0.0055	−0.0388	0.0000	33	−0.0213	−0.0454	−0.0027
4	0.0072	−0.0498	0.0000	34	−0.0302	−0.1765	−0.0667
5	0.0050	−0.0317	0.0000	35	−0.0065	−0.0145	0.0060
6	−0.0059	−0.0437	0.0000	36	−0.0674	−0.0719	−0.0125
7	0.0172	−0.0290	0.0000	37	−0.0587	−0.0871	−0.0156
8	0.0151	0.0006	0.0000	38	−0.0454	−0.0801	−0.0182
9	0.0176	−0.0424	0.0000	39	0.0367	0.0403	0.0067
10	0.0053	−0.0347	0.0000	40	−0.0768	−0.0557	−0.0110
11	−0.0068	−0.0350	0.0000	41	−0.3395	−0.0972	−0.0293
12	0.0074	−0.0367	0.0000	42	0.0096	−0.0320	−0.0004
13	0.0162	−0.0181	0.0000	43	−0.0546	−0.1030	−0.0337
14	0.0317	0.0147	0.0000	44	0.0117	−0.0314	−0.0002
15	−0.0081	−0.0421	0.0000	45	−0.0055	−0.0833	−0.0127
16	0.0042	−0.0373	0.0000	46	0.0113	−0.0500	−0.0011
17	0.0165	−0.0322	0.0000	47	−0.0071	−0.0436	−0.0045
18	0.0116	−0.0351	0.0000	48	0.0423	0.0622	0.0337
19	0.0101	−0.0451	0.0000	49	0.0589	−0.0073	0.0206
20	−0.0029	−0.0363	0.0000	50	−0.0044	−0.0575	−0.0064
21	0.0122	−0.0452	0.0000	51	0.0277	−0.0318	0.0153
22	0.0052	−0.0338	0.0000	52	−0.1896	−0.0680	−0.0138
23	0.0112	−0.0381	0.0000	53	0.0321	−0.0089	0.0147
24	0.0062	−0.0397	0.0000	54	−0.0434	−0.1845	−0.0631
25	0.0027	−0.0415	0.0000	55	0.0022	−0.0332	0.0007
26	0.0147	−0.0412	0.0000	56	0.0835	0.1716	0.0789
27	0.0634	−0.0285	0.0087	57	−0.0101	−0.1329	−0.0498
28	0.0389	−0.0122	0.0084	58	−0.0317	−0.0802	−0.0202
29	0.0602	−0.0269	0.0098	59	−0.0632	−0.1485	−0.0477
30	−0.0646	−0.0673	−0.0114	60	0.0229	−0.0792	−0.0196

Table 5.

Average Traditional Versus Average Adjusted Estimated Test Information Functions

		250		500		1000
Ability	True	Traditional	Adjusted	Traditional	Adjusted	Traditional	Adjusted
−4.0	0.460	0.483	0.471	0.478	0.464	0.477	0.461
−3.5	0.778	0.850	0.799	0.828	0.786	0.822	0.780
−3.0	1.347	1.531	1.389	1.473	1.366	1.453	1.354
−2.5	2.360	2.773	2.458	2.636	2.409	2.587	2.381
−2.0	4.075	4.915	4.231	4.609	4.151	4.499	4.107
−1.5	7.008	8.499	7.101	7.914	7.058	7.714	7.016
−1.0	12.133	13.788	12.250	13.087	12.198	12.858	12.151
−0.5	17.273	18.773	17.026	18.115	17.106	17.930	17.192
0.0	21.819	23.210	20.943	22.911	21.560	22.811	21.800
0.5	26.684	24.755	27.548	25.720	27.050	26.418	26.920
1.0	22.963	20.577	22.587	21.509	22.830	21.997	22.905
1.5	14.766	13.554	15.027	13.990	14.896	14.056	14.810
2.0	7.628	7.176	7.820	7.266	7.720	7.196	7.666
2.5	3.572	3.478	3.607	3.454	3.593	3.386	3.584
3.0	1.676	1.687	1.680	1.648	1.681	1.604	1.680
3.5	0.817	0.843	0.822	0.813	0.820	0.787	0.819
4.0	0.417	0.438	0.423	0.417	0.420	0.403	0.418

Using the biases, variances, and covariances of estimated item parameters, the values of the adjusted estimated test information functions were also calculated and are presented in Columns 4, 6, and 8 of Table 5 for the three sample sizes. For each of the sample sizes, the adjusted estimated test information functions are, in general, much closer to the true test information function than the average of traditional estimated test information functions. Similar to the case of traditional estimated test information functions, the difference between the adjusted and true test information functions decreases as the sample size increases. The maximum differences are $0.994$ , $0.366$ , and $0.238$ for the three sample sizes of 250, 500, and 1,000, respectively, which are substantially smaller in magnitude than the corresponding differences between the averages of traditional estimated and the true test information functions. Figures 1 through 3 show the average of the differences between the traditional/adjusted estimated test information functions and the true test information function. From Figures 1 through 3, it is clear that the adjusted term in Equation 15 captures the difference between estimated and the true test information functions quite well.

Figure 1.

Differences between traditional/adjusted estimated and true test information functions with sample size 250.

Figure 2.

Differences between traditional/adjusted estimated and true test information functions with sample size 500.

Figure 3.

Differences between traditional/adjusted estimated and true test information functions with sample size 1,000.

4. A Real Data Example

Response data from one administration of SAT I: Reasoning Test with 350,400 examinees were used to evaluate the extent of the impact of chance errors of item parameter estimators on test information functions with different sample sizes. We randomly drew samples of examinees from the whole data separately. Each sample was drawn without replacement. The sample sizes considered here are 250, 500, 1,000, and 2,000. For each sample size, we drew 100 samples. The number of items we used is 60, which contains 50 multiple-choice and 10 constructed-response math items. We used PARSCALE to calibrate response data with 2PL models since test takers are discouraged (penalized for incorrect answers) from guessing in the SAT. Estimated item parameters obtained from PARSCALE are considered unbiased in this study. After obtaining estimated item parameters, along with their variances and covariances, we calculated the test information functions and the estimated adjusted terms. The above process is replicated 100 times for each sample size considered. Note that one should examine whether the results from PARSCALE or other estimation programs are properly converged. If not, the SEs of item parameter estimators may be inflated. In the case of 250 examinees, we found that PARSCALE could not provide properly converged results in some replications. When this happened, samples were redrawn until properly converged PARSCALE results were obtained.

Table 6 presents the average estimated test information functions and their adjusted terms based on 100 replications for various sample sizes at 13 selected ability levels. When the sample is relatively large, the adjusted terms are rather small. However, the adjusted terms may not be negligible when the sample is relatively small, especially in the case of 250 examinees. The average adjusted terms are also illustrated in Figure 4, which clearly shows that the magnitude of the adjusted term decreases when the sample size increases. The maximum absolute values of these average adjusted terms for samples of 250, 500, 1,000, and 2,000 are 3.922, 1.375, 1.161, and 0.126, respectively. The major reason for a relatively large adjusted value is that some item parameters have relatively large SEs, especially for the difficulty parameters. For example, the largest SE of difficulty parameters in the case of 250 examinees is 1.154.

Figure 4.

Adjusted values of test information functions for various sample sizes.

Table 6.

Average Estimated Test Information Function and Its Adjusted Term

	250		500		1000		2000
Ability	$\overset{ˉ}{I} (θ)$	$Σ {\overset{ˉ}{R}}_{i} (θ)$	$\overset{ˉ}{I} (θ)$	$Σ {\overset{ˉ}{R}}_{i} (θ)$	$\overset{ˉ}{I} (θ)$	$Σ {\overset{ˉ}{R}}_{i} (θ)$	$\overset{ˉ}{I} (θ)$	$Σ {\overset{ˉ}{R}}_{i} (θ)$
−4.0	1.893	−0.058	1.861	−0.365	1.915	−0.057	1.970	−0.081
−3.5	3.087	−0.134	3.058	−0.540	3.147	−0.353	3.182	−0.122
−3.0	5.068	−0.237	4.945	−0.587	5.056	−0.930	5.018	−0.100
−2.5	8.289	−1.261	7.813	−0.749	7.803	−0.968	7.704	0.004
−2.0	12.298	−1.367	11.738	−0.924	11.535	0.445	11.466	0.058
−1.5	16.498	1.059	16.248	0.977	16.110	0.586	16.019	0.057
−1.0	20.230	0.862	19.977	0.756	19.765	0.413	19.606	0.091
−0.5	20.872	0.911	20.691	0.629	20.465	0.301	20.329	0.099
0.0	19.192	0.669	19.122	0.456	18.907	0.188	18.822	0.074
0.5	16.504	0.377	16.477	0.292	16.251	0.096	16.209	0.042
1.0	13.082	0.043	13.087	0.117	12.908	0.020	12.892	0.007
1.5	9.197	−0.183	9.234	−0.030	9.173	−0.033	9.159	−0.019
2.0	5.858	−0.201	5.911	−0.057	5.900	−0.038	5.885	−0.021
2.5	3.595	−0.142	3.638	−0.036	3.634	−0.027	3.624	−0.016
3.0	2.197	−0.079	2.220	−0.013	2.219	−0.016	2.211	−0.011
3.5	1.351	−0.033	1.355	0.004	1.354	−0.006	1.348	−0.006
4.0	0.838	−0.003	0.830	0.016	0.828	0.001	0.822	−0.002

5. Discussion

Until now, most IRT-based analysis procedures still ignore estimation errors of estimated item parameters. In this article, we argued that estimation errors, especially variances and covariances, of estimated item parameters should be considered even though their impact on subsequent analyses is not always large. This study is part of the research and development of a new IRT-based analysis procedure that takes uncertainty about item parameters into account when statistical inferences are made based on IRT models with estimated item parameters.

In this article, the measurement-error approach (Fuller, 1987; Stefanski & Carroll, 1985; Zhang et al., 2011) is applied to asymptotically formulate the major part of the difference between estimated and true test information functions. This difference is caused by uncertainty about item parameters in IRT models, namely, the biases, variances, and covariances of item parameter estimators. The formula or the adjusted term derived in this article can be used as a tool to evaluate the impact of estimation errors of estimated item parameters on test information function. At the same time, the values of the adjusted term can also provide some evidence about whether item parameter estimation is sufficiently accurate or not. A simulation study showed that the asymptotic formula approximates the difference between estimated and true test information functions rather well.

The variances (or SEs) and covariances of item parameter estimators are the major inputs of these asymptotic formulas. Thissen and Wainer (1982) investigated the SEs of a 2PL or 3PL model under the assumption that the response data are well behaved (or the model fits the data reasonably well) and the ability scores are known fixed values. In general, the variances and covariances can be calculated directly from the inverse of the appropriate Fisher information matrix (see Lehmann, 1991). However, these estimates are actually the minimum values obtainable for the variances for the parameters (Thissen & Wainer, 1982). This is because such calculation formulas of variances are based on the assumption that the IRT models are exactly true, while an operational sample typically cannot satisfy, and sometimes may severely violate, this assumption. Thus, SEs are typically underestimated in practice. In order to provide more accurate estimates of chance errors of item parameter estimators for an operational sample, we recommend using the bootstrap method (Efron, 1982; Efron & Tibshirani, 1991) to estimate the variances and covariances. It is also of interest to investigate the coverage of confidence intervals or limits for test information functions when uncertainty about item parameters is or is not taken into account. This is a topic for future research.

Footnotes

Appendix A

Acknowledgments

The author would like to thank Ting Lu and Sarah Zhang for their comments and suggestions.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported in this article was supported in part by a grant from University of Illinois Research Board.

References

Allen

Donoghue

J. R.

Schoeps

T. L.

(2001). The NAEP 1998 technical report (NCES 2001-509). Washington, DC: Office of Educational Research and Improvement, U.S. Department of Education.

Billingsley

(1995). Probability and measure (3rd ed.). New York, NY: Wiley.

Birnbaum

(1968). Some latent ability models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 392–479). Reading, MA: Addison-Wesley.

Efron

(1982). The Jackknife, the Bootstrap, and other resampling plans. Philadelphia, PA: Society for Industrial and Applied mathematics.

Efron

Tibshirani

R. J.

(1991). An introduction to the bootstrap. New York, NY: Chapman & Hall.

Fuller

W. A.

(1987). Measurement error models. New York, NY: John Wiley.

Lehmann

E. L.

(1991). Theory of point estmation. Pacific Grove, CA: Wadsworth & Brooks/Cole.

Lewis

(1985, June). Estimating individual abilities with imperfectly known item response functions. Paper presented at the annual meeting of the Psychometric Society, Nashville, TN.

Lewis

(2001). Expected response functions. In Boomsma

van Duijin

Snijders

(Eds.), Essays on item response theory (pp. 163–171). New York, NY: Springer-Verlag.

10.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

11.

Mislevy

R. J.

(1992). The variance of Rasch ability estimates from partially-known item parameters (ETS Research Report 92-9-ONR). Princeton, NJ: Educational Testing Service.

12.

Mislevy

Bock

R. D.

(1982). BILOG: Item analysis and test scoring with binary logistic models [Computer software]. Mooresville, IN: Scientific Software.

13.

Mislevy

R. J.

Sheehan

K. M.

Wingersky

M. S.

(1993). How to equate tests with little or no data. Journal of Educational Measurement, 30, 55–78.

14.

Mislevy

R. J.

Wingersky

M. S.

Sheehan

K. M.

(1994). Dealing with uncertainty about item parameters: Expected response functions (ETS Research Report 94-28-ONR). Princeton, NJ: Educational Testing Service.

15.

Muraki

Bock

R. D.

(1997). PARSCALE: IRT item analysis and test scoring for rating scale data [Computer software]. Chicago, IL: Scientific Software, International.

16.

Oosterloo

(1984). Confidence intervals for test information and relative efficiency. Statistica Neerlandica, 38, 91–107.

17.

Serfling

R. J.

(1980). Approximation theorems of mathematical statistics. New York, NY: John Wiley.

18.

Stefanski

L. A.

Carroll

R. J.

(1985). Covariate measurement error in logistic regression. Annals of Statistics, 13, 1335–1351.

19.

Thissen

Wainer

(1982). Some standard errors in item response theory. Psychometrika, 47, 397–412.

20.

Tsutakawa

R. K.

Johnson

J. C.

(1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.

21.

Tsutakawa

R. K.

Soltys

M. J.

(1988). Approximation for Bayesian ability estimation. Journal of Educational Statistics, 13, 117–130.

22.

Weisberg

(2005). Applied linear regression (3rd ed.). Hoboken, NJ: John Wiley.

23.

Zhang

(2007). Refinements of bias-correction procedure for the weighted likelihood estimator of ability (ETS Research Report 07-23). Princeton, NJ: Educational Testing Service.

24.

Zhang

Xie

Song

(2011). Investigating the impact of uncertainty about item parameters on ability estimation. Psychometrika, 76, 97–118.