Abstract
In item response theory (IRT), item response probabilities are a function of item characteristics and latent trait scores. Within an IRT framework, trait score misestimation results from (a) random error, (b) the trait score estimation method, (c) errors in item parameter estimation, and (d) model misspecification. This study investigated the relative effects of these error sources on the bias and confidence interval coverage rates for trait scores. Our results showed that overall, bias values were close to 0, and coverage rates were fairly accurate for central trait scores and trait estimation methods that did not use a strong Bayesian prior. However, certain types of model misspecifications were found to produce severely biased trait estimates with poor coverage rates, especially at extremes of the latent trait continuum. It is demonstrated that biased trait estimates result from estimated item response functions (IRFs) that exhibit systematic conditional bias, and that these conditionally biased IRFs may not be detected by model or item fit indices. One consequence of these results is that certain types of model misspecifications can lead to estimated trait scores that are nonlinearly related to the data-generating latent trait. Implications for item and trait score estimation and interpretation are discussed.
In psychometric modeling, nontrivial errors in latent trait estimates are unavoidable for a variety of reasons, including a limited number of available items, time restrictions, and test-taker fatigue. Nevertheless, those who design, administer, and score tests aim to achieve the most reliable trait estimates possible. In other words, a primary psychometric concern is minimizing the errors associated with trait estimates. Within an item response theory (IRT) framework, estimated latent trait scores are computed based on a series of examinee item responses and characteristics of the items. Errors in the resulting estimated trait scores result from random error, the trait score estimation method, errors in item calibration, and model misspecification. The purpose of this study was to identify the relative consequence of each error source on trait score estimates. This study focuses in particular on misspecification of the item response function (IRF) functional form.
Error Sources
It is common practice in IRT analyses to estimate latent trait scores using a two-stage approach. In the first stage, item calibration, item parameters are estimated using methods such as marginal maximum likelihood (MML; Bock & Aitkin, 1981). In the second stage, trait score estimates are typically computed using methods such as maximum-likelihood estimation or expected a posteriori (EAP) prediction and treating the item parameter estimates as fixed. Because MML produces consistent and computationally efficient estimates of IRT item parameters, this approach is considered the gold standard technique for item parameter estimation.
Under a two-stage approach, errors in IRT trait estimation can be attributed to four nested sources. These error sources (Jones, Wainer, & Kaplan, 1984, p. 2-3) are (a) random error due to the probabilistic nature of the model, (b) errors in trait estimation due to the estimation method, (c) errors in item parameter estimation, and (d) model misspecification. These error sources are ordered hierarchically; the existence of a higher numbered error source implies the existence of all lower numbered error sources. Strictly speaking, error sources (a) and (b) are inseparable because trait estimates are only obtained using a particular trait estimation method. Furthermore, when using estimated item parameters, errors due to the trait estimation method persist. Finally, when the model is misspecified, errors in item parameter estimates and errors due to the trait estimation method are all present. Because models are imperfect representations of reality, it is reasonable to expect that all four error sources are present in any IRT analysis.
With regard to the first error source, errors may be attributed to random variation resulting from the fact that IRT models are probabilistic rather than deterministic. The effect of random error decreases as the number of administered test items increases, and so there are practical limitations on the extent to which random error can be reduced. With regard to the second error source, different methods for estimating latent trait scores introduce different types of errors. In this article, three trait estimation methods—maximum likelihood (ML), Bayesian EAP (Bock & Mislevy, 1982), and weighted likelihood (WL; Warm, 1989)—are considered. Details of the properties of these methods and their standard errors (ses) are provided in Online Appendix A. In brief, ML estimates are outwardly biased (Lord, 1983), EAP estimates are biased toward the mean of the prior distribution, and WL estimates are unbiased (Warm, 1989).
A third source of error results from imperfect estimation of item parameters. Because item parameter estimates need not equal their true values, errors in item calibration may affect the accuracy of trait estimates. To complicate the issue, accurate item parameter estimates are often difficult to obtain, particularly for highly parameterized models such as the three-parameter (3PL; see Thissen & Wainer, 1982) and four-parameter (4PL; Culpepper, 2016; Waller & Feuerstahler, 2017) IRT models. Accurate item parameter estimates are particularly difficult to obtain when the calibration sample is small. In two-stage estimation, the uncertainty associated with the item parameter estimates is typically ignored and as a result, the asymptotic properties of the ML, EAP, and WL trait estimators do not necessarily hold when estimated item parameters are treated as known. For instance, WL is no longer unbiased when item parameters are estimated instead of known (Zhang, 2005). Moreover, errors in item parameter estimation can lead to increased bias of trait estimates (Zhang, Xie, Song, & Lu, 2011) and underestimated standard errors (Cheng & Yuan, 2010), particularly when using small item calibration samples (Tsutakawa & Johnson, 1990). Various corrections have been proposed to adjust
In the majority of applications, latent trait estimates are computed assuming correct model specification. Model misspecification in IRT, broadly speaking, occurs when an item response model cannot accurately characterize item response probabilities. Under this definition, model misspecification necessarily leads to incorrect item parameter estimates. In other words, correct item parameters do not exist when the model is misspecified because the estimated IRFs can never perfectly trace the true IRFs. One type of model misspecification not considered in this article occurs when multidimensional data are fit to unidimensional models (e.g., Drasgow & Parsons, 1983). In contrast, this manuscript is focused on one specific type of functional form misspecification. Specifically, the author of the present study focuses on the effects of ignoring the need for an IRF upper asymptote parameter less than 1; that is, fitting the 1PL, 2PL, or 3PL, when the 4PL is the data-generating model.
Few studies have explicitly investigated the effects of functional form misspecification on trait estimation error. Jones et al. (1984) generated item responses from a complex and nonstandard IRT model, and fit the resulting data sets to the 1PL, 2PL, and 3PL. They then generated item response vectors using the nonstandard model, and estimated ML trait scores using the estimated 1PL, 2PL, and 3PL item parameters. The authors found that trait estimates computed from the estimated 3PL parameters led to the least bias but highest mean squared error at low trait values, and that ML trait estimates were severely biased for the 1PL and 2PL. Wainer and Thissen (1987) simulated data in a similar manner to Jones et al. (1984) but focused on several latent trait estimators. They found that in sufficiently long tests (≥20 items), the 3PL led to more accurate trait estimates than the 1PL and 2PL at all trait levels. More recently, Markon and Chmielewski (2013) studied the effects of model misspecification on the bias, variance, mean squared error, and confidence interval coverage rates of IRT trait estimates. The authors generated data according to the 3PL and estimated ML trait scores using the true 3PL item parameters and the estimated 1PL, 2PL, and 3PL item parameters. At all trait values, they found that trait estimates computed from the 1PL and 2PL were higher than trait scores estimated from the true and estimated 3PL item parameters. They also found that at moderate to low trait values, trait estimates computed from the 1PL and 2PL had lower variance but higher mean squared error than either set of 3PL trait estimates. However, at high trait values, trait estimates computed from the true and estimated 3PL item parameters had small variance relative to the misspecified trait estimates. The authors also found that confidence interval coverage rates were higher for the correctly specified model than for the misspecified models. Moreover, for both correct and incorrect model specifications, they found substantial differences in coverage rates across trait values. Specifically, under model misspecification, coverage rates tended to be lower than their nominal rate at low
Confidence Intervals for Trait Estimates
Many IRT trait recovery studies consider bias, variability, or correlation statistics that are functions only of the true and estimated trait scores. Fewer studies incorporate the estimated standard errors. As suggested by Markon and Chmielewski (2013), “The applied literature on misspecification has focused much more extensively on bias effects than variance effects, which can be misleading given that overall error is a function of both” (p. 108). One way to assess the combined accuracy of the estimate and its standard error is by comparing the nominal and observed confidence interval coverage rates.
A primary use of trait standard errors, denoted
Notably, confidence intervals are traditionally constructed using the formula in Equation 1, even for trait estimates computed with methods other than ML (see De Ayala, Schafer, & Sava-Bolesta, 1995). Standard errors and confidence intervals play a central role in CAT (Weiss, 1982). If obtaining point estimates of
Although it has been shown that trait scores estimated from misspecified models are often inaccurate (Jones et al., 1984; Markon & Chmielewski, 2013; Wainer & Thissen, 1987), the relative contribution of model misspecification is unclear. For instance, it is not clear whether misestimated trait scores result from a lack of test information, the trait estimation method, inaccurate item parameter estimates, or model misspecification. In the next section, a Monte Carlo simulation study is described that elucidates the relative contributions of each of these error sources.
As noted earlier, our simulation study is limited to one type of functional form misspecification, namely fitting data generated from the 4PL to the 3PL, 2PL, and 1PL. This type of functional form misspecification was selected for this study first for its simplicity. Because the 4PL extends the 3PL by adding a single upper asymptote parameter, any differences in trait estimation accuracy under this type of model misspecification are attributable to the omission of the upper asymptote parameter. In addition, recent interest in the 4PL suggests that an upper asymptote is needed to avoid biased trait scores at the upper end of the trait continuum in the context of computerized adaptive testing (Rulison & Loken, 2009). Finally, although the results of this study will be limited to one type of functional form misspecification, the author hopes to provide a general framework in which the magnitude of functional form misspecification—the form of which is unknown in real data—can be understood and assessed.
Method
A series of Monte Carlo simulations were conducted to gauge the effects of trait estimation error, item parameter estimation error, and model misspecification on the bias and confidence interval coverage rates for latent trait estimates. In all cases, item responses were generated according to the four-parameter model (4PL; Barton & Lord, 1981). For item
where
To simulate data, 4PL parameters for 100 items were generated by drawing parameter values from the following distributions:
Conditional
Note. Quantities not in parentheses give the bias values averaging (where applicable) across 100 sets of estimated item parameters. Quantities in parentheses give the bias values that are largest in absolute value among 100 sets of estimated item parameters. Rows 1 to 4 correspond to trait estimates computed from the data-generating item parameters. Rows 5 to 16 correspond to trait estimates computed with the EAP2 estimator. ML = maximum likelihood; WL = weighted likelihood; EAP = expected a posteriori. 1PL, 2PL, 3PL, 4PL = one-parameter, two-parameter, three-parameter, four-parameter model.
A sequence of 13
The simulation proceeded in three stages: In the first stage, trait estimates were computed from the 26,000 response vectors and the data-generating 4PL item parameters. By using a long 100-item test, this simulation design allowed us to explore the accuracy of each trait estimation method while minimizing the effects of random error. The results for this first set of simulations should reflect the properties of the ML, EAP, and WL estimators that were described earlier. Specifically, ML trait estimates should demonstrate a slight outward bias, WL trait estimates should be unbiased, EAP1 estimates should demonstrate a strong inward bias, and EAP2 estimates should demonstrate a mild inward bias.
In the second stage of simulation, trait estimates were computed from the 26,000 response vectors and 100 sets of estimated 4PL item parameters. This second stage allowed us to explore the error introduced by uncertainty in item parameter estimation. To obtain 4PL item parameter estimates, 100 vectors of
In the third stage of simulation, trait estimates were computed from the 26,000 response vectors and estimated 1PL, 2PL, and 3PL item parameters. To obtain estimated item parameters from these misspecified models, the same 100 data sets were used with
Results
Effects of Trait Estimation
The combined effects of random error and trait estimation error are first evaluated. For the four trait estimation methods, the observed conditional bias values are reported in rows 1 to 4 of Table 1, and the observed conditional confidence interval coverage rates are reported in rows 1 to 4 of Online Table A. At almost all
Effects of Item Parameter Estimation
The effect of item parameter estimation on latent trait estimates was evaluated using the same sets of 13 (
At each of the three sample sizes, average conditional bias and confidence interval coverage rates were computed both across and within the 100 replications (i.e., the 100 sets of estimated item parameters). Bias values are presented in rows 5 to 10 of Table 1. Here, the numbers outside parentheses are the bias values averaged across replications, and the numbers within parentheses are the maximum absolute bias values (sign reintroduced) within replications. Notice that the three sample sizes lead to similar patterns and magnitudes of bias. Note also that the conditional biases do not strictly increase as
Confidence interval coverage rates are presented in rows 5 to 10 of Online Table A. Here, the numbers outside parentheses are the coverage rates averaged across replications, and the numbers within parentheses are the minimum coverage rates within replications. Notice that across replications, coverage is high for all three sample sizes and closest to the nominal rate of .95 for central
Effects of Model Misspecification
The effects of model misspecification on bias are shown in rows 11 to 16 of Table 1. As in the previous set of analyses, the quantities outside parentheses indicate the average bias across the 100 replications, and the quantities within parentheses give the (signed) maximum absolute bias observed within replications. The effects of model misspecification on confidence interval coverage are shown in rows 11 to 16 of Online Table A. Here, the quantities outside parentheses give the coverage rates across replications, and the quantities within parentheses give the minimum coverage rate within replications. When interpreting these results, bear in mind that each set of 1PL, 2PL, and 3PL item parameter estimates was computed with the very large sample size of
Table 1 and Online Table A reveal that both the 1PL and 2PL lead to relatively unbiased trait estimates (
A Closer Look at Biased Trait Estimates
In the previous section, it was found that, for responses generated under the 4PL, trait estimates computed from the misspecified 3PL were heavily biased at extreme trait levels. Unexpectedly, trait estimates computed from misspecified 1PL or 2PL item parameter estimates were nearly unbiased. The fact that this result occurred for the 3PL but not for the 1PL or 2PL—which are submodels of the 3PL—suggests that some aspect of 3PL item parameter estimation causes biased trait estimates. As a first step, then, it is necessary to identify the ways in which the 3PL item parameter estimates differ from the 1PL, 2PL, and 4PL item parameter estimates. In previous studies of item parameter recovery, various indices have been used to evaluate item parameter recovery, including parameter bias and root mean square error (e.g., Hulin, Lissak, & Drasgow, 1982) and overall IRF recovery (e.g., Ramsay, 1991). Parameter recovery is explored by comparing the data-generating 4PL item parameters with the 100 sets each of 1PL, 2PL, 3PL, and 4PL item parameter estimates (i.e., the same sets of item parameters and estimates used in the simulation study described earlier). First, the recovery of item difficulties

Scatter plot of conditional EAP2
Before exploring why the 3PL estimated IRFs and trait scores are conditionally biased, it should be noted that sets of estimated item parameters that lead to biased trait estimates do not necessarily exhibit poor model–data fit. To illustrate this point, absolute model fit, relative model fit, and item fit statistics are computed for the sets of 1PL, 2PL, 3PL, and 4PL estimated item parameters described earlier. All fit results are reported in Table 2. Recall that each model was fit to the same 100 data sets, where each data set was generated under the 4PL with
Absolute and Relative Model and Item Fit Statistics.
Note. This table represents model and item fit statistics for 100 data sets (replications), each with 100 items and
Proportions of estimated models that fit according to the M2 statistic with α = .05.
Average, minimum, and maximum RMSEA2 index across the 100 replications.
Proportion of data sets for which each model provides the lowest RMSEA2 value among the four competing models.
Proportion of data sets for which each model provides the lowest DIC value among the four competing models.
Proportion of items that fit according to the S−X2 statistic with α = .05. This proportion is reported across test items and replications, and the minimum and maximum proportions within replications are reported in parentheses.
Above, it was established that biased trait estimates are the result of systematically biased IRF estimates, and that conditionally biased IRFs need not be detected by fit indices. The effects of item elimination based on the
Implications
The finding that trait estimates are biased under the misspecified 3PL suggests that the metric of the estimated trait differs from the metric of the data-generating trait. The fact that the 3PL is negatively biased at both low and high trait scores implies a nonlinear relationship between
It is worth noting that this is not the first time that systematic biases in trait estimates—implying a nonlinear distortion of the latent trait metric—have been found with regard to the 3PL. A similar result was found by Yen (1981) in a slightly different context. She found that, when analyzing 3PL-generated data with the 2PL and 3PL, the two models fit equally well, but that the two sets of trait estimates were curvilinearly related, with “the [2PL] trait estimates more stretched out at the high end than the [3PL] traits” (p. 259). Yen explained her results in terms of model misspecification and the ability of the 2PL parameters to compensate for the lack of lower asymptote. Our results add nuance to this understanding of trait estimation under misspecified models. Rather than attempting to find and fit the true data-generating model, simpler IRT models may perform well so long as they make relatively unbiased test-level predictions conditional on all
Perhaps the most important implications of these findings might relate to how the latent trait metric is interpreted. In fact, many of the statistics and procedures used on IRT trait estimates imply an interval-level metric. For instance, linear item linking, evaluating change or growth over time, and even the use of confidence intervals based on normally distributed standard errors all depend on the idea that intervals on the latent trait are fixed and meaningful. Furthermore, distributional statements about the latent trait and parametric statistics applied to latent trait scores imply an interval-level latent trait. One may argue that interval-level interpretations are justified so long as the same procedures are applied to all individuals (i.e., so long as all individuals are scored using the same set of estimated item parameters). The trouble with this idea is that the units and spacing of the trait scores are then inextricably bound to the estimated item parameters of the test. If the test’s item parameters were to be validated with independent data, small differences in methodology, such as a change in Bayesian prior, could lead to different substantive conclusions.
Solutions to the problem of biased trait estimates are not obvious, partly because it is not easy to detect bias when the data-generating model is unknown. As demonstrated above, item elimination based on item fit statistics might reduce trait estimation bias in extremely large item calibration samples, but these methods are less effective in moderate to small item calibration samples. Trait estimation methods that adjust for both the bias and variability associated with item parameter estimation (e.g., Lewis, 1985, 2001; Tsutakawa & Johnson, 1990) might result in less biased trait estimates, although these methods assume correct model specification and have not yet been tested on misspecified IRT models. At minimum, researchers can perform additional checks when calibrating item parameters to detect whether the estimated latent trait is unstable. For instance, when calibrating item parameters, researchers could fit additional models with other methodological choices, and check whether methodological differences lead to test response functions that predict very different scores for some ranges of the latent trait continuum. 2 An example of such a check was demonstrated in the previous section, in which the 3PL model was refit using several choices of Bayesian priors. In this example, it was found that trait estimates at extremes of the latent trait had very different average values depending on the choice of Bayesian priors. If these robustness checks suggest that item calibration is indeed sensitive to methodological choices, either a less malleable (but still well-fitting) model should be chosen, or caution should be exercised to avoid interval-level score interpretations.
Conclusion
The goal of this study was to determine the relative impact of trait estimation error, item parameter estimation error, and model misspecification on the bias and confidence interval coverage rates of IRT latent trait estimates. Our results clearly showed that, assuming sufficiently large data sets, the greatest errors in latent trait estimates result from model misspecification. However, it was found that not all misspecified models introduce estimation errors, and the effect of model misspecification is not directly related to the complexity of the fitted model. Specifically, it was found that when 4PL data were fit to the 3PL, trait estimates were heavily biased and had poor coverage at extreme trait levels. Surprisingly, when 4PL data were fit to the 2PL or 1PL, trait estimate bias statistics and confidence interval coverage rates were similar to those from the correctly specified model. Follow-up analyses demonstrated that bias in latent trait estimates can be predicted from systematic conditional bias in the estimated item response functions.
Our simulation study was limited to only one type of model misspecification, and so the specific results reported in this article may not directly apply to other types of functional form misspecifications or misspecified dimensionality. However, the results of this study clearly demonstrate that seemingly minor model misspecification can have large effects on the scaling of the latent trait metric. Put another way, it was demonstrated that there is not a direct relationship between model/item fit and parameter estimation bias, a result which ought to be explored for other types of IRT model misspecification. Our results show that estimated models that demonstrate acceptable fit are not necessarily immune from the effects of model misspecification, especially when item calibration samples are not extremely large.
In this article, it was found that item parameter estimation methods that systematically underestimate or overestimate response probabilities can lead to trait estimates that are nonlinearly related to the true latent trait. Systematic bias could also occur if, for example, guessing behavior is not taken into account (low ability examinees will have systematically underestimated trait scores, even if a guessing parameter is fixed to a certain value but is inappropriate for the data). It is recommended that researchers routinely evaluate the sensitivity of estimated item parameters and response functions to small methodological choices. Researchers are also encouraged to draw interval-level score interpretations with great caution and only after evaluating the sensitivity of their fitted models.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary material is available for this article online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
