The Problem of Bias in Person Parameter Estimation in Adaptive Testing

Abstract

It is shown that deviations of estimated from true values of item difficulty parameters, caused for example by item calibration errors, the neglect of randomness of item difficulty parameters, testlet effects, or rule-based item generation, can lead to systematic bias in point estimation of person parameters in the context of adaptive testing. This effect occurs even when the errors of the item difficulty parameters are themselves unbiased. Analytical calculations as well as simulation studies are discussed.

Keywords

computerized adaptive testing biased point estimation item response theory calibration errors differential item functioning testlets computerized item generation

As many computer-based methods are readily available nowadays, adaptive testing has become a popular and efficient testing method in psychometrics. The basic principle of an adaptive test can be shortly described as follows: An examinee is first given a few items to obtain a crude initial estimate of the person parameter. Then the next item is chosen so that it contributes maximally to the precision of the updated estimate obtained after the examinee has completed the item. Therefore, several item selection criteria have been developed, for example, the criterion of maximal Fisher information, which is probably the most commonly used in practice (van der Linden, 2010). Furthermore, a number of content- and test-specific constraints often have to be taken into account. The item selection process is iteratively repeated and terminates after a certain predefined number of items or when the estimate of the person parameter no longer changes significantly.

Item response theory (IRT) models are an essential element of adaptive testing, as they allow for an easy evaluation of the Fisher information for single items. A central feature of IRT models is that they have separate parameters to describe item and person characteristics. However, in this article it is shown that when the true item difficulty parameters of IRT models differ from the estimated ones, a systematic bias in the estimation of person parameters in adaptive testing arises, even when the errors of the difficulty parameters are unbiased.

Various sources of errors in item difficulty estimation are conceivable, four of which are discussed in the following section. Their effects will be demonstrated on person parameter estimation in adaptive testing analytically as well as by simulation studies with the Rasch, the two-parameter logistic (2PL), and the three-parameter logistic (3PL) models.

An already known phenomenon caused by imprecise item parameters in adaptive testing is the so-called capitalization on chance, which is described in more detail by van der Linden and Glas (2000). This effect is mainly due to the fact that an overestimation of the discrimination parameter leads to a disproportionate overestimation of the corresponding information function, which is a quadratic function of the discrimination. Hence, in many cases, the test algorithm prefers items with high-calibration errors. As a result, the accuracy of person parameter estimation is likely to be overestimated; however, no systematic bias is expected for the estimates themselves.

Sources of Variation in Item Difficulties

Deviations of the true item difficulty parameters from the estimated ones might be due to several reasons. Here four possible sources are briefly discussed, which are, depending on the specific testing situation, likely to contribute in varying degrees.

First, errors in the calibration of item parameters are generally not avoidable. In adaptive testing, a whole item pool rather than just a single test has to be calibrated. Furthermore, the items of the pool usually need to be replaced regularly to maintain item pool integrity. As calibrating new items is often expensive, the current trend is to minimize the size of the calibration sample (van der Linden, 2010), which in most cases leads to increased calibration errors. Errors in item calibration might therefore often be more predominant in adaptive testing than in linear testing.

Second, the neglect of person- or subgroup-specific differences in item difficulties might equally result in a misfit between true and estimated difficulty parameters. Item parameters are usually modeled as fixed effects. However, especially the item difficulty parameters might be slightly different for different individual persons. It is not unusual that when given two items, A and B, some people find Item A more difficult than Item B, whereas other people have more problems dealing with Item B than with Item A. Furthermore, when there are distinguishable subgroups of respondents, (mean) difficulty parameters might also vary across different groups (differential item functioning [DIF]; see Camilli & Shepard, 1994; Holland & Wainer, 1993; Osterlind & Everson, 2009; Zumbo, 2007). There are methods for detecting DIF, yet they are not always routinely applied. Also, they might only work when the differences are sufficiently large.

Although in classical IRT it is generally assumed that the difficulty of an item is not person or group dependent, it might therefore in many cases be more realistic to think of item difficulties as random parameters which differ across persons and/or groups (De Boeck, 2008; Rijmen & De Boeck, 2002). Assuming that in the calibration of (fixed) item parameters the means of these random effects are estimated, deviations from these means to the true person- or group-dependent item difficulty parameters are likely to occur.

Third, another source of variation in item difficulty parameters can arise within tests that comprise so-called testlets (Bradlow, Wainer, & Wang, 1999; Ip, 2010; Scott & Ip, 2002; Sireci, Thissen, & Wainer, 1991; Wainer, Bradlow, & Wang, 2007; Wainer & Kiely, 1987; Wang & Wilson, 2005). Testlets are subsets of items for which the assumption of local item independence might be violated. Often, items of the same testlet share a common stimulus (e.g., a reading passage or a table of numbers). In these situations, it can be assumed that the performance on the items not only depends on absolute values of item difficulties but also on how well the examinee is able to process the stimulus. In other words, true item difficulties are no longer fixed, but vary across persons, where the direction of the person-dependent shift of item difficulty is the same for all items that belong to the same testlet. Wainer et al. (2007) point out that testlets are especially useful in adaptive testing, as they can be conveniently used to meet additional constraints.

Fourth, deviations of true from estimated values of item difficulty in adaptive testing can also arise from rule-based item generation (Geerlings, Glas, & van der Linden, 2011; Holling, Bertling, & Zeuch, 2009). This technique can significantly improve test security, as the traditional item pool is replaced by an (ideally) infinite pool of items that can be generated by a computer algorithm. The main idea for rule-based item generation is that the items in the pool are nested in families, where items that belong to the same family share the same values of item parameters and can be generated according to certain rules (item cloning). However, as the items within a family actually differ from one another, it can be assumed that in practical applications, the assumption of equal item parameters of items within the same family holds only approximately, so that, at least to a certain extent, deviations of the true from the estimated (mean) item difficulty parameters can be expected (van der Linden, 2010).

Systematic Bias in Estimating Persons

In adaptive testing, deviations of estimated from true values of item difficulty parameters, caused, for example, by the sources described in the previous section can lead to systematic bias in point estimation of person parameters. When the errors of the item parameter estimates are unbiased, a systematic overestimation of the absolute values of person parameter estimates is expected, which generally increases with absolute values of true person parameters.

For the sake of simplicity, the following explanation of this phenomenon is based on the family of Rasch models and the criterion of maximal Fisher information because here the next item of an adaptive test is chosen so that its difficulty parameter coincides with the current estimate of the person parameter. This is the situation in which the systematic bias of person parameter estimation occurs in “pure form.” However, as for other models and other selection criteria, a difficulty parameter close to the current estimate is in most cases also advantageous, the effect will also be observable in much more general settings. This is demonstrated in the following section by simulation studies, where, next to the Rasch model, also the 2PL and 3PL models are evaluated.

To explain the effect of biased person parameter estimation in adaptive testing with imprecise difficulty parameters, assume that the distribution of the true difficulty parameters in the item pool has its maximum around 0 and constantly fewer items on the left and right tail, which is a reasonable assumption in most settings of adaptive testing. In an arbitrary step of the iterative test algorithm, let y be the current estimate of the person parameter, and assume that y is unequal to zero. Let U be a small interval containing y, so that an item is to be selected with estimated difficulty in U. Having assumed error variance in the (unbiased) estimation of basic parameters, U is likely to contain estimated difficulty parameters of items for which the true difficulty parameter is not in U. Let I₁ and I₂ denote two intervals of the same length d bordering on U to the left and right, where I₁ is the one more remote from 0. Let A₁ be the number of true difficulty parameters β in I₁ so that the error ξ of β is of such a magnitude that $β + ξ$ or $β - ξ$ is in U. Let A₂ denote the same number for I₂. For an easier illustration, assume that A₁, A₂, and the distribution of true item difficulties are continuous. Then for an infinitesimal d, A₁ and A₂ are proportional to the density of true difficulty parameters in I₁ and I₂, respectively, so $A_{1} < A_{2}$ . As the density of difficulties is smaller in every point of I₁ than in any point of I₂, the inequality $A_{1} < A_{2}$ holds for arbitrary large d. Assuming that the errors of the basic parameters are unbiased, the number of true difficulty parameters β in I_1, such that the corresponding estimate $β + ξ$ is in U, is approximately $A_{1} / 2$ and is hence less than $A_{2} / 2$ , the approximate number of true β in I₂ such that the corresponding estimate $β + ξ$ is in U. Figure 1 illustrates this situation.

Figure 1.

Effect of deviations of true from estimated values of item difficulty parameters on the estimation of person parameters in adaptive testing

For a person with a positive true person parameter, the algorithm will hence select items with overestimated difficulty parameters more often than items with underestimated difficulty. These items are easier to solve for the person than their difficulty estimates suggest, so the person parameter is systematically overestimated. The same reasoning holds for persons with negative true person parameters, which are therefore systematically underestimated.

An Asymptotic Formula of the Expected Bias

A formula is derived, which describes, as a function of the current estimate, the approximate expected value of the error term ξ when choosing an item whose estimated difficulty is supposed to match this current estimate. Therefore, the following assumptions were made:

Every item of the pool has a true difficulty parameter β that is estimated with an error term ξ; that is, the estimated difficulty of the item is $β + ξ$ .

The density of the distribution of the true item difficulty parameters in the pool is given by $f_{pool}$ .

The density of the distribution of the error terms is given by $f_{err}$ and the corresponding cumulative distribution function by $Φ_{err}$ .

β and ξ are independent.

Let y be the current estimate of the person parameter. For $ε > 0$ , the neighborhood $(y - (ε / 2), y + (ε / 2))$ of y is denoted as $U_{ε} (y)$ . The event of obtaining an item with estimated difficulty parameter in $U_{ε} (y)$ is given by

B_{ε} : = {(β, ξ) : β + ξ \in U_{ε} (y)},

and the probability of this event is

P (B_{ε}) = \int (Φ_{err} (y - β + \frac{ε}{2}) - Φ_{err} (y - β - \frac{ε}{2})) f_{pool} (β) d β .

Let

B : = lim_{ε \to 0} B_{ε} = {(β, ξ) : β + ξ = y} .

For a given y, interest lies in $E (ξ | B)$ , the conditional expectation of the error term ξ given that the estimated value of the item equals y.

A density function of the conditional distribution of β given $B_{ε}$ is

g_{ε} (β) : = \frac{(Φ_{err} (y - β + \frac{ε}{2}) - Φ_{err} (y - β - \frac{ε}{2})) f_{pool} (β)}{\int (Φ_{err} (y - β + \frac{ε}{2}) - Φ_{err} (y - β - \frac{ε}{2})) f_{pool} (β) d β},

as for an arbitrary measurable $ζ \subseteq R$

P (ζ | B_{ε}) = \frac{P (ζ \land B_{ε})}{P (B_{ε})} = \frac{\int_{ζ} (Φ_{err} (y - β + \frac{ε}{2}) - Φ_{err} (y - β - \frac{ε}{2})) f_{pool} (β) d β}{\int (Φ_{err} (y - β + \frac{ε}{2}) - Φ_{err} (y - β - \frac{ε}{2})) f_{pool} (β) d β},

holds. Expanding the ratio in the expression for the density $g_{ε} (β)$ by $1 / ε$ and taking the limit yields a density function $g (β)$ of the conditional distribution of β given B:

lim_{ε \to 0} g_{ε} (β) = \frac{f_{err} (β - y) f_{pool} (β)}{\int f_{err} (β - y) f_{pool} (β) d β} = : g (β) .

As $ξ = y - β$ for all $(β, ξ) \in B$ , the following is obtained:

E (ξ | B) = \int (y - β) g (β) d β .

Equation 7 describes the expected value of the error term ξ of an item difficulty parameter as a function of y (note that g is also dependent on y), which is the value of the difficulty parameter that the next item is intended to have and which is commonly the value of the current estimate of the person parameter.

As the person parameter scale and the item difficulty scale coincide, the expected error term of the difficulty parameter for a given y can also be interpreted as the expected bias of the person parameter estimate when a person is only tested with items that are chosen so that their estimated difficulties equal y. In a real adaptive testing situation, y is supposed to change in each step of the iterative testing algorithm. However, it can be assumed that y will approximately alternate around the true value of the person parameter. Therefore, Equation 7 might nevertheless, for sufficiently large test lengths, give a good estimate of the expected bias of person parameter estimation when the true person parameter is y, as the expected bias is bigger for values of y whose absolute values extend the absolute value of the true person parameter and smaller for values of y whose absolute values are less than the absolute value of the true person parameter.

Figure 2 shows Equation 7 evaluated as a function of y under different assumptions of the true difficulty parameter distribution and the error term distribution. Increasing bias for growing error variances is clearly observable. Also, as expected, in the case of a skew normal distribution for the true item difficulties (compare Figure 3), the bias is larger when the gradient of this distribution is higher and lower when the gradient becomes smaller.

Figure 2.

Expected bias as a function of y, as calculated in Equation 7

Figure 3.

Standard normal distribution and skew normal distribution with shape = 8, scale ≈ 1.64, and location ≈ 1.30

Simulation Studies

Several simulation studies were conducted to study the effect of biased person parameter estimates caused by imprecise item parameters in different testing situations. These simulation studies investigated three models (Rasch, 2PL, and 3PL models), four estimators (maximum likelihood [ML], weighted maximum likelihood [WML; Warm, 1989] as well as expected a posteriori [EAP], and maximum a posteriori [MAP], both with normal priors), three variances of normally distributed error terms for the estimated item difficulty parameters (0.1, 0.25, 0.5), four test lengths (15, 30, 50, 100), and two sizes of item pools (500, 1,000). Item selection was based on the criterion of maximal Fisher information.

For each testing situation, 1,000 replications of testing five person parameters (−2, −1, 0, 1, 2) were made. For each run of these replications, true item difficulty parameters were randomly sampled from a standard normal distribution. For the 2PL and 3PL models, discrimination parameters were sampled from a lognormal distribution with location 0 and scale 0.2. Guessing parameters for the 3PL model were sampled from a uniform distribution on $[0.01, 0.25]$ . Error terms for the difficulty parameters were, also in each replication, generated from a normal distribution with the corresponding variance and mean zero. All analyses were conducted in R (R Development Core Team, 2009) using the package catR (Magis & Raîche, 2011).¹

Results

For each testing situation, the deviations of the estimated from the true person parameters were calculated. To obtain single numeric values representing the bias in person parameter estimation, an ordinary least squares (OLS) regression of the bias on the true $θ$ was calculated for each testing situation. Corresponding slopes are displayed in Table 1. Slope parameters that are unequal to zero at the 0.95 significance level are indicated in bold.

Table 1.

Slopes of Regressions of Mean Bias of $\hat{θ}$ on True θ

Rasch model							2PL model							3PL model
N							N							N
$σ_{β}^{2}$	Pool	$\hat{θ}$	15	30	50	100	$σ_{β}^{2}$	Pool	$\hat{θ}$	15	30	50	100	$σ_{β}^{2}$	pool	$\hat{θ}$	15	30	50	100
0.000	500	WML	−0.017	−0.003	0.008	0.002	0.000	500	WML	0.001	0.005	−0.003	0.001	0.000	500	WML	−0.009	−0.003	0.001	0.003
0.000	500	EAP	−0.225	−0.126	−0.075	−0.038	0.000	500	EAP	−0.131	−0.073	−0.053	−0.029	0.000	500	EAP	−0.159	−0.096	−0.065	−0.038
0.000	500	MAP	−0.239	−0.139	−0.078	−0.044	0.000	500	MAP	−0.142	−0.086	−0.057	−0.029	0.000	500	MAP	−0.182	−0.106	−0.069	−0.041
0.000	500	ML	−0.004	−0.002	0.005	0.005	0.000	500	ML	0.013	0.009	0.008	0.003	0.000	500	ML	0.024	0.014	0.007	0.005
0.000	1,000	WML	−0.008	−0.005	−0.004	−0.001	0.000	1,000	WML	−0.007	0.005	−0.001	0.001	0.000	1,000	WML	−0.013	0.005	0.002	−0.003
0.000	1,000	EAP	−0.234	−0.121	−0.077	−0.039	0.000	1,000	EAP	−0.123	−0.067	−0.040	−0.027	0.000	1,000	EAP	−0.148	−0.089	−0.054	−0.030
0.000	1,000	MAP	−0.239	−0.132	−0.077	−0.040	0.000	1,000	MAP	−0.135	−0.069	−0.048	−0.025	0.000	1,000	MAP	−0.158	−0.086	−0.062	−0.037
0.000	1,000	ML	0.001	−0.008	0.005	0.001	0.000	1,000	ML	0.009	0.009	0.006	0.003	0.000	1,000	ML	0.020	0.008	0.001	0.004
0.100	500	WML	0.070	0.079	0.075	0.067	0.100	500	WML	0.063	0.065	0.060	0.053	0.100	500	WML	0.063	0.060	0.067	0.056
0.100	500	EAP	−0.199	−0.067	−0.013	0.021	0.100	500	EAP	−0.090	−0.020	0.004	0.021	0.100	500	EAP	−0.131	−0.049	−0.008	0.014
0.100	500	MAP	−0.212	−0.084	−0.016	0.020	0.100	500	MAP	−0.102	−0.032	0.000	0.016	0.100	500	MAP	−0.136	−0.049	−0.022	0.006
0.100	500	ML	0.076	0.091	0.087	0.074	0.100	500	ML	0.086	0.083	0.066	0.057	0.100	500	ML	0.095	0.083	0.075	0.066
0.100	1,000	WML	0.075	0.087	0.084	0.079	0.100	1,000	WML	0.065	0.072	0.069	0.064	0.100	1,000	WML	0.056	0.069	0.071	0.067
0.100	1,000	EAP	−0.196	−0.070	−0.016	0.031	0.100	1,000	EAP	−0.078	−0.010	0.018	0.034	0.100	1,000	EAP	−0.104	−0.029	0.010	0.027
0.100	1,000	MAP	−0.205	−0.082	−0.021	0.036	0.100	1,000	MAP	−0.089	−0.023	0.008	0.033	0.100	1,000	MAP	−0.120	−0.036	−0.004	0.023
0.100	1,000	ML	0.089	0.085	0.088	0.085	0.100	1,000	ML	0.087	0.082	0.071	0.070	0.100	1,000	ML	0.103	0.087	0.077	0.071
0.250	500	WML	0.164	0.189	0.185	0.159	0.250	500	WML	0.159	0.160	0.157	0.138	0.250	500	WML	0.143	0.154	0.155	0.133
0.250	500	EAP	−0.165	−0.011	0.066	0.105	0.250	500	EAP	−0.026	0.048	0.080	0.093	0.250	500	EAP	−0.072	0.029	0.067	0.090
0.250	500	MAP	−0.171	−0.017	0.058	0.100	0.250	500	MAP	−0.045	0.044	0.075	0.091	0.250	500	MAP	−0.081	0.007	0.055	0.077
0.250	500	ML	0.207	0.204	0.191	0.166	0.250	500	ML	0.192	0.177	0.163	0.141	0.250	500	ML	0.192	0.189	0.167	0.145
0.250	1,000	WML	0.177	0.202	0.205	0.195	0.250	1,000	WML	0.168	0.178	0.174	0.154	0.250	1,000	WML	0.159	0.179	0.179	0.156
0.250	1,000	EAP	−0.162	−0.005	0.072	0.129	0.250	1,000	EAP	−0.014	0.077	0.098	0.120	0.250	1,000	EAP	−0.044	0.044	0.081	0.111
0.250	1,000	MAP	−0.168	−0.012	0.069	0.127	0.250	1,000	MAP	−0.021	0.058	0.096	0.115	0.250	1,000	MAP	−0.074	0.033	0.079	0.105
0.250	1,000	ML	0.202	0.213	0.211	0.196	0.250	1,000	ML	0.207	0.202	0.180	0.162	0.250	1,000	ML	0.203	0.199	0.180	0.166
0.500	500	WML	0.307	0.351	0.344	0.295	0.500	500	WML	0.289	0.311	0.294	0.261	0.500	500	WML	0.294	0.315	0.300	0.260
0.500	500	EAP	−0.109	0.085	0.180	0.226	0.500	500	EAP	0.052	0.161	0.201	0.210	0.500	500	EAP	−0.000	0.128	0.180	0.196
0.500	500	MAP	−0.128	0.078	0.179	0.222	0.500	500	MAP	0.036	0.150	0.194	0.204	0.500	500	MAP	−0.027	0.105	0.169	0.191
0.500	500	ML	0.357	0.373	0.356	0.299	0.500	500	ML	0.364	0.340	0.316	0.267	0.500	500	ML	0.370	0.345	0.316	0.261
0.500	1,000	WML	0.343	0.373	0.382	0.354	0.500	1,000	WML	0.320	0.333	0.330	0.299	0.500	1,000	WML	0.296	0.331	0.331	0.298
0.500	1,000	EAP	−0.110	0.090	0.195	0.271	0.500	1,000	EAP	0.070	0.184	0.231	0.253	0.500	1,000	EAP	0.016	0.150	0.214	0.244
0.500	1,000	MAP	−0.122	0.079	0.185	0.267	0.500	1,000	MAP	0.053	0.174	0.226	0.246	0.500	1,000	MAP	0.009	0.143	0.197	0.236
0.500	1,000	ML	0.408	0.408	0.397	0.362	0.500	1,000	ML	0.383	0.370	0.346	0.313	0.500	1,000	ML	0.396	0.372	0.347	0.311

Note: 2PL model = two-parameter logistic model; 3PL model = three-parameter logistic model; N = test length; $σ_{β}^{2}$ = error variance of the difficulty parameters; Pool = size of item pool; $\hat{θ}$ = estimator for the person parameter; WML = weighted maximum likelihood; EAP = expected a posteriori; MAP = maximum a posteriori; ML = maximum likelihood. Slopes are boldface if significantly different from zero at the 95% level.

To illustrate some of the results in more detail, Figures 4 to 7 show the deviations from true and estimated person parameters for some selected testing situations.

Figure 4.

First three panels: Deviations of the estimated from the true person parameters for the four different estimators and the three different models. Lower right panel: Bias resulting from Bayesian MAP estimators for all three models.

Figure 5.

Deviations of the estimated from the true person parameters for different error variances of difficulty parameters.

Figure 6.

Deviations of the estimated from the true person parameters for test length varying from 15 to 100.

Figure 7.

Deviations of the estimated from the true person parameters for item pools of size 500 and 1,000, respectively

Dependence on the Estimator

Table 1 shows that, if the test length is sufficiently long, the effect occurs for all estimators. However, it is less distinct for the Bayesian estimators. Depending on the prior distribution and the test length, estimates obtained from MAP and EAP estimators are shifted toward the mean, which partially counteracts the bias induced by imprecise difficulty parameters. In Table 1, systematic bias toward the mean is indicated by negative slope parameters, which occur for the Bayesian estimators mainly in testing situations where the error variance of the difficulty parameters is zero or the test length is small.

Comparing the two likelihood and the two Bayesian estimators, respectively, the effect of biased person parameter estimation caused by imprecise item parameter estimates is marginally stronger for the ML than for the WML, and slightly more pronounced for the EAP than for the MAP estimator.

As an example, Figure 4 shows the dependence of the effect of biased person parameter estimation on the estimator for all three models for specific testing situations. As the WML and the ML, as well as the MAP and the EAP estimator, are very similar for all models, Figures 5 to 7 only show results for WML and MAP estimators.

Dependence on the Model

The dependence of the observed effect on the model is indicated in Table 1 and Figure 4. As the information functions of the 2PL and 3PL models depend on the difficulty and the discrimination parameters, and, in the case of the 3PL model, also on the guessing parameter, it is expected that the effect of biased person parameter estimation will be most noticeable for the Rasch models. This can indeed be observed in the case of ML estimation, although the differences between the three models are not very distinct. For Bayesian estimators, however, the effect is for test lengths of 30 and 50 items in most cases even more pronounced for the 2PL and 3PL models than for the Rasch model. The explanation for this apparently contradictory result is that the bias toward the mean, which is induced by the Bayesian estimators, is (at least in the setting described here) stronger for the Rasch model than for the 2PL or 3PL model, as indicated in the last panel of Figure 4. Highly discriminating items, which are commonly chosen by the adaptive testing algorithm in the case of 2PL and 3PL models, provide more information than do Rasch-scaled items, therefore the prior distribution, which in turn causes the shift toward the mean, is here less decisive. However, as the influence of the prior decreases as test length increases, the effect of biased person parameter estimation is again strongest for the Rasch model for long tests ( $\approx 100 items$ ), regardless of whether a Bayesian or an ML estimator is used.

Dependence on the Error Variance

Table 1 and Figure 5 indicate the dependence on the error variance of item difficulty parameters. As expected, the effect becomes more pronounced with increasing error variance. When there is no error variance, estimates seem to be unbiased in the case of likelihood estimation and slightly biased in the opposite direction due to a shift to the mean in the case of Bayesian estimation.

Dependence on the Test Length

At least for tests with fewer than 100 items, test length has a stronger effect on biased person parameter estimation for Bayesian than for likelihood estimators, as indicated in Table 1 and Figure 6. For a short test length of only 15 items, the bias toward the mean induced by the Bayesian estimators is in most cases even stronger than the reverse bias induced by imprecise item difficulty parameters. The dependence on the test length in the case of the Bayesian estimators is caused by the fact that for short tests, the prior assumption is quite decisive for the posterior distribution, whereas for long tests, the likelihood is the dominating factor. Therefore, the shift toward the mean, which is induced by using a Bayesian estimator, becomes smaller with increasing test length.

Dependence on the Size of the Item Pool

Table 1 and Figure 7 show that the effect of biased person parameter estimation caused by imprecise item difficulty parameters is slightly stronger for an item pool of size 1,000 than of size 500, at least for the more extreme values of theta. This is not surprising, as tests become more “adaptive” the larger the item pool is. This result is independent of the estimator and the model.

Discussion

In adaptive testing, deviations of the true from the estimated difficulty parameters can lead to systematic overestimation for positive person parameters and to systematic underestimation for negative person parameters, even when the errors of the item parameters are unbiased. The given theoretical explanation for this phenomenon is very general and (as long as the difficulty parameter is decisive in the item selection process) can therefore be assumed to hold for many situations in adaptive testing.

Decisive Factors

The actual influence of the described effect on person parameter estimation depends mainly on the following factors: (a) the exactness of item calibration, (b) the amount of person- and/or group-specific differences in item difficulty parameters, (c) the presence of local item dependence (testlets), (d) the possible use of rule-based item generation, (e) other possible sources of variation in item difficulty parameters, (f) the distribution of the true item difficulty parameters in the pool, (g) the distribution of the errors of the item difficulty parameters, (h) the estimator, (i) the model, (j) the test length, (k) the size of the item pool, (l) the item selection criterion, and (m) additional constraints in the item selection procedure.

At least to a certain degree, calibration errors and person- or group-specific differences in item difficulty parameters can be assumed to be present in most real testing situations. The impact of the latter seems especially important for models with comparatively few item parameters. An example of such a model is the linear logistic test model (LLTM; Fischer, 1973), which comprises only a certain number K of basic parameters that determine the difficulty of $2^{K}$ different items. This model is rather interesting for applications in adaptive testing, as it enables the generation of a large item pool with comparatively little calibration effort and is also well suited for an easy implementation of rule-based item generation (Holling et al., 2009). Table A1 in the Appendix shows results similar to Table 1 for the LLTM.

However, it is a known problem of the LLTM that it often does not fit real data sufficiently, even when most of the variation in the item difficulties can be explained (Rijmen & De Boeck, 2002). For this reason, some authors have already suggested modeling some or all of the basic parameters as person-specific random effects (Rijmen & De Boeck, 2002) or introducing an additive item-specific random parameter that contributes to the difficulty parameter (De Boeck, 2008). Rijmen and De Boeck (2002), analyzing data from a test of deductive reasoning, report that some of the variances associated with random basic parameters were estimated to be rather large. Therefore, it seems reasonable that, in the case of the LLTM, most of the error variance of estimated item difficulty parameters is due to individual differences in item difficulties rather than to calibration errors.

Generally, when the model is comparatively simple with only a few item parameters, the calibration is expected to be more exact, but the effects of the neglect of individual differences are often larger. However, when the model is more complex and comprises more item parameters, calibration errors might be more predominant than the deviations due to the approximation by the model.

Testlets and rule-based item generation are especially useful in adaptive testing. However, the possible side effect of additional error variation of the difficulty parameters can be particularly severe in precisely this testing format.

When modeling testlet effects with additional random effects that are added to item difficulty parameters (cf. Wainer et al., 2007), the error variances associated with deviations of true from estimated (mean) difficulty parameters due to these testlet effects can be directly assessed. However, error variances can only be specified for particular testing situations, so that general statements about the magnitudes of these variances are limited. Yet, in this context, it might be interesting to note that Wainer et al. (2007) report various examples of testlet applications where these variances are often quite large.

Concerning rule-based item generation, there are no studies so far that investigate the magnitudes of possible error variances of item difficulty parameters within families of cloned items. Like the variances arising from person-specific differences or testlets, these variances are assumed to be strongly dependent on content-related aspects and can probably be best assessed by random effects modeling.

The distribution of the difficulty parameters in the item pool plays an important role in explaining the observed effect. Generally, as indicated in Figure 2, the bias is expected to be larger for person parameters for which the gradient of this distribution is high (note that the person and difficulty parameter scales coincide). To prevent systematic bias in person parameter estimation caused by imprecise item parameters, a uniform distribution over the whole range of person parameters would be necessary.

The choice of the estimator can be quite decisive for the bias in person parameter estimation. As Bayesian estimators bias the estimates toward the mean, the effect observed here is generally stronger for likelihood estimators. Especially in the case of short tests, the bias caused by imprecise item difficulty parameters is often balanced by the bias induced through the prior distribution, and in some cases the latter is still predominant. However, the influence of the prior distribution on the posterior, and hence the bias toward the mean, decreases with increasing test length. Therefore, the choice of the estimator becomes less important for longer tests.

In the case of Rasch models, the item selection algorithm based on maximal Fisher information chooses the next item in such a way that an optimal match is obtained between the difficulty parameter and the current estimate of the person parameter. For models with more item parameters, the criterion of maximal Fisher information is generally also influenced by other parameters. However, in this study, the simulation for the 2PL and 3PL models shows that the bias is very similar to the one obtained under the Rasch model. In the case of Bayesian estimators, the bias is often even slightly stronger for the 2PL and 3PL models, as the bias toward the mean is marginally more distinct for the Rasch model.

The dependence of the bias in person parameter estimation in adaptive testing on the test lengths is more distinct for Bayesian estimators, as the induced bias toward the mean, which counteracts the bias caused by imprecise item parameters, is test-length dependent. Larger item pools slightly intensify the bias, as corresponding tests are “more adaptive.”

In this study, the author only examined the item selection criterion of maximal Fisher information. However, other criteria also depend on the difficulty parameter, though sometimes more indirectly. For example, in a Bayesian context, it is common to weigh a measure of information (which commonly depends on the difficulty parameter) with the posterior distribution and to maximize the corresponding integral over a certain interval (van der Linden, 2010). With respect to the discrimination parameter, using a Bayesian criterion can indeed lead to a significantly different choice of items, as the information function becomes steeper and more focused when the discrimination parameter increases. However, in the case of the difficulty parameter, there is no comparable phenomenon that might suggest the use of Bayesian criteria, and therefore, the choice of the actual criterion is not expected to make much difference.

Dealing With the Resulting Bias

From a practical point of view, an important question is how to deal with the problem of possibly biased person parameter estimation in adaptive testing. The left panel of Figure 2 as well as the results of the simulation studies indicate that in many cases a sufficient rescaling of the person parameter estimates might be possible by a multiplicative factor. This can be assumed to hold when the distribution of true item difficulty parameters in the pool is approximately “well behaved,” that is, symmetric, centered around the origin, and with a moderate gradient. The item pools used in the preceding simulation studies correspond to this situation and correlations of estimated and true values of person parameters, even with error variances of 0.5, are indeed not much different here from those obtained under precise item parameters (for more details, compare Table A2 in the Appendix). Therefore, if the only purpose of testing is to decide on the relative standing of examinees, the observed bias might be negligible.

However, in most cases the aim of adaptive testing is the assessment of individual persons based on a predefined scale that is obtained during the calibration process. In this case, a sufficiently exact rescaling of the person parameter scale might be necessary, as otherwise person parameter estimation can be systematically biased.

Furthermore, as the right panel of Figure 2 shows, the situation can be much more difficult for more complicated distributions of item difficulties. In such cases, the assumption of a simple correction factor might not be realistic, and therefore, relative decision making as well as more precise individual assessment might both be affected.

Another approach to deal with possible bias in adaptive testing is to try to reduce its underlying causes. In the context of the problem of capitalization on chance, van der Linden and Glas (2000) point out some ways of reducing calibration errors in adaptive testing, most of which are probably also useful here. However, a further strategy these authors suggest is to use Rasch models rather than models with more parameters, as capitalization on chance in adaptive testing is caused by imprecisely calibrated discrimination parameters. Ironically, to avoid the phenomenon observed in the current study, one might, at least when likelihood estimators are used or when the test is sufficiently long, recommend just the opposite strategy: In Rasch models, the dependence of the Fisher information on the difficulty parameter is more pronounced than in models with further item parameters; hence, the bias in person parameter estimation for Rasch models is in many cases expected to be larger.

Errors in item calibration can be controlled at least to a certain extent; however, it is much more difficult to deal with deviations arising from person-specific differences in difficulty parameters, testlets, or rule-based item generation. Using random effects models would provide a straightforward way of directly modeling the error variance arising from these sources. However, random effects models are most useful when the objective is to estimate population characteristics. In the case of adaptive testing, where the focus is on the measurement of individual persons, they do not make a big difference, as for an individual examinee and a given item the values of corresponding random effects would still be unknown, so that the means would continue to be the best approximations. Hence, with respect to the difficulty parameter, the item selection algorithm would still pick the same items as under a standard fixed effects model.

Further research could aim to quantify the magnitudes of error variation in item difficulty parameters in diverse practical applications by using random effects models. Based on the obtained estimates of error variances, the effect of biased person parameter estimation could then be evaluated for various IRT models, estimators, and item pools, to investigate how serious its consequences are for the actual practice of adaptive testing.

Footnotes

Appendix

Table A2.

Correlations of Estimated and True Person Parameters

$σ_{β}^{2}$	$\hat{θ}$	Model	R
0.0	MAP	Rasch	.957
0.0	MAP	2PL	.972
0.0	MAP	3PL	.970
0.0	WML	Rasch	.956
0.0	WML	2PL	.971
0.0	WML	3PL	.971
0.1	MAP	Rasch	.959
0.1	MAP	2PL	.971
0.1	MAP	3PL	.970
0.1	WML	Rasch	.957
0.1	WML	2PL	.970
0.1	WML	3PL	.970
0.5	MAP	Rasch	.954
0.5	MAP	2PL	.972
0.5	MAP	3PL	.965
0.5	WML	Rasch	.944
0.5	WML	2PL	.970
0.5	WML	3PL	.964

Note: $σ_{β}^{2}$ = error variance of the difficulty parameters; $\hat{θ}$ = estimator of the person parameter; MAP = maximum a posteriori; 2PL model = two-parameter logistic model; 3PL model = three-parameter logistic model; WML = weighted maximum likelihood. Each correlation was calculated from a simulation (1,000 items in the pool; 50 items in each adaptive test; and 2,000 normally distributed person parameters).

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

The author received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Bradlow

Wainer

Wang

(1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153-168.

Camilli

Shepard

(1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.

De Boeck

(2008). Random item IRT models. Psychometrika, 73, 533-559.

Fischer

(1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.

Geerlings

Glas

C. A. W.

van der Linden

W. J.

(2011). Modeling rule-based item generation. Psychometrika, 76, 337-359.

Holland

P. W.

Wainer

(1993). Differential item functioning. Hillsdale, NJ: Erlbaum.

Holling

Bertling

J. P.

Zeuch

(2009). Automatic item generation of probability word problems. Studies in Educational Evaluation, 35, 71-76.

E. H.

(2010). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34, 467-482.

Magis

Raîche

(2011). catR: An R package for computerized adaptive testing. Applied Psychological Measurement, 35, 576-577.

10.

Osterlind

Everson

(2009). Differential item functioning. Newbury Park, CA: Sage.

11.

R Development Core Team. (2009). R: A language and environment for statistical computing. ISBN 3-900051-07-0. Vienna, Austria: R Foundation for Statistical Computing. Available from http://www.R-project.org

12.

Rijmen

De Boeck

(2002). The random weights linear logistic test model. Applied Psychological Measurement, 26, 271-285.

13.

Scott

(2002). Empirical Bayes and item-clustering effects in a latent variable hierarchical model. Journal of the American Statistical Association, 97, 409-419.

14.

Sireci

Thissen

Wainer

(1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28, 237-247.

15.

van der Linden

(2010). Elements of adaptive testing. Springer, NY; Dordrecht Heidelberg, London.

16.

van der Linden

Glas

(2000). Capitalization on item calibration error in adaptive testing. Applied Measurement in Education, 13, 35-53.

17.

Wainer

Bradlow

E. T.

Wang

(2007). Testlet response theory and its applications. Berlin, Germany: Cambridge University Press.

18.

Wainer

Kiely

(1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185-201.

19.

Wang

Wilson

(2005). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29, 296-318.

20.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response models. Psychometrika, 54, 427-450.

21.

Zumbo

(2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4, 223-233.