Online Calibration of a Joint Model of Item Responses and Response Times in Computerized Adaptive Testing

Abstract

With the widespread use of computers in modern assessment, online calibration has become increasingly popular as a way of replenishing an item pool. The present study discusses online calibration strategies for a joint model of responses and response times. The study proposes likelihood inference methods for item paramter estimation and evaluates their performance along with optimal sampling procedures. An extensive simulation study indicates that the proposed online calibration strategies perform well with relatively small samples (e.g., 500∼800 examinees). The analysis of estimated parameters suggests that response time information can be used to improve the recovery of the response model parameters. Among a number of sampling methods investigated, A-optimal sampling was found most advantageous when the item parameters were weakly correlated. When the parameters were strongly correlated, D-optimal sampling tended to achieve the most accurate parameter recovery. The study provides guidelines for deciding sampling design under a specific goal of online calibration given the characteristics of field-testing items.

Keywords

online calibration item response theory response time optimal sampling computerized adaptive testing

1. Introduction

Many testing programs nowadays make use of large item pools to create multiple parallel test forms or to continuously administer tests over a targeted population. For example, computerized adaptive testing (CAT) requires an extensive item pool to accommodate a variety of content areas and cognitive attributes. When an item pool is used continuously over a span of time, however, some items become obsolete, overexposed, or dysfunctional. These items need to be replaced by new items in a timely manner in order to preserve the quality of the item pool and of the prospective tests.

There have been broadly two approaches for replenishing an item pool. The conventional method is to embed new items in operational assessment and calibrate them together after field-testing is completed. This process is followed by a linking step that places the estimated parameter values on the scale of operational item parameters. The alternative method is online calibration, which calibrates items in real time during operational assessment. Online calibration is particularly favored when an assessment is performed using computers. With the aid of computers, it can achieve higher efficiency in replenishing an item pool as well as greater flexibility in designing a field-testing plan.

The current literature on online calibration is mostly centered on estimation methods (e.g., Ban, Hanson, Wang, Yi, & Harris, 2001; Ban, Hanson, Yi, & Harris, 2002; Segall, 2003) or sampling procedures (e.g., Berger, 1991, 1992, 1994; Buyske, 2005; Chang & Lu, 2010; Jones & Jin, 1994; Ren, van der Linden, & Diao, 2017; Stocking, 1988; van der Linden, 1988; van der Linden & Ren, 2014). More recently presented studies have considered contemporary measurement models such as multidimensional item response models and cognitive diagnostic models (e.g., P. Chen, 2017; P. Chen & Wang, 2016; P. Chen, Wang, Xin, & Chang, 2017; P. Chen, Xin, Wang, & Chang, 2012; Zheng, 2016).

Notice that the existing studies drew mainly on response data and paid no attention to time data that can be readily accessed in any computerized tests. Prior research suggests that response times can contain useful information about examinees’ cognitive processes and item characteristics (e.g., Klein Entink, Kuhn, Hornke, & Fox, 2009). Furthermore, they can help improve the precision of estimation of the response model parameters (van der Linden, Klein Entink, & Fox, 2010). Not only can they provide collateral information for analyzing response data, response times have also become increasingly popular in today’s testing and used in various sectors of psychometric applications. For instance, response time information has been used in assembling tests (van der Linden, 2011), selecting items in CAT (Fan, Wang, Chang, & Douglas, 2012; van der Linden, 2008), detecting aberrant response behaviors (Fox & Marianti, 2017; Marianti, Fox, Avetisyan, Veldkamp, & Tijmstra, 2014; van der Linden & Guo, 2008; van der Linden & van Krimpen-Stoop, 2003), controlling test administration time (van der Linden, 2009; van der Linden, Scrams, & Schnipke, 1999), to name a few. Clearly, it is of critical importance to procure accurate parameter estimates of the response-time models in these applications in order to make informed decisions and appropriate inferences.

The purpose of this article is to present and evaluate online calibration strategies for a joint model of responses and response times. The procedures are developed within the hierarchical framework (van der Linden, 2007) that has been widely used in the literature and in practice. In this study, we propose efficient estimators for the hierarchical framework that can be applied to online calibration settings. Note that there exist estimation routines developed for the hierarchical framework (e.g., Fox, Klein Entink, & van der Linden, 2007; Klein Entink, Fox, & van der Linden, 2009; van der Linden, 2006, 2007). These procedures, however, are based on Markov chain Monte Carlo algorithm and are hardly viable in online calibration due to computational intensity. The procedures developed in this study, by contrast, are grounded on likelihood inference and can be applied to computationally demanding situations such as online calibration and large-scale data analysis.

The present approach is in the similar vein with Glas and van der Linden’s (2010) approach, which applied marginal likelihood inference to evaluate abnormality in items (e.g., differential item functioning, local dependence). While the preceding study calibrated items under restrained conditions, the current study estimates all item parameters freely and examines the performance of the estimators under varying design factors. Specifically, the study considers two estimators to make likelihood inference within the hierarchical framework. The first draws inference from marginal likelihood of item parameters. The second makes inference from a posterior probability distribution, incorporating prior information about the item parameters. The study implements the two procedures using expectation–maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) based on explicit mathematical expressions. In addition to the proposal of the estimators, the study suggests and evaluates several optimal sampling strategies in the interest of calibration efficiency.

Throughout the study, we adopt CAT as a primary testing mode to implement online calibration. CAT tailors administration of items to an individual examinee and thus accomplishes greater efficiency in estimating the examinee’s proficiency level. In addition, since CAT assigns items one after another, response time data tend to be less perturbed by examinees’ deliberate test-taking strategies such as glimpses, item review, and response revision or detention. Note that the employment of CAT as a main testing platform merely adds complexity in estimation of the person parameters and does not impair generalizability of the online calibration methods. The procedures presented here can be generalized to other types of tests (e.g., linear test, multistage testing) so long as an item pool is continuously replenished with new sets of items.

In what follows, we present a brief description of the hierarchical framework and introduce assumptions needed for implementing the online calibration. The inference methods are then discussed along with optimal sampling strategies. The subsequent sections present a simulation study demonstrating the performance of the suggested online calibration procedures. Finally, the article concludes with a discussion of the findings and limitations of the current study, and some future directions.

2. Hierarchical Framework and Assumptions

The present section outlines the hierarchical framework and introduces assumptions to set the stage for online calibration. The hierarchical framework consists of two levels. The first level defines measurement models for each source of observable data. The second level relates the first-level parameters by modeling their joint relations. In this study, we consider the three-parameter logistic (3PL) model (Birnbaum, 1968) and the lognormal model (van der Linden, 2006) for modeling the response and response time data, respectively.

2.1. Hierarchical Framework

The 3PL model defines an item response function as

P_{j} (θ_{i}) = P (U_{i j} = 1 | θ_{i}; a_{j}, b_{j}, c_{j}) = c_{j} + \frac{1 - c_{j}}{1 + exp (- D a_{j} (θ_{i} - b_{j}))},

where $P_{j} (θ_{i})$ denotes the probability that an examinee i with ability $θ_{i}$ responds to an item j with a correct answer $U_{i j} = 1$ ; $a_{j}$ $(\in ℝ^{+})$ , $b_{j}$ $(\in ℝ)$ , and $c_{j}$ $(\in [0, 1))$ are the item j’s discrimination, difficulty, and lower-asymptote parameters, and $D = 1.702$ is a scaling constant approximating to the normal ogive model. For notational convenience, we denote $P_{j} (θ_{i})$ and $Q_{j} (θ_{i}) = 1 - P_{j} (θ_{i})$ as $P_{i j}$ and $Q_{i j}$ , respectively, hereafter.

The lognormal model assumes that the response time for examinee i on item j, $T_{i j}$ , follows a lognormal distribution,

f (T_{i j} = t_{i j} | τ_{i}; α_{j}, β_{j}) = \frac{α_{j}}{t_{i j} \sqrt{2 π}} exp (- \frac{α_{j}^{2}}{2} {(log t_{i j} - (β_{j} - τ_{i}))}^{2}),

where $t_{i j}$ is a realized value for $T_{i j}$ $(\in [0, \infty))$ ; $τ_{i}$ is latent speed at which examinee i performs on the test; $α_{j}$ $(\in ℝ^{+})$ and $β_{j}$ $(\in ℝ)$ are each item $j ′s$ time-discriminating and time-intensity parameter. Note that, under the lognormal model, the log-transformed response time, $log T_{i j}$ , becomes a normal variate with the mean and variance,

E [log T_{i j}] = β_{j} - τ_{i} and Var [log T_{i j}] = α_{j}^{- 2} .

These properties are used recurrently when deriving the estimators.

The second level of the hierarchical framework presents the population and item domains to model the joint relations among the first-level parameters. The population domain assumes that examinee’s trait parameters are independent samples from a bivariate normal distribution,

{(θ_{i}, τ_{i})}^{Τ} \sim N_{2} (μ_{P}, Σ_{P}),

with a mean vector $μ_{P} = (μ_{θ}, μ_{τ})^{Τ}$ and a variance–covariance matrix $Σ_{P} = (\begin{matrix} σ_{θ}^{2} & σ_{θ τ} \\ σ_{θ τ} & σ_{τ}^{2} \end{matrix})$ . The item domain models the association among the item parameters using a multivariate normal distribution,

ξ_{j} = (a_{j}, b_{j}, c_{j}, α_{j}, β_{j})^{Τ} \sim N_{5} (μ_{I}, Σ_{I}),

with a mean vector $μ_{I} = (μ_{a}, μ_{b}, μ_{c}, μ_{α}, μ_{β})^{Τ}$ and a variance–covariance matrix containing all variances and pairwise covariances between the item parameters.

The item domain cannot be used in its current form because some of the item parameters are bounded (i.e., $a_{j} \in ℝ^{+}$ , $c_{j} \in [0, 1)$ , and $α_{j} \in ℝ^{+}$ ). To place the item parameters on the proper domains, the current study applies the log, logit, and log transformations to each $a_{j}$ , $c_{j}$ , and $α_{j}$ . The item domain for the transformed parameters is then written as

ξ_{j}^{*} = (a_{j}^{*}, b_{j}, c_{j}^{*}, α_{j}^{*}, β_{j})^{Τ} = (log a_{j}, b_{j}, logit c_{j}, log α_{j}, β_{j})^{Τ} \sim N_{5} (μ_{I}^{*}, Σ_{I}^{*}),

where $μ_{I}^{*}$ and $Σ_{I}^{*}$ are the mean vector and variance–covariance matrix of the transformed item parameters. The hyperparameters on the original scale are obtained by taking the inverse transformations. For example, let $μ_{a *}$ and $σ_{a *}^{2}$ denote the mean and variance of $a^{*} = log a$ . The mean and variance of a—i.e., $μ_{a}$ , and $σ_{a}^{2}$ — are calculated as

μ_{a} = exp (μ_{a *} + \frac{σ_{a *}^{2}}{2}) and σ_{a}^{2} = exp (2 μ_{a *} + σ_{a *}^{2}) (exp (σ_{a *}^{2}) - 1) .

The hyperparameters of α—i.e., $μ_{α}$ and $σ_{α}^{2}$ — are similarly obtained given $μ_{α *}$ and $σ_{α *}^{2}$ . The logit transformation of c does not have closed-form expressions for the mean and variance on the original scale; instead, they can be obtained empirically from the data from which prior information is derived. Once the means and variances are calculated on the original scale, covariances between the item parameters can be obtained using Stein’s (1981) lemma. For example, covariance between a and b is obtained as $Cov (a, b) = μ_{a} Cov (a^{*}, b)$ ; likewise, covariance between a and α is obtained as $Cov (a, α) = μ_{a} μ_{α} Cov (a^{*}, α^{*})$ .

2.2. Assumptions

The item parameters of the hierarchical framework are estimated under several assumptions. The first set of assumptions concerns the identification of the model parameters. Specifically, two constraints are introduced to the population domain for identifying the model parameters: $μ_{P} = 0$ and $σ_{θ}^{2} = 1$ . The constraints, $μ_{θ} = 0$ and $σ_{θ}^{2} = 1$ , are placed to fix the origin and unit of the proficiency dimension. The restriction, $μ_{τ} = 0$ , is imposed to remove a trade-off between $β_{j}$ and $τ_{i}$ in Equation 2.

In addition to the assumptions made for identifiability, the study makes other assumptions to perform item calibration. First, the study assumes that items are independent to each other and examinees are independent of another. The independence of items in particular allows one to calibrate items one at a time and simplifies the calculation of standard errors. Second, the study assumes three forms of conditional independence to derive item parameter estimators: (1) independence of responses given θ, (2) independence of response times given τ, and (3) independence between responses and response times given θ and τ. For evaluating the tenability of these assumptions, see, for example, Bolsinova, De Boeck, and Tijmstra (2018), W.-H. Chen and Thissen (1997), Glas and Suárez-Falcón (2003), Houts and Edwards (2013), Liu and Maydeu-Olivares (2012), van der Linden and Glas (2010), and Yen (1984). Lastly, it is assumed that the hyperparameters of the population and item domains are known a priori or at least estimated with enough precision in advance. The prior knowledge about the model parameters has been commonly assumed in psychometric applications and is essential for appropriately implementing the EM algorithm. For a potential negative impact of improper priors on the joint estimation of the response and response time models, see Kang (2016).

3. Marginal Likelihood Inference on Item Parameters

The study considers the marginal likelihood approach (e.g., Bock & Aitkin, 1981; Mislevy & Stocking, 1989) to make inference about the item parameters. Within the hierarchical framework, the marginal inference about the item parameters can be achieved in two ways. The first is to marginalize complete-data likelihood with respect to the latent trait parameters by utilizing prior information from the population domain. This approach is in line with the standard marginal likelihood inference and thus is called marginal maximum likelihood (MML) estimation. The other approach makes use of information from both the population and item domains and thus draws inference from a marginalized posterior probability distribution of the item parameters, giving the name of marginal maximum a posteriori (MMAP) estimation. In brief, the difference between the two estimation methods is in the extent to which item calibration utilizes information from the second level of the hierarchical framework.

3.1. MML Estimation

Let $u_{i j}$ and $t_{i j}$ each denote the item response and response time observed for an examinee i on a field-tested item j. The likelihood of the item parameters is obtained by marginalizing the complete-data probability distribution with respect to the latent trait variables:

L (ξ_{j} | u_{i j}, t_{i j}) \equiv f (u_{i j}, t_{i j} | ξ_{j}) = \iint f (u_{i j}, t_{i j} | θ, τ, ξ_{j}) f (θ, τ | Ω) d θ d τ,

where $Ω = (μ_{P}, Σ_{P})$ gives the hyperparameters for the latent trait parameters and $f (θ, τ | Ω)$ is the well-defined prior density for the person parameters. The subscript i for the person parameters is deliberately refrained to imply that they are latent variables being integrated out. Under the assumption of conditional independence between $u_{i j}$ and $t_{i j}$ given $(θ, τ)$ , the conditional joint probability distribution, $f (u_{i j}, t_{i j} | θ, τ, ξ_{j})$ , is simplified to the product of $f (u_{i j} | θ, ξ_{j})$ and $f (t_{i j} | τ, ξ_{j})$ .

Now let $u_{j} = (u_{i j} : i = 1, . . ., N)$ and $t_{j} = (t_{i j} : i = 1, . . ., N)$ denote the vectors of responses and response times observed for item j from a sample of N examinees. The logarithm of the marginal likelihood given $(u_{j}, t_{j})$ is then obtained as

l = log L (ξ_{j} | u_{j}, t_{j}) = \sum_{i = 1}^{N} log \iint f (u_{i j}, t_{i j} | θ, τ, ξ_{j}) f (θ, τ | Ω) d θ d τ .

The MML estimator of $ξ_{j}$ is attained as a solution to the equation

\frac{\partial l}{\partial ξ_{j}} = {(\frac{\partial l}{\partial a_{j}}, \frac{\partial l}{\partial b_{j}}, \frac{\partial l}{\partial c_{j}}, \frac{\partial l}{\partial α_{j}}, \frac{\partial l}{\partial β_{j}})}^{Τ} = 0,

where

\begin{array}{l} \frac{\partial l}{\partial a_{j}} = D \sum_{i = 1}^{N} \iint (θ- b_{j}) \frac{(u_{i j} - P_{i j}) (P_{i j} - c_{j})}{P_{i j} (1 - c_{j})} f (θ, τ | u_{i j}, t_{i j}, Ω) d θ d τ, \\ \frac{\partial l}{\partial b_{j}} = - D a_{j} \sum_{i = 1}^{N} \iint \frac{(u_{i j} - P_{i j}) (P_{i j} - c_{j})}{P_{i j} (1 - c_{j})} f (θ, τ | u_{i j}, t_{i j}, Ω) d θ d τ, \\ \frac{\partial l}{\partial c_{j}} = \sum_{i = 1}^{N} \iint \frac{(u_{i j} - P_{i j})}{P_{i j} (1 - c_{j})} f (θ, τ | u_{i j}, t_{i j}, Ω) d θ d τ, \\ \frac{\partial l}{\partial α_{j}} = \sum_{i = 1}^{N} \iint (α_{j}^{- 1} - α_{j} {(log t_{i j} - (β_{j} - τ))}^{2}) f (θ, τ | u_{i j}, t_{i j}, Ω) d θ d τ, and \\ \frac{\partial l}{\partial β_{j}} = α_{j}^{2} \sum_{i = 1}^{N} \iint (log t_{i j} - (β_{j} - τ)) f (θ, τ | u_{i j}, t_{i j}, Ω) d θ d τ . \end{array}

The above derivatives cannot be calculated yet because of dependence on the unknown quantities resulting from the latent variables (e.g., $P_{i j}$ , τ). The current study employs the EM algorithm to tackle the incomplete-data problem. The algorithm finds maximum likelihood estimates of the item parameters by iteratively alternating between two procedures: expectation (E) and maximization (M) steps. The E-step evaluates conditional expectation of the complete-data log-likelihood given the provisional estimates of the item parameters. In the M-step, the item parameter values are updated by those that maximize the expectation of the complete-data log-likelihood. The optimization in the M-step is usually accomplished via numerical approximation technique such as Newton–Raphson and Fisher scoring. The two steps are repeated until the EM algorithm satisfies a suitable convergence criterion. Appendix A provides a detailed description of the estimation algorithm.

3.2. MMAP Estimation

The MMAP estimator draws inference about the item parameters from a posterior probability distribution. Let $f (ξ_{j} | Ψ)$ denote the prior density of the item parameters defined under the hyperparameters $Ψ = (μ_{I}, Σ_{I})$ . The posterior distribution of $ξ_{j}$ is then obtained as

f (ξ_{j} | u_{j}, t_{j}, Ψ) = \frac{L (ξ_{j} | u_{j}, t_{j}) f (ξ_{j} | Ψ)}{f (u_{j}, t_{j})},

where $L (ξ_{j} | u_{j}, t_{j})$ is the likelihood of $ξ_{j}$ given the observed data, and $f (u_{j}, t_{j})$ is the marginal probability of observing $(u_{j}, t_{j})$ . The denominator is a constant that does not depend on $ξ_{j}$ , and thus, the MMAP estimator can be equivalently obtained by maximizing

p = \sum_{i = 1}^{N} log \iint f (u_{i j}, t_{i j} | θ, τ, ξ_{j}) f (θ, τ | Ω) d θ d τ + log f (ξ_{j} | Ψ) .

The first term in the right-hand side of Equation 8 is the log of the marginal likelihood, which is given in Equation 6. Hence, one only needs to address the additional term corresponding to the prior density.

Assuming the multivariate normal prior for the transformed item parameters in Equation 5, the MMAP estimator of $ξ_{j}^{*}$ is obtained by simultaneously solving the equation:

\frac{\partial p}{\partial ξ_{j}^{*}} = {(\frac{\partial p}{\partial a_{j}^{*}}, \frac{\partial p}{\partial b_{j}}, \frac{\partial p}{\partial c_{j}^{*}}, \frac{\partial p}{\partial α_{j}^{*}}, \frac{\partial p}{\partial β_{j}})}^{Τ} = 0,

where

\begin{array}{l} \frac{\partial p}{\partial a_{j}^{*}} = D a_{j} \sum_{i = 1}^{N} \iint (u_{i j} - P_{i j}) (θ_{i} - b_{j}) \frac{(P_{i j} - c_{j})}{P_{i j} (1 - c_{j})} f (θ, τ | u_{i}, t_{i}, Ω) d θ d τ - υ_{1}^{Τ} (ξ_{j}^{*} - μ_{I}^{*}), \\ \frac{\partial p}{\partial b_{j}} = - D a_{j} \sum_{i = 1}^{N} \iint (u_{i j} - P_{i j}) \frac{(P_{i j} - c_{j})}{P_{i j} (1 - c_{j})} f (θ, τ | u_{i}, t_{i}, Ω) d θ d τ - υ_{2}^{Τ} (ξ_{j}^{*} - μ_{I}^{*}), \\ \frac{\partial p}{\partial c_{j}^{*}} = c_{j} \sum_{i = 1}^{N} \iint \frac{(u_{i j} - P_{i j})}{P_{i j}} p (θ, τ | u_{i}, t_{i}, Ω) d θ d τ - υ_{3}^{Τ} (ξ_{j}^{*} - μ_{I}^{*}), \\ \frac{\partial p}{\partial α_{j}^{*}} = \sum_{i = 1}^{N} \iint (1 - α_{j}^{2} {(log t_{i j} - (β_{j} - τ_{i}))}^{2}) p (θ, τ | u_{i}, t_{i}, Ω) d θ d τ - υ_{4}^{Τ} (ξ_{j}^{*} - μ_{I}^{*}), and \\ \frac{\partial p}{\partial β_{j}} = α_{j}^{2} \sum_{i = 1}^{N} \iint (log t_{i j} - (β_{j} - τ_{i})) p (θ, τ | u_{i}, t_{i}, Ω) d θ d τ - υ_{5}^{Τ} (ξ_{j}^{*} - μ_{I}^{*}), \end{array}

with $υ_{m} = {υ_{m m^{'}} : m^{'} = 1, . . ., 5}$ being the mth column vector of $Σ_{I}^{*} ^{- 1}$ . As in the MML estimation, the study employs the EM algorithm to solve the above equation. A detailed account of the algorithm can be found in Appendix B. Note that the parameter estimates obtained from the above procedure are on the transformed scale. To place the estimates on the original parameter scale, one needs to apply the inverse transformations to the estimated parameter values.

4. Optimal Sampling Design

The other critical element to consider in online calibration is sampling design, which prescribes how to collect a calibration sample for an item being field tested. For maximizing information about the parameters of an item, it is ideal to select an optimal sample from a complete pool of test takers. In practice, however, it is difficult to obtain such a static examinee pool because many operational tests are administered continuously or at time intervals. A more viable approach is to select an item from a preconstructed item pool and administer the selected item when an examinee reaches a seeding location for field-testing. This strategy leads to a procedure akin to the operational item selection. That is, as an examinee enters a field-testing stage, all items in the field-testing item pool are evaluated based on their current parameter values and the item that the incumbent examinee can provide with the most optimal properties is selected for administration.

Prior studies have considered three approaches to assign a field-testing item: (1) random selection (e.g., Ban et al., 2001, 2002; P. Chen, 2017; P. Chen & Wang, 2016; P. Chen et al., 2012), (2) examinee-centered selection (e.g., P. Chen et al., 2012, 2017; Zheng, 2016), and (3) item-centered selection (e.g., Chang & Lu, 2010; Ren et al., 2017; van der Linden & Ren, 2014). The random sampling selects an item randomly from a field-testing item pool when an examinee joins the field-testing. The examinee-centered design selects an item following the operational item selection rule, which in most cases, is designed to maximize information about the examinee’s trait level. The third approach, item-centered design, selects an item that the given examinee can provide with the maximum information among the prospective items. Note that, in the random and examinee-centered schemes, the calibration samples can contain little information about the field-tested items as they are not optimized for item calibration. In view of this, the current study employs the item-centered design in the interest of maximum item information and sampling efficiency.

In the joint model of a response and response time, item information retained by an examinee is calculated as

I (ξ_{j}) = (\begin{matrix} I_{a a j} & I_{a b j} & I_{a c j} & ​ & ​ \\ I_{a b j} & I_{b b j} & I_{b c j} & ​ & ​ \\ I_{a c j} & I_{b c j} & I_{c c j} & ​ & ​ \\ ​ & ​ & ​ & I_{α α j} & ​ \\ ​ & ​ & ​ & ​ & I_{β β j} \end{matrix}),

where

\begin{array}{l} I_{a a j} = D^{2} \sum_{i = 1}^{N} {(θ_{i} - b_{j})}^{2} {(\frac{P_{i j} - c_{j}}{1 - c_{j}})}^{2} \frac{Q_{i j}}{P_{i j}}, & I_{b b j} = D^{2} a_{j}^{2} \sum_{i = 1}^{N} {(\frac{P_{i j} - c_{j}}{1 - c_{j}})}^{2} \frac{Q_{i j}}{P_{i j}}, \\ I_{c c j} = \sum_{i = 1}^{N} \frac{Q_{i j}}{P_{i j} {(1 - c_{j})}^{2}}, & I_{α α j} = N α_{j}^{- 2}, \\ I_{β β j} = N α_{j}^{2}, & I_{a b j} = - D^{2} a_{j} \sum_{i = 1}^{N} (θ_{i} - b_{j}) {(\frac{P_{i j} - c_{j}}{1 - c_{j}})}^{2} \frac{Q_{i j}}{P_{i j}}, \\ I_{a c j} = D \sum_{i = 1}^{N} (θ_{i} - b_{j}) \frac{(P_{i j} - c_{j})}{{(1 - c_{j})}^{2}} \frac{Q_{i j}}{P_{i j}}, and & I_{b c j} = - D a_{j} \sum_{i = 1}^{N} \frac{(P_{i j} - c_{j})}{{(1 - c_{j})}^{2}} \frac{Q_{i j}}{P_{i j}} . \end{array}

Some key features of the item information matrix can be summarized as follows. First, the null submatrices in the off-diagonal blocks indicate that the item parameters from the response and response time models are conditionally independent given the person parameters. That is, the response time model parameters provide no information about the response model parameters once the examinee’s trait parameters have been taken into account. Second, the bottom-right submatrix indicates that information about the response time parameters depends only on $α_{j}$ and N. This implies that, once the value of $α_{j}$ is known, neither $β_{j}$ nor $τ_{i}$ provides any further information about the response time model parameters. This being the case, we denote the item information matrix as a function of $θ_{i}$ (i.e., $I (ξ_{j}; θ_{i}))$ in the following notation to highlight the fact that individual’s contribution to item information is manifested only through the proficiency level and does not change by the examinee’s speed level.

Given the above item information matrix, a number of optimality criteria are applied to choose the most desirable item for a given examinee. The first criterion is D-optimality, which is designed to maximize the determinant of the item information matrix. For estimating the item parameters within the joint modeling framework, the D-optimal sample is obtained as:

D - optimality : \underset{j}{arg max} det [I ({\hat{ξ}}_{j}; θ_{i})] .

The calibration sample obtained according to the D-optimal criterion will then minimize the volume of a confidence ellipsoid (i.e., generalized variance) of the parameter estimates of the field-tested item. The other criterion considered is A-optimality, which is designed to maximize the trace of the information matrix. In the present context of the joint model, the A-optimal criterion selects an item that minimizes the trace of the inverse of the information matrix:

A - optimality : \underset{j}{arg min} trace [I^{- 1} ({\hat{ξ}}_{j}; θ_{i})] .

The calibration sample selected according to the above criterion will minimize average variance of the parameter estimates. For both the optimality criteria, the inverse of the prior covariance matrix ( $Σ_{I}^{- 1}$ ) can be affixed to invoke Bayesian versions of D- and A-optimality. Each criterion will then lead to samples that minimize the determinant and trace of the posterior covariance matrix, respectively.

Observe that the above sampling methods attempt to maximize information about all five item parameters in the joint model. Alternatively, one can optimize the sampling for a subset of the item parameters. For example, if the primary objective of online calibration is to obtain precise estimates of the response model parameters, it would be more desirable to select samples that can provide the most information about the targeted parameters, while using response times as collateral information. In such case, one can adjust the selection criteria to obtain samples that are optimized for the intended parameters. For the D- and A-optimal sampling designs, centering on a subset of parameters leads to $D_{S}$ - and $A_{S}$ -optimality (Mulder & van der Linden, 2009; Silvey, 1980). For example, if the response model parameters are of primary interest for estimation, $D_{S}$ - and $A_{S}$ -optimality can be implemented as

D_{S} -optimality : \underset{j}{arg max} det {[W^{Τ} I^{- 1} ({\hat{ξ}}_{j}; θ_{i}) W]}^{- 1},

and

A_{S} -optimality : \underset{j}{arg min} trace [W^{Τ} I^{- 1} ({\hat{ξ}}_{j}; θ_{i}) W],

respectively, where $W^{Τ}$ is a $3 \times 5$ matrix consisting of $[I_{3} 0]$ with $I_{3}$ being a $3 \times 3$ identity matrix and $0$ being a $2 \times 2$ null matrix.

Note that, for all the criteria discussed above, the optimality statistics are calculated approximately by replacing the true proficiency parameters with the estimated values. Doing so can induce measurement error when selecting the calibration samples. A practical way of minimizing measurement error is to assign field-testing items at the end of a test, so that the optimality criteria can be evaluated based on the estimates that are sufficiently close to the true proficiency parameter values. The simulation study presented below corroborates that the optimality statistics calculated this way results in only slight loss of sampling efficiency and can still attain nearly optimal samples.

5. Simulation Study

The study performed extensive simulations to evaluate the effectiveness of the proposed online calibration strategies. The simulations were in particular designed to address three questions: (1) overall performance of the suggested online calibration methods, (2) relative performance of joint calibration to regular response model calibration, and (3) an optimal sampling strategy given a particular focus of online calibration.

5.1. Simulation Design

Below presents specific conditions and design variables considered in the simulation study.

Examinee generation: The study assumed a population of 40,000 test takers for administering CAT. The examinees’ trait parameters were randomly sampled from a bivariate normal distribution with zero means and unit variances. The correlation between the proficiency and speed dimensions was set as $ρ_{P} = .3$ to simulate a moderately strong association.

Item generation: The operational and field-testing item pools were created with the fixed sizes of 300 and 15, respectively. The parameter values of both the item types were drawn from a multivariate normal distribution with a mean, $μ_{I}^{*} = (- 0.043, 0, - 1.386, - {0.043, 0)}^{Τ}$ , and variance, $diag (Σ_{I}^{*} {) = (0.086, 1, 0.040, 0.086, 1)}^{Τ}$ . These hyperparameter values correspond to the mean of $μ_{I} {= (1, 0, 0.2, 1, 0)}^{Τ}$ and variance of $diag (Σ_{I} {) = (0.09, 1, 0.001, 0.09, 1)}^{Τ}$ on the original scale. The correlation among the item parameters was systematically varied by $ρ_{I} = .0, .4, and .8$ to investigate the impact of different levels of dependency on item calibration. Note that item correlation of $ρ_{I} = .8$ may be hardly likely in reality; this condition was included as a point of reference to examine the implication of strong correlation. Throughout the simulation, all pairwise correlations were fixed at the conditioned level to get a clear picture of the impact of $ρ_{I}$ .

Operational CAT administration: The operational CAT was simulated with a total of 30 items. During CAT, items were selected adaptively according to the maximum information criterion. The item selection based on the maximum information rule tends to result in a skewed distribution in item usage. To prevent overexposure of items, the study imposed a constraint on the maximum number of item assignment such that each item can be assigned no more than 800 examinees. The examinee’s trait parameters were estimated via expected a posteriori throughout.

Online calibration: Online calibration of field-testing items was carried out as follows. Each test taker received three field-testing items during operational CAT. The seeding locations of the pilot items were randomly decided in the later stage of testing (i.e., between the 24th and 33rd items) to regulate measurement error. For each prospective item, starting parameter values, which were obtained from a random sample of 300 examinees, were assigned before the outset of online calibration. These parameter values were used to initiate the sampling in the beginning of online calibration. In practice, one can obtain initial parameter values using content experts’ crude approximations or by conducting preparatory online calibration with small samples. As an item sets off the field-testing process, the item was administered to 100 examinees according to the specified sampling strategy. The item parameter values were not updated during this period because the sample was too small to attain stable parameter estimates. Once the item was assigned to the minimum sample of 100 examinees, the parameter values were successively estimated and updated after every batch of 10 additional observations. The adaptive online calibration process continued until the item was assigned to the maximum sample of 800 examinees.

Item parameter estimation: The study employed MMAP as a primary estimation method for calibrating the items. Our preliminary study suggested that the MML estimator tends to have a convergence problem when the calibration samples are selected adaptively. It was surmised that this result is due to little chance of guessing in adaptive sampling. When the lower-asymptote parameters were fixed to a known constant (e.g., the reciprocal of the number of the response options), the MML estimator showed improved convergence rate, but the overall successful convergence rate was still relatively lower than that of the MMAP estimator.¹ Given the findings from the preliminary study, the subsequent simulations were conducted, applying the MMAP as a primary estimator. Note that, although the MML estimator was not used for adaptive online calibration, it can still be of utility in other situations such as linear calibration settings or when prior information is not available.

Sampling design: For every field-testing item, calibration samples were obtained sequentially according to the optimality criterion described in Section 4. The first two criteria select samples that are optimized for all five item parameters; the last two criteria select samples optimized for the response model parameters. To compare the performance of the sampling methods under different scenarios, the study additionally performed online calibration of the response model. The sampling methods evaluated within each calibration model are: random, D-optimal, and A-optimal sampling for the response model calibration; and random, D- and $D_{S}$ -optimal, and A- and $A_{S}$ -optimal sampling for the joint model calibration.

Replication: The design factors were carefully crossed to contrast the performance of the different online calibration scenarios. In total, 24 simulation conditions were examined—two calibration models, different sampling designs (random, D-, $D_{S}$ -, A- and $A_{S}$ -optimality for joint model calibration; random, D- and A-optimality for response model calibration), and three levels of $ρ_{I}$ . Each condition was replicated 100 times with uniquely generated model parameters.

5.2. Evaluation

The study considered a number of criteria to evaluate the performance of the online calibration procedures. First, relative efficiency statistic (Berger, 1991, 1994) was employed to examine the impact of measurement error on sampling efficiency:

{Rel Eff}_{j} = \frac{log (det [I ({\hat{ξ}}_{j}; {\hat{θ}}_{i}, i = 1, . . ., N)])}{log (det [I ({\hat{ξ}}_{j}; θ_{i}, i = 1, . . ., N)])},

where ${\hat{ξ}}_{j}$ denotes the vector of the parameter estimates of item j, ${\hat{θ}}_{i}$ and $θ_{i}$ are the estimated and true proficiency values of examinee i in the calibration sample for item j, and N is the size of the calibration sample. The relative efficiency statistic measures the extent to which sampling loses efficiency as a result of using the estimated proficiency values in place of the true parameters. A value close to one indicates that measurement error has limited impact on sampling efficiency. A value substantially smaller than one indicates that measurement error has severely degenerated sampling efficiency.

The precision of the estimated parameter values was evaluated via bias, root mean squared error (RMSE), correlation, and standard error. The bias and RMSE were calculated respectively as

{Bias}_{m} = \frac{1}{J} \sum_{j = 1}^{J} (ξ_{j m} - {\hat{ξ}}_{j m}) and {RMSE}_{m} = \sqrt{\frac{1}{J} \sum_{j = 1}^{J} {(ξ_{j m} - {\hat{ξ}}_{j m})}^{2}},

where J is the number of the field-tested items, $ξ_{j m}$ is the true value of the mth parameter of item j, and ${\hat{ξ}}_{j m}$ is the corresponding estimated value. The correlation between the estimated and generating parameter values was assessed using the Pearson product–moment correlation coefficient. The standard errors of the item parameter estimates were obtained from the diagonals of the square root of the inverse of the observed Fisher information matrix under the assumption that the items are independent.

For every evaluation criterion, the study summarizes the results within the specified online calibration scenario. Because each online calibration strategy produced a unique set of field-testing items calibrated, the summary statistics are not necessary based on the same set of field-tested items. As such, the results below should be considered as outcomes that would have been observed under each online calibration scenario when the same item pools were submitted for online calibration—rather than the outcomes based on the same set of field-tested items.

6. Results

6.1. Computation

The computational efficiency of the suggested online calibration methods is briefly reported here. On the whole, the procedures demonstrated high computational efficiency throughout the simulation. On a desktop computer with 3.4 GHz processor and 8GB of memory, calibrating an item under the joint model took .851 seconds on average until it reached to the maximum sample size. Calibration under the response model took .017 seconds under the same conditions. The difference in computing times across the different sampling methods was minimal because of the small number of field-testing items considered.

6.2. Relative Efficiency

Figure 1 presents boxplots of the relative efficiency statistics observed for each sampling design. The plots suggest that the sampling procedures generally maintained high efficiency in selecting the calibration samples despite the measurement error. The relative efficiency statistics were close to one and remained constant across the different calibration scenarios. Although the sampling methods seemed to slightly lose efficiency due to measurement error, the overall magnitude of the loss was in general insignificant. Figure 1 also reveals a consistent pattern with regard to the calibration models. The joint model steadily demonstrated higher sampling efficiency and showed smaller variation compared to the response model. This tendency seems to indicate that the use of response times in online calibration can alleviate the loss of sampling efficiency caused by measurement error and leads to more efficient and stable sampling. In particular, the improvement in the sampling efficiency further escalated when the joint model was calibrated based on the optimal samples. Among the different sampling methods considered, the A-optimal procedures generally showed the highest sampling efficiency, followed by the D-optimal and random sampling methods.

Figure 1.

Relative efficiency of sampling designs. $ρ_{I}$ = correlation among the item parameters. IRM = item response model calibration. JM = joint model calibration. JM _S = joint model calibration using samples optimized for $(a, b, c)$ .

6.3. Bias

Table 1 presents biases of the response model parameter estimates observed at the fixed calibration points, $N = 300$ , 500, and 800. Although the actual online calibration was carried out continuously between $N = 100$ and 800, the results are reported for the fixed points to examine the performance of online calibration at each early, midway, and final stages. Table 1 shows that the estimated parameters maintained small biases across the different stages. The maximal bias observed under the evaluated conditions was less than .02 in absolute value, and there were no unusual biases. Table 1 also suggests that the size of a calibration sample was the most influential factor that affects the bias results. Despite some fluctuations, the biases tended to decrease as N increased. The calibration model and sampling designs appeared to have marginal effect on the bias performance.

Table 1.

Bias of the Item Parameter Estimates for the Response Model

$ρ_{I}$	Par	N	Random		D-Optimality			A-Optimality
$ρ_{I}$	Par	N	IRM	JM	IRM	JM	JM _S	IRM	JM	JM _S
.0	$\hat{a}$	300	.013	.007	.013	.012	.011	.005	.005	.013
		500	.015	.009	.005	.013	.013	.012	.009	.017
		800	.011	.011	.010	.012	.015	.012	.010	.014
	$\hat{b}$	300	−.005	.004	−.011	.005	−.001	−.008	−.005	−.005
		500	−.006	.010	−.009	.006	.001	−.003	−.002	−.003
		800	−.004	.011	−.004	.009	.002	.001	.000	−.001
	$\hat{c}$	300	−.001	−.001	−.001	.000	−.001	.001	.000	.000
		500	−.001	.000	−.001	.001	.000	.001	.000	.001
		800	.000	.000	.000	.001	.000	.001	.001	.001
.4	$\hat{a}$	300	.014	.012	.014	.008	.010	.005	.002	.012
		500	.016	.014	.015	.013	.012	.010	.010	.010
		800	.015	.014	.013	.014	.012	.013	.013	.007
	$\hat{b}$	300	.002	.007	.010	.000	.004	−.007	−.012	−.006
		500	.005	.006	.010	.002	.007	−.002	−.004	−.001
		800	.006	.008	.012	.004	.009	.000	−.001	.003
	$\hat{c}$	300	.000	.000	.001	.000	.001	−.001	−.001	.000
		500	.000	.000	.001	.000	.001	.000	.000	.000
		800	.001	.000	.001	.001	.001	.000	.000	.000
.8	$\hat{a}$	300	−.001	.002	−.004	.006	−.001	−.001	−.002	.003
		500	.000	.006	.000	.005	.002	.001	.001	.005
		800	.002	.004	.005	.005	.002	.001	.002	.002
	$\hat{b}$	300	.013	.003	−.001	.005	.001	−.003	−.002	−.003
		500	.008	.009	.002	.003	.002	.002	−.001	−.002
		800	.007	.006	−.001	.001	.002	.001	.000	−.001
	$\hat{c}$	300	.000	−.001	−.001	−.001	−.001	.000	−.001	.000
		500	−.001	.000	−.001	−.001	−.001	.000	−.001	.000
		800	.000	.000	.000	−.001	−.001	.000	−.001	−.001

Note. $ρ_{I}$ = correlation among the item parameters; Par = parameter estimate; N=calibration sample size; IRM = item response model calibration; JM = joint model calibration; JM _S = joint model calibration using samples optimized for $(a, b, c)$ .

Table 2 presents biases of the response time model parameter estimates. The observed statistics were again very close to zero, implying that the parameter estimates were essentially unbiased. The largest bias observed was .005, and the values appropriately decreased as N increased. Among the sampling methods, random sampling led to the smallest bias though the difference with the other designs appeared minimal.

Table 2.

Bias of the Item Parameter Estimates for the Response Time Model

$ρ_{I}$	N	$\hat{α}$					$\hat{β}$
$ρ_{I}$	N	Ran	D	$D_{S}$	A	$A_{S}$	Ran	D	$D_{S}$	A	$A_{S}$
.0	300	.000	.005	.004	.002	.002	.001	−.003	−.001	.005	.003
	500	.000	.001	.000	.001	.001	.001	−.001	−.001	.002	.000
	800	.000	−.001	−.001	−.001	−.001	.000	.000	.000	.002	.000
.4	300	.000	.002	.002	.000	.004	.002	.001	.001	.003	.000
	500	.000	.001	−.001	.000	.001	.001	.003	.002	.002	−.001
	800	.000	−.001	−.002	−.002	−.001	.002	.002	.002	.003	.001
.8	300	.001	.004	.002	.002	.001	.001	.001	−.002	.003	.004
	500	.001	.001	.001	.001	.001	.000	.002	.000	.001	.003
	800	.000	−.001	−.001	−.001	−.002	.000	.003	.002	.001	.003

Note. $ρ_{I}$ = correlation among the item parameters; N = calibration sample size; Ran = random sampling.

6.4. RMSE

Tables 3 and 4 provide average RMSEs of the item parameter estimates for the response and response time models, respectively. As with the previous, the results are summarized for the fixed calibration points. More detailed reports are presented as figures in Appendix C in the online version of the journal. The tables suggest that the online calibration performed reasonably well in recovering the true item parameter values. The RMSEs were constantly small across the varying simulation conditions and decreased appropriately as the calibration progressed. In particular, the item parameter estimates showed fast convergence to the generating parameters in the very early stage of online calibration. For example, when $N = 300$ and $ρ_{I} = .0$ , the estimated parameter values showed the average RMSEs of .170 ( $\hat{a}$ ), .145 ( $\hat{b}$ ), .029 ( $\hat{c}$ ), .044 ( $\hat{α}$ ), and .062 ( $\hat{β}$ ), and the RMSEs adequately decreased as N or $ρ_{I}$ increased. The present result is particularly notable considering the fact that the 3PL model typically requires large samples (e.g., $N \geq 800$ ) to attain reliable estimates. The result from Table 3 suggests that the present online calibration methods can achieve fairly accurate parameter recovery using relatively small samples.

One of the inquiries made in this study was the relative performance of joint calibration to standard response model calibration. A comparison of the two models in Table 3 suggests that they tend to perform comparably when the item parameters are uncorrelated. When the parameters were correlated, joint calibration showed distinct improvement in estimation of the response model parameters. In particular, when $ρ_{I} = . 8$ , joint calibration yielded 12.62% ( $\hat{a}$ ), 7.01% ( $\hat{b}$ ), and 12.38% ( $\hat{c}$ ) smaller RMSEs compared to the response model calibration.² When $ρ_{I} = .4$ , the average reductions in RMSEs were 2.36% ( $\hat{a}$ ), 4.53% ( $\hat{b}$ ), and 5.32% ( $\hat{c}$ ).

Table 3.

RMSEs of the Item Parameter Estimates for the Response Model

$ρ_{I}$	Par	N	Random		D-Optimality			A-Optimality
$ρ_{I}$	Par	N	IRM	JM	IRM	JM	JM _S	IRM	JM	JM _S
.0	$\hat{a}$	300	.169	.166	.168	.167	.177	.161	.177	.171
		500	.145	.149	.141	.143	.152	.143	.146	.144
		800	.122	.127	.120	.124	.128	.125	.127	.119
	$\hat{b}$	300	.152	.153	.154	.149	.150	.141	.132	.133
		500	.129	.130	.125	.121	.122	.118	.110	.111
		800	.110	.109	.109	.103	.102	.097	.095	.095
	$\hat{c}$	300	.029	.030	.029	.029	.028	.028	.028	.028
		500	.028	.029	.028	.028	.027	.028	.027	.027
		800	.027	.028	.027	.027	.026	.027	.026	.026
.4	$\hat{a}$	300	.168	.162	.170	.163	.161	.169	.162	.163
		500	.144	.142	.144	.142	.142	.145	.138	.143
		800	.122	.123	.126	.126	.119	.124	.124	.123
	$\hat{b}$	300	.144	.141	.159	.144	.145	.137	.133	.135
		500	.119	.118	.127	.118	.118	.111	.105	.108
		800	.096	.096	.107	.095	.099	.090	.086	.091
	$\hat{c}$	300	.026	.025	.026	.024	.024	.025	.024	.024
		500	.025	.024	.025	.023	.023	.024	.023	.023
		800	.024	.023	.024	.023	.022	.023	.022	.023
.8	$\hat{a}$	300	.142	.117	.141	.120	.121	.142	.126	.123
		500	.130	.107	.129	.106	.111	.124	.113	.111
		800	.114	.097	.110	.097	.099	.108	.102	.101
	$\hat{b}$	300	.146	.137	.153	.142	.138	.130	.121	.124
		500	.119	.112	.123	.109	.114	.103	.094	.102
		800	.099	.092	.099	.086	.091	.084	.080	.084
	$\hat{c}$	300	.019	.015	.019	.015	.016	.018	.016	.016
		500	.018	.015	.018	.015	.015	.017	.016	.016
		800	.017	.015	.018	.015	.015	.016	.015	.016

Note. $ρ_{I}$ = correlation among the item parameters; Par = parameter estimate; N = calibration sample size; IRM = item response model calibration; JM = joint model calibration; JM _S = joint model calibration using samples optimized for $(a, b, c)$ ; RMSE = root mean squared error.

The other research question posed in this study was the performance of sampling design given a specific focus of online calibration. For evaluation, the study considered two situations: when the primary interest of online calibration is (1) in estimating all five item parameters accurately and (2) in estimating the response model parameters only. For the former, three sampling methods (random, D-, and A-optimality) were evaluated within the joint model. For the latter, all eight sampling methods were evaluated under both the response and joint models to gauge the benefit of utilizing response times in online calibration. In each scenario, multiple factors (e.g., item parameter type, $ρ_{I}$ , N) were scrutinized to determine the optimal strategy.

The results from Tables 3 and 4 suggest that, when all the item parameters are intended for estimation and the item parameters are weakly or moderately correlated, joint calibration with A-optimal sampling would lead to the most stable parameter recovery. A-optimality was especially found effective in minimizing the RMSE of $\hat{c}$ when $ρ_{I}$ equaled .0 or .4. When the item parameters were strongly correlated (i.e., $ρ_{I} = .8$ ), D-optimality was in general most preferred, achieving the smallest RMSEs for $\hat{a}$ , $\hat{c}$ , and $\hat{α}$ . It should be added that, under $ρ_{I} = . 8$ , although A-optimality yielded the smallest RMSEs in $\hat{b}$ and $\hat{β}$ , the overall performance was comparable to that of D-optimality. Noting that the D-optimal designs demonstrated clear outperformance in recovery of a, c, and α, D-optimality was selected as the most preferred sampling approach under $ρ_{I} = . 8$ .

Table 4.

RMSEs of the Item Parameter Estimates for the Response Time Model

$ρ_{I}$	N	$\hat{α}$					$\hat{β}$
$ρ_{I}$	N	Ran	D	$D_{S}$	A	$A_{S}$	Ran	D	$D_{S}$	A	$A_{S}$
.0	300	.044	.045	.043	.043	.044	.061	.063	.063	.061	.063
	500	.034	.034	.034	.033	.033	.049	.048	.047	.046	.046
	800	.027	.026	.026	.026	.025	.037	.038	.038	.037	.037
.4	300	.044	.043	.043	.042	.044	.064	.064	.063	.062	.064
	500	.033	.033	.033	.034	.033	.050	.046	.047	.048	.048
	800	.027	.025	.026	.027	.027	.038	.038	.037	.036	.038
.8	300	.043	.040	.041	.041	.040	.062	.064	.061	.057	.060
	500	.033	.031	.032	.033	.034	.049	.047	.045	.043	.045
	800	.025	.024	.025	.026	.026	.039	.038	.036	.035	.035

Note. $ρ_{I}$ = correlation among the item parameters; N = calibration sample size; Ran = random sampling; RMSE = root mean squared error.

To decide an optimal sampling scheme in the second scenario, careful deliberation was again exercised. The results from Table 3 suggest that when online calibration was aimed at the response model parameters and the parameters had zero correlation, joint calibration with $A_{S}$ -optimal sampling led to the most stable parameter recovery, minimizing the RMSEs for the targeted parameters. When the item parameters had nonzero correlation, joint calibration with $D_{S}$ - or $A_{S}$ -optimal sampling appeared to yield the smallest RMSEs. Specifically, $D_{S}$ -optimality was preferred for minimizing the RMSE of $\hat{a}$ and $\hat{c}$ ; $A_{S}$ -optimality was favored for its smallest RMSEs in $\hat{b}$ . All in all, however, it appeared that joint calibration with $D_{S}$ -optimal sampling is most advantageous when the item parameters are moderately correlated, and joint calibration with D-optimal sampling is most favorable when the parameters have strong correlation.

Notice that the above evaluation steadily preferred the joint calibration over the response model calibration despite the fact that the online calibration was aimed at the response model parameters. The gain in the estimation precision accomplished by the joint calibration was especially greater when the items were calibrated using the optimal samples or when the item parameters were correlated to a stronger degree. When the item parameters were weakly related, centering the sampling design on the targeted parameter set seemed to minimize the calibration error for the intended item parameters. When the item parameters had strong correlation, it was more advantageous to optimize the sampling for all the item parameters rather than aiming for the subset of the parameters.

6.5. Correlation

Tables 5 and 6 present average correlations between the estimated and generated item parameters for the response and response time models, respectively. We note that, although the results were summarized using the average of the statistics, the correlations tended to display left-tailed distributions.³ We thus advise prudence in interpreting the results as they may not represent typical outcomes due to the presence of extreme values. For more detailed representation of the results, see the online Appendix C.

Table 5.

Correlation Between the Estimated and True Parameters of the Response Model

$ρ_{I}$	Par	N	Random		D-Optimality			A-Optimality
$ρ_{I}$	Par	N	IRM	JM	IRM	JM	JM _S	IRM	JM	JM _S
.0	$\hat{a}$	300	.785	.795	.774	.776	.774	.804	.772	.789
		500	.844	.836	.848	.844	.838	.862	.851	.853
		800	.894	.882	.895	.882	.889	.898	.884	.893
	$\hat{b}$	300	.988	.987	.987	.986	.985	.984	.985	.983
		500	.991	.991	.992	.991	.990	.989	.989	.988
		800	.993	.994	.994	.993	.993	.993	.992	.991
	$\hat{c}$	300	.306	.302	.370	.373	.374	.297	.386	.372
		500	.374	.391	.409	.417	.469	.365	.421	.461
		800	.445	.458	.458	.490	.513	.426	.489	.534
.4	$\hat{a}$	300	.752	.784	.778	.782	.778	.733	.775	.771
		500	.818	.836	.841	.845	.842	.804	.832	.823
		800	.872	.882	.886	.885	.887	.866	.868	.873
	$\hat{b}$	300	.988	.989	.987	.986	.987	.983	.984	.982
		500	.991	.992	.991	.991	.992	.989	.990	.988
		800	.994	.995	.994	.994	.993	.993	.993	.992
	$\hat{c}$	300	.420	.467	.447	.507	.532	.381	.502	.435
		500	.480	.505	.491	.549	.576	.445	.566	.487
		800	.541	.538	.559	.586	.606	.515	.592	.539
.8	$\hat{a}$	300	.850	.901	.836	.896	.889	.809	.840	.851
		500	.875	.921	.866	.920	.911	.858	.877	.883
		800	.900	.935	.905	.936	.930	.897	.893	.907
	$\hat{b}$	300	.987	.990	.987	.989	.987	.982	.985	.984
		500	.992	.993	.992	.993	.991	.989	.990	.990
		800	.994	.996	.994	.996	.995	.993	.993	.993
	$\hat{c}$	300	.769	.851	.763	.870	.831	.714	.775	.758
		500	.784	.855	.777	.874	.839	.748	.779	.774
		800	.796	.861	.797	.882	.847	.770	.788	.787

Note. $ρ_{I}$ = correlation among the item parameters; Par = parameter estimate; N = calibration sample size; IRM = item response model calibration; JM = joint model calibration. JM _S = joint model calibration using samples optimized for $(a, b, c)$ .

Table 6.

Correlation Between the Estimated and True Parameters of the Response Time Model

$ρ_{I}$	N	$\hat{α}$					$\hat{β}$
$ρ_{I}$	N	Ran	D	$D_{S}$	A	$A_{S}$	Ran	D	$D_{S}$	A	$A_{S}$
.0	300	.987	.986	.988	.988	.985	.998	.998	.997	.998	.997
	500	.992	.991	.992	.992	.991	.998	.999	.998	.999	.999
	800	.995	.995	.995	.995	.995	.999	.999	.999	.999	.999
.4	300	.986	.986	.985	.986	.985	.997	.998	.997	.998	.997
	500	.992	.991	.991	.991	.991	.998	.999	.999	.999	.998
	800	.995	.995	.994	.994	.994	.999	.999	.999	.999	.999
.8	300	.987	.989	.988	.984	.984	.998	.998	.997	.997	.997
	500	.992	.994	.993	.989	.990	.998	.999	.999	.998	.998
	800	.995	.996	.995	.993	.994	.999	.999	.999	.999	.999

Note. $ρ_{I}$ = correlation among the item parameters; N = calibration sample size; Ran = random sampling.

The tables suggest that the estimated item parameter values overall had stable linear relationships with the generating parameters. The correlations remained reasonably large across the different situations and increased appropriately as N or $ρ_{I}$ increased. The results for the different calibration models showed the similar pattern as those for the RMSEs. In Tables 5, calibrating items jointly within the response and response time models yielded 12.05% ( $ρ_{I} = . 0$ ), 12.81% ( $ρ_{I} = . 4$ ), and 7.70% ( $ρ_{I} = . 8$ ) higher correlations in $\hat{c}$ compared to the response model calibration. The impact of joint calibration on the correlation of $\hat{a}$ and $\hat{b}$ was marginal when the item parameters were uncorrelated and became more conspicuous as $ρ_{I}$ increased to .4 and .8.

The preference for the sampilng method varied depending on the objective of online calibration and correlation among the item parameters. The results from the tables suggest that, when all the item parameters are intended for estimation and the parameters are uncorrelated, joint calibration with A-optimal sampling would lead to the most stable performance. When the item parameters had nonzero correlation, calibrating items jointly based on the D-optimal samples was most preferred. It is pertinent to note that, in the above evaluations, the random sampling tended to yield the highest correlations in $\hat{b}$ 's. The optimal sampling designs were preferred because they demonstrated clear outperformance in recovery of the other parameters, while performing comparably to the random sampling in recovery of b.

When the focus of online calibration was on the response model parameters and the parameters were uncorrelated, joint calibration based on optimal samples showed little advantage in improving the correlation between the true and estimated parameter values. When the item parameters were correlated to some extent, however, joint calibration tended to yield higher correlation outcomes. When $ρ_{I} = .4$ , joint calibration with $D_{S}$ -optimal sampling overall led to the highest correlations. When $ρ_{I} = .8$ , joint calibration with D-optimal sampling was consistently preferred for maximizing the correlation between the estimated and true parameters.

6.6. Standard Error

Tables 7 and 8 report standard errors of the estimated item parameters for the response and response time models, respectively. Again, it is to be noted that, although the results were summarized using the average of the statistics, the observed standard errors tended to show right-tailed distributions.⁴ We therefore advise to exercise caution when interpreting and comparing the results. See the online Appendix C for more detailed representation of the results. The results from the tables suggest that online calibration performed reasonably well, resulting in small standard errors. The two modeling schemes showed no systematic differences in the standard errors. On the whole, the most influential factor seemed to be the calibration sample size. As N increased, the standard errors became appropriately smaller, indicating more stable estimation of the parameters.

Table 7.

Standard Errors of the Item Parameter Estimates for the Response Model

$ρ_{I}$	Par	N	Random		D-Optimality			A-Optimality
$ρ_{I}$	Par	N	IRM	JM	IRM	JM	JM _S	IRM	JM	JM _S
.0	$\hat{a}$	300	.278	.281	.279	.277	.278	.275	.272	.267
		500	.216	.219	.218	.216	.216	.214	.211	.208
		800	.172	.174	.172	.171	.171	.170	.168	.165
	$\hat{b}$	300	.320	.318	.313	.303	.291	.266	.264	.262
		500	.250	.251	.249	.240	.228	.207	.206	.204
		800	.200	.201	.198	.191	.181	.166	.165	.162
	$\hat{c}$	300	.931	.899	.879	.844	.784	.689	.686	.691
		500	.722	.704	.691	.660	.613	.532	.532	.531
		800	.569	.560	.547	.523	.484	.421	.423	.421
.4	$\hat{a}$	300	.276	.277	.281	.275	.278	.269	.269	.270
		500	.215	.216	.219	.214	.216	.209	.208	.210
		800	.171	.171	.173	.169	.171	.166	.165	.167
	$\hat{b}$	300	.334	.334	.327	.325	.310	.268	.270	.267
		500	.263	.263	.257	.256	.241	.209	.208	.209
		800	.212	.210	.204	.204	.193	.165	.165	.169
	$\hat{c}$	300	.962	.968	.929	.925	.863	.720	.723	.712
		500	.759	.759	.727	.727	.668	.556	.556	.554
		800	.610	.606	.570	.581	.535	.436	.441	.443
.8	$\hat{a}$	300	.281	.283	.283	.282	.280	.270	.266	.266
		500	.217	.221	.219	.219	.217	.208	.206	.206
		800	.171	.175	.174	.173	.171	.164	.163	.163
	$\hat{b}$	300	.366	.394	.368	.389	.357	.283	.279	.289
		500	.287	.305	.286	.304	.274	.217	.216	.222
		800	.226	.249	.229	.241	.218	.171	.171	.178
	$\hat{c}$	300	1.084	1.185	1.090	1.157	1.032	.772	.763	.798
		500	.849	.919	.843	.900	.791	.588	.589	.612
		800	.672	.754	.674	.713	.626	.464	.466	.487

Note. The item parameter estimates were on the transformed scale. $ρ_{I}$ = correlation among the item parameters; Par = parameter estimate; N = calibration sample size; IRM = item response model calibration; JM = joint model calibration; JM_S = joint model calibration using samples optimized for $(a, b, c)$ .

Table 8.

Standard Errors of the Item Parameter Estimates for the Response Time Model

$ρ_{I}$	N	$\hat{α}$					$\hat{β}$
$ρ_{I}$	N	Ran	D	$D_{S}$	A	$A_{S}$	Ran	D	$D_{S}$	A	$A_{S}$
.0	300	.041	.040	.041	.040	.041	.062	.063	.062	.063	.062
	500	.031	.031	.031	.031	.031	.048	.049	.048	.049	.048
	800	.025	.025	.025	.025	.025	.038	.039	.038	.039	.038
.4	300	.040	.040	.041	.041	.041	.063	.063	.061	.061	.061
	500	.031	.031	.032	.032	.031	.049	.049	.047	.048	.047
	800	.025	.025	.025	.025	.025	.039	.039	.038	.038	.038
.8	300	.040	.040	.041	.041	.041	.063	.063	.062	.059	.060
	500	.031	.031	.031	.032	.032	.049	.049	.048	.046	.046
	800	.025	.025	.025	.025	.025	.039	.039	.038	.036	.037

Note. $ρ_{I}$ = correlation among the item parameters; N = calibration sample size; Ran = random sampling.

Table 7 reveals some distinct patterns relating to the sampling design. Among the three sampling methods, the A-optimal procedures steadily produced the smallest standard errors. For example, when $ρ_{I}$ was conditioned at .0 within the joint calibration, the A-optimal sampling yielded 3.52% ( $\hat{a}$ ), 17.81% ( $\hat{b}$ ), and 24.12% ( $\hat{c}$ ) smaller standard errors compared to the random sampling and 2.06% ( $\hat{a}$ ), 13.59% ( $\hat{b}$ ), and 19.04% ( $\hat{c}$ ) smaller errors compared to the D-optimal sampling. The amount of proportion reduction in the standard errors further increased as $ρ_{I}$ increased. This tendency is in line with the expectation in that A-optimality was indeed designed to minimize average variance of the parameter estimates. The present results provide useful implications for operational online calibration, especially when the primary goal of online calibration is to minimize calibration samples given a threshold in standard error. That is, when standard error is used as a criterion for terminating adaptive online calibration, A-optimality can be an effective sampling strategy in that it would use relatively small calibration samples to meet the desired level in standard error.

7. Discussion

The nature of speededness in operational tests and free access to response time data in computerized tests have inspired much research on the response times in educational and psychological assessments. In this study, we investigate online calibration strategies for the joint model of responses and response times. The study presented likelihood-based estimators for estimating the parameters of the joint model and evaluated their performance in conjunction with the optimal sampling procedures. Extensive experiments based on simulation suggested that the studied online calibration methods can adequately recover the true item parameters using small samples (N = 500 ∼ 800). Across the evaluated simulation conditions, the estimated parameter values showed small biases and RMSEs, and high correlations with the generating parameters. As expected, increasing N or $ρ_{I}$ consistently reduced estimation errors.

One of the questions addressed in this study was whether, if so how much, calibrating items within the joint framework can help improve the estimation of the response model parameters. The results from the current study suggest that joint calibration can lead to more accurate parameter recovery if the item parameters are correlated between the two measurement models. The gain in the estimation accuracy was particularly conspicuous in estimation of c- and a-parameters and tended to escalate as $ρ_{I}$ increased or N decreased. The findings from this study can be linked to prior studies that examined the impact of using response times under different values of $ρ_{P}$ (e.g., Ranger, 2013; van der Linden et al., 2010). In particular, Ranger (2013) noted that information about θ that can be gained from response times is palpable when θ and τ are highly correlated (e.g., above .5). The results from the current study support the similar conclusion for item calibration, that is, the stronger the item parameters are correlated, the greater the gain in calibration precision.

The other aspect investigated in this study was the performance of the sampling methods given a particular objective of online calibration. A number of sampling procedures were examined to investigate this question: (1) random, D-optimal, and A-optimal sampling under the response model calibration and (2) random, D- and $D_{S}$ -optimal sampling, and A- and $A_{S}$ -optimal sampling under the joint model calibration. The results from the simulation study suggest that the choice of sampling design should be made in consideration of the characteristics of the field-testing items. All in all, when all the item parameters were intended and were weakly correlated, joint calibration with A-optimal sampling led to the most stable and accurate parameter recovery. When the item parameters were strongly correlated, calibrating items jointly based on D-optimal samples led to the most accurate parameter recovery.

When the primary focus of online calibration was on a subset of item parameters, centering the sampling design on the targeted parameters within the joint framework was found the most effective approach. The current study in particular contemplated a situation where online calibration was aimed at the response model parameters. The results from the simulation study suggest that when the item parameters are unrelated (i.e., $ρ_{I} = .0$ ), joint calibration with $A_{S}$ -optimal sampling would be most desirable for accurately estimating the targeted parameters. When the item parameters were moderately correlated (i.e., $ρ_{I} = .4$ ), joint calibration with $D_{S}$ -optimal sampling demonstrated the best recovery of the parameters in general. It may be alarming to learn that joint calibration demonstrated better parameter recovery even when the item parameters were unrelated. Our simulation study suggests that when the item parameters are not correlated, joint calibration tends to have minimal impact on the estimation of a- and b-parameters, and yet, it can help improve the estimation of c-parameters.

It is important to stress that, if the item parameters are strongly correlated, the effect of centering tends to wither, and it is more preferable to use samples that are optimized for all item parameters even though online calibration is aimed at a subset of item parameters. Throughout, we found that when the item parameters are strongly correlated, calibrating items jointly with D-optimal samples tends to produce the most accurate parameter recovery without regard to whether the online calibration is aimed at the entire or subset of the item parameters. This tendency may be partly explained by the fact that D-optimality utilizes information from both the respective and joint relations of the item parameters by maximizing the determinant of the item information matrix. A-optimality, on the other hand, uses information only on the individual item parameters by taking the trace of the inverse information matrix.

As final remarks, we note that the current study assumed known hyperparameters when estimating the item parameters. A future study may extend this setting by considering the hyperparameters as free parameters. It is also worth mentioning that the present study implemented online calibration under the assumption that the pilot items adequately fit the joint model. As indicated by several studies (e.g., Kang, 2017; Klein Entink, van der Linden, & Fox, 2009; Ranger & Kuhn, 2012; Wang, Fan, Chang, & Douglas, 2013), the shape of response time distributions can vary to a large extent across items even within a single test. With this in view, we recommend taking a tryout step prior to online calibration to ensure that probationary items adequately fit the presumed measurement models. This can be achieved in practice by fitting the model with small random samples before the online calibration. Associated with the present concern, it appears promising to study sequential monitoring procedures that evaluate the quality of items over the course of operational assessment. In practice, some items may become compromised when they are used repeated times. The statistical quality control procedure based on online calibration technique can be useful in this event as it can identify items as soon as they become compromised or function differently.

Supplemental Material

Supplemental Material, JEBS879080_Appendix_C - Online Calibration of a Joint Model of Item Responses and Response Times in Computerized Adaptive Testing

Supplemental Material, JEBS879080_Appendix_C for Online Calibration of a Joint Model of Item Responses and Response Times in Computerized Adaptive Testing by Hyeon-Ah Kang, Yi Zheng and Hua-Hua Chang in Journal of Educational and Behavioral Statistics

Footnotes

Appendix A. MML Estimation with EM Algorithm

The EM algorithm adopts an iterative procedure to find maximum likelihood estimates of parameters in the presence of unobserved latent variables. The E-step of the algorithm evaluates expectation of complete-data log-likelihood using provisional estimates of structural parameters. In the context of the joint model, the parameters of an item are considered structural parameters, and individuals’ trait parameters are considered incidental parameters. Suppose we are interested in estimating the parameters of a field-tested item j. For notational simplicity, we suppress the subscript j in the following equations and assume that inference is being made on the parameters of the item j, $ξ_{j} = ξ$ . Let $ξ^{(t)}$ denote the vector of the parameter values for item j estimated at tth cycle of the EM algorithm. The expectation of the complete-data log-likelihood is evaluated as

where i ( $i = 1, \dots, N$ ) denotes the index of examinees to whom the field-testing item was assigned; $U = {u_{i} = (u_{i}, u_{i}^{o p}) : i = 1, \dots, N}$ is the response matrix of the examinees for the field-testing item (u) and the operational items ( $u^{o p}$ ); and $T = {t_{i} = (t_{i}, t_{i}^{o p}) : i = 1, \dots, N}$ is the corresponding response time matrix. Note that the posterior probability distribution of the trait parameters in (A.1) takes account of the responses to both the field-testing item and operational items. This is to place the estimated item parameter values on the same scale with the operational item parameters (Ban et al., 2001).

The study evaluates the integrals in (A.1) using the Gauss-Hermite quadrature. Let X_k ( $k = 1, \dots, Q$ ) and Y_l ( $l = 1, \dots, Q$ ) denote the midpoint of each of Q ² cuboids on the θ- and τ-scale, and $A (X_{k}, Y_{l})$ denote the weight function representing the height of the density. The posterior probability that the ith examinee’s trait parameters equal $(X_{k}, Y_{l})$ is obtained as

where $Ξ_{o p}$ is the matrix of the operational item parameters, and $L (X_{k}, Y_{l})$ is the likelihood of observing $(u_{i}, t_{i})$ at the quadrature node $(X_{k}, Y_{l})$ :

For evaluating (A.1), some expected values are needed to remove dependence on the unobservable variables. In this study, we obtain expected values associated with the incidental parameters as follows:

These quantities are called artificial data because they are created artificially during the estimation. The values of the artificial data still depend on the unknown item parameters (e.g., $P (X_{k})$ and $f (t_{i} | Y_{l})$ ), and hence, an iterative procedure must be followed to obtain better estimates after every iteration. This process is done in the M-step with the Newton-Raphson method or Fisher scoring based on a Taylor series. The current study employs Fisher scoring to ensure convergence in situations where the initial estimate is not in the close vicinity of the true maximizer.

In the M-step, a better approximation to the true parameter is obtained as

where $H^{(t)}$ is the $5 \times 5$ Hessian matrix evaluated at $ξ^{(t)}$ , and $Λ^{(t)}$ is the five-dimensional vector with the gradients of the marginal log-likelihood function with respect to $ξ^{(t)}$ . The elements of $Λ^{(t)}$ , l_m ( $m = 1, \dots, 5$ ), are obtained as follows:

where $P_{k} = P (X_{k})$ and $Q_{k} = 1 - P (X_{k})$ . Equating these equations simultaneously to zero yields item parameter estimates that maximize the marginalized log-likelihood in (A.1). Analogously, the expected values of the elements of $H^{(t)} = {l_{m m^{'}} : m, m^{'} = 1, \dots, 5}$ are obtained as

The iterative algorithm continues until changes in the successive approximations become sufficiently small. The present study uses .001 as a stopping criterion for both the EM cycles and Fisher scoring iterations.

Appendix B. MMAP Estimation with EM Algorithm

The EM algorithm in the MMAP estimation is implemented analogously with the MML estimation. Several modifications are however needed to adjust the transformation of the item parameters as well as to incorporate the prior density. Let $ξ^{*} ^{(t)}$ denote the tth approximation to the true value of $ξ^{*}$ that maximizes p in Equation 8. A better approximation, ${\hat{ξ}}^{*} ^{(t + 1)}$ , to the true parameter value is obtained as

where $H^{*} ^{(t)}$ is the $5 \times 5$ Hessian matrix evaluated at $ξ^{*} ^{(t)}$ , and $Λ^{*} ^{(t)}$ is the first-order derivatives of the posterior distribution with respect to $ξ^{*} ^{(t)}$ . Based on the expected values from the E-step, the elements of $Λ^{* (t)}$ , $l_{m}^{*}$ ( $m = 1, \dots, 5$ ), are calculated as follows:

Let $υ_{m m^{'}}$ denote the $(m, m^{'})$ -th entry of the matrix $Σ_{I}^{*} ^{- 1}$ . The expected values of the elements of $H^{*} ^{(t)} = {l_{m m^{'}}^{*} : m, m^{'} = 1, \dots, 5}$ , are obtained as follows:

otherwise, $E [l_{m m^{'}}^{*}] = - υ_{m m^{'}}$ . Again, the iterative procedure is repeated until changes in the parameter estimates between the successive cycles become sufficiently small.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This study received funding from Campus Research Board (grand ID RB15138).

ORCID iD

Hyeon-Ah Kang

Notes

References

Ban

J.-C.

Hanson

B. A.

Wang

Harris

D. J.

(2001). A comparative study of on-line pretest item-calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement, 38, 191–212.

Ban

J.-C.

Hanson

B. A.

Harris

D. J.

(2002). Data sparseness and on-line pretest item calibration-scaling methods in CAT. Journal of Educational Measurement, 39, 207–218.

Berger

M. P. F.

(1991). On the efficiency of IRT models when applied to different sampling designs. Applied Psychological Measurement, 15, 293–306.

Berger

M. P. F.

(1992). Sequential sampling designs for the two-parameter item response theory model. Psychometrika, 57, 521–538.

Berger

M. P. F.

(1994). D-optimal sequential sampling designs for item response theory models. Journal of Educational Statistics, 19, 43–56.

Birnbaum

(1968). Theories of mental test scores. In Lord

F. M.

Novick

M. R.

(Eds.), Some latent trait models and their use in inferring an examinee’s ability (pp. 397–479). Reading, MA: Addison-Wesley.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: An application of an em-algorithm. Psychometrika, 46, 443–459.

Bolsinova

De Boeck

P. E.

Tijmstra

(2018). Modelling conditional dependence between response time and accuracy. Psychometrika, 82, 1126–1148.

Buyske

(2005). Optimal design in educational testing. In Berger

M. P. F.

Wong

W. K.

(Eds.), Applied optimal designs (pp. 1–19). New York, NY: John Wiley.

10.

Chang

Y.-c. I.

H.-Y.

(2010). Online calibration via variable length computerized adaptive testing. Psychometrika, 75, 140–157.

11.

Chen

(2017). A comparative study of online item calibration methods in multidimensional computerized adaptive testing. Journal of Educational and Behavioral Statistics, 42, 559–590.

12.

Chen

Wang

(2016). A new online calibration method for multidimensional computerized adaptive testing. Psychometrika, 81, 674–701.

13.

Chen

Wang

Xin

Chang

H.-H.

(2017). Developing new online calibration methods for multidimensional computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 70, 81–117.

14.

Chen

Xin

Wang

Chang

H.-H.

(2012). Online calibration methods for the DINA model with independent attributes in CD-CAT. Psychometrika, 77, 201–222.

15.

Chen

W.-H.

Thissen

(1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265–289.

16.

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

(1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38.

17.

Fan

Wang

Chang

H.-H.

Douglas

(2012). Utilizing response time distributions for item selection in CAT. Journal of Educational and Behavioral Statistics, 37, 655–670.

18.

Fox

J.-P.

Klein Entink

R. H.

van der Linden

W. J.

(2007). Modeling of responses and response times with the package CIRT. Journal of Statistical Software, 20, 1–14.

19.

Fox

J.-P.

Marianti

(2017). Person-fit statistics for joint models for accuracy and speed. Journal of Educational Measurement, 54, 243–262.

20.

Glas

C. A. W.

Suárez-Falcón

J. C.

(2003). A comparison of item-fit statistics for the three parameter logistic model. Applied Psychological Measurement, 27, 87–106.

21.

Glas

C. A. W.

van der Linden

W. J.

(2010). Marginal likelihood inference for a model for item responses and response times. British Journal of Mathematical and Statistical Psychology, 63, 603–626.

22.

Houts

C. R.

Edwards

M. C.

(2013). The performance of local dependence measures with psychological data. Applied Psychological Measurement, 37, 541–562.

23.

Jones

D. H.

Jin

(1994). Optimal sequential designs for on-line item estimation. Psychometrika, 59, 59–75.

24.

Kang

H.-A.

(2016). Likelihood estimation for jointly analyzing item responses and response times (Unpublished doctoral dissertation). University of Illinois at Urbana-Champaign, Champaign, IL.

25.

Kang

H.-A.

(2017). Penalized partial likelihood inference of proportional hazards latent trait models. British Journal of Mathematical and Statistical Psychology, 70, 187–208.

26.

Klein Entink

R. H.

Fox

J.-P.

van der Linden

W. J.

(2009). A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika, 74, 21–48.

27.

Klein Entink

R. H.

Kuhn

J.-T.

Hornke

L. F.

Fox

J.-P.

(2009). Evaluating cognitive theory: A joint modeling approach using responses and response times. Psychological Methods, 14, 54–75.

28.

Klein Entink

R. H.

van der Linden

W. J.

Fox

J.-P.

(2009). A Box-Cox normal model for response times. British Journal of Mathematical and Statistical Psychology, 62, 621–640.

29.

Liu

Maydeu-Olivares

(2012). Local dependence diagnostics in IRT modeling of binary data. Educational Psychological Measurement, 73, 254–274.

30.

Marianti

Fox

J.-P.

Avetisyan

Veldkamp

B. P.

Tijmstra

(2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39, 426–451.

31.

Mislevy

R. J.

Stocking

M. L.

(1989). A consumers guide to LOGIST and BILOG. Applied Psychological Measurement, 13, 57–75.

32.

Mulder

van der Linden

W. J.

(2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74, 273–296.

33.

Ranger

(2013). A note on the hierarchical model for responses and response times in tests of van der Linden (2007). Psychometrika, 78, 538–544.

34.

Ranger

Kuhn

J.-T.

(2012). A flexible latent trait model for response times in tests. Psychometrika, 77, 31–47.

35.

Ren

van der Linden

W. J.

Diao

(2017). Continuous online item calibration: Parameter recovery and item utilization. Psychometrika, 82, 498–522.

36.

Segall

D. O.

(2003, April). Calibrating CAT pools and online pretest items using MCMC methods. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL.

37.

Silvey

S. D.

(1980). Optimal design. London, England: Chapman & Hall.

38.

Stein

C. M.

(1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9, 1135–1151.

39.

Stocking

M. L.

(1988). Scale drift in on-line calibration (ETS Research Report Nos. 88–28). Princeton, NJ: ETS, 12.

40.

van der Linden

W. J.

(1988). Optimizing incomplete sampling designs for item response model parameters (Research Report No. 88-5). Enschede, the Netherlands: University of Twente.

41.

van der Linden

W. J.

(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181–204.

42.

van der Linden

W. J.

(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287–308.

43.

van der Linden

W. J.

(2008). Using response times for item selection in adaptive testing. Journal of Educational and Behavioral Statistics, 33, 5–20.

44.

van der Linden

W. J.

(2009). Predictive control of speededness in adaptive testing. Applied Psychological Measurement, 33, 24–41.

45.

van der Linden

W. J.

(2011). Test design and speededness. Journal of Educational Measurement, 48, 44–60.

46.

van der Linden

W. J.

Glas

C. A. W.

(2010). Statistical tests of conditional independence between responses and/or response times on test items. Psychometrika, 75, 120–139.

47.

van der Linden

W. J.

Guo

(2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73, 365–384.

48.

van der Linden

W. J.

Klein Entink

R. H.

Fox

J.-P.

(2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34, 327–347.

49.

van der Linden

W. J.

Ren

(2014). Optimal Bayesian adaptive design for test-item calibration. Psychometrika, 80, 263–288.

50.

van der Linden

W. J.

Scrams

D. J.

Schnipke

D. L.

(1999). Using response-time constraints to control for differential speededness in computerized adaptive testing. Applied Psychological Measurement, 23, 195–210.

51.

van der Linden

W. J.

van Krimpen-Stoop

E. M. L. A.

(2003). Using response times to detect aberrant response patterns in computerized adaptive testing. Psychometrika, 68, 251–265.

52.

Wang

Fan

Chang

H.-H.

Douglas

(2013). A semiparametric model for jointly analyzing response times and accuracy in computerized testing. Journal of Educational and Behavioral Statistics, 38, 381–417.

53.

Yen

W. M.

(1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145.

54.

Zheng

(2016). Online calibration of polytomous items under the generalized partial credit model. Applied Psychological Measurement, 40, 434–450.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.87 MB