Abstract
Computer-based testing (CBT) is becoming increasingly popular in assessing test-takers’ latent abilities and making inferences regarding their cognitive processes. In addition to collecting item responses, an important benefit of using CBT is that response times (RTs) can also be recorded and used in subsequent analyses. To better understand the structural relations between multidimensional cognitive attributes and the working speed of test-takers, this research proposes a joint-modeling approach that integrates compensatory multidimensional latent traits and response speediness using item responses and RTs. The joint model is cast as a multilevel model in which the structural relation between working speed and accuracy are connected through their variance-covariance structures. The feasibility of this modeling approach is investigated via a Monte Carlo simulation study using a Bayesian estimation scheme. The results indicate that integrating RTs increased model parameter recovery and precision. In addition, Program of International Student Assessment (PISA) 2015 mathematics standard unit items are analyzed to further evaluate the feasibility of the approach to recover model parameters.
Recently, the use of computers to administer tests has provided a platform not only for recording examinees’ responses to items but also in collecting response process data (RPD). Continuous RPD in the form of digital records such as response times (RTs) and eye-tracking characteristics are currently being used to capture problem-solving processes, strategies, and behaviors of test-takers (Man & Harring, 2019; Ercikan & Pellegrino, 2017). RTs, one of the critical types of RPD, have garnered considerable attention in improving current measurement practices because this type of data is thought to deliver a more comprehensive depiction of the performance and attributes of test-takers beyond what is available based on response accuracy (i.e., correct responses) alone (Bolsinova, De Boeck, & Tijmstra, 2017). RTs provide essential information regarding the amount of time spent by test-takers across assessment items indicating, for instance, their level of engagement with the content. RTs have also been utilized to deal with some measurement issues and challenges. RTs could be used to help select items, for example, that maximize Fisher’s information at the current estimate of the test-takers’ latent ability. This type of procedure may be most advantageous for computer adaptive tests as the length of the test could be ultimately shortened (Meyer, 2010; van Rijn & Ali, 2017; Wise & Kong, 2005). In addition, in the context of test security, RTs have been used to help identify aberrant testing behaviors such as preknowledge cheating and lucky guessing thereby ensuring test fairness (Bolt, Cohen, & Wollack, 2002; Man, Harring, Ouyang, & Thomas, 2018; Marianti, Fox, Avetisyan, Veldkamp, & Tijmstra, 2014; Thissen, 1983; van der Linden, 2006b; van der Linden & Guo, 2008).
Much of the previous literature on integrating RTs and item responses has centered on the relation between a test-taker’s responding speed and responding accuracy within a unidimensional item response theory (IRT) modeling framework (e.g., Bolsinova et al., 2017; De Boeck, Chen, & Davison, 2017; Fox & Marianti, 2016; Meng, Tao, & Chang, 2015; Roskam, 1997; Thissen, 1983). Importantly, among these methods is a two-level hierarchical framework for modeling item responses and RTs proposed by van der Linden (2006a). In this framework, RTs and item responses are modeled at the first level; whereas, the dependencies between lognormal RT model parameters (van der Linden, 2006b) and IRT model parameters (i.e., two-parameter logistic [2-PL]) are specified at the second hierarchical level. To properly apply this model, an assumption is made that test-takers respond to items with constant speed. However, many researchers have challenged the veracity of this assumption by considering within-subject variation of item characteristics across RTs (e.g., Bolsinova et al., 2017; Fox & Marianti, 2016).
Some extensions based on this hierarchical joint modeling approach have been proposed (Bolsinova et al., 2017; Fox & Marianti, 2016; van Rijn & Ali, 2017). However, a limited number of studies could be found that have examined the functional relation between RTs and multidimensional cognitive constructs thought to drive item responses. Clearly, conventional unidimensional IRT models would be inadequate to capture complexities inherent in these scenarios, whereas applying multidimensional IRT (MIRT) models would be theoretically defensible. In recent years, MIRT models have gained more prominence in educational and psychological testing due to the increasing needs of stakeholders to understand and interpret multifaceted cognitive processes of test-takers used during the test period (e.g., Jiao, Kamata, Wang, & Jin, 2012; Reckase, 2009). For example, the Program of International Student Assessment (PISA) assesses various content ability domains, including mathematics, reading literacy, and science. Within each domain, several cognitive processing constructs were measured such as explaining phenomena and evaluating and interpreting data. In addition, an abundance of process data such as item RTs and action sequences were gathered as ancillary information. Yet, there currently exists no model that integrates process data like RTs into the MIRT framework for evaluating the relative dependency between test-takers responding speed and accuracy. It is this gap in the literature that the current study is aiming to fill.
Inspired by the work of van der Linden (2006a, 2006b), and acknowledgment of the importance of MIRT, in this study, a joint modeling approach for multidimensional item responses and RTs is proposed, which describes the relation between the speediness and accuracy of a person answering items in multidimensional latent space. The proposed joint modeling is an extension of the hierarchical modeling framework proposed by van der Linden (2006a) to MIRT models. In this joint modeling approach, a MIRT model and an RT model are specified separately at Level 1. The variance–covariance structure of the person and item parameters are jointly estimated at Level 2. A Bayesian estimation approach is used to investigate the proposed hierarchical model via a Monte Carlo simulation under a limited number of conditions thought to impact estimation accuracy.
To outline the focus of this manuscript, the hierarchical model for RTs and item responses within a MIRT framework is introduced. In a subsequent section, a Bayesian approach to estimating the model via a Markov chain Monte Carlo (MCMC) algorithm will be discussed including specification of prior distributions. A simulation study to highlight contexts where the proposed MIRT-RT model may be advantageous over a MIRT model is outlined and results are articulated. Finally, an empirical example using data from PISA 2015 mathematics test is provided to underscore the findings from the simulation. Results from the analyses are discussed.
Hierarchical Model Specification
Level 1: Measurement Model for Accuracy
Compensatory MIRT model
MIRT models describe the relation between item responses and two or more latent traits and are categorized as either being compensatory or noncompensatory (see, for example, Ackerman, 1989; Adams, Wilson, & Wang, 1997; Bolt & Lall, 2003; Fox, Entink, & Avetisyan, 2014; Wang & Nydick, 2015). Compensatory MIRT models are used when the latent traits compensate for each other when answering an item. In other words, high proficiency in one latent dimension is thought to compensate for low proficiency on other dimension(s). By contrast, in a noncompensatory MIRT model, a deficiency in one latent dimension cannot be offset by adequacy in other latent dimensions. As noncompensatory MIRT models are notoriously difficult to estimate (see, for example, Bolt & Lall, 2003; Wang & Nydick, 2015), in this study, the authors will focus on jointly modeling compensatory MIRT with RTs.
At the first level of the modeling hierarchy, a 2-PL compensatory MIRT model (Reckase, 1985) is specified, which assumes that the probability of correctly answering an item is influenced by a weighted linear combination of abilities and is formulated as
where
Level 1: Measurement Model for Working Speed
RT modeling
The lognormal RT model proposed by van der Linden (2006b) is utilized for the modeling of RTs. Although many RT models have been proposed assuming a variety of distributions (see, for example, Roskam, 1997), the lognormal RT model proposed by van der Linden (2006b) assumes that the log-transformed RTs follow a normal distribution allowing for the joint modeling of item responses within a multivariate normal distributional framework. RTs could be modeled as being multidimensional as well. For example, if several items share a common stimulus, RTs would be related to a certain degree—much the same way items sharing a common reading passage would have a certain association. In this case, it would be necessary to measure an additional RT dimension. However, this study maintains a unidimensional perspective for now. A multidimensional RT model could be investigated as an extension in a future study.
The lognormal RT model is
where the latent parameter
Level 2: Modeling Person Parameters
The second-level model incorporates two correlational structures to account for the dependencies on both the item and person parameters, respectively. The relation between latent attributes,
with mean vector,
The parameters
Level 2: Modeling Item Parameters
To account for the item parameter dependency in this joint modeling approach, a multivariate normal distribution is defined for the item parameters,
where the mean vector and symmetric covariance matrix,
These moments are a restrictive version of the more general moment structures of parameter vector
respectively. There are several reasons for placing restrictions on these item parameters such that the only parameters to be estimated will be item location and time intensity. First, this type of reduction is often the convention used when estimating MIRT models within a Bayesian framework (see, for example, Bolt & Lall, 2003; Fox et al., 2014; Wang & Nydick, 2015). In those studies, the correlation between slope parameters of distinct ability dimensions were not specified. A compelling reason why this might be the case is that estimating the correlation of slope parameters provides neither significant information about item quality nor useful information inferring about test-takers’ abilities. Notably, the speed and accuracy trade-off was not addressed by the correlation of item slopes. Moreover, estimating the correlation between slope parameters could potentially lead to model overfitting. The estimation precision of person-side parameters would be reduced due to the lower degrees of freedom induced by needlessly estimating these correlations. Thus, item slopes are assumed not to be correlated. Furthermore, the correlation between the item slopes and item time discrimination is not considered either. Van der Linden (2006a) reported that the correlation between item discrimination and time discrimination was .04 with their real data set, which was not significant as 0 was a plausible value in the interior of the credible interval constructed from the correlation’s posterior density.
The hierarchical modeling framework proposed by van der Linden (2006a) has been extended in the current study to the jointly modeling item responses within an MIRT model and RTs. This model is referred to as the MIRT-RT model throughout the remainder of the article. The RTs are used as ancillary information in the estimation of the MIRT model parameters. Figure 1 displays the graphical representation of the MIRT-RT model.

A hierarchical graphical representation of the MIRT-RT model.
Bayesian Estimation Using MCMC Sampling
An MCMC algorithm for a fully Bayesian specification of the model was used for model parameter estimation in Just Another Gibbs Sampler (JAGS; Plummer, 2015). All data for the simulation were generated in R (R Core Team, 2016). Two chains with thinning of five were executed using at least 15,000 total iterations and parameter estimates and standard deviations from the posterior densities were computed using the final 2,000 iterations. Autojags(), an auto-updating JAGS function housed in the
Identification Constraints
Several constraints were set for identifying parameter scales and solving rotational indeterminacy issues related to estimation of MIRT models. To identify parameter scales, the population mean and variances of the latent attributes,
is applied for identifying different latent ability dimensions within the MCMC algorithm in which the first two items only load on the first dimension and the next two items only load on the second dimension. Following Bolt and Lall (2003), the remainder of the items load on both dimensions.
Prior Distributions
The prior distribution of item parameters for the MIRT-RT model is assumed to be bivariate normal such that
The item slope parameters,
where
where
The population means and variances of the latent attributes,
where
The joint posterior probability for the proposed MIRT-RT model can be represented as
Simulation Study
In this study, simulated data were used for evaluating model parameter recovery—both item and person parameters. Specifically, two levels of the number of examinees, (a) 500 and (b) 1,500, were simulated following previous methodological investigations (Bolt & Lall, 2003; Fox et al., 2014; Wang & Nydick, 2015). Two test lengths were considered: (a) 15 and (b) 30 items. Again these values were chosen to align with previous MIRT research (Bolt & Lall, 2003; Wang & Nydick, 2015) as well as conforming to the number of items found in large-scale assessments. Response data were generated based on a two-dimensional structure following a MIRT model (see, Equation 1) by applying the design matrix presented in Equation 6. RTs were generated based on the RT model given in Equation 2. Examinees’ latent attribute parameters
Generated Item Parameters for the Compensatory Multidimensional Logistic Model and RT Model.
Note. MIRT = multidimensional item response theory; RT = response time.
For the 15-item condition, only the first 15 of the total of 30 items were used. Twenty-five replications were generated for each combination of test length, correlation, and the number of test-takers. There are in total
Summary of Simulation Conditions.
Root mean square error (RMSE) was used to evaluate model parameter recovery. However, there are two types of model parameters to be recovered: item parameters and person parameters. RMSEs are calculated separately for item parameters,
where
Results
Table 3 shows the RMSE for the item and person parameter estimates of the joint MIRT-RT model, as well as the stand alone MIRT model. In general, the parameters are recovered relatively well based on the joint modeling approach, given that all the RMSE values are lower than 0.4. Notably, the RMSE of item slope parameters and item location parameters are lower than 0.26 (see Figure 2 which is provided in the online supplement). Several general trends are observed with item parameter recovery. First, the RMSE of item parameters in the jointly modeling MIRT with RT is smaller than those in the MIRT in most of the conditions. Primarily, the RMSE of item variance components is consistently lower than the values only based on MIRT. Second, in contrast to the findings based on 500 samples, the RMSE of item parameters is lower when the sample size increased to 1,500 mainly for the item slope parameters. Third, the RMSE of item parameters is lower when test length increases from 15 items to 30 items. Also, Table 3 indicates that RMSE (efficiency) is relatively smaller for the joint modeling approach than the single MIRT. In another word, RTs as ancillary information could improve the item parameter estimation precision regarding RMSE.
Parameter Recovery Results: RMSE for the Joint MIRT-RT Model and MIRT Model.
Note. RMSE = root mean square error; MIRT = multidimensional item response theory; RT = response time.
Table 3 also presents the results of the estimation of the variance–covariance recovery of item domain and population domain at the Level 2 (see Figure 1). This is of interest because the second-level item domain covariance components between different item parameters show the dependencies between responses and RT. In general, the recovery of all item variance–covariance parameters is quite well for the MIRT-RT model. The RMSE is in the range of 0.01 to 0.34. The recovery of person-side variance and covariance parameters is also satisfactory. The RMSE values are lower than 0.4.
Overall, the obtained results indicate that the joint model approach outperformed the single MIRT model. Regarding RMSE under different test length and samples sizes, the findings suggest that the joint modeling approach could improve the accuracy and precision of both item and person parameter estimates by incorporating RTs as ancillary information for MIRT model estimation.
Analysis of PISA 2015 Computer-Based Mathematics Data
Data Set Description
The PISA 2015 computer-based mathematics data were used to fit both the MIRT-RT model and MIRT model. The PISA 2015 data set is exemplar for illustrative purposes because for a number of reasons including that a 2-PL MIRT model was used to scale the test-takers responses matching the authors’ proposed MIRT-RT model, and RTs were collected. There are
The RTs were transformed to a logarithmic scale before running the RT model. The Deviance Information Criterion (DIC, Spiegelhalter, Best, Carlin, & van der Linde, 2002) was calculated for comparing overall model fit between the joint-modeling approach of MIRT and RT and separated MIRT and RT models. The results are displayed in Table 3.
Table 4 shows both item and person parameter estimates. The range of the item difficulty parameter estimates varied from −0.78 to 1.36. The range of time intensity parameter estimates was from 3.75 to 5.05, which are on the logarithm scale. The estimated covariance between item difficulty and item time intensity was −0.16 (covariance = −0.375) with a 95% credible interval of −0.537 to 0.085, which is not significant suggesting that the item difficulty was not associated with item time intensity with the authors’ data set. The person covariance
Item Parameter Estimates: PISA 2015 Computer-Based Mathematical Literacy Standard Unit Items.
Note. PISA = Program of International Student Assessment; MIRT = multidimensional item response theory; IRT = item response theory; RT = response time.
Person and Item Variance and Covariance Estimates for MIRT-RT and MIRT Models: PISA 2015 Computer-Based Mathematical Literacy Standard Unit Items.
Note. MIRT = multidimensional item response theory; RT = response time; PISA = Program of International Student Assessment; CI = confidence interval; DIC = Deviance Information Criterion.
Discussion
As is becoming increasingly evident, to gain a more comprehensive understanding of test-takers requires collecting and analyzing additional information beyond their responses to items. To this end, computer-based testing (CBT) permits the gathering of RPD, like RTs, that can be used to make more accurate inferences regarding item parameters and test-takers’ abilities (Fox & Marianti, 2016). Incorporating this type of supplementary information has been shown to improve estimation of item and person parameters in IRT (van der Linden et al., 2010) while providing insights regarding the behaviors of test-takers that cannot be identified by using item response information in isolation. A MIRT-RT model within a hierarchical framework was proposed to jointly model RTs and multidimensional latent constructs underlying item responses. The latter was accomplished by specifying a 2-PL MIRT model for the item responses. A lognormal RT model was chosen for RTs and both were jointly modeled at Level 1 of the hierarchy. The Level 2 model incorporated the mean vectors and variance–covariance structure for both item and person parameters, respectively. Moreover, the joint modeling approach allows evaluating the dependencies among the item parameters reflected by an item domain model. Estimation was carried out within a Bayesian framework using an MCMC algorithm.
Results from the small simulation indicate that RTs are useful ancillary information for improving the estimation of the MIRT model parameters. In general, the MIRT-RT model yields more accurate parameter estimates than the modeling of item responses and RTs independently. This type of joint model specification has additional benefits. Both the dependencies among item parameters and dependencies among person parameters can be assessed and may provide potential avenues of investigation for practitioners and substantive researchers. Of course, one limitation of the simulation study performed in this study is its scope. Although the conditions and levels used in the study have a theoretical grounding in the methodological literature and correspond to reasonable conditions found in practice, they are not exhaustive and a more comprehensive investigation is warranted.
Finally, several model extensions could be further investigated. A logical next elaboration might be to noncompensatory MIRT and its functional relation to RTs. Noncompensatory IRT models are notoriously more challenging to estimate, so care will need to be taken in specifying the model and on thoughtful consideration of prior distributions of parameters if a Bayesian estimation approach is enacted. A second elaboration is to response models for polytomous or graded responses. This is important for psychological testing where items are Likert-scaled. Finally, the current assumption that the working speed is constant over the entire test could be modified. This may provide localized information regarding test-takers as focus may center on specific items of an assessment rather than across all items.
Supplemental Material
Kaiwen_APM_suplimentary – Supplemental material for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times
Supplemental material, Kaiwen_APM_suplimentary for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times by Kaiwen Man, Jeffrey R. Harring, Hong Jiao and Peida Zhan in Applied Psychological Measurement
Supplemental Material
Online_supplimentary_Kaiwen – Supplemental material for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times
Supplemental material, Online_supplimentary_Kaiwen for Joint Modeling of Compensatory Multidimensional Item Responses and Response Times by Kaiwen Man, Jeffrey R. Harring, Hong Jiao and Peida Zhan in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
