Abstract
In high-stakes testing, often multiple test forms are used and a common time limit is enforced. Test fairness requires that ability estimates must not depend on the administration of a specific test form. Such a requirement may be violated if speededness differs between test forms. The impact of not taking speed sensitivity into account on the comparability of test forms regarding speededness and ability estimation was investigated. The lognormal measurement model for response times by van der Linden was compared with its extension by Klein Entink, van der Linden, and Fox, which includes a speed sensitivity parameter. An empirical data example was used to show that the extended model can fit the data better than the model without speed sensitivity parameters. A simulation was conducted, which showed that test forms with different average speed sensitivity yielded substantial different ability estimates for slow test takers, especially for test takers with high ability. Therefore, the use of the extended lognormal model for response times is recommended for the calibration of item pools in high-stakes testing situations. Limitations to the proposed approach and further research questions are discussed.
In high-stakes assessments such as college administration tests (e.g., SAT; College Board, 2016) or language proficiency tests (e.g., TOEFL; Educational Testing Service [ETS], 2020), important consequences result from test scores, such as admission to university or other educational programs. The high-stakes connected to the test outcome have important implications for the design and analysis of the respective tests. First, to increase test security, often multiple parallel test forms are used. This prevents cheating during testing sessions with multiple test takers and sharing knowledge about the test by former test takers (Luecht & Sireci, 2011). Second, for reasons of fairness, testing conditions are standardized across test takers and test occasions. For instance, the time limit for the test is equal regardless of the test form. Third, due to the high-stakes, test takers are often assumed to be highly motivated. Therefore, missing responses are commonly considered informative, that is, they are scored as incorrect responses. This scoring rule is communicated to test takers, to prevent test takers from strategically not responding to items they feel unable to provide a correct response to. Ignoring missing values as a scoring rule could incentivize test takers to omit these items and thereby lead to biased and unfair ability estimates.
When multiple tests forms are used, they are often required to be parallel, which in the strict sense means that for every test taker, the test forms have the same true score and the same error variance (Lord & Novick, 1986). Within an item response theory (IRT) framework where maximum likelihood is used to estimate ability, the expected ability estimate
The example illustrates that the speededness of a test is an interaction of the time intensity of its items, the time limit set on the test, and the exerted working speed of the test taker (van der Linden, 2011b). As the speed level usually varies between persons, the degree of speededness of a test can also be expected to vary between persons. A fast and proficient test taker will score higher on a test with a time limit than an equally proficient but slower test taker who has to engage in one of the above-described strategies to deal with the insufficient time available. Consequently, however, the measured latent construct is no longer a pure ability measure, but a composite measure of speed and ability. Whether this is seen as a conceptual property of the test or a byproduct of the testing conditions differs. In this article, there are no assumptions made on the nature of speed differences between persons and to which degree they should affect ability measurement in high-stakes testing. 2 Instead, the article focuses on how to hold the level of speededness constant across all test forms within each individual test taker.
In the following section, the typical test assembly process and analysis that is commonly performed to obtain individual ability estimates in high-stakes assessments is briefly outlined. Based on this, the state-of-the-art approach to prevent differentially speeded test forms, which uses latent response time modeling, is described. An important shortcoming of this model is explained and a common model extension that mitigates this shortcoming is discussed.
Assessment Framework
Test Assembly
The common process of creating multiple parallel test forms contains the following steps (College Board, 2016; van der Linden, 2005): (a) developing items, (b) using items on a piloting sample (Piloting), (c) item parameter estimation (Calibration), and (d) assembly of items from an item pool to parallel test forms (Test Assembly). Criteria for the assembly of tests, besides test speededness, include the test information function, comparability of content, and similar distribution of item types (van der Linden, 2005). Due to the emergence of computer-administered testing, balancing speededness has become substantially easier. In this article, it is assumed that response times are available from a computer-administered piloting study.
Ability Estimation
For the estimation of latent abilities, an often-used choice is the two-parameter logistic (2PL) model. As already described, it is assumed that missing responses are scored as incorrect. Throughout this article, the notation of Fox (2010) is adopted, denoting items as
Balancing Speededness
Several strategies have been proposed to balance speededness across the test forms of a test administration, for example, using observed response times from a piloting study (e.g., van der Linden, 2005). In the following section, the current state-of-the-art approach, which uses a latent measurement model for response times, is discussed.
Lognormal Measurement Model
Recently, van der Linden (2011b) proposed the use of a lognormal latent measurement model for response times (van der Linden, 2006) for balancing speededness across test forms. The model assumes response times to be lognormally distributed and parameterizes these lognormal response times,
where
Speed Discrimination
According to the 2PLN, items can have different intercept parameters
This article, however, will argue that items can differ in the extent to which they are sensitive to speed differences, and that this variability across items needs to be taken into account when assembling test forms that should have equal speededness for each test taker. In the next section, an extension of the lognormal measurement model for response times which allows differences in speed sensitivity across items is discussed.
Extension of the Lognormal Measurement Model
Klein Entink, Fox, and van der Linden (2009) proposed an extension of the 2PLN that this article refers to as the three-parameter lognormal model (3PLN). It introduces a slope parameter
Conceptually, the parameter
Difference between the 2PLN and the 3PLN
There has been some confusion around the 2PLN and the 3PLN and the meaning of their respective item parameters in the literature.
3
It is important to note that the 2PLN and the 3PLN models are not equivalent formulations of the same model. This can be illustrated by comparing the model implicit correlations between the response times of two items
In contrast, for the 3PLN, this correlation is defined as follows:
For the derivation of both formulas, see Online Appendix A. For a similar remark on the model implicit covariances of the response times of two items, see Fox and Marianti (2016).
To illustrate the difference between the residual variance and the speed sensitivity parameter, Figure 1 shows response time distributions conditional on two different speed levels (

Conditional response time distributions for a fast speed level with
As an illustration of the conceptual meaning of the speed sensitivity of items, consider the following two hypothetical math items with equal time intensity (e.g.,
Hierarchical Framework
For model estimation in the context of test assembly, van der Linden (2011a) proposed embedding the lognormal latent measurement model for response times in a hierarchical framework (van der Linden, 2007). The resulting model assumes two latent dimensions, ability and speed, with common item and person parameter distributions. Conditional on these joint distributions, the model assumes independently distributed responses and response times. The framework benefits the estimation of the two dimensions, especially if the two dimensions are correlated (van der Linden et al., 2010). The joint person parameter distribution with either the 2PLN or the 3PLN is a multivariate normal distribution with
The joint item parameter distribution with the 2PLN together with a 2PL model for ability is also a multivariate normal distribution 4 with
The joint item parameter distribution with the 3PLN together with a 2PL model for ability also includes
Research Questions
The questions arise, whether (a) the hierarchical framework with the 3PLN as a measurement model for response times fits empirical response time data better than the hierarchical framework with the 2PLN and, if this is the case, (b) what the consequences would be for ability estimation in high-stakes assessments. To the authors’ knowledge, models with the 2PLN and the 3PLN have not yet been compared using data from educational competence tests. Moreover, there have only been a few comparisons using empirical data at all, so far focusing on intelligence tests (Goldhammer & Klein Entink, 2011), complex problem-solving tasks (Scherer et al., 2015), and mental rotation tasks (Debelak et al., 2014). In all three studies, the model with the 3PLN showed better fit than the model with the 2PLN according to the DIC (Spiegelhalter et al., 2002). In addition, the hierarchical framework with the 3PLN has been applied to noneducational vocational credentialing high-stakes data (Fox & Marianti, 2017) and low-stakes data of chess tasks (Fox & Marianti, 2016). In both cases, substantial variance in the speed sensitivity parameter was found across the items. The aforementioned studies provide general evidence for the relevance of the proposed model extension. However, they do not focus on educational assessment data. Therefore, an empirical data analysis was conducted, in which the models with the 2PLN and the 3PLN were applied and compared with data from an educational assessment, to investigate whether items differ in their speed sensitivity. This analysis is discussed in the “Empirical Data Analysis” section.
If the appropriateness of the model extension indeed holds in educational competence testing and items vary in their speed sensitivity, those differences may also accumulate over test forms of educational high-stakes assessments. This could result in test forms that, despite having equal time intensities and similar average observed response times, differ in their sensitivity to speed differences and therefore in their conditional distributions of expected testing times. Especially the substantial differences in expected response times for slow test takers would be important, as they could lead to differences in ability estimates across test forms. In the section “Simulation Study,” the possible consequences of unbalanced test forms on ability estimation are investigated and described using simulated data from test forms with item properties as found in the empirical example.
Empirical Data Analysis
Data Description
For the empirical data analysis, data from the 2015 Programme of International Student Assessment (PISA, Organisation for Economic Co-operation and Development [OECD], 2016) were used, for which responses and response times on item level are publicly available. The competences measured by PISA resemble competences that are often assessed in high-stakes educational assessments. Note that it is not uncommon to calibrate items for a high-stakes context based on data from low-stakes conditions, when piloting in high-stakes conditions is cumbersome or impossible (e.g., College Board, 2016; ETS, 2020). In those situations, it is implicitly assumed that items function similarly in low- and high-stakes conditions. In that sense, the results of this empirical low-stakes data analysis also have implications for high-stakes assessments. The Canadian subsample was chosen because it is the largest among the 72 countries participating in PISA.
To avoid substantial numbers of missing responses by design, test booklets were analyzed separately and only the test takers who had worked on the respective booklet were included. In PISA 2015, every test form consisted of four booklets, and booklets were assembled to a whole of 66 different test forms in the computer-administered version. Returning to items within a booklet was only possible within the items sharing a common stimulus and otherwise prohibited. Response times were accumulated across multiple visits of the same item (OECD, 2016). All math booklets used in the assessment were analyzed (named “M01”–“M05” and “M06ab”), which appeared each in overall eight different test forms, at every position twice. For simplicity, all polytomous items were dichotomized, scoring fully correct responses as correct and partially incorrect responses as incorrect. This resulted in data sets of 10 to 12 dichotomous items and 1,863 to 1,929 persons.
Method
The software JAGS (Plummer, 2003) together with the R package rjags (Plummer, 2016) was used for model estimation. The hierarchical framework with both the 2PLN and the 3PLN was used to analyze the data set. In the actual analysis of the PISA data set, omitted responses are scored incorrect and number of not-reached responses is used as a manifest variable in the background model for the plausible value generation (OECD, 2016). Because the aim of this empirical example is the unbiased estimation of item parameters (as in an actual pilot study for a high-stakes assessment), all missing responses were treated as if the items were not administered to the corresponding persons, which is the recommended practice for estimating item parameters (Finch, 2008).
Model estimation
Priors were uninformative and chosen in correspondence to Fox (2010) and Pohl et al. (2019). An inverse Wishart distribution was used as a hyperprior for the distribution of the three (
Results
Inspections of the Markov chain Monte Carlo (MCMC) chains were conducted using the R packages coda (Plummer et al., 2006) and rjags. Trace plots indicate good convergence for all parameters in both models in all data sets. The point estimates of the univariate potential scale reduction factors (Gelman & Rubin, 1992) for all parameters in all booklets were below 1.03 (95% upper confidence interval limits at or below 1.10) and below 1.05 (95% upper confidence interval limits at or below 1.19), for the models with, respectively, the 2PLN and the 3PLN. This indicates satisfactory convergence (Gelman & Shirley, 2011). The correlation of the person ability and person speed parameter ranged between
Regarding model fit, DIC indicated better fit with the 3PLN as a measurement model for all booklets (Online Appendix E). Table 2 shows the statistics for the resulting speed sensitivities for all booklets. The mean of speed sensitivities
The correlations of the speed sensitivities with other item parameters were also investigated. Table 1 displays the means of the posterior distributions of these correlations. Speed sensitivity correlated low but consistently over all booklets with the time intensity parameter
Descriptive Statistics of Item Speed Sensitivity Within All Math Booklets.
Note. Descriptive statistics for speed sensitivity, including its mean M(
Simulation Study
Design
The performed empirical data analyses illustrate that it is plausible to assume differences between items regarding their speed sensitivity. Therefore, the question arises, how the fairness of test forms is affected if this speed sensitivity is not controlled for between test forms. Based on the findings and parameter distributions in the empirical analyses, a simulation study was conducted to investigate how differences in speed sensitivity across test forms affect ability estimates. The simulation study reflects the main stage of a high-stakes assessments in which item properties are known from prior piloting and the sole interest lies in person parameter estimation. Three test forms were created, each with 30 items. The item parameters for the first test form were drawn from a multivariate normal distribution. Means, variances, and covariances of the item parameters were set to be in accordance with the results obtained from the empirical data analysis (see Online Appendix F).
Method
Person abilities were estimated according to the 2PL, with known item parameters using the weighted likelihood estimator (WLE; Warm, 1989) via the R package TAM (Robitzsch et al., 2017). Not-reached items were scored as incorrect. This approach reflects a high-stakes assessment, in which item parameters are obtained from a previously conducted calibration study and ability estimation is the focus (without specifically considering speed in the estimation). Numbers of not-reached items and estimated ability were compared for the four different speed groups between the three test forms.
Results
As can be predicted from the response time measurement model in Equation 4 and the response time characteristic curves described in the introduction, differences in cumulative response times between the three test forms were most severe for the fastest and slowest participants (Table 2). 6 The fastest subgroup was much faster than the time limit of 3,900 s, with means of 1,310.08 and 1,953.53 s for the high and the low speed sensitivity test forms. In contrast, the slowest subgroup working on the high speed sensitivity test form was, on average, substantially slower than the time limit, with a mean of 5,419.48 s. In the faster subgroups, the differences in testing time did not result in different numbers of not-reached items because for all test forms the testing times were well below the time limit. For the slowest participants, however, the medium and high speed sensitivity test form led to substantially more not-reached items than the low speed sensitivity test form. Detailed numbers for items not-reached on average can be seen in Table 2 and are depicted in Online Appendix G for a single replication.
Test Statistics per Test Form and per Speed Group, Averaged Across All Replications.
Note. Descriptive statistics are depicted for mean cumulative response times
These differences in number of not-reached items also resulted in differences in ability estimates, mainly for the slowest subgroup. For them, the average difference in ability estimation between the test forms with low and medium speed sensitivity was 0.09 and 0.51 between the test forms with low and high speed sensitivity. Higher average speed sensitivity resulted in substantially lower ability estimates. A difference of 0.51 in the ability logit for a test taker with a true ability

True and estimated ability for the low and high speed sensitivity test form, across the four subgroups.
To conclude, the simulation shows that differences in speed sensitivity between test forms can lead to substantial differences in ability estimates especially for slow and able test takers. This finding is independent from whether speed is seen as a nuisance parameter or part of the construct to be measured. Furthermore, if speed is seen as a nuisance parameter, the high speed sensitivity test forms lead to a more biased and less precise ability measurement. If speed is seen as a substantial part of the construct to be measured, differences between true and estimated ability are in fact desirable for slow test takers, however should be identical across test forms.
Discussion
High-stakes assessments often require multiple test forms with equal speededness at the level of the test taker. So far, the use of average response times and the use of the lognormal measurement model for response times by van der Linden (2006) have been proposed as strategies to control speededness across test forms (van der Linden, 2011b). In this article, the 2PLN model was compared to the model extension of the 3PLN by Klein, Fox, and van der Linden (2009), which introduces a speed sensitivity parameter into the measurement model. It was investigated which measurement model, embedded in the hierarchical framework by van der Linden (2007), fits empirical competence data better. Indeed, the 3PLN showed better model fit and the estimated speed sensitivity parameters varied substantially across items. This implies that balancing test forms using either observed response times or the item parameters from the 2PLN can lead to unbalanced speed sensitivity across test forms. Moreover, the simulation study shows that when missing responses are treated as incorrect (a standard practice in high-stakes assessments), differences in speed sensitivity between test forms can lead to severe differences in ability estimation. Especially slow test takers with a high ability were affected because they had increased numbers of not-reached items in the test forms that had higher speed sensitivities.
The issue of differential speed sensitivity can also be illustrated from an alternative perspective: As stated before, it is assumed that high-stakes tests usually are speeded power tests and therefore that the ability measured in the test is a composite measure of ability and speed. However, this composition changes between test forms if the test forms differ in their speed sensitivity. If a test form has a high speed sensitivity and a time limit induces time pressure for a certain speed level, the proportion of speed in the composite measure can be considered quite high. If in the same scenario a test form has low speed sensitivity, however, the proportion of speed in the composite measure for this test form will be rather low. This study argues that the influence of speed on the ability estimation has to be the same across test forms within each speed level.
Practical Implications
The following conclusions are drawn regarding the practice of assembling test forms for educational high-stakes assessments: Right now, the use of the hierarchical framework with the 2PLN is the state-of-the-art approach when balancing test forms. However, the findings of this study suggest that only when (a) the model with the 2PLN proves to better fit the data than the model with the 3PLN (e.g., using DIC) or (b) the 3PLN shows low variation in the speed sensitivity parameter across items, this approach should be considered sufficient. In cases where the model with the 3PLN shows better model fit and items differ in their speed sensitivity, using only the hierarchical framework with the 2PLN could lead to unfair testing situations. To be more precise, the ability estimates and the rank order of test takers could heavily depend on the administered test form, especially for slower test takers. Instead, the hierarchical framework with the 3PLN should be used when calibrating the items, and not only the average testing time but also the sensitivity to speed differences should be balanced across test forms.
Another common alternative for the assembly of fair test forms is the approach of assembling unspeeded test forms. Because in most educational assessments speed is a nuisance parameter that is conceptually not part of the construct being measured, this strategy seems promising. However, the results of this study and results from previous studies (e.g., van der Linden & Xiong, 2013) indicate that this approach might be unfeasible because there are generally large differences in the time that test takers require to respond to all items in an assessment (see Table 2). Assuring that even the slowest test takers can work without time pressure would imply a time limit that is far too generous for fast test takers and problematic both from an economical and from a motivational perspective. Furthermore, the results of this study have important implications for determining the speededness of a test: So far, often experimental methods using different time limits or different numbers of items in the same time limit have been used (e.g., Bridgman, Cline, & Hessinger, 2004; Bridgman, Trapani, & Curley, 2004; Harik et al., 2018). But while for the majority of the test takers more generous time limits might only have a small impact on the demonstrated ability, different time limits can still substantially affect the slowest part of the population. This effect can only be disentangled by explicitly modeling speed. If differences in ability estimation for different time limits are averaged over all test takers or calculated for different ability levels, the degree of speededness of the test for slow test takers could be severely underestimated. Therefore, tests that have been examined using the aforementioned experimental methods could have been falsely classified as unspeeded.
Limitations
There are a number of limitations to this study. First, the real data analysis is based on low-stakes data while implications are mainly relevant for high-stakes assessments. However, similar analyses on (non-educational) high-stakes data have reported similar findings (Fox & Marianti, 2017). In addition, it is not uncommon that pilot studies for item pool calibrations are conducted under low-stakes conditions. Furthermore, this study does not conclude that the hierarchical framework with the 3PLN will always demonstrate better model fit than the model with the 2PLN for item pools of high-stakes assessments. Rather, it argues that the assumption of equal speed sensitivity across items should be tested, just like the assumption of equal factor loadings should be tested in confirmatory factor analysis or structural equation modeling (Brown, 2006).
A second limitation relates to a general limitation of the hierarchical framework, namely, the assumption of stationarity (van der Linden, 2007). The model assumes that given the common distribution of the person and item parameters, residuals between responses and response times are independent. The assumption is, for example, violated if participants substantially speed up or slow down during the test. This could happen in high-stakes assessments with a time limit, if test takers speed up when they feel they are running out of time. However, for test assembly purposes, only item parameters and their relations across items are of interest. If position effects are controlled for (similar to controlling for position effects of ability item parameters estimation, for example, Gonzalez & Rutkowski, 2010), speeding up might only affect the precision of item parameter estimation. Avoiding speeding up seems easiest if items were piloted in low-stakes settings.
A third limitation is that this study deals with a specific violation of the model assumptions of the 2PLN. In the past, assumptions of the hierarchical framework using the 2PLN or 3PLN for response times have been critically reviewed using empirical data analyses (Bolsinova & Tijmstra, 2018; Klein Entink, van der Linden, & Fox, 2009; Fox & Marianti, 2016; Ranger & Ortner, 2012). Criticism includes violations of the assumption of lognormally distributed response times and the stationarity assumption mentioned above. Although the lognormal distribution has been the standard for modeling response times in educational assessments, future research could explore alternatives to the 2PLN and 3PLN, possibly also embedded in the hierarchical framework.
Outlook
In the past, automated test assembly (ATA) procedures have been developed to enable the assembly of multiple test forms from large item pools under various constraints (van der Linden, 2005). These methods are already frequently used in practice (Luecht & Sireci, 2011). To enable the use of the 2PLN in ATA, van der Linden (2011a, 2011b) reparametrized the model. Future research should investigate how the 3PLN can be used best in automated test assembly and whether a similar reparameterization approach might be feasible.
The 3PLN could also be useful to determine the speededness of assessments under various time constraints for different test-taker populations without having to experimentally investigate all possible combinations. This would especially be valuable for determining test accommodations for students with disabilities (Lovett, 2010). Furthermore, while this article focuses on fixed-test forms, the findings can also be applied to computerized adaptive testing or multistage testing. Studies have shown that differential speededness of test forms is an even greater challenge in these settings (van der Linden & Xiong, 2013). Investigating whether the 3PLN could contribute to the fairness of these assessments seems worthwhile as well.
Supplemental Material
sj-pdf-1-apm-10.1177_01466216211008530 – Supplemental material for On the Speed Sensitivity Parameter in the Lognormal Model for Response Times and Implications for High-Stakes Measurement Practice
Supplemental material, sj-pdf-1-apm-10.1177_01466216211008530 for On the Speed Sensitivity Parameter in the Lognormal Model for Response Times and Implications for High-Stakes Measurement Practice by Benjamin Becker, Dries Debeer, Sebastian Weirich and Frank Goldhammer in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
