Abstract
Two simple constraints on the item parameters in a response–time model are proposed to control the speededness of an adaptive test. As the constraints are additive, they can easily be included in the constraint set for a shadow-test approach (STA) to adaptive testing. Alternatively, a simple heuristic is presented to control speededness in plain adaptive testing without any constraints. Both types of control are easy to implement and do not require any other real-time parameter estimation during the test than the regular update of the test taker’s ability estimate. Evaluation of the two approaches using simulated adaptive testing showed that the STA was especially effective. It guaranteed testing times that differed less than 10 seconds from a reference test across a variety of conditions.
Keywords
Introduction
Adaptive tests are typically organized as fixed-length tests with test takers signing up for fixed time slots. However, the items in the pool may show considerable variation in the time they require to process all their information and find a solution. As different test takers get different selections of items from the pool, the time they need to reply to their selection may also vary considerably, with serious differential speededness of the test as a result. Because the time intensity of the items typically correlates with their difficulty, and the difficulty of the selected items correlates highly with the ability of the test takers, usually the more able test takers experience much greater time pressure during adaptive testing than their less able counterparts. Empirical evidence demonstrating this counterintuitive result has been presented, for instance, by Bridgeman and Cline (2003) and van der Linden (2009a).
A method for the control of speededness in adaptive testing was presented by van der Linden (2009a). The method was based on the same response–time (RT) model as in the current study (which will be reviewed below). It consisted of the following components: (a) use of the RTs to permanently update an estimate of the test taker’s speed during the adaptive test; (b) use of the speed estimate along with the estimates of the time intensities for the remaining items in the pool to predict the time necessary for each of them; and (c) selection of the items in the adaptive test subject to a constraint on their predicted times that guarantees completion of the test within the remaining time. Bayesian methodology was used for updating both the estimate of the speed parameter and the predicted times for the test taker on the remaining items in the pool in the method.
Two major differences exist between this earlier study and the method presented in this article. First, the control in the earlier study was one sided in that it only imposed the time limit as an upper bound on the completion time for the test but did not prevent test takers from finishing too early. In fact, because of the low time intensity of the majority of items selected by the adaptive algorithm, the evaluation of the method showed many test takers who actually did finish much earlier (van der Linden, 2009a, figure 1). The new method is designed to offer two-sided control; it prevents candidates from running out of time but at the same time selects combinations of items that imply full use of the time available for the test. We believe this additional feature is important because it allows for more effective scheduling of test takers across the fixed time slots typically used in adaptive tests (no idle testing stations or unrest created by test takers leaving early).

Plot of the total-time distributions on the reference test for test takers working at speed τ = −.70, −35, 0, .35, and .70.
Second, the new method does not require any real-time estimation of speed parameters for the test takers nor any updating of the predicted distributions of the time the test takers would spend on each of the remaining items in the pool. It thus entirely avoids the computational burden involved in the use of a Bayesian predictive methodology during adaptive testing and produces results not liable to any errors in the estimates of these parameters and distributions. Instead, the new method imposes two simple constraints on the RT parameters of the items that are selected. Consequently, the method is effective directly from the beginning of the test; we do not have to wait for stabilization of any parameter estimates more toward the end of it.
Another key element of the new method is its direct focus on the probability of a test taker running out of time on the adaptive test. Following van der Linden (2011a), the probability is defined as follows: Let
where τ is the test taker’s speed,
For a given test taker, the probability of running out of time in Equation 1 can be controlled by constraining the vectors α and β for the selection of items in the adaptive test. The same principle was applied in van der Linden (2011b) to assemble fixed test forms to be equally speeded as a reference form, fixed forms with the same degree of speededness but some of the other test specifications changed, as well as the problem of adjusting the degree of speededness of a testing program with a fixed time slot to a newly selected level. As demonstrated below, application of the principle to adaptive testing becomes extremely straightforward for a direct approximation to the distribution function in Equation 1, which leads to two constraints each to be imposed on a simple sum of item parameters. The only thing we need to do during adaptive testing is updating the sums for the items selected in the test.
Total-Time Distribution
The response time for a test taker with speed
with parameters
The density of the distribution of the total time for a test taker with speed τ on two items follows from Equation 2 as
an expression known as the convolution integral of the separate densities for the two items. Repeated application of the integral, each time adding an extra item to the test, gives the distribution of
We reparameterize the RT model in Equation 2 and define new item parameters
and
As a result, although the distribution of
and
(van der Linden, 2011a, equations 20–21).
The standard family of lognormal distributions family has density
with parameters μ and σ2. The member we need to approximate the density of
and
Observe that μ and σ2 are not the mean and variance of the (otherwise unknown) distribution of
Controlling Speededness in Adaptive Testing
A surprisingly useful feature of Equations 6 and 7 is their factorization into two parts depending exclusively on the test taker and the item parameters, respectively. The factorization guarantees that, whatever the speed τ of the test taker, the mean and variance of his or her total-time distribution on any two test forms with matching sums of these item parameters are identical. In fact, Equations 8 through 10 show that the same principle holds for the entire shape of the approximating distribution; for any τ, identical sums of
Another useful feature of each expression in Equations 7 and 8 is that they are based only on sums of the
In order to control the speededness of an adaptive test, our basic idea is to identify a reference form with a proven, excellent level of test speededness, and build the adaptive test to produce the same level. For instance, the reference form could be a fixed form used before a program became adaptive or the set of items in a previous adaptive test. It is essential that the actual total times on it have been checked for test takers working at a realistic range of speed and that they have been found to be all right, given the current time limit. In addition, an evaluation of the subjective time pressure experienced by the test takers during the test may play a role. Observe that we only have to check these data for a realistic range of τ values. As for any two matching sums of
In fact, as the two sums of the

a. Average difference between the time limits for the reference and adaptive tests for the conditions with no control, shadow-test approach (STA), and two versions of the heuristic in seconds (π = .05).

b. Average difference between the time limits for the reference and adaptive tests for the conditions with no control, STA, and two versions of the heuristic in seconds (π = .10).

c. Average difference between the time limits for the reference and adaptive tests for the conditions with no control, STA, and two versions of the heuristic in seconds (π = .15).
As identical total-time distributions on the adaptive and reference test for each test taker implies identical risks of running out of time, we control the speededness during the adaptive test administrations in the strongest possible sense. For instance, for any given population of test takers, the method of control guarantees that the same number of test takers will experience exactly the same levels of time pressure on both tests. Observe that we are able to guarantee this without actually having to specify any explicit limit on the risk in Equation 1. We thus entirely avoid the problem of having to specify a minimally acceptable level of speed required to calculate such a limit.
Let
and
Each adaptive test is built to meet these two values. In order to achieve this goal, two different implementations of the idea are suggested. The first is a shadow-test method in which constraints based on the target values are inserted in the test-assembly model for each shadow test. A key advantage of this approach is that the constraints can be easily combined with whatever other constraints are required to meet the full set of content, statistical, and practical specifications for the test. The second method is a simple heuristic. It can be used for a plain adaptive test without any other constraints than the target values in Equations 11 and 12.
Implementations
Shadow-Test Approach (STA)
In STA, the selection of each item is preceded by the assembly of a shadow test from the item pool. Shadow tests are defined as fixed-form tests of the same length as the adaptive tests that (a) are maximal informative at the test takers current estimate of θ, (b) meet all constraints to be imposed on the test, and (c) contain all items already administered to the test taker (van der Linden, 2005, chap. 9). Basically, each next shadow test involves a reassembly of the remaining portion of the adaptive test, such that it still meets all constraints but now is maximally informative at the new θ estimate. The next item to be administered is the most informative item in the shadow test not yet seen by the test taker. All other free items are returned to the item pool. Because each next shadow tests meets all constraints and is maximally informative, the same automatically holds for the adaptive test.
In principle, shadow tests can be assembled using any method of constrained test assembly that is fast enough for use in real time. In the example later in this article, we used mixed integer programming (MIP) in combination with a fast commercial solver (IBM ILOG OPL version 6.3; International Business Machines Corporation, 2009) to identify the optimal shadow tests. For an application in R based on a free solver, see Diao and van der Linden (2011). The first step in this approach is the definition of 0-1 decision variables
Suppose the shadow test prior to the selection of the kth item in the adaptive test needs to be reassembled. Let
subject to
where
In a real-world application, the model may have to be extended with several other constraints, for instance, on the content distribution of the items in the test, to guarantee a desired answer-key distribution, prevent the administration of “enemy items,” constrain the range of readability indices, and so on. The solution to the model found by the solver is the string of 0s and 1s for the variables
Heuristic
The heuristic we propose is analogous to a suggestion for automated test assembly by Luecht (1998). However, instead of the original application to a target for the test information function as the criterion for item selection, the current application is with respect to a combination of the targets
The method consists of the calculation of the differences between the targets in Equations 11 and 12 and the sums of the parameters for the
and
as the target values for the selection of the kth item. The kth item is selected to have both maximum information at
where
The weights
and
Evaluation Study
Simulation studies were conducted to evaluate the effectiveness of the STA and heuristic in eliminating differences in speededness between test takers under a variety of conditions. More particularly, we looked at the extent to which the two methods approximated the desired degree of speededness for the adaptive test for test takers with different abilities operating at different speeds. In addition, we assessed the possible price to be paid for the control of the speededness in the form of loss of precision of ability estimation due to the extra constraints or objectives enforced on the item selection.
Data
Adaptive test administrations were simulated from a pool consisting of a set of 185 items sampled from a retired pool from a large-scale testing program. The
Descriptive Statistics of Item Pool Used in Simulation Study
Adaptive tests with a fixed length of 20 items for test takers with a true ability parameter equal to
Four different conditions were simulated: No control of speededness; Control of speededness using the STA in Equations 13–19; Control of speededness using the item-selection heuristic in Equation 22 with the standardized weights in Equations 23 and 24. Control of speededness as in Condition 3, but now with weights that were 50% greater than the standardized weights in Equations 23 and 24.
The fourth condition was included to explore the effects of relatively greater weights for the last two objectives in Equation 22. In principle, increasing these weights means better control of speededness, but possibly at the price of a less informative adaptive test. The STA was implemented using tolerances
The reference test consisted of 20 items randomly sampled from the pool. The target values for the reference test in Equations 11 and 12 were
For each of the five τ levels, we calculated the time limit
which is the value of the quantile function for the total-time distribution (inverse of the distribution function in Equation 1) at
Time Limits (in minutes) on the Reference Test Required to Realize Risk Levels
As the methods of control of differential speededness did not assume estimation of any speed parameter during testing, it was not necessary to manipulate this parameter in our study. The only thing we had to do to evaluate the two methods was to record the
The estimation errors
and
Results
Figure 2 shows the average difference between the actual and required time limits in seconds as a function of the τ values for a risk of
The shadow-test method did a superior job controlling the speededness of the adaptive test. No matter the level of risk, speed, or ability, the presence of the constraints in Equations 14 to 17 guaranteed a maximum absolute difference between the time limit for the reference test and the limits required for the adaptive test smaller than 10 seconds. The two versions of the heuristic method had more difficulty controlling the degree of differential speededness, but all plots show results that tend to be much closer to those for the shadow-test method than for the condition without control. Generally, as expected, the results for the higher weights were somewhat better, especially for the low ability group.
In principle, small differences between average levels of speededness do not imply anything at the level of the individual test takers. The standard deviations of the time limits required by the adaptive tests in Figure 3, however, confirm the differences in control between the four conditions. The variation between the limits for the condition without any control was huge, whereas the shadow-test method showed negligible variation (all standard deviations smaller than 10 seconds). Again, the results for the heuristic were closer to those for the shadow-test method than for the condition without any control. For all ability groups, the two versions of the heuristic yielded approximately equal standard deviations, with the exception of some improvement for the low-ability group for the version with the higher weights. Thus, the better results for the average differences in time limit by the latter were accompanied by a somewhat smaller variability across the test takers. The same general pattern of results for the four methods is shown by the 1st and 99th percentiles of their distributions of the time limit for the reference test minus the limits for the adaptive tests in Table 3. Note that the case for
First and 99th Percentiles of Distributions of the Time Limits for the Reference Test Minus the Limits for the Adaptive Tests in Seconds (
aStandardized weights. bWeights 50% greater.

a. Standard deviation of the required time limits for the adaptive test for the conditions with no control, shadow-test approach (STA), and two versions of the heuristic (π = .05).

b. Standard deviation of the required time limits for the adaptive test for the conditions with not control, STA, and two versions of the heuristic (π = .10).

c. Standard deviation of the required time limits for the adaptive test for the conditions with no control, STA, and two versions of the heuristic (π = .15).
Generally, the control of speededness resulted in negligible loss of statistical quality of the ability estimates. The plots in Figure 4 show the estimated bias and MSE functions for the two control methods that are essentially the same as for adaptive testing without any control. The only differences are a positive bias for the shadow-test method and a somewhat larger MSE for the two versions of the heuristic method at

Estimated bias and MSE as a function of θ for the conditions with no control, shadow-test approach (STA), and two versions of the heuristic.
Concluding Remarks
The empirical study showed how serious the effects of differences in item selection on the testing time can be if we do not control for differential speededness in adaptive testing. It also showed that the effects can be effectively removed by imposing two simple constraints on the RT parameters of the items during item selection.
Another novelty of the method is the absence of any estimation of the test taker’s actual speed during the test. Besides, it does not require any projection of the time needed by the test taker for the remaining portion of the test. And the idea of matching the time characteristics of the adaptive tests with a reference test of proven quality is also practically convenient: It prevents the setting of a minimally acceptable level of speed for the test takers required for direct control of the risk in Equation 1. In spite of control of differential speededness up to a few seconds by the STA method, there appears to be hardly any price in the form of deteriorated ability estimation.
It is possible to compare the results for the current STA method with those for the earlier method in van der Linden (2009a). Both studies had an item pool for the adaptive test sampled from the same inventory of items and an identical setup of the simulated adaptive tests. The only differences existed in a much smaller item pool for the current study (185 vs. 350 items) in combination with a longer test (20 vs. 15 items). The former favors the earlier study; the latter our current study. The relevant comparison is between the average differences between the actual and required time limits in our current study (Figures 2 and 3) and the average time spent on the adaptive test in the earlier study (van der Linden, 2009a, figure 1; this figure also shows the time limits that were simulated). While the current study demonstrated control up to differences smaller than 10 s for each of the ability groups working at any of the levels of speed, the earlier study yielded much greater variation. In fact, only the differences for the test takers with ability
Finally, observe again that the two constraints are only on two different sums of item parameters. It is thus not necessary to match these parameters with the ones for the reference test on an item-by-item basis. The leeway generated by this relaxation explains why there was no significant loss of statistical precision in the ability estimates for the two methods of speededness control.
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
