Abstract
Most computerized adaptive testing (CAT) programs do not allow test takers to review and change their responses because it could seriously deteriorate the efficiency of measurement and make tests vulnerable to manipulative test-taking strategies. Several modified testing methods have been developed that provide restricted review options while limiting the trade-off in CAT efficiency. The extent to which these methods provided test takers with options to review test items, however, still was quite limited. This study proposes the item pocket (IP) method, a new testing approach that allows test takers greater flexibility in changing their responses by eliminating restrictions that prevent them from moving across test sections to review their answers. A series of simulations were conducted to evaluate the robustness of the IP method against various manipulative test-taking strategies. Findings and implications of the study suggest that the IP method may be an effective solution for many CAT programs when the IP size and test time limit are properly set.
Computerized adaptive testing (CAT) is rapidly becoming a popular choice for test administration, favored by test developers and test takers alike (Martineau & Dean, 2010; Poggio, 2010), because it delivers more accurate score estimates and requires relatively less testing time than conventional paper-and-pencil-based tests (PBTs). Due to the item selection algorithm that CAT programs use, however, CAT usually does not allow test takers to review test items and/or change their responses as they can on PBTs, because it relies on an interim score estimate that is updated after each item administration. Previous research has shown that having an opportunity to change their responses can help test takers reduce their anxiety and stress levels during test administration (Lunz, Bergstrom, & Wright, 1992; Papanastasiou, 2002; Stocking, 1997; Wise, 1996), which can result in fewer mistakes made during testing and test scores that may more accurately reflect test takers’ true proficiency (Papanastasiou, 2002). This especially can be the case when the test is high stakes (Stocking, 1997) and/or tightly timed. Test takers almost always prefer to have the option to change responses even if they do not always exercise that option (Vispoel, Henderickson, & Bleiler, 2000). More importantly, research shows that only a fraction of test takers actually benefit from response changing in terms of score improvement (Benjamin, Cavell, & Shallenberger, 1984; Waddell & Blankenship, 1995; Wise, 1996).
Eliminating unnecessary causes of a test taker’s anxiety and stress during CAT administration is an important step toward creating an ideal test environment. Reducing test takers’ anxiety and stress by giving them sufficient control over the test administration is critically important because a high level of test anxiety can contribute to increased measurement errors: Providing test takers with a better testing experience is, educationally and morally, the right thing to do.
Manipulative Test-Taking Strategies
One of the main practical objections to allowing for response changing in CAT programs is that allowing test takers imprudently to change responses could open the door to systematic test-taking strategies that could render CAT administration less efficient (Wainer, 1993; Wise, 1996) and/or result in biased score estimates (Vispoel, Rocklin, Wang, & Bleiler, 1999).
Wainer Strategy
For the manipulative test-taking strategy that Wainer (1993) introduced, a test taker could intentionally answer all items incorrectly on the first round to make the CAT system administer only easy items. After that, the test taker could go back to each item to review and change his or her responses to get a perfect score on the second round. Simulation studies confirmed that this so-called ‘Wainer strategy’ could result in extremely large measurement errors with a considerable risk for positively biased score estimates, especially for high-proficiency test takers who are capable of implementing the strategy successfully (Bowles & Pommerich, 2001; Gershon & Bergstrom, 1995; Stocking, 1997; Vispoel et al., 1999; Wang & Wingersky, 1992). In practice, the Wainer strategy is difficult to implement successfully—test takers run the risk of getting significantly underestimated scores if they fail to respond to all items correctly on the second round (Wise, 1996). Notwithstanding the risks, studies with live data showed that high-proficiency test takers had a good chance to profit from the Wainer strategy (Vispoel et al., 1999; Wise, 1996).
Kingsbury Strategy
Another manipulative test-taking strategy for CAT involves examinee judgment on item difficulty. As pointed out by Green, Bock, Humphreys, Linn, and Reckase (1984), test takers, knowing that the current item’s difficulty depends heavily on the response made to the previous item, could use the difficulty of the current item as a clue to the correctness of their responses for the earlier item. If a test taker thinks the current item’s difficulty is slightly higher than the previous item, it would be reasonable for the test taker to think his or her answer for the previous item was correct. However, if the current item seems easier than the previous item, it may indicate that the previous item was answered incorrectly, and thus, the test taker can go back to the previous item and change the answer. In his simulation study, Kingsbury (1996) modeled such a test-taking strategy based on two strong assumptions: (a) Test takers were assumed to make guesses if they saw an item whose difficulty was higher than their true proficiency by 1.0 theta unit or more, and (b) test takers also were assumed to go back to previous item and change their responses if the next item’s difficulty was lower than the previous item by 0.5 theta unit or more. The simulation result found that examinees could benefit from this test-taking strategy depending on their proficiency level—the lower the examinees’ proficiency levels, the greater the possible benefits they could see.
Generalized Kingsbury (GK) Strategy
Wise, Finney, Enders, Freeman, and Severance (1999) expanded Kingsbury’s test-taking strategy so that test takers were assumed to have speculated on the difficulty level of the next item not only for items with guessed responses but also for all previous items. Wise et al. called this the “generalized Kingsbury (GK) strategy” and simulated it in more realistic conditions where probabilities for correctly judging item difficulties derived from real data were less than 1.0. The simulation results suggested small possible benefits to test takers from using the GK strategy, but the strategy offered no meaningful improvement in score estimates. Vispoel, Clough, Bleiler, Henderickson, and Ihrig (2002) used live testing data to examine the effect of the Kingsbury and GK strategies on score estimates under various conditions and found that for a majority of test takers in the study, the Kingsbury and GK strategies were ineffective. This was mainly because the accuracy of test takers’ item-difficulty rating was much lower than what Kingsbury (1996) and Wise et al. assumed: Test takers were only 61% successful in distinguishing the difficulty difference within each item pair. The study did find, however, that some test takers were still able to improve their scores using either of these strategies.
CAT With Restricted Revision Options
To minimize the effect of the manipulative test-taking strategies, yet still provide test takers with reasonable options to review and change their responses, Stocking (1997) proposed three different models for limiting test takers’ practices in item review and response change. In the first model, test takers were allowed to change their responses at the end of the test but were limited to a maximum number of revisions. According to her simulation results, Stocking’s model effectively reduced the impact of the Wainer strategy, and the conditional standard error of measurement (CSEM) and bias were close to those with a zero-revision condition when the number of revisions was limited to 2 out of 28 test items. When the number of allowable revisions was greater than 2, however (in Stocking, 1997, the studied conditions were 0, 2, 7, 14, and 28 revisions for the 28-item test), Stocking’s model failed to control the effect of the Wainer strategy.
In Stocking’s second model, the test consisted of multiple separately timed sections, and test takers were allowed to revise their responses freely within each section. The simulation results showed that administering the test in two separate sections substantially reduced the effect of the Wainer strategy on the CSEM and bias. When the item revision option was available with four or more separated sections, the effect of the Wainer strategy was almost completely negated. Stocking’s third model, in which test takers were allowed to revise responses only within each item set associated with the common stimulus, showed its robustness against the Wainer strategy as well. The main disadvantage of Stocking’s third model, however, was that test takers were not allowed to revise responses for discrete (i.e., nonset) items (Stocking, 1997).
In their study using live test data, Vispoel et al. (2000) confirmed that Stocking’s second model (the restricted review within each block) successfully reduced the possible effect of the Wainer strategy. In addition, the majority of examinees (98.4%) in this study felt they had adequate opportunity to review and change their responses. In the latter studies by Vispoel and his colleagues, which also used live test data, the Kingsbury and GK strategies proved to be ineffective when the restricted review was permitted (Vispoel et al., 2002; Vispoel, Clough, & Bleiler, 2005).
Reviewing the overall results from Stocking (1997) and Vispoel et al. (2000, 2002, 2005), the restricted review approach (especially Stocking’s second model) seems to be one of several solutions that could allow test takers to change their responses during CAT to a certain degree without incurring unacceptable levels of sacrifice in measurement precision and efficiency. A serious limitation of the restricted review approach, however, is that allowing test takers to review and change their responses only within each section eventually causes them to involuntarily surrender access to items in the current section to proceed to the next section. If test sections are timed strictly and separately from one another as Stocking originally proposed, test takers would not necessarily feel pressure to move on to the next section before the section time expired. In a majority of operational CAT programs, however, small test sections (or item sets) usually are not timed separately. As a result, every time test takers finish one section (unless it is the last section), they must decide whether it is better to spend time revisiting test items in the current section and improving their initial responses or use their time to complete the remaining test sections. Being forced repeatedly to make such a decision, not knowing exactly how much time will be needed to complete the remaining test sections, very likely causes test takers the same kind of test anxiety as that observed among test takers in a regular CAT administration with no response revision option.
Another downside of the restricted review approach is that it still does not allow test takers to skip items (unless the items within each section are not adaptively administered). If test takers want to proceed further, even within one section, they must answer every item because the CAT program selects the following item based on the test taker’s initial response to the current item. The inability to skip items may not be terribly bothersome for test takers because they can simply answer randomly and move on to the next item, knowing they can return to the item before advancing to the next section. In terms of measurement efficiency, however, an item selection process that is based heavily on initial responses that do not necessarily reflect test takers’ best effort could seriously erode CAT’s level of adaptiveness. Moreover, some test takers might try different initial responses to find clues on correct response based on the difference in item difficulty between item pairs, which was the source of concern for Kingsbury (1996) and Wise et al. (1999). Vispoel et al. (2002, 2005) observed no meaningful gain from practicing Kingsbury and GK strategies in their CATs with the restricted review option, but their finding was based on results from low-stakes exams. In high-stakes CAT programs, test takers might be tempted to follow the Kingsbury or GK strategies with the intent of improving their scores, and CAT with the restricted review approach technically is vulnerable to successful implementations of either of these strategies.
Item Pocket (IP) Method
To address the shortcomings of the restricted review approach, this study proposes a new approach for allowing response change. This method, referred to as the “item pocket” (IP) method, provides test takers with IPs into which they can place items for later review and response change. Test takers can skip answering items by putting them in the IP. Once an item is placed in the IP, a test taker can go back to it anytime during the test until the test taker submits his or her final answer for the item. For example, in the CAT interface shown in Figure 1, a test taker is reviewing Item 4 among the three items in the IP. If a test taker wants to take Item 4 out of the IP, he or she “confirms” the final answer for it. Once removed from the IP, Item 4 cannot be placed back in. Test takers must confirm final answers for any items in the IP to empty the IPs before the test time expires or face the prospect that any remaining items in the IP will be counted as incorrect responses. CAT developers can determine the IP size based on the test length and time limit (more discussion on the IP size follows later in this article). The IP size was five in the example shown in Figure 1. As for CAT item selection, only items outside the pocket (in other words, items with final responses) are included in the interim score estimation procedure.

Example of a test interface for a CAT with the item pocket method.
The IP method has several advantages over the restricted review approach. First, there is no restriction on the number and range of items that can be revisited for changing responses. In contrast to the restricted review approach, the IP method eliminates the need for CAT sections to break into smaller separately timed sections, and test takers can return to any item as long as the item is in the IP. Even if a test taker removes an item from the IP because it is full and wants to add a new item, he or she would still have control over which item to remove from the IP. From a psychological point of view, this is a significant improvement. Test takers’ feeling of loss of control over the test, a major source of their anxiety and stress (Olea, Revuelta, Ximénez, & Abad, 2000; Stocking, 1997), potentially may be reduced with the IP method.
Another merit of the IP method is that test takers are not forced to provide an answer just to move forward but instead can skip items simply by adding them to the IP (as many as the IP size allows). The reduced anxiety by not being forced to answer each item before proceeding to the next would be one of the IP method’s possible direct psychological benefits, but the IP method’s psychometric benefits are equally noteworthy. In the IP method, test takers’ initial responses for the items in the IP, including skipped items, have no impact on item selection. The CAT system excludes items in the IP when computing interim score estimates. Therefore, CAT item selection is always based on test takers’ final responses, and as a result, CAT’s level of adaptiveness can be retained effectively. As test takers cannot change their answers once they are finalized—in other words, once items are removed from the IP with final answers—any attempt to apply the Kingsbury or GK strategies becomes ineffective.
Compared with the restricted review approach or traditional CAT, it is easy to appreciate the possible psychological benefits that accrue to test takers using the new IP method given the greater degree to which it allows test takers to revisit items and change responses. The fact that Kingsbury and GK strategies naturally become ineffective with the IP method also makes this method appealing to test developers. As yet unknown, however, is whether the new IP method is robust enough to immobilize other manipulative test-taking strategies, such as the Wainer (1993) strategy. A series of simulation studies were followed to examine the robustness of the IP method against worst-case scenarios of test-taking strategy. The simulation study also evaluated the effect of IP size.
Simulation Study
Research Design
For the simulation, 500 items were chosen from a real operational item bank built for a CAT-administered exam used for admissions to graduate-level educational programs. As shown in Table 1, a minor correlational relationship was observed between a- and b-parameter values among the items in the item pool. All items were multiple-choice formats with five answer options. The items were calibrated using the three-parameter logistic model (3PLM), and the summary statistics for the item pool are reported in Table 1. In total, 10,000 simulated test takers were sampled from a normal distribution with a mean of 0 and a standard deviation of 1. Each test taker was administered a fixed-length CAT with 40 items.
Descriptive Statistics for the Item Pool (500 Items).
For the CAT administration, the maximized Fisher information (MFI) method was used as an item selection criterion, and the interim and final scores were estimated using the maximum likelihood estimation (MLE) method. The score estimates were truncated to be within a range of −3 and 3. The initial score estimate was randomly drawn from a uniform distribution that ranged from −0.5 to 0.5. During the first five item administrations, the absolute value of change in the interim score estimates from one item to another was limited so as not to exceed 1.0. This prevented fluctuations in item selection in CAT’s early stage. This restriction was particularly important here because the use of MLE method for score estimation could result in extreme values when all responses are the same (i.e., all 0s or all 1s), as often occurs in the early stage of CAT. In terms of item exposure control, the simulation was conducted under two different conditions: (a) no exposure control and (b) the Sympson and Hetter (1985) method. For the exposure control using the Sympson and Hetter method, the target exposure rate was set to 0.20 and the exposure parameter for each item was derived after 40 iterative simulations. The content balancing was ignored to eliminate other extraneous factors and to make the implications from the study as generalizable as possible.
The IP method was implemented under three different conditions in terms of the IP size (i.e., the maximum number of items allowed in the IP at any one time). The studied IP sizes were 2, 4, and 6. To serve as a baseline, the IP method also was implemented with no IP condition, essentially the same as a conventional CAT that does not allow reviewing and changing.
An unlikely worst-case scenario using the Wainer-like manipulative test-taking strategy, as well as a more realistic scenario reflecting observations from literature in a probabilistic model, was simulated with the IP method to evaluate possible impacts of test-taker review and response change on measurement precision. Under each scenario, the CSEM using the mean absolute error (MAE) of θ estimation and bias across the score levels with a 0.5 interval on the θ scale were evaluated along with IP usages. The simulation was replicated 25 times and averaged. The CAT administration and simulation were conducted using a modified version of SimulCAT, a comprehensive computer software package for CAT simulation, written by Han (2012).
Simulation Study 1: Wainer-Like Test-Taking Strategy
To be comparable with earlier research, this study mimicked the “unrealistic worst case” scenario used in Stocking (1997). The study simulated test takers following the Wainer strategy to the extent possible within the IP system. To recap, the Wainer strategy assumes that test takers intentionally keep their interim score estimates low by providing incorrect initial answers to see more items that are easier than the test takers’ proficiency level. Supposedly, this gives them a better chance of answering those items correctly when they go back to change their initial responses. Within the IP system, one expects the effect of the Wainer strategy to be minimized because the IP size is limited and the intentionally incorrect initial responses for items in the IP do not influence item selection. Test takers might still be able to make their interim score estimates negatively biased by postponing their answers to as many easy items as the IP size allows, so the strategy’s impact on the final score estimates should be examined. The study simulated this situation assuming that all test takers would mechanically implement such a strategy throughout the test administration. This manipulative test-taking strategy is referred to in this article as “Test-Taking Strategy 1” (TTS1).
Simulation Study 2: Test-Taking Scenario With IP and No Time Limit
Simulation Study 1 was designed to provide knowledge about the robustness of the new IP method against test takers’ systematic attempts to use the Wainer-like manipulative test-taking strategy (TTS1). Such a gaming strategy, however, is not the way the IP method was intended to be used, nor was it expected to happen often in practice. According to Vispoel et al. (2000), when examinees were allowed to review their responses, their most frequently observed test-taking strategy could be described as follows: “I mark some of my answers that I review later” (p. 34). This test-taking strategy varied little from the ways examinees would respond to items on other tests (Vispoel et al., 2000). In fact, marking some answers (or items) for later review involves essentially the same process as putting items in the IP within the IP system. Although Vispoel et al. (2000) did not report which items were frequently marked for later review, it would be reasonable to assume that, within the IP system, examinees were likely to use the IP to set aside the items they found challenging (i.e., they were not confident about their answers). This allowed them a later opportunity to answer the items instead of giving up on the items by locking in their initial answers. This test-taking strategy, referred to in this article as “Test-Taking Strategy 2” (TTS2), is a legitimate use of the IP system (as opposed to the manipulative Wainer and Kingsbury gaming strategies) and, more important, represents what it was designed for—a less restrictive reviewing option for test takers. The second simulation study, therefore, was conducted to evaluate the performance of the IP method under a more realistic situation with TTS2.
Simulating test takers’ reviewing behavior involves several strong assumptions, and so it is important to make those assumptions as realistic as possible. Under TTS2, it was assumed that test takers first evaluated the relative difficulty of each item against their proficiency level. If the item difficulty was challenging given the test takers’ proficiency, it was assumed they would put the item in the IP. For items that were not challenging according to this definition, test takers were assumed to give their final answers and move on, not necessarily using the IP. For purposes of the simulation, the item was viewed as challenging if the difficulty (b-parameter value) of an item exceeded a test taker’s true proficiency (θ) by 0.5. In practice, researchers often find large errors associated with a test taker’s ratings (or judgment) of item difficulty, so it is important to incorporate that assumption in the simulation algorithm to ensure realistic results. Vispoel et al. (2005) observed that test takers showed success rates between 46% and 61% when asked to compare the difficulty of a pair of items when the difference in difficulty was less than 0.50. When the difference was larger than 0.50, the average rating success was between 63% and 82% across the studied conditions (Vispoel et al., 2005). In this study, test takers do not compare a pair of items but instead evaluate the difficulty of an item using their proficiency as a baseline to decide whether they want to put the item in the IP. If the difference between test takers’ proficiency and the item difficulty was less than 0.50, test takers were simulated to find the item challenging 50% of the time. If the difference between test-taker proficiency and the item difficulty was greater than or equal to 0.50, test takers were assumed to find the item challenging 70% of the time. If all IPs were full and a test taker found the current item challenging and was considering putting that item in the pocket, it was assumed that the test taker compared the current item with the easiest items in the IP. If the test taker found an item in the IP that was easier than the current item, the test taker was assumed to finalize his or her answer to the easiest item and remove it from the IP to make room for the current item. During the comparison between the easiest item in the IP and the current item, test-taker error associated with difficulty ratings was also simulated in the same way as the errors associated with the test taker’s decision on IP use (i.e., rating success at a rate of 0.50 if the difference in difficulty between the easiest and the current items was less than 0.50 and rating success at a rate of 0.70 if the difference was greater than or equal to 0.50).
Simulation Study 2 was designed using research findings based on real empirical test data (Vispoel et al., 2000, 2005) to account for test-taker errors in item-difficulty rating. In spite of this parameter, test takers’ patterns observed in this simulation of the IP system were still considerably exaggerated due to the fact that the simulation study imposed no test time limit. In other words, the simulation did not take into account any speededness or test-taker fatigue. In addition, test takers in the simulation were modeled to take as much time as needed to review the items utilizing the IP system. For most operational real-world CAT programs, however, tests are strictly timed, often slightly speeded for some test takers, and subject to severe penalties on final scores (according to the number of omitted items) if test takers do not complete all items within the time limit. Similar to other simulation studies, which are essentially based on probabilistic models, this study was not able to recreate all possible testing behaviors that would be observed from real people, so readers should bear this in mind when they examine the findings.
A major benefit of this study’s research design using “worst-case” scenarios—essentially an extreme stress test of the new IP method—was the implication that results would be even more useful in real CAT situations because the findings would indicate the bottom-line performance of the new method. This article presents direct interpretations of observed simulation results under the studied “worst-case” conditions as well as a comprehensive discussion on their important implications for real CAT programs under more common operational testing situations.
Results and Discussion
As mentioned, the IP method simulation was conducted under two different exposure control conditions (no exposure control vs. Sympson and Hetter method). As no meaningful difference was observed in terms of CSEM, bias, and IP usage between the two conditions, this article, therefore, presents only the cases with the Sympson and Hetter exposure control.
Under TTS1, which essentially was a modification of the Wainer strategy for the IP system, the CSEM and bias displayed in Figure 2 (focusing only −2 ≤θ < 2 area to show the differences among the studied conditions more clearly) showed either no change or minimal changes with the IP system compared with those from the CAT with no response change condition (the condition with IP size = 0). The increase in CSEM due to the IP system was less than 0.10 throughout most of the θ range (−2.5 ≤θ < 2.5) even when the IP size was 6, which was the largest IP size in this study. This was similar to findings observed in Stocking’s (1997) Restricted Model 1 (with the limit of two response changes) and Model 2 (with four or more separately timed sections). In terms of the average conditional bias in the score estimates (seen in the middle of Figure 2), test takers did not achieve any meaningful positive gain by implementing this Wainer-like strategy. In fact, higher proficiency test takers (θ > 0.5) tended to have slightly underestimated scores with this manipulative test-taking strategy because, as Wise (1996) pointed out, final score estimates could drop substantially if a test taker failed to respond to all items correctly when attempting the Wainer strategy. Therefore, it seems safe to conclude that the IP method was very robust against test takers’ attempts to implement the Wainer strategy even under a worst-case scenario. Figure 2 shows the total number of items that test takers of each proficiency level placed in the IP (i.e., IP usage).

Conditional bias, CSEM, and IP usage under TTS1.
In Simulation Study 2, where it was assumed test takers would use the IP system systematically for challenging items (i.e., implementing TTS2), the change in the MAE for θ estimates due to the IP system was significant but the magnitude was minor—the MAE increased by about 0.069, 0.083, and 0.087 when the IP was 2, 4, and 6, respectively. Considering a typical standard error of θ estimation—often between 0.30 and 0.40 for many CAT programs—the increase in the MAE due to the use of IP system, which was well below 0.10, seemed acceptable. Increases seen in the average bias in θ estimates due to the use of IP system were 0.057, 0.075, and 0.080, when the IP was 2, 4, and 6, respectively. Looking at the observed patterns of the MAE and bias together across the different IP sizes, it was apparent that the increase in MAE was due mainly to the systematic bias in the θ estimates. Although both the MAE and bias statistics showed that the impact of the IP system was negligible on average, it is also important to evaluate CSEM and bias across θ levels.
Figure 3 shows the details of the CSEM and conditional bias. The pattern of CSEM was similar to the pattern of conditional bias because, as noted earlier, the increase in CSEM was mainly due to the systematic bias. For higher proficiency test takers (θ > 0.5), TTS2 was ineffective in gaining any meaningful score improvement. However, for test takers at lower proficiency levels (e.g., in the θ < −0.5 range), a saturated, excessive use of the IP system with TTS2 could result in a positive score bias. As the result suggested a possible vulnerability of the IP system against an unrealistic case of TTS2, it was important to understand its implication in real-world situations.

Conditional bias, CSEM, and IP usage under TTS2.
Because the IP usage (i.e., the average number of items put in the IP) shown in the bottom of Figure 3 did not necessarily suggest the time and effort real test takers would expend taking advantage of the IP system under TTS2, this study analyzed the average conditional frequency of item review processes that included comparing the difficulty of each new challenging item with the easiest preexisting items in the IP. As shown in Figure 4, the higher the test-taker proficiency level, the less time likely was needed to revisit items in the IP under the studied condition. There were two main reasons for this. First, given the initial θ value around zero, test takers with below-average proficiency were likely to see more items of challenging difficulty and, hence, use the IP system more frequently. Second, test takers with extremely high proficiency (e.g., θ > 1.5) saw fewer items of challenging difficulty and, hence, used the IP system less frequently than test takers of below-average proficiency. They had no need to compare new challenging items with preexisting items in the IP because often there was no item in the IP that needed to be removed to make room for a new, harder item. As a result, higher proficiency test takers reviewed items less frequently under TTS2. For example, for highly proficient individuals (θ > 1.5), the total load of item review tasks was fewer than two occasions even when the IP size was six. For real CAT programs, test time limits often are set at a level at which about 80% to 90% of average test takers can finish the last item (Talento-Miller, Guo, & Han, 2010). This reduces the amount of wasted time both for test takers and for the CAT administration while minimizing possible speededness. In such a case, it is not unusual for highly proficient test takers to have a decent amount of time left to review the small number of items in the IP as needed.

Frequency of test taker revisiting (easiest) item under TTS2.
Although TTS2 could be a feasible strategy for high-proficiency groups in real, timed CAT administrations, use of the IP system would not necessarily result in positively biased scores. As seen in the middle of Figure 3, the bias in final θ estimation was next to nothing for highly proficient test takers. However, for groups at lower proficiency levels (e.g., θ < −1.0), use of the TTS2 strategy theoretically could result in slightly biased scores. Based on the average loads of review tasks that these test takers would need to process, however, it appears TTS2 would be an extremely unrealistic strategy for most CAT programs with test time limits (Figure 4). Although it was assumed in the TTS2 simulation that each new challenging item would be compared only with the easiest item in the IP, in reality test takers likely will need to revisit not only the easiest but also several (if not all) items in the IP. Because items in the IP are not ordered by item difficulty, test takers would need to determine which ones are the easiest among them. Determining the easiest item, which was not addressed in this simulation, could add a significant load to the item review process in practice, making the actual item review load much heavier than what was shown in Figure 4. Essentially, it means test takers at lower proficiency levels might spend more time and effort analyzing the item difficulties than on solving the problems to result in a final score with a meaningfully positive bias.
If test time is unlimited, test takers theoretically can review items many times. However, if time is unlimited, an option to review and change responses later would not be necessary (or desired) for CAT in the first place. In that instance, test takers could spend as much time as needed on each item, thus minimizing any test anxiety resulting from having to rush to the next item. Therefore, it is reasonable to assume that the IP system is most likely to be employed and useful for CAT programs that are strictly and tightly timed, a condition that applies to most operational real-world CAT administrations. Under this assumption, typical test takers would have only a fraction of test time left for reviewing items within the IP system. Hence, a strategy such as TTS2 that test takers must mechanically implement would be impossible to complete in real time. Knowing that many CAT programs apply severe penalties on final scores when test takers fail to complete all items within a set time limit, test takers would be discouraged from pushing TTS2 to the extreme—it would do more harm than good on their final scores. 1
Aside from the test time limit issue, it is important to understand that analyzing and comparing item difficulties to game a CAT system is a complicated and difficult process, one with a fairly poor success rate observed even for highly proficient test takers (Olea et al., 2000; Vispoel, 1998; Vispoel et al., 2002; Wise et al., 1999). Ironically, test takers who truly needed to review items most often under TTS2 were those at the lowest proficiency level, and the quality of their real-world performance in analyzing item difficulties is expected to be very poor, unlike the simulated study conditions in which all test takers performed the item-difficulty analyses with 50% to 70% accuracy. Thus, it is extremely unlikely that the magnitude of score bias observed with low-proficiency test takers in this simulation of the IP system could be replicated in the real world, even with unlimited test time.
Most test takers would not benefit from practicing TTS2 in typical, timed CAT programs, yet it presents some extreme scenarios worth thinking about. As shown at the bottom of Figure 3, test takers with extremely low proficiency (e.g., θ = −2.0) were likely to put fewer than three items into the IP during the test when the IP size was two. The items they most often placed in the IP were those administered at the beginning of CAT testing. For test takers of extremely low proficiency, the first few items, selected based on initial (randomly chosen) θ around 0, were the most difficult among all administered items. Because the interim score estimates for these lowproficiency test takers quickly declined from the initial θ, the first few items placed in the IP mostly remained the same throughout the CAT administration under TTS2. Therefore, test takers with very lowproficiency levels may benefit from being coached (e.g., by test prep institutions) to put the first items into the IP, solve all other items at once, and skip comparing the new items with those in the IP. Once the test takers reach the last item, they can revisit the first few items in the IP and submit their final answers. Such a manipulative modification of TTS2, however, would yield only a negligible positive bias (<0.2 when IP size = 2) at the extremely low θ area, which for most CAT programs is far from the main population of consideration. This issue, however, could become more serious, if the IP size was very large (e.g., more than 20% of the test length). Therefore, it is important to determine the right size for the IP (see further discussion of this later in the article). Moreover, assuming that test takers at the bottom-proficiency level are important for a CAT program, one could suggest lowering the initial θ value for item selection to reduce possible bias even further.
It bears repeating that the simulation study could not possibly measure various psychological effects of the IP method on real human test takers; the study’s main purpose was to understand and determine whether any negative effects of the IP method on measurement efficiency could be controlled sufficiently to be acceptable for real CAT programs under worst-case conditions.
Conclusion
Because of its efficiency, CAT has become widely accepted in the field of educational measurement; however, test takers’ dissatisfaction over not being allowed to review and change their responses (Baghi, Gabrys, & Ferrara, 1991; Legg & Buhr, 1992; Vispoel, 1998; Wise, 1996) has yet to be satisfactorily addressed. On one hand, reducing unnecessary test anxiety for test takers by allowing them to review and change their responses during a CAT administration is believed by some to have a positive effect on test validity (Olea et al., 2000; Papanastasiou, 2002; Stocking, 1997). On the other hand, the trade-off in CAT efficiency is simply unacceptable for most operational CAT programs, especially due to the possibility that test takers would attempt to game the CAT system (Wainer, 1993; Wise, 1996). Stocking’s (1997) restricted review models brought forward effective means of controlling the impact of some manipulative test-taking strategies such as the Wainer strategy while offering test takers a limited ability to review.
The new IP method presented in this study aimed to reduce the restrictions in reviewing even further and at the same time improve the robustness of the CAT system against manipulative test-taking strategies. With the IP system, test takers can go back and forth across sections to review items in the IP, and test developers do not need to time each section separately. The simulation result (with TTS1) showed that the IP method was as robust against the Wainer strategy as Stocking’s restricted review models. Moreover, unlike the restricted review models, the IP method systematically is immune to the Kingsbury and GK strategies because item selection is not influenced by items in the IP. The simulation study under TTS2 revealed that test takers with above-average proficiency would not benefit from saturatedly excessive use of the IP system.
The simulation did reveal the possibility of slight score biases (<0.2 when IP size = 2) for test takers with very low proficiency (θ = −2) under the studied condition when abnormally excessive use of the IP system (TTS2) occurred. But, given the evidence of the IP method’s robustness against the Wainer, Kingsbury, and GK strategies as well as against excessive uses of the IP system seen in TTS2, it is highly unlikely that test takers will be tempted to waste their time and effort on such manipulative test-taking strategies in a real, tightly timed operational CAT. The unrealistic nature of the simulation condition (no time limit, no fatigue, and 100% accuracy in determining the easiest item) combined with the huge loads of item review that test takers would have to process to complete the test would seem to work against that. Even in the worst-case scenario under the studied conditions, the sacrifice in the measurement efficiency was about one or two items in the extremely low-proficiency level. Some CAT programs may be willing to accept such a trade-off to provide test takers (or clients) with a better testing experience, especially when the test market is customer (i.e., test taker or client) driven.
The process for determining a proper IP size is somewhat ambiguous but critical because it ultimately decides the IP system’s flexibility and affects measurement efficiency. If the IP size is too small, test takers’ feelings of a loss of control over the test may persist because their ability to review items is too limited. However, if the IP size is too large, the CAT optimality would decrease because items in the IP would not contribute information needed for item selection. IP size, however, is not the only factor determining IP usage. For example, if a CAT is tightly timed, test takers may not use the IP up to its size limit because they know they will not have enough time near the end of the test to review items in the IP if there are too many. So test length and test time limit should be considered together when determining the IP size.
This study focused mainly on evaluating the possible negative impacts of the IP method on measurement accuracy under worst-case scenarios. Several possible positive effects of the IP method on measurement accuracy, however, were not covered in this simulation study but are worth mentioning. First, providing an option to review items may help test takers reduce their test anxiety levels, even if they never exercise that option (Olea et al., 2000; Vispoel et al., 2000). With reduced anxiety, test takers typically perform in a way they are supposed to with fewer mistakes, so, conceivably an item review option could reduce measurement errors (Papanastasiou, 2002). Second, with a well-chosen IP size, the IP system may help test takers manage their time more wisely during CAT administration. A typical test taker mistake observed during a real, operational CAT administration is spending too much time on a few items and then rushing through the remaining items to finish the test on time. This may contribute to a large (often unobservable) measurement error because “speededness” might influence his or her whole performance on the rest of the items. An IP system, however, would allow test takers to skip items they think will take longer to answer right at the outset. Even if a test taker finds a current item taking too long to finish in the middle of problem solving, he or she can put the item into the IP and move on to the next item. As test takers always can revisit the items in the IP and restart from where they left off, they would not need to gamble on whether to give up and guess or spend more time on the current item in which they have already invested a sizable chunk of time. The IP method can offer flexibility in test time management and help test takers minimize unintended speededness during CAT, which, as a result, may reduce related measurement errors. These potential positive benefits of the IP method could not be investigated in this study due to the simulation limitations, but it is strongly suggested that future studies conduct an in-depth examination of the psychological and psychometric effects of the IP method using real empirical data. It should be also noted that the IP method may bring a new type of negative psychological effect if test takers focus on using the IP method rather than working on the test itself. The potential for the IP method to generate a new level of test anxiety deserves to be the subject of additional future investigations.
The simulation conditions in this study mainly reflected a CAT program with a fixed length for a high-stakes exam. Because even minor changes in the item selection algorithm, item bank, estimation method, and test length can make huge differences in the outcomes, the findings of this study should not be imprudently generalized. The impact of IP usage on test length when CAT administration is terminated based on the estimation precision would be an interesting topic for examination in future studies.
The primary goal of the IP method was not necessarily to give test takers a better chance at improving their scores but to create a less restrictive testing environment, allowing them more control over a CAT administration so that they can perform undistracted with less test anxiety.
Footnotes
Acknowledgements
The author wishes to thank Lawrence M. Rudner, Fanmin Guo, and Eileen Talento-Miller of Graduate Management Admission Council® (GMAC®) for their feedback and support. The author also is grateful to Ernie Anastasio of GMAC®, Hua-Hua Chang of University of Illinois, Wim J. van der Linden of CTB/McGraw-Hill, Paula Bruggeman of GMAC®, and the reviewers of Applied Psychological Measurement for review and valuable comments, which strengthened the article greatly.
Author’s Note
The views and opinions expressed in this article are those of the author and do not necessarily reflect those of the Graduate Management Admission Council®. This is a revision of the original study that was presented at the 2011 annual meeting of the National Council on Measurement in Education (NCME). The article received the Alicia Cascallar Award for Outstanding Paper by Early Career Scholar from NCME.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
