Abstract
Computerized classification tests (CCTs) classify examinees into categories such as pass/fail, master/nonmaster, and so on. This article proposes the use of stochastic methods from sequential analysis to address item overexposure, a practical concern in operational CCTs. Item overexposure is traditionally dealt with in CCTs by the Sympson-Hetter (SH) method, but this method is unable to restrict the exposure of the most informative items to the desired level. The authors’ new method of stochastic item exposure balance (SIEB) works in conjunction with the SH method and is shown to greatly reduce the number of overexposed items in a pool and improve overall exposure balance while maintaining classification accuracy comparable with using the SH method alone. The method is demonstrated using a simulation study.
Computerized classification tests (CCTs) are measurement instruments that classify examinees into categories such as pass/fail, master/nonmaster, and so on, often for the purpose of professional licensure or certification. In many cases, the sequential probability ratio test (SPRT) of Wald (1947) is used in CCTs to terminate the test when a classification decision can be made with a high degree of certainty. Clearly, it is undesirable for testing to go on indefinitely and the entire item pool to be exposed to a single examinee, so a maximum number of items is specified for almost all operational CCTs. In addition, a minimum number of items is often specified as well, and the test may be terminated according to the SPRT when the number of items is between the minimum and the maximum. (Some authors refer to SPRTs constrained by a maximum number of items as “truncated SPRTs,” to differentiate them from pure SPRTs with no such constraint. However, because many operational SPRTs do in fact impose a maximum number of items, the authors simply use the term SPRT.)
Finkelman (2008) proposed the method of stochastically curtailed SPRTs (SC-SPRTs), an innovation designed to make CCTs more efficient, that is, decrease the number of items administered to examinees while still classifying them accurately. He motivates the topic by demonstrating that under some conditions, the SPRT will continue to administer items although the current interim classification decision would not change before the maximum number of items is reached. Thus, the remaining items that are administered are wasted if there is no chance that the responses to them will change the classification decision. The method of the SC-SPRT computes the probability that the current interim classification decision will not change before the maximum number of items is reached. If this probability is very high, the SC-SPRT terminates the test, even if the traditional SPRT would have kept administering items. It is shown that the SC-SPRT is very effective in reducing the average test length (ATL) while maintaining classification accuracy equal or nearly equal to that of the traditional SPRT.
This article proposes the use of these stochastic methods to address another practical concern in operational CCTs, that of item overexposure. Item overexposure is traditionally dealt with in CCTs by the Sympson-Hetter (SH) method (Sympson & Hetter, 1985), but this method is unable to ensure that the exposure rate of all items will be kept at or below the desired maximum exposure rate (Barrada, Abad, & Veldkamp, 2009). In other words, when using the SH method, the most popular (i.e., informative) items will usually be administered to a larger proportion of examinees than was intended by the test developers.
To remedy this situation, the authors propose a method that uses the mathematical concepts of the SC-SPRT to work in conjunction with the SH algorithm to help balance item exposure rates in CCTs that specify a maximum and minimum number of items. As Thompson (2011) stated, a minimum serves a public relations function in that it prevents examinees from failing after responding to only a few items, which would likely result in complaints to the testing organization. From a psychometric standpoint, however, it is often possible to classify examinees accurately before they reach the minimum, such as in the case of examinees with either very high or very low ability levels. For example, for a CCT with a minimum test length of 50 items, a high-ability examinee may be able to be classified correctly as passing by the 40th item, but 10 more items must still be administered to meet the minimum requirement. In this situation, it would be unnecessary to draw these last 10 items from a portion of the item pool consisting of often exposed items; it is reasonable that these 10 positions may be filled with items that have been underexposed.
In the proposed method, before the minimum number of items is reached, an interim classification decision is made and the probability of this decision being retained for the remainder of the test is computed. If this probability is very high, the item selection algorithm will be routed to a subset of the item pool consisting of the most underexposed items, which are often the least informative. Responses to these items continue to contribute to the classification of the examinee; however, the classification is not likely to change, given the high probability of retention. In this manner, classification accuracy will remain high, and rarely administered items will have a much larger chance of being used, thereby preventing the use of more popular items. For examinees with interim classification decisions that are more uncertain, that is, interim decisions with a higher chance of changing, the test will proceed normally and terminate according to the traditional SPRT.
This article is structured as follows: The section titled ‘Method’ briefly summarizes the concepts of Finkelman’s (2008) SC-SPRT and presents the proposed method, which applies these ideas to the problem of improving item exposure balance in CCTs. ‘Simulations’ describes two simulation studies designed to examine the performance of the method in contrasting CCT settings, and ‘Results’ presents the results of the simulation. ‘Discussion’ concludes the article with discussion and avenues for further research.
Method
Traditional CCTs and SC-SPRTs
This article deals with CCTs that perform dichotomous classifications based on item response theory (IRT), specifically, the three-parameter logistic model (Birnbaum, 1968). For a detailed treatment of CCTs and the SPRT, see Wald (1947) and Spray and Reckase (1994, 1996). Practical considerations when designing CCTs include the aforementioned issue of item exposure control and item selection methods. A traditional method of item selection, and the one used in this article, is choosing the item that maximizes the Fisher information (FI) at the cut point θ0, which was shown by Spray and Reckase (1994) to compare favorably with other methods. This method of item selection is nonadaptive, as it does not depend on an examinee’s interim ability estimate. For cases in which a CCT uses an item selection method that is adaptive, Finkelman (2008) presented a slight modification of his method of stochastic curtailment. Thus, the general method of stochastic curtailment, as well as the methodology proposed below, would be applicable to CCTs using either method of item selection.
As mentioned above, Finkelman (2008) proposed shortening CCTs via the SC-SPRT. To review the method, the authors introduce the following notation. Items administered in a CCT are indexed by the letter j, and the minimum and maximum number of items allowed are denoted by Jmin and Jmax, respectively. The authors differentiate the interim classification decision after the administration of j′ items, denoted as
Stochastically Balancing Item Exposure Rates
As discussed previously, Finkelman (2008) proposed the use of the probability
The exact set of items selected to be contained in Mlow may be specified at the discretion of the psychometrician. For CCTs using the SH method for item exposure control, a reasonable way to construct the subset would be to include all items with a SH exposure parameter of 1, because it is these items that are exposed less often. Another simple way to construct Mlow would be to include the x least informative (and therefore, most seldom exposed) items in the pool, for x = 100, 150, 200 . . . depending on the size of the pool and the number of items that tend to be over-/underexposed.
Simulations
Two simulation studies were performed to examine the ability of the new SIEB method to improve item underexposure while maintaining classification accuracy comparable with that of the traditional SPRT. The two studies use CCTs with differing specifications so that the proposed methodology may be demonstrated in two contrasting situations. As is described in the following, eight different test conditions were simulated in each study, and each condition was run for 50 replications. The item pools generated for each study were kept constant throughout all conditions and replications. Also, for each replication of each condition n = 1,000 simulated examinees were generated from the Normal(0,1) distribution.
Study 1
For the first simulation study, an item pool consisting of 750 items was generated with IRT a, b, and c parameters drawn from the Normal(0.7,0.2), Normal(−0.75,2.0), and Normal(0.25,0.03), respectively. These values were chosen to be similar to those used in previous CCT studies (Thompson, 2009). SH exposure control parameters were set for the items in the pool via initial simulations run according to the method outlined by Sympson and Hetter (1985), and the target maximum exposure rate was set at 0.2, a commonly chosen rate (Leung, Chang, & Hau, 2002). The exposure control parameters obtained at the 10th iteration of the SH algorithm were used. For four of the eight conditions, content domain constraints were imposed; in these cases, items in the pool were randomly assigned to one of three equally balanced content domains.
For all conditions, the following CCT parameters were set:
In summary, the conditions that were varied for Study 1 include two levels of content domain constraints (present and not present) and four-item different exposure control methods, resulting in 2×4 = 8 conditions. For the conditions incorporating content domain constraints, content balance was achieved using the spiraling method proposed by Kingsbury and Zara (1989), with the three contents being assigned equal importance.
Study 2
The design of the second study was similar to that of the first, but several CCT settings were varied for contrast. For this study, an item bank of 500 items was generated with IRT parameters drawn from the same distributions as in Study 1. The following CCT settings were implemented:
Results
The results of the simulation study were evaluated according to five different criteria: (a) the percentage of correct classifications (PCC); (b) ATL; (c) a chi-square statistic used by Wen, Chang, and Hau (2000) to summarize the item exposure rate balance (or imbalance) of the items in a pool for variable-length computer adaptive test or CCT (lower values signify better exposure balance); (d) the proportion of low-exposed items (LEXP), that is, items seen by less than 2% of examinees; and (e) the proportion of highly exposed items (HEXP), that is, those items exposed to greater than 20% of the examinees, the maximum rate specified when running the SH simulations.
In general, the results of the simulation studies show that SIEB is capable of significantly improving item exposure balance, including decreasing the proportion of overexposed items in the pools without sacrificing classification accuracy. Table 1 displays the results of Study 1, averaged over the 50 replications for each condition (standard deviations are in parentheses). For conditions both with and without content domain constraints, the SH with SIEB(
Results for Study 1
Note: PCC = percentage of correct classifications; ATL = average test length; LEXP = low-exposed items; HEXP = highly exposed items; SH = Sympson-Hetter; SIEB = stochastic item exposure balance.
Also, in Table 1, there is very little difference between the conditions with content constraints and those without. With the studied pool, the content constraints do not affect the classification accuracy. A difference may be seen, perhaps, if the bank was made smaller and/or a larger set of constraints was used.
The results for Study 2 are displayed in Table 2. The patterns described for the results of Study 1 all hold for Study 2: SIEB with γ values close to 1 yield significant improvement in χ2 and HEXP while maintaining similar classification accuracy as SH only, whereas SIEB(
Results for Study 2
Note: PCC = percentage of correct classifications; ATL = average test length; LEXP = low-exposed items; HEXP = highly exposed items; SH = Sympson-Hetter; SIEB = stochastic item exposure balance.
In addition to the SIEB yielding lower χ2 and HEXP values across testing conditions in both studies, the ATLs decrease as lower values of γ are used in the SIEB. This is due to the fact that under the SIEB, the tests for examinees with
A question that arises naturally is how many examinees are routed to
Proportion of Examinees Routed to
Note: HEXP = highly exposed items; SH = Sympson-Hetter; SIEB = stochastic item exposure balance.
Discussion
The SH method is a tried and true means of limiting item overexposure in CCTs. One of the standard criticisms of this method is that many items may remain overexposed, exceeding the desired maximum item exposure rate. While this may be a security concern for test developers and stakeholders, it would be undesirable for any method aiming to improve item overexposure to sacrifice a significant degree of classification accuracy, which is usually a much higher priority. This article proposes SIEB, a stochastic method of improving item exposure balance that works in conjunction with the SH method for CCTs, which specify a minimum number of items that must be administered to all candidates. This new combined method exposure control has been shown to aid in balancing item exposure rates while maintaining classification accuracy virtually equal to that of the SH method used alone. Specifically, SIEB greatly reduces the number of items that display an exposure rate greater than the maximum desired exposure rate specified when using the SH method. The strength of the proposed method is due to the calculation of the probability
The idea of improving item exposure balance by administering low-discriminating and/or low-information items at a point in an exam when doing so is psychometrically “inexpensive,” that is, unlikely to negatively affect examinee ability measurement, is not a new one. For example, the technique of a-stratification (Chang & Ying, 1999) administers low-discriminating (and therefore underexposed) items at the beginning of a computerized adaptive test when an examinee’s theta value has not yet been estimated precisely. In contrast, SIEB administers rarely exposed items in a CCT after the examinee has been classified with certainty, but the two methods are similar in spirit and aim. a-Stratification has not yet been studied in a CCT setting; the randomization approach for exposure control used by Lin (2011) appears to be analogous. Further research may attempt to combine SIEB with such methods.
There are myriad ways to alter the design of CCT, including varying the minimum and maximum test length, false negative and false positive rates α and β, length of indifference region, item pool size, item parameter distribution, and content domain constraints. No single study can coherently examine all these factors; this article presented two contrasting simulation studies in which the item pool size and content domain constraints were varied, as well as the SIEB parameter γ. Researchers wishing to use the proposed methodology should experiment with different γ values for use with their particular CCTs. While the authors do not wish to overgeneralize, their results suggest that even very conservative values of γ (close to 1) may yield significant improvements in item exposure balance while maintaining classification accuracy.
An issue that would arise if this method were to be used operationally is how to select
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article
The author(s) received no financial support for the research, authorship, and/or publication of this article.
