Abstract
This study compares the progressive-restricted standard error (PR-SE) exposure control procedure to three commonly used procedures in computerized adaptive testing, the randomesque, Sympson–Hetter (SH), and no exposure control methods. The performance of these four procedures is evaluated using the three-parameter logistic model under the manipulated conditions of item pool size (small vs. large) and stopping rules (fixed-length vs. variable-length). PR-SE provides the advantage of similar constraints to SH, without the need for a preceding simulation study to execute it. Overall for the large and small item banks, the PR-SE method administered almost all of the items from the item pool, whereas the other procedures administered about 52% or less of the large item bank and 80% or less of the small item bank. The PR-SE yielded the smallest amount of item overlap between tests across conditions and administered fewer items on average than SH. PR-SE obtained these results with similar, and acceptable, measurement precision compared to the other exposure control procedures while vastly improving on item pool usage.
The application of computerized adaptive tests (CATs) and computer-based tests has increased in the last few decades due to advancements in technology and has replaced many traditional paper-and-pencil assessments. The Graduate Management Admission Test (GMAT) developed by the Graduate Management Admission Council, the Certified Public Accounts (CPA) exam from the American Institute of Certified Public Accountants, the North American Pharmacist Licensure Examination (NAPLEX) developed by the National Association of Boards of Pharmacy, and the National Council Licensure Examinations (NCLEX) developed by the National Council of State Boards of Nursing are all examples of CATs and computer-based tests currently implemented.
One component of CAT deals with item selection procedures to control for exposure rate and ensure content sampling. In high-stakes CATs especially, the overexposure of items is a major concern, and there are several methods that can be implemented to reduce item exposure rates as well as increase item pool usage. One of these methods is the progressive-restricted standard error (PR-SE) exposure control procedure (McClarty, Sperling, & Dodd, 2006), which is a variant of the progressive-restricted (PR) method developed by Revuelta and Ponsoda (1998). It is relatively easy to implement just like the original PR method, but it can be used with both fixed- and variable-length CATs (unlike the PR procedure). The ability to use the PR-SE procedure with variable-length CATs allows tests to end after an individual’s ability is estimated satisfactorily.
In its original form, CAT was used similarly to a traditional test, where all examinees received the same number of items, also referred to as a fixed-length test. However, a CAT fixed-length test differs from a traditional assessment because each item administered is matched to each examinee’s latent estimated trait level. CAT allows for the estimation of an examinee’s trait level after a response to each item, and item selection is based on the latent ability estimation. Alternatively, the CAT can be a variable-length test, where test termination can be decided by a variety of methods, such as minimum information or the standard error (SE) stopping rule. For the minimum information procedure, the CAT is stopped once there are no longer items left in the item pool that provide a predetermined amount of information for an examinee. With the SE stopping rule, the CAT is terminated once a predetermined SE for that ability estimate is achieved.
While the applicability of the PR-SE method for variable-length CATs (as well as fixed-length CATs) seems promising, it needs to be compared to other methods before it is implemented in a live CAT system. No studies to date have evaluated the performance of the PR-SE with prevalent exposure control procedures, such as the Sympson–Hetter (Sympson & Hetter, 1985) or randomesque (Kingsbury & Zara, 1989) methods, which will be the aim of the current study.
Exposure Control Procedures
The present study focuses on exposure control, an item constraint usually required with any high-stakes or high-volume testing situation. Exposure control methods attempt to preserve the item bank by selecting and administering items based on how often examinees have seen each item. Exposure control procedures can be categorized into four main types that are commonly found in CAT practice and research: conditional, randomized, stratified, and combined procedures (Georgiadou, Triantafillou, & Economides, 2007). The exposure control procedures described for the study are widely known for their use operationally.
The maximum information method provides the basis for many of the item selection procedures found in CAT practice and research. As the sole criterion for item selection, maximum information, or no exposure control, typically provides the best trait estimation. However, no exposure control is limited by its inherent risk that many of the same items will be administered to examinees. This neglects the use of the entire item bank (where lower discriminating items are rarely administered) and heightens test security risk. Repeatedly sampling an item can create situations where the item becomes known, creating differences in the previously calibrated parameters and current testing situation called parameter drift (van der Linden & Glas, 2010). Employing only information as the item selection criteria is a potentially serious threat to a test’s validity and has led to the development of various exposure control procedures that can be incorporated into CAT administration.
Conditional
Conditional procedures attempt to control exposure rates based on a given criteria, such as usage frequency. One of the most notable procedures in the conditional category is Sympson–Hetter (SH; Sympson & Hetter, 1985). The SH method assigns each item an exposure parameter, k, where an iterative process, prior to administration, takes place by a simulation study to determine the probability an item will be administered given that it has been selected. Once the simulation process accumulates the probabilities, an item is assigned the item exposure parameter ki, which ranges from 0 to 1. When an item is chosen by this selection procedure, a random number is generated from a uniform distribution between 0 and 1 to be compared to the item exposure parameter. If the random number is less than or equal to the item exposure parameter, the item is administered. Conversely, an item exposure parameter less than the random number results in the item being disregarded for further administration.
While the SH procedure can control item overexposure, it requires additional simulations to calculate the items’ exposure parameters that are used during the computer adaptive test. This “extra step” may be deemed too time-consuming for test developers, making other exposure control methods more attractive (Georgiadou et al., 2007).
Randomized
Randomized procedures essentially introduce some randomization to the item selection process for a given subset of items. For example, McBride and Martin’s (1983) 5-4-3-2-1 method selects the most informative five items and randomly chooses one from that subset. After the ability is estimated, the procedure then selects the four most highly informative items, from which one is randomly chosen. The method continues reduction of the item group until its subset is defined as one item, where maximum information proceeds. Kingsbury and Zara (1989) introduced the randomesque procedure, where a number of items, such as five, are chosen based on a person’s ability estimate and one item is randomly selected from the subset throughout the testing occasion.
Stratified
Stratified methods stratify the item pool according to statistical properties and are constrained to be administered from a given strata. There is the a-stratified procedure (Chang & Ying, 1999) that first administers items with lower a parameter (discrimination) values because the theta estimate at the beginning is more of a guess and therefore more variable. Another procedure is a-stratified with b blocking and takes care of the notion that b values are correlated with the a values (Chang, Qian, & Ying, 2001).
Combined
Combined procedures occur when two or more exposure control methods are combined. The present study focuses on the progressive-restricted (PR) combined procedure, which evolved from two different types of exposure control methods proposed by Revuelta and Ponsoda (1998) to address the item exposure control issues that are inherent in computer adaptive tests. The two methods that were combined to formulate the PR method (progressive and restricted maximum information) are designed to prevent overexposure, increase the rate of usage of infrequently selected items, while still maintaining precision in ability estimation.
Progressive-restricted procedure
Revuelta and Ponsoda’s (1998) progressive-restricted procedure was derived from the maximum information method, where the most informative item is always administered, and the restricted maximum information method, which selects items using maximum information but then no longer administers items after they appear in (100k)% of the tests. Calculations for k begin during the first administration of the test. Once an item, i, has been administered ai times in the previous t tests, the item exposure rate, ki, is calculated using the formula ai/t. Only items that are below the exposure rate, k, are available for selection.
The progressive method brings a random component to maximum information by calculating a weight value for each item using:
where wi is the weight assigned to the item, s is the relative serial position of the item, Ri is a random number selected from a uniform distribution, and Ii is the item information. s is calculated by h/m, where h is the number of previously administered items in a test with a maximum length of m items. Here, the random component loses importance as the test progresses since it is multiplied by (1−s). Conversely, the information component becomes more important because it is multiplied by s. Alternating the contribution to the item weight value allows more of the items in the item bank to be used (due to the random component), while still maintaining estimation precision (due to the item information value).
Using a real item pool, the restricted maximum information and progressive methods were compared to other exposure control methods on precision and exposure control capabilities (Revuelta & Ponsoda, 1998). Overall, the methods with the greatest precision were also the methods with the worst exposure rates, as seen by the excellent precision abilities of maximum information. From a practical standpoint, test developers are not just concerned with having the most precise test, especially in high-stakes testing with large item banks. The progressive method was able to decrease the number of unused items, while maintaining precision in ability estimation. However, the progressive method still had considerably high maximum exposure rate. The progressive method showed an increase in item bank usage, but exposure rates were still higher than desired.
Revuelta and Ponsoda (1998) conducted a second study, where the progressive and restricted maximum information methods were combined in an attempt to maintain the benefits of the progressive method while decreasing the high maximum exposure rate that afflicts it. In the PR method, the restricted maximum information part of this combined method is used to select the available items for the test, and then the formula for the progressive method in Equation 1 is used to calculate a weight, wi, for each item. Item exposure rates are capped at k, and any item with a higher exposure rate is not included in the item bank for the next administration of the test.
The PR procedure showed better results for maximum exposure rate than the progressive method. The restriction component, k, that was introduced to the progressive method was able to keep maximum exposure rates under control, while still maintaining ability estimation precision and higher usage of seldom used items. While methods such as SH also perform similarly, the PR procedure is much easier to execute.
Progressive-restricted standard error procedure
A variant of the PR method was developed by McClarty et al. (2006), termed the progressive-restricted standard error exposure control method. This method uses the same formula as in Equation 1 but redefines s as the ratio of stopping rule SE over the current SE. Using the SE does not make selection dependent on serial position but rather uses SE to determine whether the random portion or information is more influential in the weighting. As with the randomized procedures, the combined procedures are easier to implement than conditional procedures.
McClarty et al. (2006) compared the PR-SE method to the PR method from Revuelta and Ponsoda (1998) and no exposure control (i.e., maximum information procedure). The authors chose to examine two models, the dichotomous three-parameter logistic (3PL; Birnbaum, 1968) and polytomous partial credit (PC; Masters, 1982) models. They also manipulated the item pool size, trait distribution, and stopping rule.
Under the 3PL model, PR-SE administered fewer items than PR, where the difference in number of items administered was greater under the normally distributed data. Smaller item pool sizes resulted in increased SE of theta, with higher SEs for uniformly generated data. Correlations between estimated and known theta were similar for both PR-SE and PR in every condition, and percent item overlap was generally low for both methods for every condition. For the PC model, there were no consistent differences found between PR-SE and PR. PR-SE was found to use most of the item pool during administration. Overall, PR-SE performed similarly to the PR method in both models, supporting the use of PR-SE with variable-length CATs.
The PR-SE procedure offers a method that is straightforward to implement compared with other exposure control procedures used in the testing field. While PR worked similarly to other methods, it is more appropriate for fixed-length tests. Both combined procedures (i.e., PR and PR-SE) have been shown to work similarly to each other (McClarty et al., 2006). PR is appropriate for fixed-length tests because s is defined by serial position, whereas the PR-SE method is appropriate for both fixed-length and variable-length CAT, as s is defined by SE of ability estimates. The purpose of the study is to compare PR-SE to other notable procedures commonly used in operational settings. While the PR-SE is not a new procedure, no comparisons have been made between PR-SE and any other exposure control procedure except for the PR procedure. Therefore, the present study will evaluate the PR-SE, SH, and randomesque procedures that have been used in operational CATs. Stratification procedures were not investigated in the current research because those procedures are currently not used in operational CATs and both the SH and randomesque methods are used in operational testing programs. Additionally, the investigation will incorporate more realistic high-stakes settings with stricter content area requirements than previous research. The study will consider trait estimation, item overlap, pool usage, and exposure rates for comparison criteria.
Method
Overview of Study Design
The aim of the current study is to compare the PR-SE exposure control method (with the exposure rate set to .30) to the SH (with the exposure rate set to .30), randomesque (R-5; using a group of five items at a time), and no exposure control (None) procedures using the dichotomous 3PL model. Performance of these four exposure control methods is studied under varying conditions, including two item pool sizes (300 vs. 540 items) and two stopping rules (fixed-length with 50 items vs. variable-length with the SE set to 0.30 or a maximum administration of 50 items). The completely crossed design of this study resulted in 4 (exposure control procedures) × 2 (item pool sizes) × 2 (stopping rules) = 16 experimental conditions.
Item Pools
The item pools developed for this study are based on real data from a nationally administered mathematics achievement test. Item parameters for difficulty, discrimination, and guessing (b, a, and c, respectively) based on the 3PL model were provided for this test with an information function that peaked at a theta value of zero. Table 1 contains the means and standard deviations of these parameters for both item pools. A total of 540 items coming from nine different test forms, each containing 60 items, was used for the “large” item pool. The “small” item pool was constructed by choosing at random five of the forms, for a total of 300 items. Each test form used to produce the item pools contained items from six different content areas. Both item pools included the same proportions of items for their respective content areas. The six content item proportions are as follows: 24%, 16%, 15%, 15%, 23%, and 7%.
Descriptive Statistics for the Item Parameter Estimates for Both Item Pools.
Data Generation
Data generation procedures implementing the SAS macro IRTGEN (Whittaker, Fitzpatrick, Williams, & Dodd, 2003) were used to simulate item responses for the two item pools based on parameter estimates obtained from the real data set described above. The data generation began with a sample of 1,000 simulees’ known theta levels drawn from the standard normal distribution with a mean of zero and standard deviation of one. The probability of responding to an item was computed using the examinee’s theta level and parameter estimates from the item pools. This probability was then compared to a random number that was chosen from a uniform distribution. If the probability was greater than the random number, then the simulee was given a response score of one for that item, otherwise they were given a score of zero. This data generation procedure was repeated for every item and each simulee. For both item pools, each of the eight conditions (2 stopping rules × 4 exposure control methods) had 100 replications. Therefore, a total of 200 data sets were generated for a repeated-measures factorial design.
CAT Simulations
The SAS program developed by Boyd, Dodd, and Fitzpatrick (in press) was used to administer the CATs using the data sets described above. The item pool information function peaks for a theta value of zero. Therefore, an initial value of zero for theta was assumed for each simulee. Content balancing was implemented using the Kingsbury and Zara (1989) method, because the proportion of items in each of the six content areas varied. This content balancing procedure uses the discrepancy between the percentage of targeted proportions in each area and the proportions during the adaptive test to determine item selection from the content area with the largest discrepancy. Expected a posterior (EAP) estimation, a Bayesian procedure, was used for ability estimation. Bayesian ability estimation methods use the likelihood function multiplied by the prior distribution to compute the posterior distribution. EAP is noniterative and uses the mean of the posterior distribution. Specifically, EAP looks at the ordinate point (height) of each slice (quadrature) of the normal curve. A normal prior was used for EAP estimation because the ability distribution was assumed to be normally distributed. The stopping rule for fixed-length CATs was set to 50 items. The minimum SE stopping rule for variable-length CATs was set to 0.30 or a maximum administration of 50 items.
Data Analysis
The SE of measurement and the Pearson product–moment correlation between known and estimated theta were calculated for each replication. Additionally for the variable-length condition, descriptive statistics were calculated for the number of items administered. These statistics were then averaged for each of the four exposure control procedures, compared by item pool size and stopping rule. Bias for the final trait estimate averaged across the 100 replications for each condition are calculated with the following formula:
where
where
The present study also examined the item exposure rates, pool utilization, and the item overlap across test administrations. Item exposure rates were calculated by taking the number of times an item is administered divided by the number of administered CATs. Pool utilization refers to the percentage of items not administered during the CATs. Item overlap is the number of items that are the same for two examinees. These were averaged across the 100 replications for all conditions. In addition, conditional bias and RMSE plots are provided for the 300-item pool condition.
Results
Table 2 provides descriptive statistics averaged across the 100 replications under all conditions for the mean SE, correlation between known and estimated theta, bias, RMSE, and mean item overlap for the fixed-length tests. For both the large (540) and small (300) item pools using the fixed-length CATs, the grand mean SE increases slightly from the no exposure control condition (0.27 and 0.28 for the large and small pools, respectively) to R-5, SH, and then PR-SE (0.30 and 0.32 for the large and small pools, respectively). The correlations between known and estimated theta as well as bias were on average similar across conditions for the fixed-length CATs. RMSE slightly increased on average from no exposure control, R-5, SH, to PR-SE resulting in higher RMSE for both item pools with fixed-length CATs. Grand mean item overlap was smallest for PR-SE (11.5 and 12.4 for the large and small pools, respectively), then SH, R-5, and largest for no exposure control (24.5 and 26.8 for the large and small pools, respectively). Mean conditional bias and RMSE plots were very similar across all item exposure control conditions for the fixed-length CATs using both item pools where we see that mean bias is zero and RMSE is smallest around a theta of 0 (see Figures 1 and 2 displaying the 300-item pool condition).
Descriptive Statistics Averaged Across 100 Replications for the Fixed-Length Tests.
Note. None = no exposure control; R-5 = randomesque using a group of 5 items; SH = Sympson–Hetter; PR-SE = progressive-restricted standard error; SE = standard error; RMSE = root mean squared error.

Conditional bias for 13 equally spaced theta intervals for the 300-item pool fixed- and variable-length conditions averaged across 100 replications.

Conditional root mean squared error (RMSE) for 13 equally spaced theta intervals for the 300-item pool fixed- and variable-length conditions averaged across 100 replications.
For variable-length CATs, descriptive statistics averaged across the 100 replications are presented in Table 3, where results are very similar to those found for fixed-length CATs. Bias and correlation between known and estimated theta averaged across the replications were similar across the conditions for both item pools. Grand mean SEs were also very similar across all conditions for variable-length CATs using both item pools. The mean number of items administered averaged across the 100 replications was smallest for no exposure control (35.5 and 35.9 for the large and small pools, respectively), then R-5, PR-SE, and highest for SH (41.1 and 44.4 for the large and small pools, respectively). Mean RMSEs were more similar across conditions for each of the item pools using variable-length CATs, but the no exposure control condition still resulted in the smallest and PR-SE resulted in the largest. Mean item overlap averaged across replications is once again smallest for PR-SE (8.9 and 10.4 for the large and small pools, respectively), then increases from SH, R-5, to no exposure control (15.9 and 17.4 for the large and small pools, respectively) for both item pools with variable-length CATs. Comparable to using the large item pool fixed-length CATs, the conditional bias and RMSE plots were the similar across all exposure control methods (see Figures 1 and 2 displaying the 300-item pool condition).
Descriptive Statistics Averaged Across 100 Replications for the Variable-Length Tests.
Note. None = no exposure control; R-5 = randomesque using a group of 5 items; SH = Sympson–Hetter; PR-SE = progressive-restricted standard error; NIA = number of items administered; SE = standard error; RMSE = root mean squared error.
Tables 4 and 5 present the distribution of exposure rates for each of the exposure control procedures for both fixed- and variable-length CATs for the large and small item pools, respectively. Results reveal for the large and small item pools that exposure rates for the no exposure control condition are as high as 1 and for PR-SE they are as high as .31 to .35. SH has similar maximum exposure rates to PR-SE, except that they are slightly higher under the large item pool variable-length CAT condition. For the small pool, R-5 produces item exposure rates as high as .81 to .90 for fixed-length CATs and .71 to .80 for variable-length CATs. For the large pool, maximum item exposure rates for R-5 are slightly less.
Count and Percentage Frequency Distribution of Item Exposure Rates Averaged Across the 100 Replications.
Note. None = no exposure control; R-5 = randomesque using a group of 5 items; SH = Sympson–Hetter; PR-SE = progressive-restricted standard error.
Count and Percentage Frequency Distribution of Item Exposure Rates Averaged Across the 100 Replications.
Note. None = no exposure control; R-5 = randomesque using a group of 5 items; SH = Sympson–Hetter; PR-SE = progressive-restricted standard error.
The item pool utilization is also presented in Tables 4 and 5. The PR-SE procedure administered nearly the entire item pool under all conditions. For the large item pool, the other procedures administered 52% or less of the item pool. The no exposure control procedure resulted in the lowest percentage of administered items from the pool under all conditions. For the small item pool, more of the item bank is administered than compared to the large item pool with SH, R-5, and no exposure control procedures. SH administered about 80% of the items, whereas R-5 administered about 70% of the item bank. It should be noted that differences are amplified with item pool utilization as pool size increases. For the fixed-length condition, the percentage of not administered items ranged on average from 2.1% to 61.8% for the 540-item pool and from 0% to 42.2% for the 300-item pool across replications.
Discussion
A variant of the PR item exposure control procedure was developed by McClarty et al. (2006) for use with variable-length CATs. This variant, termed the progressive-restricted standard error method, uses the current SE of theta rather than the serial position of the administered item, leading to its capability to be used with both variable-length and fixed-length CATs, instead of only the latter. The PR-SE method has only been previously assessed against the PR procedure and no item exposure control. The current study examined the PR-SE method with other commonly used procedures: no exposure control (None), R-5, and SH methods. Overall, the PR-SE method was able to estimate theta with good precision, administer almost all of the items from the bank, and reduce item overlap among the administered CATs using a dichotomous 3PL model.
For the small item pools with both fixed-length and variable-length CATs, grand mean SEs and mean RMSEs were slightly larger for all exposure control conditions, similar to the results found in McClarty et al. (2006). Other results from the current study that are replicated from previous studies, such as McClarty et al. (2006) and Revuelta and Ponsoda (1998), were that the grand mean item overlap and mean item exposure rates were best for the PR-SE procedure, and comparable to SH, but substantially less than the no exposure control and R-5 methods. The grand mean number of items administered across the replications for PR-SE was slightly less than SH, but higher than both no exposure control and R-5. PR-SE administered almost the entire item pool, whereas the other procedures all administered about half or less of the item bank.
In general, this study has demonstrated that the PR-SE procedure may be a good item exposure control procedure to use with both fixed- and variable-length CATs. In particular, this method works well with either a large or small item pool because it resulted in similar measurement precision with the other exposure control procedures, and in less time than SH. PR-SE also administered fewer items on average than SH, which would result in reduced test burden for the test taker. Item bank utilization was best for PR-SE because it used almost the entire item bank and prevented overexposure of the items, which are essential components to item exposure control procedures in order to reduce item development, increase test security, and increase longevity of the items.
As with any study, limitations are associated with this study also. One limitation to the study is examinee ability distribution. In the current study, only a normal distribution was considered, where future research might investigate the impact of different distributions of ability. Future research should also investigate PR-SE with other models, such as polytomous item response theory models. Real-data simulations could be conducted comparing the PR-SE procedure to other item exposure control procedures not evaluated in this study. Additionally, the aim of the current study was to compare PR-SE with other procedures commonly used in operational settings. Further studies could compare PR-SE with other exposure control procedures such as the genetic programming procedures developed by Chen and Doong (2008) or proportional methods developed by Barrada, Olea, Ponsoda, and Abad (2008).
As we move forward to technology-based tests, and in particular with computerized adaptive testing such as through the Smarter Balanced Assessment Consortium, studies like the current one should be informative to practitioners. If practitioners need minimal item exposure control with high precision and test efficiency, then the randomesque procedure would be useful based on its ability to obtain the smallest standard errors and RMSEs, bias closest to zero, and administer the fewest items compared to PR-SE and SH. If the need is more balance between precision of measurement and item exposure control, then SH would be a good choice because it results in just slightly higher standard errors and RMSEs than R-5, but it does limit the item exposure rate unlike with R-5. If practitioners are looking for higher item exposure control while still maintaining precision of measurement, then PR-SE might be the preferred procedure because the results of this study suggest that it is the only method of the three that administers nearly the entire item bank while limiting the item exposure rate. Although the PR-SE results in slightly higher RMSEs and standard errors than SH, the values were still reasonable. Of course, before implementing these procedures in an operational testing program, simulations using the test’s item bank should be conducted to insure the generalization of the present results to that testing program.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
