Abstract
This study compared various panel designs of the multistage test (MST) using mixed-format tests in the context of classification testing. Simulations varied the design of the first-stage module. The first stage was constructed according to three levels of test information functions (TIFs) with three different TIF centers. Additional computerized adaptive test (CAT) conditions provided baseline comparisons. Three passing rate conditions were also included. The various MST conditions using mixed-format tests were constructed properly and performed well. When the levels of TIFs at the first stage were higher, the simulations produced a greater number of correct classifications. CAT with the randomesque-10 procedure yielded comparable results to the MST with increased levels of TIFs. Finally, all MST conditions achieved better test security results compared with CAT’s maximum information conditions.
Classification testing has been used to make both dichotomous (e.g., pass/fail or mastery/nonmastery) or polytomous decisions (e.g., below basic, basic, intermediate, or proficient). This type of testing determines whether examinees have a qualified ability required by standards of different test settings (Bergstrom & Lunz, 1999; Parshall, Spray, Kalohn, & Davey, 2002; Thompson, 2007). For example, classification testing has been used for licensing and certification testing, in educational settings, and in the counseling field (Jiao, 2003; Spray & Reckase, 1996).
To administer classification tests, the multistage test (MST) has been considered an effective approach. In general, the test’s adaptive nature selects and administers test items according to each examinee’s dynamically estimated proficiency level. MST adapts using a prebuilt set of items rather than an individual item level as is typical with the computerized adaptive test (CAT). Thus, MST is often considered a compromise between fully CAT and paper-and-pencil (P&P) tests (Jodoin, Zenisky, & Hambleton, 2006).
MST also has several additional components beyond CAT, including panels, stages, modules, and pathways (Luecht, 2000). Modules are the smallest units that contain groups of items. Combining several modules creates a stage. Panels are composed of specified stages within which several modules belong. Finally, pathways are the routes that examinees take from module to module within a particular panel.
MST forms are constructed prior to the test being administered. Often, MST assembly can be conducted using computer software such as an automated test assembly program, which allows many varieties of constraints (e.g., Breithaupt & Hare, 2007; Jodoin, 2003; Xing & Hambleton, 2004; Zenisky, 2004). As a result, quality assurance in MST is feasible because test specialists and content committee experts control the test forms before the test is administered (Patsula, 1999). Furthermore, test developers can consider cognitive levels, item formats, and word counts (Hendrickson, 2007; Luecht & Nungester, 1998; Patsula, 1999). Unlike CAT, MST allows examinees to review and correct previous responses within a test stage as the test is administered. Because ability estimates with CAT are calculated after the examinee answers each item, skipping among test items is not allowed. MST, however, allows the examinee to review the items within a module before moving to another section. This is because an examinee’s proficiency is calculated after finishing an entire module (Hendrickson, 2007; Luecht & Nungester, 1998; Patsula, 1999).
This study investigated MST panel construction designs by varying the levels of test information functions (TIFs) at the first stage and the centers of TIFs at the first stage using mixed-format tests. The mixed-format design more closely reflects a real test setting because it includes both dichotomous and polytomous items (Rosa, Swygert, Nelson, & Thissen, 2001). Some studies (e.g., Grady & Dodd, 2009; Ho & Dodd, 2008) have investigated the mixed-format in the context of CAT design but not in the context of classification testing. Most of the previous studies examining MST, however, have used purely dichotomous items in the context of classification testing setting (e.g., Xing & Hambleton, 2004; Zenisky, 2004).
In addition, estimating an examinee’s initial ability adequately at the first stage is particularly important to determine the quality of the entire MST’s adaptive nature (Zenisky, 2004). Kim and Plake (1993) suggested variations, for example, in the distribution of item difficulty in the first-stage module. They demonstrated that a rectangular item difficulty distribution produced a better estimate of ability. Zenisky (2004) found that providing more information in the first module yielded better results when reduced levels of test information were used. However, all of these tests used only a dichotomous item pool. In the context of classification testing based on MST, no studies have investigated first-stage module variations using mixed-format tests that include both dichotomous and polytomous items.
MST has many advantages compared with traditional CAT such that it offers greater control over constructing prior test forms. Using this advantage, the current study tried to build test forms in various ways. In particular, this study used the current item pool more economically during the MST construction process, especially in the first stage, by increasing and decreasing TIFs, while maintaining the test length for all conditions. This study’s results would be useful to testing programs for economical and efficient pool maintenance and MST construction. Finally, different levels of TIFs at the first stage would provide test developers with insight into the relationship between the level of information and decision accuracy.
Previously, when item pools consisting only of dichotomous items or polytomous items were considered, the terminology item was usually used to indicate either purely dichotomous or polytomous items. “Test unit” is the measurement unit previously used in studies by Ho and Dodd (2008) and Grady and Dodd (2009) to indicate items with two to four categories in the mixed-format CAT design. In the current study, we also called an “item” used in the mixed-format MST (also CAT) designs a “test unit” in the same sense.
For baseline comparisons, this study included CAT with the maximum information (MI) procedure and CAT with the randomesque-10 procedure (Kingsbury & Zara, 1989). CAT with the MI procedure chose and administered the most informative test unit at the examinee’s current ability level. CAT with the randomesque-10 procedure initially selected a group of 10 test units with the MI at the examinee’s currently estimated ability and then randomly selected 1 of 10 test units to administer to the examinee. Kingsbury and Zara’s (1989) content balancing procedure was also applied for all CAT and MST conditions. This procedure made it possible to administer test units according to a predetermined proportion of each content area and test unit type for the test. All conditions were evaluated in terms of test accuracy and exposure control properties according to three passing rates—40%, 50%, and 60%—based on a normal distribution of examinees.
Method
Various MST design conditions for binary decision making (i.e., pass/fail) were considered in this study. The condition of 27 test units was applied as a stopping rule for both CAT simulations and constructing and simulating MST panels. Based on the percentage of each test unit type (i.e., dichotomous, three-category, and four-category test units) in the current pool, the 27 test units result in 43 possible score points if converted to dichotomous test units. The test length of 27 test units with 43 score points was selected to correspond approximately to the test length that Ho and Dodd (2008) used for their study of a realistic testing condition, which was based on the fixed-length, mixed-format CAT in a high-stakes test setting.
As noted, three cutoff score points were set as passing rates of 40%, 50%, and 60%. The latent ability scale based on a normal distribution of examinees determined the location of cutoff score points. Thus, passing rates of 40%, 50%, and 60% correspond approximately to theta (θ) points (or cutoff scores) of 0.254, 0.000, and −0.254. These passing rates were chosen to investigate the performance of the MST design at various points along the ability scale, especially when the difference between cutoff theta points is very small. The current method of choosing passing rates based on the normal distribution of examinees has been used in numerous studies (e.g., Hambleton & Xing, 2006; Zenisky, 2004).
The simulation study included MST conditions of 3 (levels of TIFs at the first stage) × 3 (centers of the TIFs at the first stage) × 3 (passing rates). In addition, CAT conditions of 2 (exposure controls) × 3 (passing rates) were included. Thus, 33 conditions were evaluated for classification accuracy and test security.
Mixed-Format Test Unit Pool
The test unit pool for the current study included three test unit types and three subcontent areas, producing nine content cells. Among the 424 total test units, 244 (57.55%) were dichotomous test units; 113 test units (26.65%) had three categorical scores; and 67 test units (15.80%) had four categorical scores. Furthermore, three content areas consisted of 126 (29.72%) Content I test units; 148 (34.90%) Content II test units; and 150 (35.38%) Content III test units. Item parameter estimates for the generalized partial credit model (GPCM; Muraki, 1992) were obtained from the technical manual for a national test.
Data Generation
Using the parameters of 424 test units based on the GPCM, responses were generated using the IRTGEN SAS macro (Whittaker, Fitzpatrick, Williams, & Dodd, 2003). A random number to represent a simulated examinee’s known ability was drawn from a normal distribution (0, 1). The probability of responding in each category given the simulated examinee’s known ability level was calculated for each test unit according to the GPCM. The probabilities were then summed to create a cumulative response probability, which ranged from 0 to 1. As a next step, a random number was drawn from a uniform distribution, and it was compared with the cumulative response probability. The simulated examinee was assigned the category score corresponding to the location in the cumulative probability distribution that was at or below where the random number fell. This procedure was repeated for all simulated examinees and all test units. Forty replications with 1,000 simulated examinees were generated for this study to evaluate.
Multistage Test Assembly and Simulations
The SAS program written by the authors assembled the panels and modules using target TIFs. Fundamentally, this SAS program for MST assembly is based on Luecht’s (2000) normalized weighted absolute deviations heuristic, but it was modified according to this study’s design. Three panels were constructed for each condition, and each panel included three stages and seven pathways. Each pathway met the requirements of Kingsbury and Zara’s (1989) content balancing because each pathway reflected the percentage of each content area and test unit type within the entire test. The first stage had one module, whereas the second and third stages were composed of three modules. Thus, a 1–3–3 panel structure was used for the current study. Twenty-seven fixed-length tests were used for all conditions, with nine, nine, and nine test units assigned to each stage module.
The TIFs for the first-stage module were targeted at theta points of −1.0, 0.0, and 1.0. In addition, the decreased levels of TIFs were constructed to have 40% to 50% less information compared with the regular levels of TIFs. The increased levels of TIFs were constructed to have 40% to 50% more information compared with the regular levels of TIFs. For the second and third stages, no changes were made to the TIF levels. As constructing the test continued, test units were used only once across the three panels, each of which had seven modules.
During the simulation, one of the three panels was assigned randomly to each simulated examinee. Although the MST approach includes several routing methods, the current study used the modified approximate maximum information (M-AMI) method to route simulated examinees from one stage to another. The AMI (Luecht, Brumfield, & Breithaupt, 2006) method routes examinees by determining the intersection between cumulative TIFs of subsequent modules. Examinees are then routed to the next stage that provides the MI of cumulative TIFs given the examinee’s current ability estimates. AMI was modified for the current study so that after completing the first stage, the simulated examinee was routed to the next-stage module, providing the maximum amount of information based on the maximum likelihood estimation (MLE) of ability. From the second to the third stage, simulated examinees were also routed using M-AMI method. For this stage, however, a simulated examinee could only be routed to modules with the same or an equivalent difficulty level as modules found in the second stage.
According to different conditions, a fixed-length stopping rule of 27 test units based on different cutoff scores was conducted. After the simulations were complete, the simulated examinees’ estimated abilities were considered to classify them into “pass” or “fail” categories by comparing ability estimates to cutoff scores points. Table 1 presents the details of the MST simulation conditions.
Simulation Designs for MST Conditions
Note: MST = multistage test; TIFs = test information functions.
Levels of TIFs at the first stage.
Centers of TIFs at the first stage.
Computerized Adaptive Test Simulations
Six CAT simulations were also conducted according to different cutoff scores. The initial ability estimate of each simulated examinee was equal to zero (i.e., the mean of the population). The test units were selected based on the MI or randomesque-10 (Kingsbury & Zara, 1989) procedure, with Kingsbury and Zara’s (1989) content balancing procedure considering content areas and test unit types. The MLE procedure estimated each simulated examinee’s ability. A fixed-length of 27 test units terminated the tests.
Data Analyses
All MST and CAT conditions were evaluated in terms of the following decision-making criteria: (1) correct classification rate (CCR), (2) false-negative error rate, (3) false-positive error rate, and (4) total error rate. First, in the MST and CAT context, simulated examinees in each condition were classified as either “pass” or “fail” by comparing their true or estimated abilities with cutoff scores for three passing rates of 40%, 50%, and 60%. A correct classification was considered an accurate decision. Accuracy can be deduced when both the true and estimated ability from the MST and CAT approaches classifies the simulated examinee as “pass,” or both are classified as “fail.” The false-negative classification is referred to as the decision to classify true “pass” examinees according to their true abilities to a “fail” category based on their estimated abilities. A false-positive classification is defined as the decision to classify examinees that “fail” according to their true abilities to a “pass” category according to their estimated abilities. The total classification error was considered as a sum of these two classification errors. These four criteria were averaged across 40 replications according to different conditions, and each replication contained 1,000 simulated examinees.
In addition, pool utilization, frequency distribution, descriptive statistics of test unit exposure rates (including the mean, the standard deviation, and the maximum value of exposure rate) were included as exposure control properties. These values were also averaged across 40 replications according to different conditions.
Results
Multistage Test Assembly
According to the different conditions, 1–3–3 MST panel structures with three panels were constructed. Constructing each panel incorporated Kingsbury and Zara’s (1989) content balancing into the pathway level. The first-stage modules were constructed to reflect three different TIF centers (i.e., theta points of −1.0, 0.0, and 1.0) and three levels of TIFs (i.e., increased, regular, and decreased). For example, Figure 1 illustrates the actual constructed information functions at the first-stage module according to different levels of TIFs when the TIFs were peaked at the theta point of 0.0. The second and the third stages for each condition were constructed to have the same amount of TIFs as the regular level of TIFs.

The first-stage module information function
Multistage Test/Computerized Adaptive Test Simulations
All classification rates in percentages were averaged across 40 replications, and each replication contained 1,000 simulated examinees. All of the MST conditions performed well by producing total error rates of less than 10% (see Table 2). In most of the conditions, the mean false-positive error rates and mean false-negative error rates were less than 5%.
Comparisons for the Classification Error and Accuracy Rates of MST Conditions
Note: All statistics were averaged across 40 replications. Each replication contained 1,000 observations. MST = multistage test; TIFs = test information functions; CCR = correct classification rate; FNER = false-negative error rate; FPER = false-positive error rate; TER = total error rate.
Levels of TIFs at the first stage.
Centers of TIFs at the first stage.
When conditions for the increased level of TIF were considered, the mean CCRs across centers and passing rates ranged from 90.74% to 91.53%. When conditions for the regular level of TIF were used, the mean CCRs across centers and passing rates ranged from 90.44% to 91.36%. When conditions for the decreased level of TIF were used, the mean CCRs across centers and passing rates ranged from 89.81% to 90.64%.
Within the increased levels of TIFs conditions, maximum mean CCR differences across different centers were 0.31% when the passing rate was set on 40%; 0.61% when the passing rate was set on 50%; and 0.64% when the passing rate was set on 60%. The largest mean CCR differences for regular levels of TIFs conditions across different centers were 0.57% for the 40% passing rate condition; 0.62% for the 50% passing rate condition; and 0.79% for the 60% passing rate condition. Finally, the decreased level of information conditions across different centers produced the maximum differences in mean CCRs by 0.41% when the passing rate was 40%; 0.38% when the passing rate was 50%; and 0.50% when the passing rate was 60%.
Furthermore, in most of the conditions, the 50% passing rate conditions produced the lowest mean CCRs, whereas the 40% and 60% passing rate conditions yielded similar results in their mean CCRs.
As expected, CAT with the MI procedure produced the highest mean CCRs, ranging from 93.06% to 93.50% according to the different passing rate conditions (see Table 3). Similar to the MST conditions, the passing rate of 50% produced the lowest mean CCRs. CAT with the randomesque-10 procedure yielded mean CCRs across different passing rate conditions ranging from 91.44% to 91.86%. The passing rate of 50% produced the highest mean total error rate of 8.56%.
Comparisons for the Classification Error and Accuracy Rates of CAT Conditions
Note: All statistics were averaged across 40 replications. Each replication contained 1,000 observations. CAT = computerized adaptive test; MI = maximum information; CCR = correct classification rate; FNER = false-negative error rate; FPER = false-positive error rate; TER = total error rate.
CAT with the maximum information.
CAT with the randomesque-10 procedure.
Exposure Control Properties
Table 4 displays the pool usage, frequency distribution, and descriptive statistics of test unit exposure rates for the MST conditions. Like the calculation for classification accuracy, all of the exposure control properties were averaged across 40 replications, with each replication containing 1,000 simulated examinees. The test unit exposure rate was calculated based on the number of times a particular test unit was administered to simulated examinees divided by the total number of simulated examinees. The pool usage rates were based on the percentage of the test unit pool that was not administered during the test.
Pool Utilization and Exposure Rates for MST Conditions
Note: All statistics were averaged across 40 replications. Each replication contained 1,000 observations. MST = multistage test; TIFs = test information functions; PR = passing rate; PS = pool size; ER = exposure rate; NA = not administered; ERA = exposure rate average; SD = standard deviation; ERM = exposure rate maximum.
Levels of TIFs at the first stage.
Centers of TIFs at the first stage.
MST used only 189 (of 424) test units in assembling the panel for test length conditions of 27 test units.
Computing the exposure control properties of the MST conditions was based only on the proportion of the entire test unit pool used to construct the MST panels (i.e., 189 test units). Chen, Ankenmann, and Spray (2003) defined the mean exposure rate as the ratio of the test length to the pool size when the test length is fixed. Thus, the grand mean of test unit exposure rates was .143 (i.e., 27 divided by 189) across all conditions. Moreover, because test units were only used once across three panels with seven modules each, and each panel was randomly assigned to each simulated examinee, the mean maximum exposure rate was less than .35 (i.e., .337) across all conditions. An average of 61 to 95 (or 32.28% to 50.26%) of the test units had exposure rates less than .10. Thus, pool usage was excellent throughout all MST conditions because all of the test units were used to construct the panels (i.e., 189 test units).
Table 5 describes the pool usage, frequency distribution, and descriptive statistics of test unit exposure rates for the CAT conditions averaged across 40 replications with 1,000 simulated examinees. CAT with the MI procedure and CAT with the randomesque-10 procedure produced the same grand mean of test unit exposure rates (i.e., .064) across all conditions (Chen et al., 2003). CAT with the MI procedure, however, produced a significantly lower mean pool usage in that an average of 64.62% of the test unit pool was not used as the test was administered. An average of 80 (i.e., 18.87%) test units had exposure rates less than .10. In addition, CAT with the MI conditions had a higher mean maximum exposure rate (i.e., .843) than CAT with the randomesque-10 condition.
Pool Utilization and Exposure Rates for CAT Conditions
Note: All statistics were averaged across 40 replications. Each replication contained 1,000 observations. CAT = computerized adaptive test; MI = maximum information; SD = standard deviation.
CAT with the maximum information.
CAT with the randomesque–10 procedure.
CAT with the randomesque-10 condition had a mean maximum exposure rate of .322, which was similar to the MST conditions (i.e., .337). The mean pool usage rates for CAT with the randomesque-10 procedure showed that an average of 17.69% of the test unit pool was not used during test administration. An average of 241 (i.e., 56.84%) test units had exposure rates less than .10.
Discussion
MST has many advantages compared with traditional CAT (e.g., assuring test form quality). Drawing on these advantages, the current study constructed and investigated various panel structures emphasizing first-stage module constructions using a mixed-format pool rather than single-format pool in the context of classification testing. This study’s results confirmed that various MST conditions were constructed properly and performed well using mixed-format tests in terms of classifying simulated examinees into dichotomous categories across testing conditions.
Most of conditions obtained more than 90% of the mean CCRs with less than 5% of each mean error rate according to different test lengths and passing rates. Some conditions, however, particularly the decreased level of information function, produced mean error rates higher than 5%, but only slightly. According to Jiao (2003), in actual licensure and certification testing situations, nominal false-positive and false-negative rates are often set at .05 (or 5%) to maintain a balance between the two error rates. Compared with these nominal error rates of 5%, overall, all the conditions of the current study’s simulations produced satisfactory results.
In particular, higher levels of TIFs at the first-stage module achieved better accuracy in the classification decision. The differences in mean CCRs between increased and decreased levels of TIFs at the first stage ranged from 0.70% to 1.44% given the center and passing rate conditions. This implies economical aspects of constructing the test. Using fewer items (or test units) with high measurement precision at the first stage, we can obtain results comparable to using many items with less information. In other words, if the pool from which the MST is assembled is composed of highly informative items, the number of items can be relatively smaller. Test designers can subsequently obtain similar accuracy measuring examinees compared with a pool that includes relatively fewer informative items.
As expected, CAT with the MI conditions produced the best classification accuracy. This is because MST selected test units at the module level (i.e., a set of test units), not the individual test unit level. The accuracy in classification, therefore, can drop compared with CAT with MI, which chose the most informative individual test unit based on the simulated examinee’s current ability estimate. CAT with the randomesque-10 procedure, however, produced comparable results to the MST conditions with the increased levels of TIFs.
Finally, this study showed that the MST achieved the best pool usage rates by using all the test units to construct the panel compared with the CAT conditions. The CAT with the MI conditions yielded, on average, a higher percentage of test units not administered. The mean maximum exposure rate of the MST was less than .35 (i.e., .337), whereas the mean maximum exposure rate of CAT with the MI was approximately .843 and .322 for CAT with the randomesque-10.
The current study’s designs have never been investigated in a mixed-format-based MST test in the context of classification testing. Thus, by replicating and broadening previous research using only a single item type, the current study’s design with its satisfactory results will be useful to test administrators as they further develop flexible MST panel designs to meet test specifications.
Future research studies could compare MST conditions and other CAT methods with exposure controls, such as the progressive–restricted procedure (Revuelta & Ponsoda, 1998) or the Sympson–Hetter procedure (Sympson & Hetter, 1985). Such studies will be useful for test developers as they make decisions about the use of MSTs relative to CATs. Furthermore, the test unit pool of this study was somewhat negatively skewed, such that constructing the easy modules was challenging. Thus, it will be interesting to examine how pools with different characteristics interact with MST panel constructions.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
