Abstract
This study investigated a new pool utilization method of constructing multistage tests (MST) using the mixed-format test based on the generalized partial credit model (GPCM). MST simulations of a classification test were performed to evaluate the MST design. A linear programming (LP) model was applied to perform MST reassemblies based on the initial MST construction. Three subsequent MST reassemblies were performed. For each reassembly, three test unit replacement ratios (TRRs; 0.22, 0.44, and 0.66) were investigated. The conditions of the three passing rates (30%, 50%, and 70%) were also considered in the classification testing. The results demonstrated that various MST reassembly conditions increased the overall pool utilization rates, while maintaining the desired MST construction. All MST testing conditions performed equally well in terms of the precision of the classification decision.
Keywords
The multistage test (MST) is considered an effective adaptive test approach for classification tests (Jodoin, 2003; Zenisky, 2004). Compared with computerized adaptive testing (CAT), the MST design has several additional components, such as panels, modules, stages, and pathways (Luecht, 2000). Modules, a collection of items, are the smallest units of the MST. The adaptive nature of MST occurs after an examinee answers all items in a module. Items within a module usually possess similar statistical attributes, even though they vary in terms of content and other non-statistical attributes. Stage, the next unit of MST, contains one or more modules. Because an examinee is taking only one of the modules within a stage, the statistical characteristics of modules vary according to the purpose of the test. Panels are the largest units of MST and contain several stages. MST panel design with one module at the first stage, three modules at the second and the third stages is called a 1-3-3 panel structure.
As test forms are constructed prior to administration, MST guarantees a quality control procedure such as exposure control and content balancing. As a result, MST allows test designers to make detail adjustments and modifications to satisfy various test requirements. Another benefit for examinees is that they may be allowed to review and modify answers for previous items within the module (Patsula, 1999). This feature is possible because the ability estimation is performed at the end of the module, rather than after each response to individual items.
Properly constructing the MST is crucial for producing accurate results (Zenisky, 2004). Manual construction of MST is not an option considering the time and accuracy required. However, thanks to the advancement of digital computer techniques, automated test assembly (ATA) is now readily available to assemble MST to satisfy all required constraints. In particular, ATA programs’ capability of producing fixed length parallel tests with constraints, while accommodating various constraints including test length, content areas, and item exposure control, has proved to be quintessential to MST constructions. ATA computer software utilizes optimization algorithms such as linear programming (LP) to produce multiple panels simultaneously in MST (Luecht & Nungester, 1998).
In terms of pool utilization, a large proportion of good quality items in the pool is often not selected and is ultimately discarded even though maintaining an item pool requires extensive time and effort, including evaluating pretest items and adjusting the current pool, among other activities (Parshall, Spray, Kalohn, & Davey, 2002). Rather than managing this costly process, utilizing more of the unused, good quality items to construct the MST is more practical and economical. Enhancing pool utilization has been accomplished through various methods of computerized tests. Under CAT, such efforts are related closely with content balancing (Kingsbury & Zara, 1989; Stocking & Swanson, 1993) and exposure control (Revuelta & Ponsoda, 1998; Sympson & Hetter, 1985). Because the CAT performs tailored item administration for each individual, high information items are used more frequently than low information items. High exposure rates of a small portion of items leave a large portion of unutilized items, which represents poor use of valuable resources. In addition, the overexposure of a small number of items threatens test security. Therefore, pool utilization is improved indirectly by constraining the administration of popular items. In contrast, content constraint gives a test content validity evidence according to a test blueprint. For this, additional constraints are added to the item selection procedure to achieve a fair assessment across content areas. Because the item use is distributed across varied content areas, pool utilization enhancement is achieved indirectly.
In MST, pool utilization is evaluated through optimum test construction. Pool utilization is defined as the number of items in the MST forms divided by the number of items in the pool. Using LP, the popular methods available for constructing MSTs are sequential, simultaneous, and the shadow approaches. A sequential test may not be able to obtain a maximum number of forms because items used for a form are removed from the pool. This may further block the assembly of test forms (van der Linden, 2005). Simultaneous assembly constructs multiple forms simultaneously, overcoming the greedy effect of sequential assembly (Boekkooi-Timminga, 1987). The number of equations may grow, however, beyond a practical limit as the number of test forms increases. The shadow approach repeats smaller simultaneous assemblies to accomplish a large simultaneous problem. It alleviates the burden of simultaneous assembly by constructing several manageable test forms and a shadow test at a given step.
MST assembly methods described thus far focus on solving an optimization problem and enhance pool utilization indirectly by constructing as many forms as possible (i.e., optimal test construction). In the present article, however, the authors suggest a new LP approach to enhance pool utilization after the initial MST construction through multiple executions of MST constructions. After the initial MST construction, unutilized items are incorporated into the subsequent MST reassemblies through systematic programming. This increases the total number of MST forms and overall pool utilization. The incremental enhancement of pool utilization is the unique advantage of the method proposed in the present article. In addition, this algorithm is applicable to incremental changes in the pre-existing MST forms for the utilization of newly developed items. For example, assume that 10% of items are newly developed and added to the pool to satisfy certain extraneous requirements. They must be incorporated into the test by replacing a portion of items already in the test. Performing MST construction from scratch does not guarantee that these freshly added items will be included.
The conventional assessment is generally constructed using a mixed format of items composed of both dichotomously scored items (e.g., multiple choice) and polytomously scored items (e.g., a constructed response; Rosa, Swygert, Nelson, & Thissen, 2001). Mixed-format tests tend to increase the test’s reliability, the validity of scores, and cost efficiency due to the enhanced psychometric features of the tests (Ercikan et al., 1998).
Although the importance of mixed-format tests has bee highlighted, they have not been investigated extensively in MST. This current study thus applied a MST pool utilization method to the mixed-format design. The remainder of this article starts by describing item pools and data generation. Then, the MST assembly and reassembly methods, including content balancing and creating target test information functions (TIFs), are explained in detail. Finally, MST simulations with their results and a discussion are presented.
Method
The current study used the mixed-format test unit pool. Here, the term “test unit” denotes either dichotomously or polytomously scored items in the mixed-format design (Grady & Dodd, 2009). To increase pool utilization, additional MST constructions should be performed based on the initial one. To clarify that these additional constructions differ from the initial MST construction, the term MST reassembly is introduced in this study. In essence, MST reassembly replaces a proportion of utilized test units with unused test units, thereby increasing overall pool utilization while satisfying statistical and non-statistical constraints.
Three successive MST reassemblies based on the initial MST construction were performed to demonstrate the increase of the pool utilization: the first reassembly, the second reassembly, and the third reassembly. Here, being “successive” signifies that the first reassembly is performed based on the initial MST construction, while the second reassembly is performed based on the first reassembly, and so on. The proportion of test units within a module that need to be replaced for the subsequent MST reassembly is defined as test unit replacement ratio (TRR). For each time of the reassembly, three different TRRs were investigated: 0.22, 0.44, and 0.66. Nine different MST constructions (3 reassemblies × 3 TRRs) were produced. Each MST contained three panels, and each panel contained seven modules, satisfying the 1-3-3 MST design.
For the present study, MST simulations were conducted in the context of classification accuracy. For these simulations, three cutoff score points were set as passing rates of 30%, 50%, and 70%. The cutoff score locations were the 30th, 50th, and 70th percentiles of normally distributed examinees. Classification decisions were made by comparing θ estimates with the cutoff θ values. Therefore, the design of this study included 3 (reassemblies) × 3 (TRRs) × 3 (passing rates) for a total of 27 conditions being evaluated for classification accuracy.
Mixed-Format Pool
Parameter estimates of the mixed-format test unit pool calibrated according to the generalized partial credit model (GPCM; Muraki, 1992) were obtained from the technical manual for a national test. The test unit pool consists of 424 total test units. Of the 424 total test units, 244 are dichotomous test units, 113 test units have three categorical scores, and 67 test units have four categorical scores. Furthermore, the test unit pool contains three content areas: 126 Area I test units, 148 Area II test units, and 150 Area III test units.
Data Generation
Test unit responses were generated using the IRTGEN SAS macro (Whittaker, Fitzpatrick, Williams, & Dodd, 2003). One hundred replications were performed, with 1,000 normally distributed simulees being generated for this study.
MST Assembly
A JAVA ATA program named JPLEX (Park, Kim, Dodd, & Chung, 2011) was used to assemble the panels and modules automatically. A 1-3-3 MST panel structure was used for the current study, and 27 fixed length test units were used for all conditions, with 9 test units assigned to each of the three stages. Three panels were constructed for each condition. In addition, each of the seven pathways met the requirements of Kingsbury and Zara’s (1989) content balancing procedure. The number of test units for the specific content for each module was pre-calculated and modeled as constraints in the LP modeling.
Target TIF construction
Two conflicting goals need to be balanced when the target TIFs are determined: (a) The MSTs must provide desired information to produce reliable scores, and (b) the item pool should be able to supply items that satisfy target TIFs. In other words, they should not be set to unrealistically large or small values across the θ scale (Luecht & Nungester, 1998). For the current study, absolute target TIFs for each module were created. Absolute target TIF specifies the exact height of information across the latent trait scale. Because the content areas and TIFs determine the parallelism among tests (Lord, 1977; Samejima, 1977) in the item response theory (IRT) environment, the absolute target TIF is used when parallel tests are assembled. The target TIFs are obtained from the test unit information function within the test unit pool, while also considering the proportion of content areas. Figure 1 illustrates the target TIFs for easy, medium, and hard difficulty modules in the 1-3-3 MST.

Target test information functions of different difficulties for 1-3-3 MST panel construction.
Initial MST construction
The minimax method (van der Linden, 1987), an LP modeling technique, is used to model the MST construct module. The minimax method is expressed as
subject to
and
where y is the real-valued variable that specifies maximum deviation from the TIF; N is the number of test units in the pool;
During the first stage of the MST, a medium-difficulty module was constructed to peak at the θ point of 0.0. During the second and third stages, easy, medium, and hard modules were constructed to peak at the −1.0, 0.0, and 1.0 θ points, respectively, to accomplish the 1-3-3 MST panel design. Content constraints according to Kingsbury and Zara’s (1989) content balancing procedure are added in addition to Equations 1 to 6. Contents are categorical attributes that partition the test unit pool into subsets of test units according to content domain. In the current study, content constraints were translated into multiple requirements stating the number of test units belonging to these categorical attributes for the module. Let c denote a symbol for the category, Vc is the test unit sets of category c, and nc is the number of test units that need to be included in the test. Models for content constraints are then expressed as
In practice, a few quantitative constraints must be considered for test construction. The response times for panels should be comparable, so that a panel does not take dramatically longer time to finish than the others. In addition, constraints on the items’ word count are specified to achieve fairness in terms of the number of words examinees need to process. Because decision variables are either 0 (not include) or 1 (include), the weighted sum of the attributes is equivalent to the total attributes included in the test form, which could be bounded using predefined values. For example, to bound the response time of the module below 20 min, the following constraints are added:
where ti denotes response time for test unit i in terms of minutes.
Two methods were incorporated to mitigate the greedy effect of sequential construction: (a) the sequence of module construction and (b) criterion for LP solver. First, it is common for an item pool to have a larger number of medium-difficulty items than easy or hard difficulty items. Indeed, this was the case for the test unit pool under study. Therefore, under the sequential construction method, easy and hard difficulty modules were constructed before medium-difficulty modules, which are less challenging. In addition, the skewed test information indicates that either easy or hard difficulty modules are more challenging. For example, a negatively skewed item pool may indicate that there are more items at θ points above zero in the pool, providing higher measurement accuracy for examinees with above-average abilities. The test unit pool in the current study was, in fact, negatively skewed. Thus, easy modules were constructed before hard difficulty modules to alleviate the greedy effect of the sequential method.
Second, the LP solver (Park et al., 2011) the authors used provides an option to terminate the search as soon as an integer solution is found and is acceptable in terms of the user-specified criterion. A strict criterion ensures a solution close to optimal, while a less strict criterion still produces acceptable solutions from the reduced search space. By specifying fewer strict criteria for the modules constructed in the early stage of sequential construction, the greedy effect was mitigated.
MST reassemblies
After the initial MST construction, the test unit pool was divided into two test unit sets: Vt, which is the set of test units utilized for the initial MST construction, and Vt, which is the set of test units that remained unused for the initial MST construction. The proportion of test units to be replaced from the initial construction is specified by TRR; it was programmed into the LP modeling. The TRR is the design factor that determines how many test units are chosen from Vu and Vt, respectively, for the subsequent MST reassemblies. A large TRR replaces a large portion of the initial MST design, whereas a small TRR replaces a small portion of the initial MST.
In terms of LP programming of MST module reassembly, the following two constraints based on sets Vu and Vt were added to the minimax method, replacing Equation 4:
and
where n1 is the number of test units that need to be replaced; n2 is the number of test units that should be retained in MST design; and n1+n2 is the number of test units per module. Therefore, TRR is expressed as
To prevent test units from being overlapped for multiple modules, n1+n2 test units selected for the current module need to be removed from Vu and Vt, respectively, before the next module construction.
Test unit pool utilization after the first MST reassembly
The increase of test unit utilization for the first MST reassembly is derived analytically. Let Ui be the test unit utilization ratio of the initial MST. As the test unit utilization increases by Ui multiplied by TRR after the first MST reassembly, the accumulated test unit utilization for the initial and the first MST reassembly, UR1, is
Successive MST reassemblies
The MST reassembly method can be repeated to further increase pool utilization. For example, based on the first MST reassembly of the initial MST construction, the second MST reassembly can be constructed. Furthermore, the third MST reassembly can be performed based on the previous constructions, and so on. However, one cannot obtain a linear increase of overall pool utilization over several reassemblies because test units of statistical characteristics satisfying requirements are likely to be utilized more than once over the successive reassemblies.
During the first MST reassembly, a portion of test units utilized are replaced with test units from the unutilized set. Therefore, there are four types of test units at the end of the first MST reassembly: (1) test units used for the initial MST construction that remained as used; (2) test units used for the initial MST construction but were put back into the unused set; (3) test units not used for the initial MST construction but were used for the first reassembly; and (4) test units that were never used. For the simplicity of the procedure, however, successive reassemblies only remember the test unit utilization status from the previous reassembly step. In other words, for the reassembly followed by the first MST reassembly, Options 1 and 3 are considered as a used test unit set, and Options 2 and 4 are considered as unused in the test unit set. In this way, a simple and consistent LP programming for the successive reassemblies is achieved. The assumptions are that successive reassemblies would use test units from the unused test unit set, incrementally increasing the overall pool utilization. A simulation study was conducted to demonstrate that this assumption is valid.
MST Simulation
Successful MST reassemblies can be determined using the following criteria: (a) statistical characteristics of MST modules, (b) the accuracy of classification, and (c) the increase of the overall pool utilization. Statistical characteristics of MST modules were evaluated through root mean squared errors (RMSEs), which measure how well the constructed tests fit the target TIFs. However, although RMSE is one of the readily available indicators of the quality of constructed MST designs, the ultimate criteria of the performance of any practical test design should be drawn from actual test administrations. One of the acceptable alternatives for the expensive and time-consuming actual administration is computer simulations. To this end, the SAS program was written to perform MST simulation in the context of a classification test.
Each of 1,000 normally distributed simulees was randomly assigned to one of three panels. After completing each stage, simulees were routed to the next-stage module based on the approximate maximum information (AMI; Luecht, Brumfield, & Breithaupt, 2006) method from the ability estimated using the maximum likelihood (MLE) method with a variable step size (Koch & Dodd, 1989).
Data Analyses
Specific decision-making criteria were averaged across 100 replications—namely, (a) correct classification rate (CCR), (b) false positive error rate (FPER), (c) false negative error rate (FNER), and (d) total error rate (TER). In terms of the pool utilization, increased pool utilization and accumulated pool utilization rates per each MST reassembly were calculated. Finally, the exposure of test units after the third successive MST reassembly was investigated. These data indicate patterns of exposure of test units over successive MST reassemblies.
Results
MST Assembly
Table 1 displays the RMSEs and pool utilizations for the given TRRs, cutoff scores, and the number of MST reassemblies. A smaller RMSE indicates that the constructed modules are well matched to the target TIFs, maintaining the necessary statistical characteristics in terms of TIF. Although the RMSE remains small across the successive MST reassembly, the larger TRR tended to increase RMSE. However, the differences remained negligible. In terms of pool utilizations, Table 1 illustrates the increased and accumulated pool utilizations for each design condition of the given TRR, cutoff scores, and the number of MST reassemblies. For example, from the initial utilization rate of 44.57%, the pool utilization rate increased to 54.48% and 60.14% for the first and second MST reassemblies, respectively, at a given TRR of 0.22. Although the successive MST reassembly only increased the accumulated pool utilization rate, the increased amount from a consecutive reassembly diminished as the reassembly repeated. For example, at the TRR of 0.66, the increased utilization rates were 29.72%, 3.77%, and 3.30% for the first, second, and third reassemblies.
Comparisons for the RMSE and Pool Utilization Rates.
Note.TRR = test unit replacement ratio; RMSE = root mean squared error; NA = not applicable.
Small TRRs (e.g., 0.22) intended for the small amount of replacement in the test units within MST designs allowed for small increases of pool utilization: 9.91%, 5.66%, and 4.24% for the first, second, and third reassemblies. However, although a large TRR is expected to bring a large increase of pool utilization, the pool utilization did not increase proportional to the TRR. For example, at the TRR of 0.66, the largest increased amount of utilization rate, 29.72%, was made only for the first MST reassembly; however, it remained small for the remaining reassemblies. In fact, for the second and third MST reassemblies, the increases remained similar to those with TRRs of 0.22 and 0.44. This is because after the first reassembly, a large portion (74.29%) of the test unit pool was already utilized, leaving little room for the pool utilization to grow. In other words, a large portion of test units utilized for the previous constructions ended up being included in the current construction. Figure 2 illustrates pool utilization growth across MST reassemblies.

Accumulated pool utilization ratio across MST reassemblies for TRR = 0.22, 0.44, and 0.66.
Table 2 displays the number of test units corresponding to each accumulated utilization frequency after the third MST reassembly. For example, at TRR of 0.22, 120 test units were utilized on every MST construction/reassembly, whereas only 37 test units were utilized twice at the end of the third MST reassembly. Each TRR produced different results in terms of unutilized test units. Compared with 151 unutilized test units for TRR of 0.22, only 79 test units were not utilized when the constructions/reassemblies were performed with the TRR of 0.66. In terms of the number of test units utilized on every MST construction/reassembly, only 46 test units had a TRR of 0.66 compared with 120 test units with a TRR of 0.22. Three observations can be made. First, with small TRRs, a large number of test units were either not utilized or exposed on every MST design. Second, with the large TRR of 0.66, there was a concentration of test units on the accumulated utilization frequency of two, meaning utilized half of the times through MST reassemblies, while reducing the number of test units both completely unutilized and utilized in every design. Finally, with the medium TRR of 0.44, the frequencies of test unit exposure are more evenly distributed due to the fact that the design with a TRR of 0.44 achieved a balance between high overall pool utilization and the flexibility of choosing test units from the unutilized test unit pool.
Comparing the Number of Test Units for Each Accumulated Utilization Frequency of the Third Reassembly.
Note. The total number of test units in the pool is 424. TRR = test unit replacement ratio.
MST Simulation
Table 3 displays the results of classification accuracy from the MST simulation. For instance, 92.02% of simulees were correctly classified from the simulation during the first MST reassembly, with a TRR of 0.22 and a 30% passing rate. In all conditions, CCRs exceeded 90%. In addition, in all conditions, mean FPERs and mean FNERs were less than 5%.
Comparing the Classification Error and Accuracy Rates With 27 Test Units.
Note. All statistics were averaged across 100 replications. Each replication contained 1,000 observations. TRR = test unit replacement ratio; CCR = correct classification rate; FNER = false negative error rate; FPER = false positive error rate; TER = total error rate; NA = not applicable.
Conclusion and Discussion
The goal of this study was to apply LP modeling to MST construction to enhance the test unit utilization. To this end, new linear equality constraints specifying the TRR were introduced. Based on this modeling, this study demonstrated that successive MST reassemblies could be performed to create multiple MST forms that shared similar statistical and non-statistical characteristics. By programming TRR, test designers could control pool utilization and test unit exposure among MST designs. For instance, a large TRR result in high pool utilization while limiting the exposure of test units to each MST due to the fact that the large TRR takes advantage of the flexibility of selecting test units if there are enough good test units remaining in the unutilized pool. In particular, the largest gain of pool utilization occurred during the first MST reassembly with large TRR. However, a small TRR should result in small pool utilization because it limits the number of test units to be replaced through successive MST reassemblies.
Measuring the accuracy of reassembled test forms depends on the following: (a) the quality and quantity of unused test units, (b) the statistical requirements imposed by the target TIFs, (c) non-statistical constraints, (d) the purpose of the test (i.e., norm referenced or criterion referenced test), (e) test length, and (f) MST structure factors (i.e., the number of modules and stages). The quantity of unused test units seems to be the most important factor that needs to be considered before the reassembly procedure is contemplated. For example, if the initial assembly used 378 test units from among 424 test units, it has used almost 90% of the test units in the pool and produced six nonoverlapping panels (i.e., 9 test units per module, seven module per panel; 9 × 7 × 6 = 378). This does not leave much room to improve the overall pool utilization.
The quality of test units encompasses statistical and non-statistical criteria. For example, if a large proportion of unused test units are limited in a few specific content domains, the reassemblies might not be able to find test units suitable for replacement. However, if the statistical characteristics of unused test units are poor, successive reassembly might produce MST designs with poor measurement precision. The CCR might be low for the criterion referenced tests or the ability estimates might be inaccurate in the norm referenced tests. However, a potential explanation could be presented to show why this is unlikely. First, MST construction is accompanied by various non-statistical requirements that prevent initial assembly from exhausting all of the good test units. Therefore, it is highly unlikely that the unused test unit pool is left with test units of drastically low discriminating power that deteriorate measurement precision. Second, test units that were not replaced in MST reassemblies might have high discriminating power, so that they mitigate the potential loss in performance of the reassembled test. Third, although the unused test units possess poor discriminating power, the number of test units replaced might not be large enough to make a practical difference in terms of measurement precision, but might strengthen test security and enhance the pool utilization. Particularly with small TRRs (e.g., .22 and .44), changes in measurement precision might not be noticeable because only 22% and 44% of test units are replaced for each MST reassembly. Table 1 and Table 3 in the current study support these points. There was not a noticeable difference in terms of RMSE of constructed module information and classification accuracy. This result might be due to that because (a) the quality of unused test units was as good as those that were used, (b) the quality of the test units that remained throughout successive reassemblies was a determining factor for the overall measurement accuracy, and (c) the quantity of replacement of test units is not a significant drawback for test accuracy, particularly for TRR = 0.22 and TRR = 0.44.
MST simulation clearly demonstrated that MST designs created through the proposed method performed well in the context of a classification test. All of 27-unit test length MST construction/reassemblies produced satisfactory results. The average TERs were kept below 10% in all conditions, although they were highest when the passing rate was 50%, which corresponds to the results of previous studies (e.g., Jodoin, 2003; Zenisky, 2004) because the cutoff score for the 50% passing rate is the θ point on the ability scale—where most of the examinees would cluster in a normal distribution compared with other passing rates (i.e., 30% and 70%). Thus, CCRs were lowest for the 50% passing rate conditions.
MST provides greater quality control when constructing test forms. Using this main advantage, the current study investigated a novel method to increase the pool utilization when constructing the MST. The results of this study demonstrated that various MST reconstructing conditions performed well by increasing the pool utilization while maintaining the desired MST construction and testing accuracy. These results will be useful for testing programs in terms of economically and efficiently maintaining test unit pools and constructing MSTs.
The current study measures the accuracy of the criterion reference test. Therefore, the performance comparisons of tests from MST assemblies might be limited by the extent to which the accuracy of the criterion referenced test changes due to the reassemblies. Although it provides a reference for applications with a binary decision (i.e., pass or fail), a further investigation should highlight the impact of MST reassemblies on the ability estimates. In addition, separate simulations were performed on each MST construction. Future research could extend this and investigate accuracy and exposure rates when a simulation is performed for all MST designs produced through successive MST reassemblies. Furthermore, MST designs other than 1-3-3, and other LP modeling should also advance our understanding of the proposed scheme. Finally, empirical evidence of the feasibility of this method when newly developed test units are added to the test unit pool should be provided to further strengthen the usefulness of this technique in real environments, where test unit pools are updated continually as test units are added or removed.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
