Abstract
Multistage testing (MST) has many practical advantages over typical item-level computerized adaptive testing (CAT), but there is a substantial tradeoff when using MST because of its reduced level of adaptability. In typical MST, the first stage almost always performs as a routing stage in which all test takers see a linear test form. If multiple test sections measure different but moderately or highly correlated traits, then a score estimate for one section might be capable of adaptively selecting item modules for following sections without having to administer routing stages repeatedly for each section. In this article, a new framework for developing MST with intersectional routing (ISR) was proposed and evaluated under several research conditions with different MST structures, section score distributions and relationships, and types of regression models for ISR. The overall findings of the study suggested that MST with ISR approach could improve measurement efficiency and test optimality especially with tests with short lengths.
Multistage testing (MST) is a special type of computerized adaptive testing (CAT), in which the adaptiveness of test form construction happens at the subsection level instead of at the individual item level. In MST, each subsection of a test is called a “stage,” and each test form/administration consists of multiple stages as the name MST implies (Yan, Lewis, & von Davier, 2014a). Although MST’s measurement efficiency is often not as good as the item-level measurement efficiency of CAT, MST nonetheless is becoming a popular test design in the educational measurement field because of distinct advantages it offers over CAT. For example, MST often offers test developers more control over test construction in terms of possible test forms and content combinations and enables test takers the opportunity to move back and forth among questions within each test stage (Jodoin, Zenisky, & Hambleton, 2006; van der Linden & Glas, 2010; Yan, Lewis, & von Davier, 2014a).
In MST, the set of test items to be administered in each stage is called a “module.” Although some new approaches to MST assemble modules for each stage “on-the-fly” (Han & Guo, 2014; Zheng, Wang, Culbertson, & Chang, 2014), in actual applications of MST, it is still most common to have multiple modules preassembled for each stage. Test takers are then routed to one of modules in each stage based on their performance in the previous stages. Depending on the purpose of test, MST employs two different strategies for routing modules (Zenisky & Hambleton, 2014). The first strategy is based on norm-referenced routing (Armstrong, Jones, Koppel, & Pashley, 2004; Zenisky, 2004). In norm-referenced routing, a test taker’s expected percentile based on performance in previous stages is compared against the routing cut score(s) from a prior distribution. The other strategy is criterion-based routing. In common practice, criterion-based routing is employed to select the module that has the most relevant difficulty level based on the test-taker’s performance in previous stages or that is expected to result in minimal measurement errors.
Criterion-based routing can be accomplished using the item response theory (IRT) framework with information-based targets (Luecht & Nungester, 1998; Weissman, 2014; Zenisky, 2004) under the number-correct scoring of classical test theory (CTT; Weissman, Belov, & Armstrong, 2007) or under the nonparametric adaptive approach such as the tree-based MST (Yan, Lewis, & von Davier, 2014b). IRT information-based routing is the theoretical and practical equivalent of the maximum Fisher information method for item selection in item-level CAT, arguably one of the most widely used MST routing approaches. The number-correct-based routing method is also popular because the routing rule is easier to explain and communicate to test takers. Routing cut scores based on number-correct-based routing are still usually derived from the IRT information-based approach. The differences between number-correct routing and IRT information-based routing tend to diminish as the number of items in each stage increases.
In MST applications with preassembled modules with IRT information-based routing, a test taker is always administered a fixed set of test items in the first stage, known as the “routing” stage. Once the routing stage administration is complete, the test taker is routed, in the second stage, to a test module that is expected to result in the maximized Fisher information given the latent trait score, θ, which is estimated based on the items and responses from the routing stage.
Numerous research studies have evaluated the performance and behavior of MST in a variety of different settings especially regarding the number of stages and the number of modules per stage (Armstrong et al., 2004; Lee & Han, 2008; Patsula & Hambleton, 1999; Wang, Fluegge, & Luecht, 2012; Yan et al., 2014a; Zenisky & Hambleton, 2014). In theory, the more stages involved in an MST administration, the more adaptive it becomes, resulting in better measurement performance, all things being equal. The greatest number of stages possible in MST would be equal to the number of test items (for fixed-length tests). Although increasing the number of stages in MST tends to improve measurement efficiency due to increased adaptiveness, it undermines the MST benefits compared with CAT. For example, if an MST-based exam with 30 items changes from two stages (with 15 items per stage) to 15 stages (with two items per stage), a test-taker’s ability to move back and forth among items within each stage would be less meaningful. In addition, the test developer’s process for reviewing and controlling all possible test forms from 15 different stages becomes exponentially more complicated as the number of stages increase. Therefore, determining the best structure and design of MST involves more than simply satisfying a single psychometric aspect (e.g., measurement efficiency). Rather, one needs to find and strike the right balance among measurement efficiency, developmental complication, test-taking experience, user communication, and other factors to best serve measurement needs and purposes of the MST program (Zenisky & Hambleton, 2014).
Determining the number of modules per stage in MST should follow the same logic. In theory, having more modules available per stage usually helps the overall performance of MST because it is more likely to increase the likelihood of a module that exhibits its maximized Fisher information at the interim θ estimate from the previous stage. Developing a large number (e.g., more than five) of preassembled modules per stage to cover all possible interim θ areas with a higher resolution is not very cost-effective (in terms of overall item utilization), however. It would unnecessarily complicate the MST design and implementation unless the modules were assembled on the fly (Han & Guo, 2014; Zheng et al., 2014). There is no such thing as a one-size-fits-all deal when it comes to MST design, but, in general, as the previous studies suggest, having a maximum of about four modules per stage constitutes reasonable practice (Armstrong et al., 2004; Yan et al., 2014a).
MST With Intersectional Routing (ISR) for Short-Length Tests
In theory, the relative improvement of measurement efficiency in adaptive tests over linear tests is most apparent when the test length is short, but in practice, however, the development of an efficient short-length test based on MST poses notable challenges. Test scenarios involving short test lengths, for example, less than 20 items per scale, limit test developers’ flexibility in designing MST. The number of stages in this example most likely would not exceed three. In fact, MST with two stages is a common design, even for longer test programs such as the Graduate Record Examination (GRE) or the revised General Test and Comprehensive Testing Program 4 (CTP4) for K-12 students (Robin, Steffen, & Liang, 2014; Wentzel, Mills, & Meara, 2014). With only two stages, assuming the number of items per stage is the same across stages, a test based on MST essentially becomes half adaptive because, in typical MST designs, the items in the first (routing) stage are always fixed (nonadaptively selected and administered).
If the routing stage (with a single module for everyone) was skipped and the test module was adaptively selected from the first stage of MST, then the test’s level of adaptiveness would be greatly improved—for example, from 50% to 100% in the case of a two-stage MST—and the test’s measurement efficiency would increase. The question then becomes “On what basis can we start routing when no item of the test section has been administered to a test taker?”
Some ideas that have been discussed for item selection in CAT at the beginning of a test include the possibility of using available collateral information—for example, scores from previous test administrations, scores from other test programs measuring similar skills, usage data from learning systems, and response time data (Thompson & Weiss, 2011; van der Linden & Pashley, 2010). Technically, it is not unreasonable to consider using such collateral information for initial routing in MST to eliminate a linear routing stage, and it might be a sound solution, indeed, for low-stakes exams such as formative assessments. For test programs with high-stake consequences, however, using collateral information for initial routing in MST may be impossible and/or inappropriate because (a) the availability and/or quality of collateral information is often not equal across all test takers and (b) test routing influenced by external information outside a test can be legally and politically challenged in terms of test fairness. The next question thus becomes “How do we identify collateral information that is (a) available for all test takers, (b) generated from within the test, and (c) appropriate for initial routing in MST?”
Test programs that measure cognitive skills commonly include several test sections that measure different skills and traits such as quantitative and verbal skills. More often than not, those cognitive skills and traits are closely related, so one usually finds moderate to high correlation among the test section scores reflecting each trait. For example, it is typical to observe a correlational coefficient of .50 to .70 for math and verbal section scores in many educational test programs. Thus, if there is information about a test taker’s score for one trait and about the relationship between the traits, then his or her score can be predicted for another trait using predictive modeling methods like regression. In MST practice, therefore, if a test taker’s score is known from one section, it would be interesting to see whether it makes any meaningful difference to start the following section not with a routing stage but, instead, with adaptive routing based on the predicted section score. This approach hereafter will be referred to as ISR and is illustrated in Figure 1 (upper).

Illustration of MST with ISR.
The basic logic behind ISR for MST in some ways is equivalent to multidimensional adaptive testing (MAT) for item-level CAT. In MAT (with Bayesian-based θ estimators), unless there is zero correlation among measured traits, a known covariance matrix can be used as a prior for estimating θ on one dimension based on another dimension even when no such item measuring the dimension of θ being estimated was administered. There are several key differences, however, between MST with ISR and MAT. First, MAT involves the use of multidimensional IRT (MIRT) models, whereas, in MST with ISR, the unidimensional structure for each test section is retained. This is a key practical advantage of ISR—it can be implemented easily for test programs using existing items/banks/pools/modules that are already based on unidimensional IRT models with little modification needed. More important, in MAT, the adaptive selection and the θ estimation (both interim and final) are accomplished using a MIRT framework. Previous studies have found that MAT tends to result in considerable estimation bias especially when tests were short (e.g., 10 and 30 items) and/or were based on Bayesian procedure, where a prior density function played critical roles (Segall, 1996; van der Linden, 1999). In MST with ISR, however, the collateral information from the previous test section is used only for routing at the first stage of the following section and is not used for estimating interim and final θ for individual test takers. The complete separation of θ estimation for each section is critically important in actual applications, especially in high-stakes testing. To allow a prior density function and trait score from another dimension to have such a huge impact or influence on a different dimension’s θ estimate for each individual, as in MAT, is often not justifiable in terms of test fairness. Whether MST with ISR really is free of estimation bias is something to be investigated later in this study.
The actual implementation of MST with ISR is not as straightforward as would be suggested by simply applying a regression model to predict an initial routing score based on a score from the previous section. In practice, one typically has well-established knowledge about the relationship between trait scores from different test sections based on long-term data observation of existing test forms and programs. A regression model based on such data, however, does not necessarily present an accurate reflection of the actual relationship of MST section scores that is to be applied with ISR, especially when test length is very short. For example, if the known latent trait score distributions for Traits X and Y follow a multivariate normal with the latent correlation of .7, the regression model to predict y on x would be
Identify (almost) true latent distributions and the relationship of two traits (X and Y) based on large sample administrations of longer versions of the test and/or several different test forms.
Conduct a simulation study for the actual test section (Section X) for Trait X using the true latent distribution identified from Step 1.
Conduct a simulation study for the actual test section (Section Y) for Trait Y using the true latent distribution identified from Step 1. In the first stage of Section Y, the routing is based on the true y value.
Compute a regression model,
Use the regression model,

Example of difference in correlational relationship for latent and observed data.
Because different test forms will yield different observed data in Steps 2 and 3 when there are multiple test panels, this framework should be implemented for each MST test panel and the regression model for ISR should be integrated into each panel. Also, it is advised to have a large enough sample (e.g., 1,000 or more) in Steps 2 and 3 to result in a stable regression model in Step 4.
This article presents a series of simulation studies designed to evaluate the behavior and effectiveness of applying ISR in MST with a short test length using the suggested framework. Studies 1 and 2 focus on evaluating the performance and behavior of MST with ISR when there are two and three modules to route to, respectively, and Study 3 looks at ISR cases that use other regression models including polynomial regression and multiple regression. An actual implementation of the proposed framework for developing short-length MST with ISR is also introduced, followed by a comprehensive discussion of the implications of the MST with ISR with other MST approaches and designs.
Study 1: MST With 2-3 Structure
MST Design and Data
The results of Study 1 show how the MST design (illustrated in Figure 1) was developed and implemented. Stages 1 and 2 each included seven items, each one based on the three-parameter logistic (3PL) model with a-parameter value of 1 and c-parameter value of 0.2. The items differed only in the b-parameter values. The shape of the module information function (MIF) was controlled to be the same across modules within each stage (middle of Figure 1) to minimize the possible impact of extraneous factors on the study. For the initial routing with ISR when the first stage began, the θ estimate from the previous section—measuring Trait X (
For simulation, the study generated 100,000 simulees to have two latent trait scores,
Results and Discussion
The actual observed correlation coefficients of
Regression Models for ISR in Study 1.
Note. ISR = intersectional routing.
Table 2 displays the percentage of observed MST paths under each condition. Under ISR Condition 0, which was the baseline for the expected upper bound of ISR with perfect routing at the first stage, 95.1% of cases had optimal paths. The remaining 4.9% of cases showed less than optimal paths, where the choice of module for the first stage was not adjacent in module difficulty to the choice of module at the second stage. This indicated the impact of the measurement error from the seven first-stage items on the routing for the second stage even when the routing at the first stage was perfect. Using this as a baseline made the interpretation of the results from ISR Conditions 1 to 3 more meaningful and practical. As the observed correlation of
Percentage of Observed MST Paths From Study 1.
Note. MST = multistage testing; ISR = intersectional routing.
Paths considered as less than optimal.
The measurement efficiency and score estimation accuracy across the studied conditions were evaluated using mean absolute errors (MAE) and bias statistics for θ estimation as well as the score reliability coefficient. As shown in Table 3, the MAE tended to decrease among the ISR conditions as the observed correlation increased. In comparison with the MST condition with routing stage, all ISR conditions exhibited smaller MAE. The MAE observed under ISR Condition 3, for example, was smaller than that observed for the MST condition with routing stage by about 0.017. Among the ISR conditions, the difference in MAE between Conditions 2 and 3 was only about 0.001, whereas the difference between Conditions 1 and 2 was about 0.009. This suggests a corresponding growth in improvement in measurement efficiency of MST with the ISR as the correlational relationship between section scores increases. Once the correlation becomes larger than that achieved with Condition 2 (R = 0.378), however, further increases in correlation between section scores will not necessarily lead to a meaningful increase in ISR measurement efficiency. ISR Condition 0, which essentially represented unrealistic, fictional cases with zero measurement error with
Performance of MST With ISR Under Different Conditions in Study 1.
Note. MST = multistage testing; ISR = intersectional routing; MAE = mean absolute errors.
In terms of estimation bias, all studied conditions showed minimal bias (≤0.042), as shown in Table 3. The conditional estimation bias patterns in Figure 3 showed no meaningful differences among the studied conditions for most areas of θ. The overall results suggest that use of the ISR approach did not introduce systematic estimation errors.

Conditional estimation bias in Study 1.
Study 2: MST With 3-4 Structure
In Study 1, the first stage of Section Y had only two modules and a routing cut score about zero. The intercept of the regression models to predict ISR scores (
MST Design and Data
Study 2 involved the development of an MST with ISR design with a 3-4 structure. Like Study 1, all items in the test modules had an a-parameter value of 1 and c-parameter value of 0.2, differing only in b-parameter value. The first stage of Section Y included three modules that had different peak locations for the MIF but the same shapes. The routing cut score between Stage 1 Easy and Stage 1 Medium difficulty modules was about −0.3. It was 0.3 between Stage 1 Medium and Stage 1 Hard modules. The second stage involved four modules, again with the same shape but differing only in the location of MIF peak. All other study conditions remained the same as in Study 1.
Results and Discussion
Although the true latent scores,
Regression Models for ISR in Study 2.
Note. ISR = intersectional routing.
The percentage of simulees routed to optimal paths in Study 2 (Table 5) was smaller than that observed in Study 1 (Table 2), because of the expanded number of paths in the 3-4 MST structure. Across the ISR Conditions 1 to 3, the higher the observed correlation coefficient between
Percentage of Observed MST Paths From Study 2.
Note. MST = multistage testing; ISR = intersectional routing.
Paths considered as less than optimal.
The MAE observed in Study 2 generally were smaller than the ones observed from Study 1 across all ISR Conditions by 0.010 to 0.017 (Table 6). Overall, the results of Study 2 suggest that the effectiveness of MST with ISR can be maintained or further improved when there are more than two modules in the first stage.
Performance of MST With ISR Under Different Conditions in Study 2.
Note. MST = multistage testing; ISR = intersectional routing; MAE = mean absolute errors.
Study 3: ISR Scenarios With Other Regression Models
In Studies 1 and 2, simple linear regression models were used to compute the ISR scores (
In real-world applications, the correlational relationship between

Comparison of linear and polynomial (to 4th order) regression models when distribution of the correlational relationship between
A multiple regression approach also could be considered for MST with ISR implementations. Many test programs have more than two sections measuring different traits. For example, one could consider ISR for a test program with three test sections measuring traits X, Y, and Z, which, after sections measuring X and Y are finished, uses a multiple regression predicting
Real-World Implementation of MST With ISR
The Graduate Management Admission Council® (GMAC®) developed an Executive Assessment (EA) to provide Executive MBA schools and programs with a new tool for admissions decisions. The EA exam has three test sections, each of which measures integrated reasoning (IR), verbal reasoning (VR), and quantitative reasoning (QR) skills, respectively. Each section is separately timed to be up to 30 min. and consists of two stages. The average correlation between observed IR scores and VR scores is about .55, and about .40 between observed IR scores and QR scores across all panels.
In actual administrations of EA with ISR, the average percentage of the test administered with optimal routing paths was about 86% for the VR section and 80% for the QR section. IR, VR, and QR section scores, which are linear transformations of IRT θ estimates, show reliability of about 0.77 to 0.80, and the EA total which is a sum of IR, VR, and QR section scores shows reliability of about 0.87, which is quite high for a composite score of multiple traits given the short test length (40 items in total). Compared with the experimental version of EA, which had a linear routing stage in each section without ISR, the operational EA with ISR resulted in a significant improvement in EA total score reliability across all test panels, at least, by 0.04 without lengthening the test or changing the quality of items.
Discussion and Conclusion
Although the idea of utilizing preknowledge (i.e., prior) or collateral information of other trait dimensions for adaptively selecting items and estimating θ has been brought up under MIRT and MAT frameworks, there have been no studies to date that offer any real-world solutions for MST-based test programs. The ISR framework for MST with multiple modules in the first stage was proposed in this study as a means to improve the adaptiveness of MST especially when the test length is very short (<15 items) and the number of stages is small (two or three). This study proposed and evaluated the implementation of an ISR framework by developing different regression models based on observed score distributions from form-specific simulations and using them for ISR in the first stage of an MST section. The overall results of the study indicate that when there is a moderate level of observed correlational relationship between section scores (R > .37, for example), applying ISR can improve the measurement efficiency of MST and help reduce the number of items by about 10% of test length under the studied conditions. The study also confirmed that implementation of ISR did not cause any score estimation bias and helped to further reduce both systematic and nonsystematic estimation errors.
In all MST-based applications (with or without ISR), there is always a possibility of test takers being routed to a suboptimal module due to measurement error. Most such suboptimal routings usually happen for those test takers whose θ located around the routing cut score. In MST with ISR, the possibility of suboptimal routing in the first stage is expected to be greater than typical MST because of the role of prediction error (from the regression model used in ISR) on top of the measurement error from the linked previous section(s). Therefore, in practice, it is very important to mitigate possible negative impacts of such suboptimal routing especially in the first stage of MST with ISR. One of the most effective approaches to accomplish that goal is to design MST modules in a “staggering” manner between stages. The idea of “staggering” modules is similar to the process of staggering bricks when building a wall. When builders stack up bricks to construct a wall, they lay bricks staggered between courses—in other words, one brick sits on top of two bricks in the row below so that the weak spot of brick joints is reinforced by the staggered brick on top. In the example of Figure 1 (middle), the routing cut score for Stage 1 was at
Although all simulation studies in this article were based on the conventional MST with preassembled modules, the proposed framework for developing ISR-based MST with short test length is fully applicable to the MST approaches with module assembly on-the-fly using CAT item selection methods (Zheng & Chang, 2011; Zheng et al., 2014), module shaping method (Han & Guo, 2014), or shadow test assembler (van der Linden & Diao, 2014). In MST with modules assembled on-the-fly, the target module difficulty and/or the location of peak MIF for the first stage could be determined using the ISR development framework and regression model,
The conditions studied in this research were extensive, covering different MST designs and different relationship and distributions of θ, as well as different types of regression models for ISR. It should be noted, however, that the findings of this study are not necessarily generalizable to all other MST conditions and designs as there are numerous factors, such as shapes of MIFs and their level of overlap in each stage, for example, that can substantially affect MST performance and behavior. The good news is that, with the proposed ISR framework, test developers and practitioners can easily investigate and evaluate the expected ISR performance and behaviors with specific designs and forms of their own MST-based test programs.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
