Abstract
This paper evaluates the multistage adaptive test (MST) design of a large-scale academic language assessment (ACCESS) for Grades 1–12, with an aim to simplify the current MST design, using both operational and simulated test data. Study 1 explored the operational population data (1,456,287 test-takers) of the listening and reading tests of MST ACCESS in the 2018–2019 school year to evaluate the MST design in terms of measurement efficiency and precision. Study 2 is a simulation study conducted to find an optimal MST design with manipulation on the number of items per stage and panel structure. The results from operational test data showed that the test length for both the listening and reading tests could be shortened to six folders (i.e., 18 items), with final ability estimates and reliability coefficients comparable to those of the current test, with slight differences. The simulation study showed that all six proposed MST designs yielded slightly better measurement accuracy and efficiency than the current design, among which the 1-3-3 MST design with more items at earlier stages ranked first. The findings of this study provide implications for the evaluation of MST designs and ways to optimize MST designs in language assessment.
Keywords
Introduction
Computerized multistage adaptive testing (MST) has gained great momentum in both educational assessment research and practice in the past decade (Chen et al., 2014; MacGregor et al., 2022; Yan et al., 2014; Yang & Reckase, 2020). Many assessments have made their transition from traditional paper-and-pencil testing or computerized adaptive testing (CAT) to MST, such as the Graduate Record Examination (GRE; Robin et al., 2014), the World-class Instructional Design and Assessment (WIDA) Assessing Comprehension and Communication in English State-to-State (ACCESS) for English language learners (ELLs) (hereafter ACCESS; MacGregor et al., 2022), the National Assessment of Educational Progress (NAEP; Yamamoto et al., 2019), the Program for International Student Assessment (PISA; Educational Testing Service, 2016; Yamamoto et al., 2018), and the revised General Test and Comprehensive Testing Program 4 (CTP4) for kindergarten through grade 12 (K-12) students (Wentzel et al., 2014).
Unlike CAT, where adaptation occurs at the individual item level, MST adapts using a prebuilt module of items, and is thus often regarded as a compromise between CAT and paper-and-pencil tests (Jodoin et al., 2006; Kim et al., 2012). Compared to CAT, MST embraces several advantages, including enhanced test security, better item exposure control, well-controlled content balancing, and less demand on the item pool size (Hendrickson, 2007; Pohl, 2013). However, a more complex test assembly process is needed in MST. For instance, the number of items per module, the number of stages for desired psychometric property, and the number of routing paths need to be preconfigured for MST design.
In language assessment, few studies have been conducted to evaluate the performance of MST, although there are some studies addressing the comparability of CAT scores with their computer-based or paper-and-pencil counterparts (e.g., He & Min, 2017; Mizumoto et al., 2019) and some comparing the MST measurement precision with its paper-and-pencil counterparts (e.g., MacGregor et al., 2022). Nevertheless, the evaluation of MST should be different from that of CAT, as MST involves the determination of the optimal preconfigured design of MST, which is much more complicated than comparing the score comparability or measurement precision, typically seen in CAT studies.
The central issue explored in MST studies in educational assessment is the optimal preconfigured design. Much research has been undertaken to assess how MSTs perform in various scenarios, considering factors, such as the number of stages and the number of modules within each stage (Han, 2020; Luecht et al., 2006; Luo & Kim, 2018; Svetina et al., 2019; Zenisky & Hambleton, 2014). These studies show that assembling too many stages may increase the complexity of the MST design without contributing much to the enhancement of measurement precision of the final test (Luecht & Nungester, 1998; Zenisky et al., 2010). However, some MST designs in language assessment, for example, the MST ACCESS design, feature a much larger number of stages than conventional MST design. Although maximizing the number of stages in MST seems to enhance measurement efficiency with increased number of adaptive points (MacGregor et al., 2022), it may have undermined the MST advantages over CAT (Han, 2020). For example, if we change an MST design with 30 items from three stages (10 items per stage) to 10 stages (three items per stage), higher chances of item skipping and less flexibility in item review may occur. Furthermore, it may become exponentially more complicated for the test developer to ensure content and psychometric control of all paths as the number of stages increases. Therefore, determining the optimal MST design involves more than simply focusing on measurement efficiency. Researchers and practitioners need to strike a balance among measurement precision, measurement efficiency, user experience, and other practical issues (Zenisky & Hambleton, 2014). To the best of our knowledge, no research has been conducted to systematically evaluate the preconfigured design of MST in language assessment.
The aim of this study is to evaluate the MST design of a large-scale language assessment and seek ways to optimize the current design, using both simulated and operational test data. It is hoped that this study may provide a point of reference for future studies evaluating MST designs in language assessment.
Computerized MST
MST comprises a number of stages within which pre-constructed item sets at different difficulty levels are selected for test administration. Compared with CAT, the MST design has some new components, such as modules, stages, panels, and routes (Luo & Kim, 2018; Park et al., 2014). The pre-constructed item sets are called modules, which are the smallest units of the MST. The next unit of MST is stage, which involves one or more modules that differ in difficulty levels. Each test-taker only takes one of the modules within a stage depending on his response to previous module(s). The entire collection of the modules completed by test-takers is named a panel, the largest unit of MST, and the pathway used by the test-takers is called a route. A 1-2-2 panel structure indicates that there is one module at the first stage, two modules at the second and the third stages.
The reasons why MSTs have been increasingly adopted by testing corporations in the past decades are threefold. The first reason pertains to the fact that MSTs offer better measurement precision than linear fixed-item tests, and lower cost of test administration than CATs (Xiong, 2018). Compared with linear tests that consist of items targeting the average test-takers and may not offer precise measurement for test-takers at the two extremes of the ability continuum, CATs are tailored to individual’s needs through administering test items that closely match each test-taker’s estimated ability.
In this way, CATs increase the accuracy of ability estimates and reduce the test length by up to 50% (Han, 2018; He & Min, 2017; Huang et al., 2022; Mizumoto et al., 2019; Oladele & Ndlovu, 2021; Reckase, 2010). However, they are increasingly criticized in both research and operational fields in terms of the high cost of test administration and computational complexity (Xiong, 2018). As a result, MSTs, as a compromise between linear fixed-item tests and CATs, have gained great momentum in the past decade. For example, some assessments have replaced CATs with MST versions, such as the GRE and the NAEP (Robin et al., 2014; Yamamoto et al., 2019; Zeng, 2016).
A second major advantage of MSTs is that they can alleviate the ability underestimation or overestimation problem detected in earlier CAT applications. In 2000 and 2003, the CATs of the GRE and the Graduate Management Admission Test (GMAT) were found to yield severely overestimated or underestimated scores for thousands of test-takers (Carlson, 2000; Chang & Ying, 2008), which are primarily caused by the item selection algorithm that maximized Fisher information (Chang & Ying, 2008; Zheng & Chang, 2015). Using this algorithm, responses to the first few items would heavily influence final ability estimates in a short test (Zheng & Chang, 2015). For instance, if a high-proficiency test-taker accidentally gets the first few items wrong or if a low-proficiency test-taker guesses the first few items correct, the high-proficiency test-taker would get underestimated and low-proficiency test-taker overestimated. However, MSTs overcome this problem in that they estimate ability at the end of the entire first stage and are thus generally immune to unstable and inaccurate interim estimates.
Another important feature of MSTs is that they guarantee better quality control, as all test forms in MSTs are preassembled. Test developers can review all forms to make sure that the various content and psychometric constraints are met before administration (Luo & Kim, 2018; Zheng & Chang, 2015). However, if there are many stages and test forms, it would be cumbersome and expensive for test developers to review all test forms. Therefore, striking a balance between measurement precision and measurement efficiency is warranted.
A plethora of studies have been conducted to examine the performance of MSTs in different settings regarding the number of stages, the number of modules at each stage, and the routing method (Han, 2020; Luecht et al., 2006; Luo & Kim, 2018; Svetina et al., 2019; Zenisky & Hambleton, 2014). Theoretically, more stages and larger differences in modules’ difficulty allow greater adaptation with fewer routing errors and thus more flexibility and opportunity to recover from inappropriate routing (Yan et al., 2014). Practically, two, three, or four stages have been used in most MST research and applications (Jodoin et al., 2006; Yan et al., 2014). According to Betz and Weiss (1974), two-stage design uses the first stage for initial ability estimation and the second stage for final measurement, which might be prone to error without much adaptation point. A comparison showed that increasing the number of stages from two to three increased the accuracy of proficiency estimates (Patsula & Hambleton, 1999), as well as the efficiency of MST designs (Yan et al., 2014). However, adding more stages to the test increases the complexity of the test assembly without necessarily adding much to the measurement precision of the final test forms (Luecht & Nungester, 1998; Zenisky et al., 2010).
The same logic applies to the number of modules at each stage. Theoretically, the more modules there are at each stage, the better the MST will perform, as more modules per stage increase the chance of finding the optimal module that maximizes Fisher information at the interim ability estimates of the test-takers. However, assembling more than five modules per stage may complicate the MST design without adding much more precision, and is thus not recommended, especially considering that it is not cost-effective (Han, 2020). Previous research has shown that reasonable practice is to assemble a maximum of four modules per stage (Armstrong et al., 2004; Han, 2020; Yan et al., 2014). These principles possibly explain why the three-stage panel design (e.g., 1-2-3, 1-3-3) are widely used in previous MST research (Jodoin et al., 2006; Sahin, 2020).
Another important issue in MST design concerns the method used to route test-takers into the next optimal module. Two different routing strategies have been generally employed in MST design. The first one is the number-correct method (Han, 2020; Sari & Raborn, 2018), where a test-taker’s number of correct responses is compared with a predetermined cut point to decide which module the test-taker will take next. The number-correct method is straightforward and thus easy to communicate to stakeholders (Sari & Raborn, 2018); however, it does not tend to select the most tailored module for each test-taker because there is a discrepancy between the number-correct score and test-takers’ true ability. The second one is information-based routing, where the module that leads to minimal measurement errors is selected for the test-taker using the item response theory (IRT) approach, such as maximum information (MI; Hendrickson, 2007; Yan et al., 2014). More specifically, it calculates the expected information gain from each remaining module and selects the one that is expected to provide the most information about the test-taker’s ability, given their current estimated ability. The MI method is arguably one of the most-commonly used routing approach in MST design (Han, 2020), and generally recommended by researchers (Magis et al., 2017).
WIDA ACCESS for ELLs
ACCESS for ELLs Online 2.0 is a large-scale, multistage adaptive English language proficiency test, administered to an annual test-taking population of over 1 million ELLs in Grades 1–12 in 40 states in the United States. It is designed on the basis of WIDA’s English language development (ELD) standards (WIDA Consortium, 2014), which describe the developing English language proficiency of K-12 ELLs in the academic language of five areas, including social and instructional language (SIL), language of language arts (LoLA), language of math (LoMA), language of science (LoSC), and language of social studies (LoSS). Anchored in WIDA’s ELD standards, test forms are generated for five grade clusters, namely, Grade Clusters 1, 2–3, 4–5, 6–8, and 9–12. Within each test form, items are developed to target five language proficiency levels (PLs), ranging from “entering” (PL1) to “bridging” (PL5).
The listening and reading tests of ACCESS are operated in an MST system that allows students to navigate among three different modules based on their performance throughout the test. Figure 1 displays the structure of MST ACCESS listening test. The listening test has six to eight stages with a 1-1-3-3-3-3 (-2-2) panel structure. Each stage has three modules, and each module includes three items in a bundle of a common passage. The modules at the first and second stages are entry modules targeting SIL. Based on their performance on the six SIL items at the first two stages, test-takers will be routed to one of the three leveled LoLA folders at Stage 3. Specifically, test-takers’ interim ability estimates are compared to two predetermined thresholds. Test-takers whose interim ability estimates exceed the higher cut will be routed to the folder at the highest level (i.e., C in Figure 1), and students whose interim ability estimates are below the lower cut will be routed to the folder at the lowest level (i.e., A in Figure 1), with all others routed to the B folder. Throughout the test, a test-taker’s interim ability is re-estimated after he/she finishes each stage, based on which the next module to be administered will be selected. Test-takers are not restricted to routing through modules at adjacent levels. A test-taker who gets a much higher interim ability estimate as compared to that at the previous stage may be routed to a module more than one level higher than the current one (i.e., A–C), and vice versa. At the sixth stage, the test is terminated if a test-taker’s ability estimate is below the lower cut (i.e., Tier A), and the remaining test-takers take two more stages (Stages 7 and 8).

Structure of MST ACCESS listening test.
The reading test has the same navigating algorithm except it has 8 to 10 stages (see Figure 2). The language standards follow the same sequence in the listening test (i.e., SIL–LoLA–LoMa–LoSS–LoSc–LoLA–LoMa) except that one more LoSS folder (i.e., Tier B or C) and one more LoSc folder (i.e., Tier B or C) are included in the last two stages for Tier B and C students in the reading test.

Structure of MST ACCESS reading test.
One noteworthy feature of this MST design is that the number of items in the MST ACCESS is even larger than that in the linear ACCESS. Tier A students take the same number of items in the two versions of ACCESS, but Tier B and C students take three more items in the MST ACCESS than those in the linear ACCESS (MacGregor et al., 2022). A relevant question to be asked, then, is whether the MST ACCESS can be shortened to reduce the cognitive demand on students while maintaining the same level of measurement precision, since previous research shows that MSTs could yield similar measurement precision as linear fixed-item tests with fewer items (Oranje et al., 2014; Yamamoto et al., 2018).
Another issue is that the number of stages is much greater than what is suggested in the literature (Yan et al., 2014). There are 6 to 8 stages for the listening test and 8 to 10 stages for the reading test. Although including more stages in MST ACCESS decreases the chance of routing errors with multiple adaptation points (MacGregor et al., 2022), such a complicated design makes it difficult for the test developers to review all test forms’ content and psychometric quality before test administration. With a 1-1-3-3-3-3 (-2-2) panel structure for the listening test and 1-1-3-3-3-3-3-3 (-2-2) structure for the reading test, the number of possible pathways, theoretically, exceeds 100. Reviewing over 100 forms before test administration presents a challenging task for both content experts and psychometricians.
In addition, the routing decisions in MST ACCESS are based on the a priori difficulty of the folders, rather than the empirical difficulty value of the folders, that is, difficulty value estimated from real test data (MacGregor et al., 2022). Items written to target Tier B by content experts may have empirical difficulty values within the Tier A range, although averagely speaking, Tier C folders are more difficult than Tier B folders, and Tier B folders more difficult than Tier A folders (MacGregor et al., 2022). In such a design, the next module selected may not be the one that is the best-tailored for each test-taker. It is therefore necessary to examine whether the routing method can be improved by using the empirical difficulty values of the folders via MI method, the most-commonly used routing approach in MST design (Han, 2020).
The article presents two studies to (1) evaluate the current ACCESS MST design to see whether similar measurement precision can be reached with fewer items using operational test data and (2) propose ways to improve the current ACCESS MST design using simulated test data.
Study 1: Evaluating current MST design
The purpose of Study 1 was to explore the possibility of reducing the number of items administered in MST ACCESS while maintaining similar psychometric properties. Specifically, we aimed to find the minimum number of folders needed to achieve satisfactory measurement accuracy and precision, by comparing interim ability estimates at different stages with those at the last stage, and by comparing test reliability at different stages with those at the last stage.
Data
Operational population data from the 2018–2019 administration was used in this study. A total of 1,456,287 test-takers’ responses to the listening and reading tests of MST ACCESS across five grade clusters in 38 WIDA states and US territories were used. Table 1 shows the number of test-takers by gender and ethnicity at different grade clusters.
Participation by gender and ethnicity at different grade clusters.
Data analysis
We first compared the means and standard deviations of test-takers’ interim ability estimates at different stages with the final ability estimates to see since which stage test-takers’ interim ability estimates stayed stable and close to their final ones. Then we examined test-takers’ interim reliability estimates at different stages with the final reliability estimates to see since which stage test reliability estimates stayed close to the final ones. Reliability coefficients were computed using the reliability coefficient described in the work of Thissen (2000), which is an estimate of the ratio of true measure variance to observed measure variance, similar to the Rasch separation reliability coefficient (Linacre, 1999). We did not use Cronbach’s coefficient alpha because it is best suited for situations where all respondents answer the same set of questions, which is not the case for the MST. After that, we employed scatterplots and correlation coefficients to examine whether possible removal of folders would exert much influence on individual test-taker’s final ability estimates. Finally, we performed mixed regression to see whether possible removal of folders would influence test-takers across gender and race in a different way.
Results and discussion
Table 2 shows all test-takers’ interim ability estimates at each stage and their final ability estimates in the listening and reading tests. Scale scores were presented in this table because WIDA report scale scores to test-takers after applying a linear transformation procedure to the ability estimates (logit score) to make scores more understandable to stakeholders. The scale scores range from 100 to 600.
Interim and final ability estimates in listening and reading.
Note: (1) S stands for stage; (2) S1 was not presented because the adaptive engine starts to estimate ability after test-takers finish the two entry folders.
As shown in Table 2, for listening, both the mean scale score and standard deviation stayed close to the final scale score since the fifth folder (Stage 5). For reading, the mean scale score stayed close to the final scale score since the third folder (Stage 3), while the standard deviation started to stabilize from the fifth folder (Stage 5). It should be noted that the last two folders (i.e., S7 and S8 in listening, and S9 and S10 in reading) were only finished by tier C students, thus, it is expected that the standard deviation after these two stages looks smaller than that of all the other stages. The trend can be seen more clearly in Figure 3, which shows the boxplots of scale scores for reading and listening at each stage.

Scale scores at different stages in listening and reading.
We then broke down the population data by grade clusters, and examined the change of scale scores across stages for each grade cluster. As shown in Figure 4, for listening, for all grade clusters, after the fifth folder (Stage 5), the scale score got stable. For reading, after the fifth folder (Stage 5), the scale score also got stable. This finding indicates that administering two more folders (i.e., LoLA, LoMa) to Tier B and C students may not have much effect on the mean ability estimates.

Ability estimates at different stages across grade clusters.
Table 3 shows the interim and final reliability estimates in the listening and reading tests. As expected, the more folder students take, the more reliable the score is. However, for listening, starting from the sixth folder, the reliability generally reached above 0.8, with the exception of Grade Clusters 2–3 and 4–5. For reading, starting from the sixth folder, the reliability stayed above 0.8 for all grade clusters. These findings suggest that the removal of the last two folders in the listening test and the last four folders in the reading test would not significantly compromise the test’s reliability. Even with these exclusions, the reliability would still attain a level comparable to that of the original test, albeit slightly diminished.
Interim and final reliability estimates in listening and reading.
Truncated value; for simplicity, only coefficients at even stages are presented.
To examine whether such removal would influence individual test-taker’s scale scores, we checked the variability of the scale scores after the sixth folder and also of the final folder for listening and reading in Grade Clusters 6-8. The vertical axis in Figure 5 represents the interim ability estimates after the sixth folder in the listening (Panel a) or reading test (Panel b), and the horizontal axis the final estimates. We can see that the scatterplots are tightly clustered around the regression line for both listening and reading. The scores at stage 6 are highly correlated with those at the final stage for listening (r = 0.96, p < .001) and reading (r = 0.97, p < .001), indicating that the sixth folder scores can account for 92% and 94% of the variances for the final folder scores for listening and reading. There is more variability at the higher ends of ability estimates, but generally it is very tight. This suggests that adding more folders to the later stages does not have much effect on the ability estimates of individual test-takers.

Comparison of the interim ability estimates at the sixth and final stages.
Next, we conducted a mixed regression analysis on the listening scores at the sixth and final stages in Grade Clusters 6–8 as an example to see whether the removal of two folders would exert differential influence on test-takers across gender and race. The results showed that there are differences in student performance across gender, F(2, 394,986) = 205.31, p < .001, ethnic group, F(6, 394,986) = 481.01, p < .001, and stage, F(1, 394,986) = 607.64, p < .001. However, the significant effect might be due to the large sample size used in the current study. Effect sizes were then calculated using Cohen’s d, with a d value of 0.2 denoting a small effect, 0.5 a moderate effect, and 0.8 a large effect (Cohen, 1988). All effect sizes were small, ranging from 0.02 to 0.29, and falling well below the minimum threshold of 0.5 for a moderate effect.
Figure 6 demonstrates that the magnitude of performance differences across gender and ethnicity remained similar across Stages 6 and 8.

Ability estimates at the sixth and final stages across gender and race.
In sum, the main findings are that (1) students’ ability estimates remained stable and close to the final ability estimates after the fifth folder, (2) test reliability estimates were generally above 0.8 after the sixth folder, and (3) the vast majority of test-takers’ ability estimates at the sixth stage were similar to their final ability estimates. This suggests that reducing test length from 24 to 18 items in the listening test and from 30 to 18 items in the reading test would not exert much influence on the overall outcome.
Study 2: Simulating new MST design
This simulation was conducted to explore ways to improve the current ACCESS MST design, with manipulation on the number of items per stage and the panel structure without changes of item bank. The listening test of MST ACCESS for Grade Clusters 6–8 was focused on in the simulation study. Grade Clusters 6–8 were chosen to represent the population because it lies close to the middle of all grade clusters and its ability distribution spans a wide range, from −4 to 6 in logit scale.
Data
Simulated data were used to examine the performance of different MST designs to find the optimal one for MST ACCESS. Referring to the ability distribution in real test data, we simulated data from 10,000 examinees from a normal distribution with a mean of 2.24, and standard deviation of 1.28. The same ability distribution was used across all the conditions in all simulated MST designs. This will allow for comparison of different MST designs with enough test-takers at different ability levels.
The item bank was composed of 18 items related to five content categories (i.e., SIL, LoLA, LoMa, LoSS, and LoSc) which were calibrated by Rasch model and used in the operational test administration of MST ACCESS in 2018–2019.
MST test assembly
A baseline design (see Figure 7) was first specified to represent the current MST ACCESS design, the only differences being that (1) 18 items (i.e., six folders) were used instead of 24 (i.e., eight folders) and (2) the MI routing method was employed to route test-takers into different modules to maximize the measurement precision of the test instead of comparing the ability estimates with predetermined difficulty cuts.

Design 1: Baseline MST panel design.
Then three MST panels were assembled based on module length, using 1-3-3 panel design, which is one of the most-commonly used designs in the MST literature (Kim et al., 2012; Luo & Kim, 2018; Sari & Raborn, 2018). The three levels of module length were set to be: (1) equal number of items at each stage (six items–six items–six items), (2) more items at the second stage (six items–nine items–three items), and (3) more items at the last stage (six items–three items–nine items) (see Figure 8).

Studied 1-3-3 MST panel designs. (a) Design 2. (b) Design 3. (c) Design 4.
Finally, the best-functioning two panels in 1-3-3 MST panel designs were simplified into a 1-2-3 design to see whether the designs could be further streamlined without influencing the measurement accuracy and precision. Figure 9 shows the two studied 1-2-3 MST panel designs. The difference between Figures 8 and 9 is that compared to the corresponding 1-3-3 panel design in Figure 8 (Designs 2 and 3), the 1-2-3 panel design had tier B folders at the second stage removed (Designs 5 and 6).

Studied 1-2-3 MST panel designs. (a) Design 5. (b) Design 6.
For all designs, the MI routing method was employed to route test-takers into different modules to maximize the measurement precision of the test, and test-takers’ responses were scored using the maximum likelihood estimator (MLE). The order in which the content language standards are administered in all the six designs are the same as that in the operational MST ACCESS, namely, SIL–LoLA–LoMA–LoSS–LoSc (see Figures 7–9). The simulation process was completed via the R package mstR (Magis et al., 2017).
Data analysis
The results of MST simulation were evaluated in terms of measurement accuracy (i.e., person ability parameter recovery) and measurement precision (i.e., reliability estimates). Measurement accuracy refers to how well the MST simulation estimates a test-taker’s true ability or trait, which was evaluated with reference to three overall statistics, namely, average signed bias (ASB), root mean square error (RMSE), and correlation between estimated and true theta. The ASB “is the mean difference between estimated and true ability levels” (Magis et al., 2017, p. 104), which indicates how much the MST tends to overestimate or underestimate the true abilities. RMSE is the square root of the mean of the squared differences between estimated and true ability levels” (Magis et al., 2017, p. 104), which provides a measure of the typical error or discrepancy in ability estimates. The lower the ASB (in absolute value) and RMSE values are, the better the design would be. Pearson’s correlation coefficients between the estimated and true theta were also computed, which assess how closely the MST’s estimates align with the true abilities of the test-takers. The higher the correlation coefficients are, the better the design.
Measurement precision was evaluated in terms of reliability coefficients, which address whether the MST provides consistent results when measuring the same test-taker’s ability repeatedly. We used the formula provided by Thissen (2000), the same as that in Study 1. All the statistics were calculated within each replication for the 10,000 simulated test-takers, then averaged across all replications for all MST designs.
Results and discussion
Figure 10 presents the summary statistics of the six MST designs. 1 The left panel displays the ASB values per true ability level and MST design, while the right panel displays the corresponding RMSE values. As shown in Figure 10, Designs 3 and 6 produced overall better results than the other four designs, evidenced in their lowest ASB and RMSE values. Design 1 returned the worst result, with the highest ASB and RMSE values. This finding converges with the findings from previous research that more items at earlier stages tend to result in more stable and accurate ability estimates (Svetina et al., 2019). The finding that Design 6 ranked similarly with Design 3 indicates that simplifying the 1-3-3 design to 1-2-3 design would not lose much measurement precision, which is in alignment with the findings from previous research that panel complexity did not affect the overall outcomes (Sari & Raborn, 2018).

ASB (left) and RMSE (right) values at each true ability level in six MST designs.
Despite slight differences, all the six MST designs yielded a similar pattern, that is, as shown in the left panel in Figure 10, there was some slight overestimation for extremely low ability students, and slight underestimation of extremely high ability students across all the six designs. This seems to stand in contrast to the previous findings that with ML estimator, lower ability levels were underestimated, and higher ability levels were overestimated (Lord, 1983, 1986; Magis et al., 2017). This may be attributed to the fact that the current item pool has insufficient items targeting the two extreme ability levels (i.e., [-4, -3], (5, 6)), thus resulting a shrinkage of ability estimates at the lowest and highest ends. In all, the six MST designs, the smallest RMSEs were observed for ability levels with the range of (-1, 3), which is expected since the current item pool for Grade Clusters 6–8 has MI around ability estimate of 2, and the prior distribution is centered around 2.24, with a standard deviation of 1.28, following the ability distribution in operational population data.
The correlation coefficients between true and estimated ability in the six MST designs are generally high (see Table 4), ranging from 0.954 to 0.972, indicating rather coherent final estimates of all the six designs. Most notably, the highest correlations were obtained for Designs 3 and 6, and the lowest for Design 1 (i.e., the baseline design).
Correlation coefficients between true and estimated ability and reliability coefficients in six MST designs.
Nonetheless, Design 1 produced the highest reliability coefficient (see Table 4). This is expected as Design 1 has the most adaptive points and thus the highest possibility of administering the best fitting folder to test-takers, leading to the best measurement precision. This result supports MacGregor et al.’s (2022) claim that information can be maximized with more adaptive points. The reliability of Design 1 also exceeded that in the operational test (i.e., 0.82). This is also anticipated as the MI routing method was used in Design 1, which selected the best fitting folder with minimal measurement errors (Hendrickson, 2007; Yan et al., 2014), whereas in the operational test, folders were selected by comparing the ability estimates with predetermined difficulty cuts.
When taking into account both measurement accuracy and measurement precision, the findings indicate that no one MST design consistently outperformed all the others, which is in line with previous research findings that no single design worked best across all conditions (Svetina et al., 2019). However, the fact that all the six designs produced the same level of reliability coefficients as or slightly higher coefficients than the current MST design indicate that all the six designs are viable solutions, although all of them may not be statistically higher than the original MST design. Considering that Design 3 yielded the smallest ABS and RMSE values, the second highest correlation coefficient between true and estimated ability levels, and the second highest reliability coefficient among the six MST designs, we would recommend it to be the optimal design.
In sum, the findings of the study are that (1) in terms of the mean person ability parameter recovery, Designs 3 and 6 outperformed other designs, but differences in ABS and RMSE were generally small, and correlation coefficients were similar among all the six designs and (2) in terms of reliability estimates, Design 1 outperformed all other designs, but differences were quite small among all the six designs. The findings indicate that the six different MST designs are all viable alternatives to the current MST ACCESS design, with Design 3 (i.e., 1-3-3 with six items at the first stage, nine items at the second stage, and three items at the third stage) being the optimal one.
Conclusion
The current study is one of the first to evaluate MST designs in language assessment. The results from the operational test data suggest that the test length for both the listening and reading tests of MST ACCESS could be shortened to six folders (i.e., 18 items), with final ability estimates and reliability coefficients comparable to, though slightly lower than those of the original test. The findings from simulated test data indicate that changing the routing method to the MI method from the current predetermined cut point method could enhance the overall test reliability. In addition, bundling some of the items targeting different content language standards into the same module at the early stage (e.g., Design 3) would result in improvement in measurement accuracy and efficiency of the MST.
This study has some implications for the optimization of MST ACCESS. As MacGregor et al. (2022) showed that the MST design enhanced the psychometric properties of the paper-and-pencil version of ACCESS, this study revealed the possibility to optimize MST ACCESS by evaluating the preconfigured MST design itself. This study revealed that 18 items with a sensible design may provide sufficient information for decision-making, reducing the cognitive demand on young language learners. The decision of the exact number of items to be included, of course, needs to be made by ACCESS test developers, based on the minimum level of measurement precision desired at the overall level and each PL level, and in consideration of its influence on diverse groups of test-takers (e.g., gender, ethnicity) and local educational policies in different states. In addition, the proposed six MST designs provided alternative methods to upgrade the current ACCESS MST system. All of them simplified the design, as they shortened the test length and the number of stages, and used the MI method to maximize measurement efficiency. However, the change of MST design on a large scale is a large and challenging task. We would recommend that more systematic studies be conducted before making the final call of upgrading the current MST design for ACCESS.
This study holds broader implications for MST research and practice in several ways. First, the MST design in the present study encompasses multiple stages to minimize routing errors with more adaptive points. Our study showed that having more items in the earlier stages of the test may lead to better measurement accuracy and efficiency than the current MST design, in addition to allowing test-takers more flexibility to review items and change responses than the current MST design. These findings highlight the need to make full use of the advantage of MST design, namely, being adaptive at the module level instead of item level. Second, the routing decisions in the current MST are made based on predetermined difficulty of folders with no reference to the empirical values, whereas the proposed MST designs employ the MI method to find folders that maximize the test information for each test-taker, which leads to higher measurement precision. These findings offer valuable insights for practitioners seeking to develop and implement new MST designs or enhance existing ones. Third, the study serves as one of the earliest examples in language assessment to comprehensively evaluate the performance of MST, highlighting the crucial importance of exploring the optimal preconfigured design for MST, which differs from CAT evaluation.
Notwithstanding the above points, there are several issues that warrant further attention. First, the study only focused on the effect of possible removal of folders on the change of test-takers’ scale scores and test reliability. Future MST evaluation studies are needed to examine how such change might influence the classification accuracy of test-takers, an important psychometric property when MST is used for classification purpose (Lim et al., 2021). Such investigation is especially important for ACCESS test-takers at the higher end of the ability continuum, such as PL5, which is the exit requirement of ELL programs in many states. Second, the simulation study only focused on the listening test of MST ACCESS for Grade Clusters 6–8, using item parameter estimates in the 2018–2019 administration as true item parameters and ability distribution of the operational population data as true ability parameters. Future studies could be conducted to cross-validate the findings of this study by conducting multiple simulations across grade clusters, and across both listening and reading tests. The results need to be more systematically studied to reach a conclusion regarding the optimal design for MST ACCESS listening and reading tests. Third, the study only focused on evaluating the MST design at the overall level with indices on measurement accuracy and precision. More breakdown analysis at the path level with other psychometric properties, such as test characteristic curve (TCC) and test information function (TIF) would yield more information on the performance of the MST designs. Such an investigation would also provide insights on how to assemble the best folders for test-takers at different ability levels. For example, future research could be conducted to examine each panel’s TCC and TIF to find the best item difficulty levels per module to reach maximum measurement precision per path. Fourth, different gender and ethnic groups were found to have slightly different performance in the ACCESS test. Further research is warranted to understand the underlying reasons for these differences by employing qualitative content analysis. Such investigation would help ensure that the test content remains unbiased across all subgroups of test-takers, and in particular, that any potential shortening of the test does not yield adverse consequences for diverse groups of individuals.
Supplemental Material
sj-pdf-1-ltj-10.1177_02655322231225426 – Supplemental material for A shortened test is feasible: Evaluating a large-scale multistage adaptive English language assessment
Supplemental material, sj-pdf-1-ltj-10.1177_02655322231225426 for A shortened test is feasible: Evaluating a large-scale multistage adaptive English language assessment by Shangchao Min and Kyoungwon Bishop in Language Testing
Footnotes
Author contributions
Declaration of conflicting interests
The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The second author is a lead psychometrician at WIDA, who oversees the psychometric quality of WIDA tests, including ACCESS for ELLs.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article (Figures S1 and S2) is available online at the following link: sj-pdf-1-ltj-10.1177_02655322231225426.pdf.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
