Abstract
A classification method is presented for adaptive classification testing with a multidimensional item response theory (IRT) model in which items are intended to measure multiple traits, that is, within-dimensionality. The reference composite is used with the sequential probability ratio test (SPRT) to make decisions and decide whether testing can be stopped before reaching the maximum test length. Item-selection methods are provided that maximize the determinant of the information matrix at the cutoff point or at the projected ability estimate. A simulation study illustrates the efficiency and effectiveness of the classification method. Simulations were run with the new item-selection methods, random item selection, and maximization of the determinant of the information matrix at the ability estimate. The study also showed that the SPRT with multidimensional IRT has the same characteristics as the SPRT with unidimensional IRT and results in more accurate classifications than the latter when used for multidimensional data.
Keywords
Computerized adaptive testing (CAT) estimates ability precisely or makes accurate classification decisions while minimizing test length. Much is known about unidimensional CAT (UCAT), and several classification methods are available (Eggen, 1999; Spray, 1993; Weiss & Kingsbury, 1984). However, knowledge about multidimensional CAT (MCAT) is still expanding, and classification methods are available only for some situations.
Seitz and Frey (2013) developed a multidimensional classification method that makes a decision for each dimension for items that are assumed to measure only one trait. Spray, Abdel-Fattah, Huang, and Lau (1997) investigated classification testing for items that are assumed to measure multiple traits and concluded that this was not feasible. A new method was developed to make decisions for items that measure multiple traits. The advantages of making multidimensional classification decisions are that the multidimensional structure of the data is respected, adaptive testing principles can be used, and test length is reduced even more than in MCAT for estimating ability.
Item response theory (IRT), which is often used for CAT, is discussed in the “Multidimensional Item Response Theory” section of this article. IRT relates the score on an item, based on the item parameters, and the examinee’s ability (van der Linden & Hambleton, 1997). In multidimensional IRT (MIRT), multiple person abilities describe the skills and knowledge the person brings to the test (Reckase, 2009). Classification methods are then discussed. These methods decide whether testing can be finished and which decision is made about the examinee’s level (e.g., insufficient/sufficient). A new classification method for MCAT is proposed. Item-selection methods are discussed in the “Item-Selection Methods” section of this article. These methods select the items based on a statistical criterion or on the examinee’s responses to previously administered items. New item-selection methods are proposed for multidimensional computerized classification testing (MCCT). The efficiency and effectiveness of the new classification and selection methods are shown using simulations in the “Simulation Study” section. In the “Discussion and Conclusion” section, remarks are made about MCCT and directions for future research.
Multidimensional Item Response Theory
CAT requires a calibrated item pool suitable for the specific test for which the model fit is established, item parameter estimates are available, and items with undesired characteristics are excluded (van Groen, Eggen, & Veldkamp, 2014a). MIRT assumes that a set of
The likelihood of a set of observed responses
where
Classification Methods
Classification methods determine whether testing can be stopped and which decision is made before the maximum test length (van Groen et al., 2014a). The existing literature about classification methods for MCAT is described, and then a new classification method is proposed.
Existing Multidimensional Classification Methods
Two studies about making classification decisions using MIRT exist. These studies concern MCAT with multiple unidimensional decisions for between-dimensionality (Seitz & Frey, 2013) and the use of the sequential probability ratio test (SPRT) for within-dimensionality (Spray et al., 1997).
MCAT for between-dimensionality
Seitz and Frey (2013) used the SPRT to make multiple unidimensional decisions using the fact that, for between-dimensionality, the multidimensional two-parameter logistic model is a combination of UIRT models (W.-C. Wang & Chen, 2004). Seitz and Frey implemented the SPRT for each dimension. The SPRT (Wald, 1947/1973) was applied to classification testing using IRT by Reckase (1983). A cutoff point is set for the SPRT between adjacent levels with a surrounding indifference region. The region accounts for the uncertainty of the decisions, owing to measurement error, for examinees with an ability close to the cutoff point (Eggen, 1999). Two hypotheses are formulated for the cutoff point,
in which
in which
where
Seitz and Frey (2013) implemented the SPRT by setting cut scores,
in which
If the items load on multiple dimensions, Seitz and Frey’s (2013) method cannot be used because the ratio does not reduce to Equation 8. Furthermore, the method requires an additional decision rule if a decision on all or a set of dimensions is to be obtained. This implies that Seitz and Frey’s method can be used only for between-dimensional tests with no decisions based on multiple or all dimensions.
MCAT for within-dimensionality
Spray et al. (1997) investigated the possibility of using the SPRT for MCAT. They specified a passing rate on a reference test with a standard setting method and obtained an equivalent latent passing score by solving for
A Classification Method for Within-Dimensionality
Because the SPRT requires unique values for updating the ratio, a method should be developed that results in unique values if the SPRT is to be applied. The reference composite (RC; Reckase, 2009; M. Wang, 1985, 1986) reduces the multidimensional space to a unidimensional line. By using the RC, the likelihood ratio can be updated with unique values after an extra item is administered.
RC
The RC relates the multidimensional abilities to a unidimensional line in the multidimensional space (Reckase, 2009). This line describes the characteristics of the discrimination parameter matrix for the item set. All
and the direction cosines for the line are calculated using (Reckase, 2009)
in which
Multidimensional decision making using the RC
Using the RC, abilities can be ranked on a unidimensional line. The position of the RC is fixed before administration based on all items in the item pool. By fixing the RC, ability is measured on the same scale for all examinees, and cutoff points can be set.
The SPRT requires specifying a cutoff point,
where
which can be used to make classification decisions with the following decision rules:
Item-Selection Methods
Selecting the correct items is important, because items that are too hard or too easy or provide little information result in tests that do not function well (Reckase, 2009). Several methods are available for MCAT (e.g., Luecht, 1996; Reckase, 2009; Segall, 1996) and for unidimensional computerized classification testing (UCCT; for example, Eggen, 1999; Spray & Reckase, 1994). However, item-selection methods for MCCT are scarce. Seitz and Frey (2013) selected items using Segall’s (1996) method for MCAT for estimating ability. This method is discussed in the next section. Item-selection methods for UCCT are described, and then these methods are adapted for MCCT using Segall’s method.
An Item-Selection Method for MCAT for Ability Estimation
The method that maximizes the determinant of the Fisher information matrix was developed for MCAT to estimate ability (Segall, 1996). This matrix is a measure of the information in the observable variables on the ability parameters (Mulder & van der Linden, 2009). The elements of
Segall’s (1996) method is based on the relationship between the information matrix and the estimates’ confidence ellipsoid (Reckase, 2009). The method selects the item that results in the largest decrement in the volume of the confidence ellipsoid (Segall, 1996). As the size of the confidence ellipsoid can be approximated by the inverse of the information matrix, the item is selected that maximizes (Segall, 1996)
which is the determinant of the information matrix of the administered items and the potential item
Item-Selection Methods for UCAT for Classification Testing
In UCCT, two methods are commonly used in addition to random selection. The first method maximizes Fisher information at the ability estimate by minimizing the confidence interval around the ability estimate using
where
In unidimensional settings with the SPRT, maximizing information at the cutoff point is considered the most efficient (Eggen, 1999; Spray & Reckase, 1994).
Item-Selection Methods for MCAT for Classification Testing
Segall’s (1996) method selects the item with the largest determinant of the information matrix at the ability estimate. This method can also be used for MCCT. This method will be referred to as the method that maximizes the determinant of the information matrix at the ability estimate. The method is adapted to select items that maximize on some fixed point on the RC, analogous to the methods for UCCT.
The first new item-selection method for MCCT maximizes the determinant of the information matrix at the projected ability estimate. The rationale is that interest is limited here to the points that fall on the RC but not on the other points in the multidimensional space. The ability estimate is estimated using WML estimation (see the appendix). The estimate can be projected on the RC using Equation 11. To calculate
The second new item-selection method for MCCT maximizes the determinant of the information matrix at the cutoff point on the RC. This value is on the RC but has to be transformed to the multidimensional
The resulting objective function is
Simulation Study
The effectiveness and the efficiency of the classification and item-selection methods were investigated using simulations. The results with MCCT were evaluated on well-known characteristics of the unidimensional SPRT. A well-known characteristic of the unidimensional SPRT is that increasing
Simulation Design
An item pool from the ACT Assessment Program, which was used by Ackerman (1994) and Veldkamp and van der Linden (2002), was used to evaluate MCCT. The item pool consisted of 180 items, previously calibrated with a two-dimensional compensatory IRT model with within-dimensionality using NOHARM II (Fraser & McDonald, 1988). The fit of the MIRT model was established (Veldkamp & van der Linden, 2002). The means of the discrimination parameters were 0.422 and 0.454 with standard deviations 0.268 and 0.198. The observed correlation between the parameters was .093, which is explained by the orthogonal constraint in the calibration. The mean of the easiness parameter was −0.118 with a standard deviation of 0.568. The matrix of the discrimination parameters resulted in angles between the Dimension Axes 1 and 2 with the RC of 44.621 and 45.379 degrees.
Simulations were run for four item-selection methods: random selection (RA) and maximization of information at the cutoff point (CP), the projected ability estimate (PA), and the ability estimate (AE). The maximum test length was set at 50 items, following Veldkamp and van der Linden (2002). The acceptable decision error rates
A well-known characteristic of the unidimensional SPRT is that as ability becomes closer to the cutoff point, the test length increases (Eggen & Straetmans, 2000), and the proportion of correct decisions (PCD) nears 0.5 (van Groen & Verschoor, 2010). Additional simulations were run to investigate the effect of the distance between ability and the cutoff point. This study used 372,100 simulees: 100 at each of 61 evenly spaced points on
The classifications using multidimensional and unidimensional IRT were compared in a third simulation series. Although a two-dimensional model was required for model fit, which implied the use of MCCT, a comparison was made with UCCT. One hundred thousand simulees were generated using a multivariate standard-normal distribution with
The classifications with multidimensional and unidimensional IRT were compared. Unidimensional item parameters were obtained for the generated multidimensional data set using BILOG. The cutoff point and
Dependent Variables
The efficiency of MCCT was evaluated with the average test length (ATL), which was calculated per condition as the mean test length over 100 replications with each 1,000 simulees. Although reducing the test length reduces respondent burden, test development costs, and test administration costs, effectiveness was considered more important. Effectiveness was investigated using the PCD, which was calculated per condition as the mean of the PCD for each simulation over 100 replications. The PCD compared the true classification based on the true proficiency, with the decision by the SPRT. The PCD for UCCT compared the true classifications based on the proficiency on the RC with the observed classifications.
Simulation Results
Table 1 presents the ATL for different SPRT settings and the four selection methods. The performance of RA, AE, PA, and CP was evaluated. RA resulted in the highest ATL. CP resulted in the lowest ATL. An increase in
Average Test Length for Different SPRT Settings and Item-Selection Methods.
Note. SPRT = sequential probability ratio test;
The effectiveness of the classification method is shown in Table 2. The PCD is given for simulations with different SPRT settings and four item-selection methods. RA was the least accurate method. The PCD was lower for the simulations with
Proportion of Correct Decisions for Different SPRT Settings and Item-Selection Methods.
Note. SPRT = sequential probability ratio test;
Simulations were run to investigate whether the ATL and the PCD depended in the same way on the distance between ability and the cutoff point as in UCAT. In Figure 1, the ATL and the PCD are shown for different combinations of ability. The ATL increased considerably when the projection of the ability on the RC was close to the cutoff point and the PCDs decreased considerably and became close to 0.50 or lower.

Average test length and proportion of correct decisions with maximization at the cutoff point.
The ATL is shown in Table 3 for simulations in which classifications with UCAT and MCAT were compared for tests with a flexible test length. The ATL for the UCAT simulations was often lower than for the MCAT simulations. RA resulted in the highest ATL and CP in the lowest ATL. As shown in Table 4, the shorter UCAT tests were accompanied by a lower PCD than for MCAT. The decisions with MCAT and an information-based item-selection approach resulted in 5% higher accuracy than in UCAT. MCAT combined with CP resulted in the most accurate decisions followed by AE. RA resulted in 3% less accurate decisions for MCAT. In contrast, RA resulted in the most accurate decisions for UCAT, followed by CP.
Average Test Length for Different SPRT Settings for UCCT and MCCT.
Note. Simulations for classifications with UIRT and MIRT with a flexible test length. SPRT = sequential probability ratio test; UCCT = unidimensional computerized classification testing; MCCT = multidimensional computerized classification testing; MIRT = multidimensional item response theory; UIRT = unidimensional item response theory;
Proportion of Correct Decisions for Different SPRT Settings for UCCT and MCCT.
Note. Simulations for classifications with UIRT and MIRT with a flexible test length. SPRT = sequential probability ratio test; UCCT = unidimensional computerized classification testing; MCCT = multidimensional computerized classification testing; MIRT = multidimensional item response theory; UIRT = unidimensional item response theory;
Discussion of the Results
The main aim of the simulations was to investigate whether typical SPRT characteristics for UCAT also applied to the SPRT for MCAT. An increase of
The simulation results were in line with previous unidimensional findings by Spray and Reckase (1994), Eggen (1999), and Thompson (2009), in which item selection by CP was the most efficient. Selecting items using the CP on the RC resulted in MCAT in the shortest tests. As expected, the other methods outperformed RA.
In the third series, the SPRT for MCAT was compared with the SPRT for UCAT. It might be unexpected that the SPRT resulted, on average, in shorter tests in UCAT. This can be explained by the simpler structure of the likelihoods that are used for the SPRT. The CP resulted in the shortest tests for UCAT and MCAT. Although a reduced test length has a practical value, accuracy is often considered to be more important. MCAT resulted in more accurate decisions than UCAT. For MCAT, the CP resulted, as expected, in the most accurate decisions followed by the AE. Surprisingly, RA resulted in the most accurate classification decisions with the SPRT for UCAT. This is probably the result of optimization at incorrect points on the scale by the information-based methods or the reduced test length. Given the importance of making accurate decisions, if an MIRT model improves model fit for a specific data set, these item parameters should be used to make classifications instead of unidimensional parameters.
Discussion and Conclusion
A classification method was developed to make classification decisions in tests with items that are intended to measure multiple traits. The method can be used in testing situations in which the construct of interest is modeled using an MIRT model. A RC is constructed in the multidimensional space and is used to make classification decisions with the SPRT.
Segall’s (1996) item-selection method was adapted to select items that had the largest determinant of the information matrix at either the cutoff point or the current projected ability estimate. The methods use the
For item-selection methods that use an ability estimate, WML estimation was used. WML estimates (Tam, 1992) have a smaller bias than ML estimates. The Newton–Raphson method was used to find the estimates (see the appendix).
Simulations were used to investigate the ATL, the PCD, and the characteristics of the classification method. The efficiency and the accuracy were compared for different item-selection methods and different settings for the classification method. Independent of the settings for the SPRT, the classification method resulted in accurate decisions.
The differences in efficiency and effectiveness between the item-selection methods are small. The settings of the classification method had more influence on the ATL than on the PCD. Tests could be shortened considerably without much effect on the accuracy of the decisions. It was shown that the new classification method had the same characteristics as the unidimensional SPRT; when the projection of ability on the RC becomes close to the cutoff point, test length increases, and the PCD nears 0.5. The settings of the new SPRT had the same influence as in unidimensional IRT.
When compared with the SPRT with unidimensional IRT, the SPRT with MIRT resulted in longer tests but decisions that are more accurate. Given the importance of making accurate classification decisions, the SPRT should be used with MIRT when model fit for the data set is improved by MIRT.
Future Directions and Further Remarks
If the items load on one dimension, the new classification method cannot be used. If each item measures just one dimension, the non-diagonal elements of the
Simulations were run with an item pool that was calibrated with a two-dimensional model. The classification method can be applied to models with additional dimensions. A fixed test length can also be used.
Decisions were made based on the total set of items administered. Reckase (2009) showed that RCs can be constructed for underlying domains as well. Investigating whether it is possible to classify on these domains as well would be interesting. Such classifications can provide information regarding the level of the examinees for the underlying domains.
The current version of the SPRT is used to classify into one of two levels. It is expected that the method can be adapted to classify examinees into one of multiple levels, such as basic, proficient, and advanced.
The simulations used an item bank in which the dimensions were restricted to be orthogonal at each other. The SPRT can also be used if orthogonality is not assumed. The effects of fitting an orthogonal model and a non-orthogonal model to the same data set should be investigated, and the best fitting model should be used.
A WML estimator was used in the current study. The effectiveness and efficiency of the estimator have not been intensively studied and should be compared with other estimators. If this estimator is used for other studies, the researchers should investigate the appropriateness of using the estimator for their study.
In testing programs, constraints have to be met for the test content, and attention has to be paid to item exposure. The effects of content and exposure control should be investigated before the classification method for within-dimensionality is applied in actual testing programs.
Footnotes
Appendix
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
