Abstract
Although multidimensional adaptive testing (MAT) has been proven to be highly advantageous with regard to measurement efficiency when several highly correlated dimensions are measured, there are few operational assessments that use MAT. This may be due to issues of constraint management, which is more complex in MAT than it is in unidimensional adaptive testing. Very few studies have examined the performance of existing constraint management methods (CMMs) in MAT. The present article focuses on the effectiveness of two promising heuristic CMMs in MAT for varying levels of imposed constraints and for various correlations between the measured dimensions. Through a simulation study, the multidimensional maximum priority index (MMPI) and multidimensional weighted penalty model (MWPM), as an extension of the weighted penalty model, are examined with regard to measurement precision and constraint violations. The results show that both CMMs are capable of addressing complex constraints in MAT. However, measurement precision losses were found to differ between the MMPI and MWPM. While the MMPI appears to be more suitable for use in assessment situations involving few to a moderate number of constraints, the MWPM should be used when numerous constraints are involved.
Keywords
Computerized adaptive testing (CAT) is an approach that is used to measure person characteristics (van der Linden & Glas, 2010), whereby the item selection in CAT is based on the information acquired from responses to previously administered items. The approach has been proven to substantially increase measurement efficiency relative to linear tests involving a fixed number of items (Segall, 2005; W.-C. Wang, Chen, & Cheng, 2004). For this reason, over the last decade, the relevance of CAT has increased considerably and it is now used in numerous fields (e.g., educational assessment and psychological testing). In order to assess multiple latent traits simultaneously, CAT has been generalized to multidimensional adaptive testing (MAT; e.g., Frey and Seitz, 2009; Segall, 1996). In multiple simulation studies (Liu, 2007; Segall, 1996; W.-C. Wang & Chen, 2004; Yao, 2010), MAT was found to be more efficient for correlated traits than unidimensional CAT. Nevertheless, only a few operational assessments have employed MAT (e.g., Mulcahey, Haley, Duffy, Pengsheng, & Betz, 2008). Even in the field of large-scale assessments, in which dimensions are often highly correlated, there is currently no form of operational assessment for which MAT is used. Possible reasons for this limited use of MAT include the fact that the management of test specifications in MAT is more complex than in unidimensional adaptive testing and the fact that only a few studies have addressed the management of test specifications in the multidimensional case (Frey, Cheng, & Seitz, 2011; Su, 2015; Su & Huang, 2015; Veldkamp & van der Linden, 2002; Yao, 2014).
Stocking and Swanson (1993) described test specifications as rules governing the assembly of tests, whereby these rules are related to one or more item or test properties. Test specifications can be, for example, the proportion of administered items on a specific topic, the test length or the total testing time. These specifications can be formulated as constraints or as objective functions that must be considered during the item selection process (van der Linden, 2005b). For standardized testing programs in particular, various test forms (which are typically needed due to the presence of large item pools) must be comparable in regards to a predefined set of test specifications. For linear tests, it is possible to assemble various test forms before administering tests. However, in adaptive testing, test specifications must be fulfilled over the course of a test. This challenging task can be addressed by modifying the item selection algorithm of an adaptive test to simultaneously consider statistical optimality criteria and the required test specifications. According to the literature, multiple constraint management methods (CMMs) can account for test specifications during the CAT item selection process.
He, Diao, and Hauser (2014) gave a brief overview of the existing CMM and differentiated between two types. The first type of CMM, such as the constrained CAT method (Kingsbury & Zara, 1991) and modified multinomial model (Chen & Ankenman, 2004) can only address mutually exclusive constraints. The weighted deviation model (WDM; Stocking & Swanson, 1993), shadow test approach (STA; van der Linden & Reese, 1998), weighted penalty model (WPM; Shin, Chien, Way, & Swanson, 2009), and maximum priority index method (MPI; Cheng & Chang, 2009) belong to the second group of CMMs, which are also capable of addressing complex sets of constraints. The present study focuses on the second group of CMMs, as they are more flexible and can therefore be used to address a broad variety of constraint management problems. The main difference between approaches of this type pertains to the ways in which future item selection consequence projections are incorporated into the item selection process (He et al., 2014). The STA is a very flexible approach that has been proven to be successful in the management of multiple constraints for unidimensional (van der Linden & Reese, 1998) and multidimensional adaptive testing (Veldkamp & van der Linden, 2002). However, its use requires access to considerable knowledge on linear programming, and solver software must be available. For practitioners, solver software selection decisions can be challenging to make, as multiple issues must be considered in regards to specific test assembly problems (Donoghue, 2014). Such issues relate to the frequency of software program use, the size of the problem considered (e.g., the number of items and constraints), one’s programming experience, and the financial resources available for purchasing licenses. Freeware such as lpSolveAPI could be an attractive alternative to commercial solver. Diao and van der Linden (2011) demonstrated its capacity to carry out CAT with STA for smaller number of constraints. However, the authors argue that the performance of this software must be evaluated on a case-by-case basis.
Based on this background information, heuristic CMMs (e.g., the WDM, WPM, and MPI) are of particular interest to practitioners, as the requirements for their implementation are considerably low. Nonetheless, there are still considerable differences between the performance and maintenance of heuristic CMMs. In a study by He et al. (2014), the performance of the STA and that of the three heuristic CMMs (WDM, WPM, MPI) was compared. In regards to measurement precision, no significant differences were found between the heuristic CMMs. However, in regards to how well imposed constraints were met, the WPM outperformed the other heuristic methods. Furthermore, the MPI was described as the most “low maintenance,” and the WPM was described as the most “high maintenance” method. Unfortunately, few results have been recorded with regard to the performance of heuristic CMMs for the multidimensional case (Su, 2015; Yao, 2014).
The present study addresses this issue. According to He et al.’s (2014) results, the MPI and WPM are very promising candidates of constraint management in MAT. For the MPI, a multidimensional extension already exists. It is named multidimensional maximum priority index (MMPI) and was presented by Frey et al. (2011). The WPM, however, has not yet been extended to the multidimensional case. Therefore, the first objective of the present study is to render the WPM applicable in MAT.
As the size of test assembly problems is a crucial issue, it is important to determine whether all methods are equally well suited to a particular number of constraints. While numerous studies have addressed the performance of CMMs (Cheng & Chang, 2009; Cheng, Chang, Douglas, & Guo, 2008; He et al., 2014; Shin et al., 2009; Su, 2015; van der Linden, 2005a), no existing results detail the relationship between performance and the number of constraints. For this reason, the second objective of this study is to compare multidimensional extensions of the MPI and WPM with regards to the relationship between their performance and number of constraints. From these results, we present recommendations on the use of the various approaches in MAT.
The remainder of the article is organized as follows. First, a brief introduction to MAT is given. Next, the two CMMs (MPI and MWPM) are introduced, and their extensions to the multidimensional case are is described. Finally, both approaches are evaluated through a simulation study, and recommendations for practitioners are presented.
Multidimensional Adaptive Testing
MAT is proposed as a means of simultaneously measuring several traits. When employing MAT, two important issues must be addressed: the psychometric model and the item selection procedure. Multidimensional item response theory (Reckase, 2009) models are typically used as psychometric models for MAT. One general multidimensional item response theory model is the multidimensional three-parameter logistic (M3PL) model, which specifies the probability that an examinee j will answer an item i correctly as a function of the ability vector
The elements
The second important aspect in MAT pertains to the item selection method (see Yao, 2014 for an overview). Various approaches are used that differ with respect to the multivariable function that must be minimized or maximized (Yao, 2010): for example, maximizing the determinant of the Fisher information matrix (Segall, 1996), minimizing the trace of the inverse Fisher information matrix (van der Linden, 1999), maximizing the posterior expected Kullback-Leibler information (Veldkamp & van der Linden, 2002), and maximizing a simplified Kullback-Leibler information index (C. Wang, Chang, & Boughton, 2011). One frequently investigated item selection method for MAT is Segall’s Bayesian approach (1996), whereby the determinant of the Fisher information matrix is maximized. With regard to typical evaluation criteria (e.g., (conditional) bias and the measurement precision of ability estimates), this approach has been proven to be one of the best performing methods relative to other item selection methods (Mulder & van der Linden, 2009; Veldkamp & van der Linden, 2002; C. Wang & Chang, 2011; C. Wang et al., 2011; Yao, 2012, 2013, 2014). Nevertheless, according to some studies, other approaches perform slightly better (C. Wang & Chang, 2011). However, as Segall’s item selection method has been shown to be robust in several studies and for various MAT specifications, it is used as the item selection procedure for the present study.
For Segall’s Bayesian approach (1996), item selection is optimized by using the variance-covariance matrix
This matrix is determined by summing the information matrix of the previously t administered items
Constraint Management Methods
In this section, the MPI and the WPM are described; their extensions to the multidimensional case—the MMPI and the MPWM—are introduced, and the similarities and differences between the two methods are outlined.
The Maximum Priority Index
The MPI (Cheng & Chang, 2009) is based on the constraint relevancy matrix
The PI for a candidate item
The scaled “quota left”
This ratio is equal to one if no item that is relevant for constraint k is presented
The MPI was extended to the MMPI (Frey et al., 2011) by replacing Fisher item information
The Weighted Penalty Model
Although not described explicitly in Shin et al.’s (2009) study, the WPM is also based on a constraint relevancy matrix
Compared with the PI, the calculation of the weighted penalty value
Small
In an additional step, the standardized information penalty
Finally, the weighted penalty value
To extend the WPM to the multidimensional case (MWPM), only the calculation of the standardized information penalty
Comparison between the MMPI and MWPM
The MMPI and MWPM, in addition to their unidimensional ancestors, can be understood as penalty-based approaches. However, they differ in the ways in which they calculate the overall desirability of an item. For the MMPI, the statistical information of a candidate item
The two examined CMMs are also similar in the selective appropriateness for specific item pool structures. Numerous educational and psychological tests are based on item pools with between-item-multidimensionality structures, whereby each item in a pool assesses only one latent trait. However, for some multidimensional assessments, items measure multiple latent trait dimensions. The MMPI and MWPM in their current forms are better suited to between-item-multidimensionality structures. For assessments based on an item-pool with items measuring one latent trait or several traits, item selection based on the MMPI and MWPM tends to favor items with a single loading. This characteristic is attributable to the ways in which priority index and standardized total content penalty values are calculated, generating a smaller priority index or a higher penalty value for items that measure several traits. In reference to the MPI, Su and Huang (2015) recently described this problem and developed a modified MPI for item selection in cases of within-item multidimensionality. However, as we are focusing between-item multidimensionality, which is often used in operational tests, a modification of the MMPI and MWPM is not necessary in the present study.
Research Questions
As CMMs are designed to fulfill desired test specifications while optimizing statistical information of the presented items, CMM usage results in a more or less intense loss of measurement precision. The magnitude of this loss will depend on the CMM used (He et al., 2014), on the number of constraints imposed, and on the characteristics of the item pool. In addition to unidimensional adaptive testing, the correlation structure of the measured dimensions is crucial for measurement precision in MAT (W.-C. Wang & Chen, 2004; Yoo, 2011). Although several studies have examined the performance of various CMMs, very few have been conducted in the context of MAT (Frey et al., 2011; Su, 2015; Su & Huang, 2015; Veldkamp & van der Linden, 2002; Yao, 2014). Furthermore, no previous study has systematically varied the number of constraints or has analyzed interactions between CMMs and the number of imposed constraints. In providing this information, which is essential for determining which CMMs should be used for MAT, the present study addresses four research questions.
Method
Study Design
To answer the four research questions presented above, a comprehensive study with simulated data was conducted. The study was based on a full factorial design with three independent variables (IVs). For all of the conditions,
Item Pools
For each replication, an item pool with 600 items (200 items per dimension) was constructed. Each item measures exactly one of three dimensions (between-item-multidimensionality). Item discrimination parameters were drawn from a uniform distribution on the interval of real numbers (0.5, 1.5), and item difficulty parameters were drawn from a standard normal distribution,
Furthermore, for each replication a constraint relevancy matrix
Data Generation
For each replication, a sample of 1,000 simulees was generated. Ability parameters were randomly drawn from a multivariate normal distribution of
Three different levels of correlations ρ (.2, .5, .8) between the measured dimensions were used to study the effect of the correlation on the performance of the MMPI and MWPM. Binary responses on the items for the simulees were generated based on the M2PL model ( Equation 12 ).
MAT Specifications
The simulations were performed using SAS® 9.4 for a fixed test length of 60 items. For all of the conditions, the ability vector
Overall Test Blueprint for all Research Conditions.
The weights for all constraints were set to a value of one. For the number of items per dimension, the lower bound was set to 18 and the upper bound was set to 22; the bounds for each categorical property were 28 and 32.
Evaluation Criteria
As dependent variables (DV), the average mean squared error (MSE), the proportion of tests with at least one constraint violation (%Viol), and the average number of violations (#Viol) were used. MSE was calculated as the average squared difference between the ability estimates
The other two DVs were used to evaluate the extent to which the imposed constraints were fulfilled. %Viol was computed as the ratio of simulees taking a test with at least one constraint violation relative to all of the simulees multiplied by 100. #Viol was calculated as the average of constraint violations across all simulees N.
Results
In this section, the four research questions of the present study are answered. First, the results regarding the constraint violations are presented. Second, the measurement precision of the MMPI and MWPM is compared. Then, the performance of the CMM in the various assessment situations is evaluated. Finally, the performance of the CMM for the various correlation levels between the measured dimensions is analyzed.
Constraint Violations
The first research question focuses on the CMM’s capacity to fulfill the desired test specifications. To answer this question, we evaluated the proportion of tests with at least one constraint violation (%Viol) and the average number of violations (#Viol). Tables 2 and 3 show the results of the various correlation levels.
Percentage of Tests With at Least One Violation (%Viol) and Standard Error of Different Constraint Management Method for Multidimensional Tests With Differently Correlated Dimensions.
Note. MMPI = multidimensional maximum priority index; MWPM = multidimensional weighted penalty model.
Average Number of Violations (#Viol) and Standard Error of Different Constraint Management Method for Multidimensional Tests With Differently Correlated Dimensions.
Note. MMPI = multidimensional maximum priority index; MWPM = multidimensional weighted penalty model.
It can be concluded that the MMPI and MWPM perfectly met all imposed constraints for all of the conditions. Accordingly, no test was conducted with at least one violation, and the average number of violations was zero. When item selection was solely based on issues of statistical optimality (CMM = “none”), for almost all of the conditions, the proportion of tests with at least one violation was higher than 97%. However, when constraints were only imposed on the number of administered items per dimension (constraints = 3), the percentage of tests with at least one violation was considerably lower. When no CMM was used, the average number of violations increased with a growing number of imposed constraints.
Measurement Precision
The second research question focuses on the effect the number of imposed constraints has on the measurement precision of the MMPI and MWPM. Table 4 presents the average mean squared error (MSE) of the various correlation levels.
Average Mean Squared Error (MSE) and Standard Error of Different Constraint Management Method for Multidimensional Tests With Differently Correlated Dimensions.
Note. MMPI = multidimensional maximum priority index; MWPM = multidimensional weighted penalty model.
We designated the condition wherein item selection was based solely on the statistical optimality criterion (“none”) as the baseline condition. As expected, the MSE of this condition was independent of the number of imposed constraints. For the MMPI and MWPM, the MSE increased and, accordingly, the measurement precision decreased when more constraints needed to be considered. When constraints only referred to the number of administered items per dimension, there was no loss in the measurement precision of the MMPI and MWPM in relation to the baseline condition. However, when numerous constraints were involved, the measurement precision of the two heuristic CMM decreased significantly relative to the baseline.
Performance in Various Assessment Situations
In addressing the third research question, we aim to make recommendations in regards to the two CMMs for specific assessment situations. We thus examined interactions between the CMM, the number of test specifications and the performance of the CMM used. Performance was assessed based on constraint violations and measurement precision. As stated above, both CMMs perfectly met all of the imposed constraints for all of the conditions. However, a closer examination of this result shows that the loss in measurement precision resulting from an increasing number of imposed constraints was different for the two CMMs. For low to moderate numbers of imposed constraints, the measurement precision of the MMPI was higher than that of the MWPM. This changed when numerous constraints were involved. Here, the MWPM outperformed the MMPI. This interaction between the constraint management methods and the number of imposed constraints is shown in Figure 1. The graph intersection point for all of the correlation levels occurs within 23 and 28 constraints. Thus, for this number of constraints and above, the MWPM performs better than the MMPI.

Average mean squared error (MSE) on number of imposed constraints for the different constraint management methods and different correlation levels. MMPI = multidimensional maximum priority index; MWPM = multidimensional weighted penalty model.
Performance on Various Correlation Levels
The fourth research question concerns the effect that the correlation between the measured dimensions has on the performance of the MMPI and MWPM. Tables 2, 3, and 4 present the criteria for evaluating performance depending on the correlation levels between the measured dimensions. The two CMMs do not differ in regards to constraint violations (%Viol and #Viol) for various correlation levels. By contrast, measurement precision is affected by correlations between the measured dimensions. For all of the conditions, a higher correlation between the measured dimensions resulted in lower MSE values and thus in higher degrees of measurement precision. Furthermore, the loss in measurement precision derived from an increasing number of imposed constraints was related to correlations between the measured dimensions. In Table 4, it is clearly shown that the least correlated dimensions were associated with a greater loss of measurement precision.
Discussion
The present study focused on the effectiveness of two promising heuristic CMMs employed in MAT when varying numbers of imposed constraints and correlations between measured dimensions are involved. The multidimensional extension of the WPM was introduced, and its performance was compared to the MMPI through a simulation study. The performance of the two CMMs was evaluated based on constraint violation and measurement precision.
The study shows that both the MMPI and MWPM are capable of addressing complex sets of constraints in MAT without causing any violations. By contrast, when item selection is based only on a statistical optimality criterion, the number of tests with constraint violations is quite high. The small proportion of tests with at least one constraint violation found for the condition where no CMM was used and where only the number of administered items per dimension was constrained may erroneously support the conclusion that CMMs may not be needed to balance the proportion of items per dimension. However, this result is rather attributable to the built-in “minimax mechanism” of the D-optimal method (Mulder & van der Linden, 2009), which tends to select items belonging to the dimension with the least information. Consequently, the requested proportions of items per dimension are automatically balanced when (as in the present study) the item pool is balanced. Though we expected the two CMMs examined to perform better in regards to the fulfillment of test specifications, it is not a trivial finding that all of the constraints were met perfectly in each condition. Thus, this study highlights the capability of the MMPI and MWPM in meeting test specifications when numerous constraints are involved.
To fulfill imposed constraints, statistical information must be sacrificed to a particular degree. Accordingly, measurement precision decreased with an increasing number of constraints for both CMMs. This result is not surprising, because with an increasing number of constraints and a constant item pool size, the proportion of items with a specific combination of properties decreases. In turn, the penalty for non-statistical constraints becomes more critical to the item selection process, and the relevance of statistical information decreases. In particular, when numerous constraints are involved, the loss in measurement precision is significant compared to that found for MAT without CMM (“none”).
This study shows that the performance of the two heuristic CMMs for various correlation levels does not differ in regards to constraint violations. In accordance with the results of W.-C. Wang and Chen (2004) and Yoo (2011), measurement precision was affected by correlation levels among the dimensions. The loss in measurement precision that resulted from an increasing number of imposed constraints was related to the correlation level and to CMM used. For tests with highly correlated dimensions, the resulting loss was considerably smaller than that for lowly correlated dimensions, and especially when numerous constraints were imposed. With regard to the CMMs used, the MMPI performed slightly better than the MWPM for low to moderate numbers of constraints. Although the difference in measurement precision between the MMPI and MWPM was rather small, we recommend using the MMPI for assessment situations involving few constraints, as this is a low maintenance method (He et al., 2014). When numerous constraints are involved, the MWPM appears to be more suitable. These findings seem to contradict results presented by He et al. (2014), who found the WPM method to perform considerably better for a moderate number of constraints in the unidimensional case. However, these results are not directly comparable, as the two studies differ in some major respects (e.g., the number of items in the pool per dimension, the item selection criterion, the use of constraint weights).
The results of the present study make a substantial contribution to the management of test specifications in several respects. First, through the extension of the WPM to the MWPM, a new heuristic CMM that can address complex sets of constraints was made available for MAT. Second, the results underline that the number of imposed constraints constitutes a crucial factor affecting the management of test specifications (Donoghue, 2014). Third, the study pointed out that for some assessment situations, one specific CMM is more suitable to use than another. In selecting a heuristic CMM to address complex constraints, the MMPI should be used when a low to moderate number of constraints is involved, and the MWPM should be employed when numerous constraints are involved.
The findings of the present study are limited to assessment situations in which the item pool is constructed so desired test specifications can be fulfilled. This assumption is appropriate, as test specifications are typically known prior to item pool construction. Furthermore, we used Segall’s Bayesian item selection method (1996), a powerful item selection method that is likely the most frequently studied and applied item selection procedure for MAT. Nevertheless, as other MAT item selection methods may have specific effects on CMM performance, a comparison between the presented CMM and various item selection procedures should be conducted. As the CMMs used in this study are better suited to cases of between-item-multidimensionality, the extension of methods to within-item-multidimensionality contexts represents another area for future inquiry.
In conclusion, the MWPM and MMPI are two heuristic CMMs that can manage complex sets of constraints in MAT. In applying these two methods, which can be selected depending on the number of constraints involved, the management of test specifications in MAT becomes much easier, and more operational applications of this powerful method may be generated as a result.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
