Abstract
Mathematical programming has been widely used by professionals in testing agencies as a tool to automatically construct equivalent test forms. This study introduces the linear programming capabilities (modeling language plus solvers) of SAS Operations Research as a platform to rigorously engineer tests on specifications in an automated manner. To that end, real items from a medical licensing test are used to demonstrate the simultaneous assembly of multiple parallel test forms under two separate linear programming scenarios: (a) constraint satisfaction (one problem) and (b) combinatorial optimization (three problems). In the four problems from the two scenarios, the forms are assembled subjected to various content and psychometric constraints. Assembled forms are next assessed using psychometric methods to ensure equivalence about all test specifications. Results from this study support SAS as a reliable and easy-to-implement platform for form assembly. Annotated codes are provided to promote further research and operational work in this area.
Automated test assembly (ATA) is a process that uses mathematical procedures to first select items from an item bank and next package them into test forms subjected to content and psychometric constraints. ATA plays an important role in two stages of test creation: (a) test form generation and (b) item bank development (Armstrong, Jones, & Wu, 1992). Testing agencies oftentimes need to administer various tests at multiple locations, sessions, and examination days. Due to test security concerns, it is necessary to generate multiple parallel test forms that are maximally equivalent on content coverage and psychometric properties. The parallel forms can be assembled automatically as well as manually.
Automated methods are gaining popularity due to innovations in computer technology, psychometric theory, and the increasing need for large-scale assessment tools. ATA has great advantages over manual test assembly (Verschoor, 2007). Featuring improved effectiveness and efficiency, automated methods can streamline the optimal item selection and packaging process while factoring more constraints into forms. Therefore, automated methods can support the mass production of test forms for continuous test administration (Breithaupt & Hare, 2007, pp. 5-7; van der Linden & Adema, 1998, p. 190). Generally speaking, forms from automated methods tend to be more statistically parallel (i.e., more equivalent) than forms from manual assembly (Cor, Alves, & Gierl, 2009, p. 16; Luecht & Hirsch, 1992, pp. 46-47, 51; Stocking, Swanson, & Pearlman, 1991, 1993). Empirical evidence from the high-stakes medical licensing examinations also supports that automated methods generate forms which can assess examinees’ ability levels more accurately, particularly around the cutoff score for classifying examinees than manual methods (Choe & Denbleyker, 2014).
ATA can also work reversely to support the design and improvement of the item bank through investigating the maximum number of forms the item pool allows to be assembled. For instance, when item shortage causes one or more constraints not to be met, the ATA process will lead to an infeasible solution. Analysis of the item bank (e.g., checking item frequencies in various breakdowns from the blueprint, as recommend by Luecht, Champlain, and Nungester, 1998) may reveal the item shortage areas, which in turn guides future item writing.
Finally, ATA serves to establish test reliability and validity in various ways. In terms of test reliability, numerous measures have been proposed under either item response theory (IRT) or classical test theory (CTT) (e.g., Parshall, Spray, Kalohn, and Davey, 2002). When a reliability measure is operationalized in a test, it becomes a test specification. ATA allows one or more such specifications to be taken into consideration simultaneously. As for test validity, ATA supports the balance of content and psychometric properties by creating parallel test forms, a critical content validity consideration especially in criterion-referenced, credentialing examinations (e.g., medical licensing examinations) (Luecht et al., 1998; van der Linden, 2005, p. ix). Besides, ATA is able to more efficiently and rigorously satisfy test specifications from the test blueprint than manual assembly, which provides additional support for content validity, that is, easier to experiment with a larger number of test constraints and update constraint boundaries. In addition, ATA contributes to test security (e.g., effectively controlling item exposure and test overlap), which is another ultimate matter of test validity (Choe, 2017, pp. 5-7). In the end, ATA can automatically deal with enemy items (when one item gives away the answer to another through the stimulus or answer, one item is the enemy of the other) to prevent them from being administered to examinees on the same form, therefore contributing to test validity (Woo & Gorham, 2010).
Mathematical Programming for Binary Integers
There is extensive literature about ATA methods, which are largely based on mathematical programming (MP). Conceptually, different MP methods work in more or less a similar way (Drasgow, Luecht, & Bennett, 2006, p. 480). Given a mathematical target or criterion (if specified), an MP method searches for items optimizing the mathematical objective function (e.g., maximizing the test information function [TIF]) when simultaneously satisfying other test specifications formulated as constraints (Chen, Chang, & Wu, 2012; Drasgow et al., 2006; Luecht, 1998; Luecht & Hirsch, 1992; van der Linden, 2005).
ATA methods abound under the MP framework, such as mixed-integer linear programming (MILP), genetic algorithm (Finkelman, Kim, & Roussos, 2009; Sun, Chen, Tsai, & Cheng, 2008; Verschoor, 2007), simulated annealing (Veldkamp, 1999), normalized weighted absolute deviation heuristic (Luecht, 1998; Luecht & Hirsch, 1992), among many others. MILP is a most popular MP method for ATA to solve various test problems and has been implemented in many optimization software programs (Belov & Armstrong, 2005; Breithaupt & Hare, 2007; Chen et al., 2012; Cor et al., 2009; Diao & van der Linden, 2011; Luo & Kim, 2018; Luo, Kim, & Dickison, 2018; Theunissen, 1985, 1986; van der Linden, 2005).
Under MILP, ATA can be formulated as one of the following two problems: (a) constraint satisfaction (CS) and (b) combinatorial optimization (CO). With CO, the MILP program searches iteratively to identify the best possible solution to produce maximally equivalent test forms along the ability scale. The optimal solution here is in the form of feasible values of decision variables on test items that optimize the objective function while satisfying test constraints. By contrast, when ATA is formulated as CS, only test constraints need to be satisfied (no objective function is involved). So, CO and CS share two fundamental components: (a) decision variables and (b) constraints. For CO problems, the objective function needs to be specified as the third component. Methods to establish these components in an MILP-based ATA are elaborated below.
Test Specifications as Constraints
First, binary decision variables on item assignment are defined in the form of 1s and 0s:
where
Second, constraints are test specifications which are often formulated as linear equalities or inequalities on item assignment. Outlined below are typical constraints:
To restrict test length to be exactly
To ensure item
To limit the number of items on an attribute
An attribute is generally defined to indicate anything on which a constraint is to be specified, that is, whether or not an item covers a certain topic, belongs to the same testlet with other items or is an enemy item of others.
Test Specifications as an Objective Function
Similar to constraints, the objective function represents test specifications as well. Whereas test specifications formulated as constraints typically cover attributes with lower and upper bounds, those in the objective function often deal with attributes that need to be either maximized or minimized. The formulation of an objective function depends largely on the goal of the ATA problem. Testing professionals could choose to minimize the length of the test while satisfying all other test specifications or attempt to maximize test reliability while keeping the lengths of all forms fixed at a certain number. In this study, the objective function is formulated by requiring the TIF of each assembled form to be as close to the selected target value
Calculated by summing all item information functions (IIFs) together under the local independence assumption, TIF is an instance of the Fisher information reflecting the information in an examinee’s responses to test items regarding his or her unknown ability. In other words, TIF is an indicator of the quality of a test form over the range of examinees’ ability levels and reflects how well the test distinguishes one examinee’s ability level from another across the ability continuum (de Ayala, 2009, pp. 27-31; van der Linden, 2005, pp. 16-17). When a test is designed to make decisions about examinees with a cutoff, it needs to reach a pre-specified information level both at the cutoff and around its neighborhoods so that the test allows an examinee whose ability level is above the cutoff to be clearly distinguished from another whose ability level is below the cutoff.
Following Theunissen (1985) and van der Linden (2005, p. 110), the objective function is defined as follows to minimize the distance between the TIF and a target TIF value at
where
Since the objective function in Equation 5 is nonlinear with respect to
As noted by van der Linden (2005, p. 106), an objective function can also be approximated as constraints. Taking TIF as an example, it is a well-behaved smooth function for models such as Rasch. Therefore, if it is required that the TIF meets a smooth target
where
Test Assembly Using SAS
A solution to a complicated MILP problem requires sophisticated optimization algorithms (i.e., solvers). The performance of the MILP method in ATA relies heavily on the software program that provides the solver. Therefore, choosing the right optimization software is critical in ATA. Moreover, given a large number of MILP solvers, the task can become overwhelming. A variety of solvers for linear programming have been made available and reviewed in the literature. For instance, Cor, Alves, and Gierl (2008) introduced the EXCEL-based Premium Solver version 7.0 platform. van der Linden (2005, p. 87) introduced multiple solvers including ConTEST, OPL Studio, OTD, LINDO/LINGO, AIMSS, and the linear programming options in EXCEL and solved all problems using CPLEX 9.0.
Many of these specialized tools are designed exclusively for optimization problems with little to no consideration of their ability to communicate with other software programs. R (R Core Team, 2018) is a likely exception here because it is versatile enough to communicate with many other software programs including SAS to transfer data and retrieve outputs, although some R optimization packages only provide wrapper functions serving as an interface to the well-known lp_solve version 5.5.2.5 program (Berkelaar, Eikland, & Notebaert, 2016) instead of offering their own solvers. To address this issue, the authors propose to use SAS as a possible one-stop shop to consolidate MILP-based ATA and many other psychometric analyses into one platform. Notably, SAS itself can be programmed to invoke other psychometrics and statistics software programs such as WINSTEPS (Linacre, 2018) for calibration and R for general statistical analyses and psychometric operational work. In other words, real-time data exchange and output sharing are possible between SAS and many of these other software programs.
SAS has been a popular analytic tool in the assessment industry and beyond. In many testing organizations, SAS has been used in various aspects of the operational work including test bank database query, initial response data cleaning, examinee scoring, score equating, score release, and reporting. However, SAS has not been widely utilized as a test form assembly platform. Many testing agencies have resorted to other programs for ATA work, such as LINDO/LINGO and EXCEL that cannot interact with SAS in any automated way to further process the data for form review and publishing purposes. Switching software significantly reduces work efficiency as it often incurs manual processes requiring many steps of pre-switch and post-switch data manipulation. It is therefore desirable to identify and utilize more effective and efficient tools in form assembly that can communicate with other psychometric analyses already conducted in SAS (and other software programs such as R).
In this article, the authors propose to use the SAS Operations Research (OR) software as a solution to linear programming in ATA. As the flagship procedure in SAS/OR, PROC OPTMODEL allows virtually unlimited numbers of decision variables and constraints (Pratt, 2018, September 25; SAS Institute, 2017a). The procedure integrates seamlessly and in real time with other SAS procedures and other software programs such as R and WINSTEPS to manage data and perform analysis while solving complex test assembly problems. In this study, the authors demonstrate the capabilities of SAS/OR in ATA by giving detailed information on how to code the objective function and various constraints under SAS/OR.
An empirical item bank is used to present four ATA exemplar problems formulated under CS (one problem) and CO (three problems) scenarios. PROC OPTMODEL is utilized to assemble forms. Form equivalence is evaluated by TIFs and test characteristic curves (TCCs). The authors conclude the study with findings, limitations, and recommendations for future work.
ATA Demonstrations in SAS
For the ATA demonstration here, an item bank of around 1,000 items is selected from the Comprehensive Osteopathic Medical Licensing Examination of the United States (COMLEX-USA). COMLEX-USA is a three-level computer-based series of examinations. Each level of the examination has multiple forms. The vast majority (around 90%) of items in this bank are standalone items and the rest are case items (i.e., a testlet of two or more items). In a sense, a standalone item can be treated as a special/fake case item (testlet) with only one item in it (vs. a true case item/testlet of two or more items). Finally, it is worthwhile mentioning how items are identified in the bank. Each and every item has an item ID (ITEM_ID) but those belonging to true testlets also share an additional case ID (CASE_ID). Although a standalone item can be viewed as a special case item with only one item in it, the bank does not assign a case ID to it.
Table 1 gives the information of 10 items in the pool including item and case IDs, sub-topics under the two dimensions of the COMLEX-USA master blueprint (BP1 and BP2: say, a 4 under BP1 indicates subtopic 4 under BP1 [BP1_4]; National Board of Osteopathic Medical Examiners [NBOME], 2018), calibrated Rasch difficulty parameters (RASCH_DIFF), and corresponding item IDs of enemies (ENEMY ITEM_ID). All items in the pool are also calibrated under CTT, but CTT difficulty and discrimination parameters are not displayed to save space. Finally, for programming purposes in SAS, a combo ID (COMBO_ID) is created for each item: for a standalone item, item ID prefixed with 888; for a true case item, case ID prefixed with 999. The combo ID allows each case to be uniquely identified and also distinguishes a true case item from a standalone item. As is to be seen, this new ID plays an important role in both the data preparation and the ATA processes.
Ten Select Items from the Empirical Item Bank.
ATA Problems and Constraints
PROC OPTMODEL is used to solve four ATA problems: (a) Problem 1: CS, (b) Problem 2: CO under one
Test length: 100 items for each form;
No item is used more than once across five forms (i.e., no overlap between forms);
Constraints on BP1, BP2, and life stage in clinical presentations. For example, subtopic 1 under BP1 (BP1_1) is at least 12%;
In each form, the average item difficulty under CTT should fall between 0.71 and 0.74;
In each form, the average item discrimination under CTT should be at least 0.15;
The difference between the average response time (average duration) of each assembled form and the average value of the item bank should not exceed 5%;
In each form, the number of (true) case items should be at most 10% of test length;
Items that are enemies to each other should not be assigned to the same form.
Programming in PROC OPTMODEL and Output
After data preparations (preparation details can be found in Supplemental Appendix A) are completed, PROC OPTMODEL can be used next to import data (READ DATA statement) from the case-level SAS dataset, solve the ATA problem under MILP, and export any CS/CO results (CREATE DATA statement) into a new SAS dataset for additional analyses. When writing the CS/CO code, the VAR declaration with the BINARY option is used to create binary decision variables outlined in Equation 1 (named as “included” in the code). The decision variables are laid out in an
As a courtesy, sample SAS code (including specific constraint boundary percentages used in the study) is provided in Supplemental Appendix B. To fully understand the technical and syntactic details of the program, the authors recommend the official SAS guide on MP (SAS Institute, 2017a). Also, readers are welcome to contact the authors with questions, requests, and/or comments on the code.
Assessment of Form Equivalence
In all four problems, PROC OPTMODEL reached a feasible solution within 3 min on a regular desktop computer (CPU: Intel i5 2.40 GHz; Memory: 16.0 GB; OS: Windows 7 Professional 64-bit) under ABSOBJGAP = 0.01 using SAS version 9.4 and SAS/OR version 14.1. The feasible solution is in the form of an
The authors begin with graphical means of assessing form equivalence. Figures 1 and 2 present the TIF and TCC overlays of the form assemblies for all four problems. Under each problem, the TIFs and TCCs, respectively, exhibit almost identical patterns over the examinee ability continuum across the forms. That the TIF/TCC curves closely overlap with each other suggests that they are statistically parallel and thus provides strong evidence of form equivalence. Notably, the TIFs of the CS problem have slightly more variability around the neighborhood of the cutoff (this is as expected since the CS problem imposes no requirements on the TIFs of forms), but this variability is not clearly visible in the TIF overlays of the other three problems. Because the CS problem does not implement an objective function, but the other three problems all have an objective function in place to minimize the distance between the TIF and a target value at

Test information functions for all five forms under each ATA problem.

Test characteristic curves for all five forms under each ATA problem.
To supplement the TIFs and TCCs, the authors follow Luecht and Hirsch (1992) to further assess the extent to which the assembled forms are equivalent. To that end, the authors demonstrate the distributions (means and standard deviations) of CTT difficulty, CTT discrimination, and Rasch difficulty parameters (Table 2) and IIF values at selected ability levels (Table 3) across all five forms under each of the four problems. As is observed from the two tables, the between-form distributional differences on all three parameters and on IIF values at all selected ability levels are generally very minor. Evidently, PROC OPTMODEL is very consistent in matching item parameters and IIFs across test form assemblies. Besides, given that the CS problem does not impose any constraints on the TIF, it is probably not surprising to find that the five CS forms exhibit slightly more variability with regard to the average IIF value at
Distributions of Item CTT/Rasch Parameters Across Five Form Assemblies.
Note. CTT = classical test theory; CS = constraint satisfaction; CO = combinatorial optimization.
Distributions of IIF Values at Select Ability Levels Across Five Form Assemblies.
Note. IIF = item information function; CS = constraint satisfaction; CO = combinatorial optimization.
Notably, from time to time, PROC OPTMODEL may fail to find a feasible solution or cannot find one within a reasonable amount of time under specified constraints and/or the objective function. When that happens, it may be necessary to relax some constraint boundaries or adjust the threshold setting specified in ABSOBJGAP to give the procedure more flexibility in searching for the solution. Running an inventory analysis of the bank to see if there are enough items to meet a certain constraint is also recommended.
Discussion
The OPTMODEL procedure provides a general-purpose optimization modeling language and is capable of calling various solvers offered in SAS/OR to solve MP problems of many different types. Included solvers in PROC OPTMODEL are CLP for constraint logic programming, LP for linear programming, LSO for local search optimization, MILP for mixed-integer linear programming, Network for network optimization, NLP for general nonlinear programming, and QP for quadratic programming.
With PROC OPTMODEL, users can easily formulate an MP problem as an optimization model using a natural and transparent algebraic form that closely mimics the symbolic algebra of the formulation, pass the model together with the MP problem data directly to the appropriate solver, and review the solver results. Whether the problem data are encapsulated in SAS datasets or interweaved in the algebraic, optimization model in PROC OPTMODEL, the MP results can be saved, easily changed, and solved again. PROC OPTMODEL produces SAS data tables containing the solutions for users to view and interact with. Because the OPTMODEL procedure does not use a RUN statement but instead operates on an interactive basis throughout, users can continue to interact with PROC OPTMODEL even after invoking a solver. For example, users could modify the problem to update the model before issuing another SOLVE statement.
The OPTMODEL procedure is able to create and solve optimization problems of substantial scope and detail. For the test assembly problems in NBOME that vary in size and complexity (usually thousands of test specification constraints under hundreds of thousands of items in the item bank; regular desktop computers as outlined above), if the solution is feasible, PROC OPTMODEL typically reaches optimality in a reasonable amount of time (anywhere from a couple of seconds to 5-10 min, depending on the number of constraints, the value of the stopping criterion specified in the ABSOBJGAP option, among other things). If the solution is infeasible, the procedure spends no more than a couple of seconds figuring it out.
Also, the documentation admits that sometimes optimization models under PROC OPTMODEL can run out of memory during problem generation, and the authors have also encountered such a memory issue in this study. In the last problem, the authors planned to achieve objectives at five θ points (i.e., CO-O5) but due to the system running out of memory, they had to reduce the number of
As for licensing and costs, PROC OPTMODEL is provided as part of SAS/OR, a component of SAS Optimization running on SAS Viya. To access PROC OPTMODEL, at least a SAS/OR license is required. In terms of system requirements, users should have at least SAS Base on which to run SAS/OR and other related SAS programs. Special features in SAS/OR may require additional SAS programs. For example, to enable graphics in SAS/OR, a SAS/GRAPH license is also required. To learn more about system requirements for SAS/OR, the authors recommend the official documentation such as SAS Institute (2017b, p. 27). Finally, when it comes to costs, like many other commercial optimization software packages (LINDO, IBM CPLEX, etc.), the license of SAS/OR and other related SAS programs needs to be purchased. The licensing costs depend on numerous factors and are subjected to change from time to time. The authors refer those interested to the SAS company sales department and authorized vendors for the most up-to-date information on SAS/OR pricing and licensing options. For users interested in free alternatives, there are several available, such as the R package xxIRT (Luo, 2018) and lp_solve version 5.5.2.5 and several of its API wrappers in R and Python.
Finally, the SAS-based ATA method proposed in this study focuses on in-house assembly of fixed forms that are to be administered (say, by computer) at a later point. Whether and how SAS/OR optimization can be extended to operationalize more complicated test designs (e.g., those requiring on-the-fly assembly during administration: adaptive testing, multi-stage testing, or a combination of both such as Chuah, Drasgow, and Luecht, 2006; Luecht, Brumfield, and Breithaupt, 2006; Zheng and Chang, 2014, 2015; Zheng, Wang, Culbertson, and Chang, 2014) need additional, dedicated investigations. Among the technical difficulties, the main one is likely to be how to combine the capabilities of PROC OPTMODEL with simultaneous test administration platforms.
Conclusion
The study implements ATA in the SAS/OR software program. The proposed approach is effective and efficient and can be integrated into other aspects of data analyses in SAS and/or R to streamline the entire ATA process including pre-assembly and post-assembly data query, item data management and manipulation, form equivalence assessment, scoring, result presentation, among other things. Although the scope of the study is restricted to a simultaneous assembly of fixed forms measuring a single ability, the authors hope their work will bring about more research in this area to harness the power of SAS and its optimization software to investigate form assembly in more complicated test designs. Finally, the study addresses to some extent the issue of test reliability under IRT through minimizing the distance between the TIF evaluated at the cutoff and a corresponding target. Additional research is needed to further examine how ATA contributes to test validity as well as reliability. Such an investigation is imperative in high-stakes medical licensing examinations tasked to protect the public by ensuring examinees possess the level of knowledge and skills necessary to provide high-quality, safe and effective patient care (NBOME, 2018).
Supplemental Material
Finalization_File3_ATA_Online_Supplement – Supplemental material for Automated Test Assembly Using SAS Operations Research Software in a Medical Licensing Examination
Supplemental material, Finalization_File3_ATA_Online_Supplement for Automated Test Assembly Using SAS Operations Research Software in a Medical Licensing Examination by Can Shao, Silu Liu, Hongwei Yang and Tsung-Hsun Tsai in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
