Abstract
Automated test assembly uses the methodology of mixed integer programming to select an optimal set of items from an item bank. Automated test-form generation uses the same methodology to optimally order the items and format the test form. From an optimization point of view, production of fully formatted test forms directly from the item pool using a simultaneous optimization model is more attractive than any of the current, more time-consuming two-stage processes. The goal of this study was to provide such simultaneous models both for computer-delivered and paper forms, as well as explore their performances relative to two-stage optimization. Empirical examples are presented to show that it is possible to automatically produce fully formatted optimal test forms directly from item pools up to some 2,000 items on a regular PC in realistic times.
Keywords
The two usual steps involved in test-form production consist of (a) selecting a set of items from an item bank and (b) ordering of the items and formatting the actual test form. So far, automated test assembly (ATA) has been focused mainly on the optimization of the first step using the methodology of mixed integer programming (MIP). However, the same methodology can be used to optimize the second step (automated test-form generation [ATFG]; van der Linden & Diao, 2011). As this step typically involves quite a bit of manual labor, automation of it is welcome.
More specifically, ATA involves the selection of the items for a test form from an item bank subject to a potentially large set of content and statistical specifications while optimizing their measurement properties. An MIP approach consists of the definition of 0-1 decision variables for the selection of the items and modeling the specifications as a set of constraints and an objective function that are linear in the variables (e.g., Chen, Batson, & Dang, 2010). The use of MIP for optimal test assembly is extensively reviewed in van der Linden (2005).
ATFG involves additional ordering and paging of the items upon their selection, as well as further formatting of their presentation. The result is typically required to satisfy several additional content, statistical, and layout specifications. Also, it is expected to be optimal in some sense. For instance, an obvious objective for a printed test form is minimization of the required number of pages. Full realization of this objective can lead to substantial savings, not only in the costs of printing but also of shipping and scanning of the returned forms. More details about test-form specifications relevant to ATFG can be found in van der Linden and Diao (2011).
Generally, simultaneous optimization of a sequence of two problems always outperforms their separate optimization. Hence, in the current context, the solution for a single MIP model that integrates ATA and ATFG can be expected to give better results than those produced by subsequent solution of two separate MIP models. For instance, when a completely formatted test form is generated directly from the item bank, it is likely to reflect a better ordering of the item attributes or to save more paper than through the separate steps of item selection and subsequent ordering and formatting of the much smaller set of selected items.
Obviously, when dealing separately with item selection and test-formatting problems, there are two optimization models each with a different objective function. But a simultaneous model can only have one function. General strategies for dealing with multiple objectives in one test-assembly model were presented in Veldkamp (1999). A simple strategy is optimizing the attribute in one of the objectives while constraining the other attribute to a set of satisfactory values. Other options include the use of weighted combinations of objective functions or goal programming (use of goal values for the attributes that are approximated as closely as possible). In the empirical examples that follow, the first approach is used. Although this choice was somewhat arbitrary, it had the advantage of not having to allow for potential scale differences when combining different test attributes into one objective.
A potential challenge to simultaneous modeling of ATA and ATFG, however, is the size of the MIP model involved. For example, as will become clear, for an item bank of size 2,000, test length of 50 items, and upper limit of the number of pages in the test form equal to 15, the number of variables in the model is equal to 1,500,000. Besides, although not necessarily critical, simultaneous modeling involves more constraints on the variables than either of the separate models for ATA and ATFG. Even though modern MIP solvers, such as the one in IBM ILOG OPL 6.3 (International Business Machines Corporation, 2009), have become exceptionally powerful, it is not yet clear if the size of the models involved in the integration of ATA and ATFG tend to be too large to find solutions in realistic time.
One of the goals of this research is to find out if test-production problems of realistic sizes can be solved. Following up on a modeling suggestion in van der Linden and Diao (2011), the authors first explore the possibilities of simultaneous optimization modeling of ATA and ATFG and document the nature of the candidate models, and then evaluate their performances against the case of separate models for ATA and ATFG for a series of applications with realistic item-bank sizes and test lengths. As shown later in this article, the running times necessary for the solver to obtain solutions for these applications were extremely encouraging.
Simultaneous MIP Models for ATA and ATFG
The field of educational and psychological testing now has two common modes of test delivery: computer and paper-and-pencil delivery. For computerized test delivery, the formatting is relatively simple. It basically consists of the ordering of the items, which are then presented on the computer screen one item at a time. For paper-and-pencil testing, the formatting process entails additional paginating and decisions about the layout of the pages. Both models of delivery thus involve potentially quite different sets of constraints and objectives. Simultaneous models for both modes of testing are provided.
MIP Modeling of Computer Delivery
As already noted, a computer screen needs to be fed one item at a time. The simultaneous model is thus for the selection of a set of items from the bank that satisfies all usual statistical and content specifications and is ordered with respect to possibly different quantitative and categorical item attributes.
Each possible test from a bank of I discrete items is labeled using a string of 0-1 variables x
ir
,
Using these variables, the following model illustrates several types of constraints that may be imposed on the selection of a formatted computerized test form from an item bank:
subject to
where
The objective function in Equation 1 maximizes quantitative attribute
The solution produced by the MIP solver consists of the formatted set of items that meet all constraints and has an optimal value for the objective function.
The example in Equations 1 to 9 is only to highlight several of the main types of constraints that may occur in the selection and formatting of computer forms. In real-world applications, multiple versions of these constraints are needed to deal with multiple categorical and quantitative item attributes, as well as possibly different additional types of constraints, for instance, to deal with items that share a stimulus (for formulations of these extensions and alternatives, see van der Linden, 2005). Also, for computer forms, we usually face a test-formatting problem with multiple objectives, for instance, a problem with both maximization of test information and approximation of an ideal order of its quantitative item attributes (typically item difficulty) or a quantitative item attribute nested within ordered categories of a content classification. In such cases,
MIP Modeling of Paper Forms
Paper forms tend to be more complicated to assemble, for instance, because we need to observe page breaks and typically want to minimize the use of paper.
To model this case, a new index p = 1, . . . , P is needed to denote the pages in the form, where P should be chosen large enough to produce a form. In addition, L p is used to denote the maximum number of lines available on page p and l i to represent the number of lines needed to print item i (including the white space following it). Decision variables x irp now take the value of 1 when item i is assigned to rank number r on page p; otherwise, they are equal to 0.
In addition, there are separate decision variables x p that take the value of 1 if any item is assigned to page p. These variables are used to minimize the number of pages in the test form.
An example of a model for the direct selection of a formatted paper form from an item bank is
subject to
The objective function in Equation 10 minimizes the number of pages in the form. Most of the constraints in the model represent the same type of specifications as those in the earlier model for a computerized test form. But the constraints in Equations 11, 12, and 17 are additional constraints necessary to keep the assignment of items to the ranks and pages consistent.
Both the example for the computer form and paper form are for a test without an item-set structure. For test with such structures, the authors introduce decision variables for the selection of their stimuli as well as constraints to keep the values for the variables for the items and stimuli consistent. For their definitions, see van der Linden and Diao (2011, Equations 7-10).
Empirical Examples
The authors demonstrate the methodology for the two different cases of a computerized and a paper form assembled to satisfy various content, statistical, and formatting specifications. More specifically, they used an existing set of specifications for the assembly of test forms, along with an existing item pool from a large-scale mathematics achievement testing program. The set of specifications was as follows:
The length of the test is constrained to a fixed number.
The test information function (TIF) has to match an absolute target at five equally spaced ability points.
The items in the forms belong to eight different content categories: measurement (M); problem solving and reasoning (PS&R); number and number relations (N&NR); operation concepts (OC); computation and numerical estimation (C&NE); geometry and spatial sense (G&SS); data analysis, statistics, and probability (DAS&P); and patterns, function, algebra (PFA).
The maximum length of a sequence of items from the same content category is constrained to a fixed number.
For each of the four possible answer keys, the number of items selected has to be higher than a given lower bound.
The length of a sequence of items with the same answer key has to be lower than a given upper bound.
All items have to have a p value between a given lower and upper bound.
The first 10 items are required to have p values higher than a given lower bound.
In addition, the paper form had to meet the following requirements:
The number of pages has to be minimal.
For each page in the paper form, the maximum number of lines available is 50.
The first goal of this study was to model all these specifications as a simultaneous optimization model integrating ATA and ATFG both for computer and paper forms. The model for the computer form is presented in Appendix A; the model for the paper form is given in Appendix B.
Second, the authors evaluated the results for this simultaneous model against those for the traditional two-step approach based on the separate ATA and ATFG models in Appendix C.
The last goal was to evaluate the performance of the solver for different sizes of these models. More specifically, 10 different scenarios were simulated, each existing of a combination of the following two factors:
item pool size (100, 200, 500, 1,000, and 2,000 items)
test length (25 and 50 items)
The scenarios were repeated both for the production of computer and paper form, and for each scenario, the results for the simultaneous model and the two separate ATA and ATFG models were compared. The different sizes of the item bank were obtained by sampling the items from the original bank of 253 items, which had been calibrated using data from a national field test with more than 10,000 students, and randomly perturbing the Item Response Theory (IRT) parameters. All other item attributes were maintained. For the actual bounds on the numbers of items from each content category and the numbers of items available in the item bank, see Table 1. For each item, the number of lines required to print it was known. To adjust for the differences in test length between the two conditions, the bounds in the constraints in the model were changed proportionally.
Content Category Requirements for the Main Item Bank.
Note: N&NR = number and number relations; C&NE = computation and numerical estimation; OC = operation concepts; M = measurement; G&SS = geometry and spatial sense; DAS&P = data analysis, statistics, and probability; PFA = patterns, function, algebra; PS&R = problem solving and reasoning.
Each of the applications was run using the MIP solver in IBM ILOG OPL Version 6.3 (International Business Machines Corporation, 2009) on a computer with Intel(R) Xeon(R) CPU, 2.8 GHz, and 12 GB of RAM, with the absolute gap parameter set equal to 0.001 and the relative gap parameter to 0.01.
Of course, each solution obtained satisfied all constraints and had an optimal value for its objective function. But the authors’ interest was primarily in the comparison of the objective function values for the solutions to the simultaneous models and the second models in the two-stage approach as well as the performances of the solver for the more demanding simultaneous model.
As for the former, the solution for the simultaneous model was always 1 to 2 pages shorter than for the ones for the two-stage approach. A typical example was the case of a paper form of 25 items from a bank of 200 items. The solution for two-stage production of the test form yielded a total of seven pages, whereas the solution produced by the simultaneous model required only five. Nevertheless, as shown in Figure 1, the TIFs for the forms selected by the two approaches were equally close to their common target (represented by *).

Test information functions for the separate and simultaneous models.
The running times of the solver for the simultaneous models are given in Table 2. The conclusions can be summarized as follows:
Running Times for the Simultaneous Models (rounded to nearest second).
Running time after problem slicing (see text).
For the computer forms of 25 items, the running times for the simultaneous model were always less than 2 s. The increase of the length of the forms to 50 led to times that were always less than 5 s. Both results held even for an item bank as large as 2,000 items.
For the more complicated models for the paper forms, the running times were generally longer: For the test forms of 25 items, they were always less than approximately 3 min. For the forms of 50 items, the running times increased to somewhat less than 9 min for the pool of 1,000 items. A further increase of the pool to 2,000 led to an initial “out of memory” message. But the size of the model could be reduced using slicing, that is, iterating efficiently over sparse sets, instead of overall combinations. For more details about the technique of slicing, one can refer to International Business Machines Corporation (2010). By doing so, the solver produced an optimal solution in approximately 24 min.
Conclusion and Discussion
Although the literature offers several ATA models for the selection of an optimal set of items from an item bank, and recently ATFG models for the formatting of the test form have been added, it still missed the more efficient option of a simultaneous modeling approach that would allow integration of the traditional two steps involved in test-form production. The goals of this research study was to provide simultaneous models both for computer and paper forms and to explore the increase in the quality of their solutions relative to those for the earlier separate models.
The solutions for all the separate and simultaneous models satisfied all content and other constraints, and had comparable TIFs. But, as anticipated, the values of the objective functions for the simultaneous model were consistently more favorable (1-2 fewer pages for the paper forms).
The performances by the solver for the simultaneous models were remarkable for pools up to 2,000 items and a test length of 50 items (which for a problem with a maximum of 10 pages amounts to a model with some 1,000,000 variables). The initial memory overflow for the biggest problem is typical of modern MIP solvers. They have now become so powerful that, rather than having to wait forever, overflow messages warn us when the problem has become too big. Also, techniques to improve model efficiency, such as reduction of the number of invalid combinations of the decision variable values using sparse model data and slicing, are now standard options for most commercial solvers.
For future applications, the performance of the solver needs to be assessed on a case-by-case basis. The total number of decision variables is only one of the factors with an impact on the performance of MIP solvers; others are the structure and sparseness of the coefficient matrix for the constraints (i.e., for a test-assembly problem, the composition of the item bank and the nature of the test specifications). However, in this research, the authors did not attempt to use more powerful machines; neither did they use the optimization settings available for the solver to tune its performance for the current problems.
Finally, the current examples were for the case of assembling a single form. The generalization to multiple test forms f = 1, . . . , F from the same item pool is straightforward. All that is to be done is extend the definition of the variables to binary variables xif for the decision whether or not item i is assigned to form f and, if necessary, add extra constraints to control the overlap between the forms. Although it may seem more attractive to just run a model for a single form repeatedly, the authors are not in favor to this alternative approach, which will tend to capitalize on the best items in the pool for the first forms and produce worse later forms. If the performance of the solver becomes an issue, they recommend a big-shadow test approach (van der Linden, 2005), which produces single forms but uses shadow tests to balance between the quality of earlier and later test forms, as an alternative.
Footnotes
Appendix A
Appendix B
Appendix C
Acknowledgements
The authors would like to thank the anonymous reviewers for their useful comments and suggestions.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
