Abstract
This rejoinder responds to the commentary by van der Linden and Li entiled “Comment on Three-Element Item Selection Procedures for Multiple Forms Assembly: An Item Matching Approach” on the article “Three-Element Item Selection Procedures for Multiple Forms Assembly: An Item Matching Approach” by Chen. Van der Linden and Li made a strong statement calling for the cessation of test assembly heuristics development, and instead encouraged embracing mixed integer programming (MIP). This article points out the nondeterministic polynomial (NP)–hard nature of MIP problems and how solutions found using heuristics could be useful in an MIP context. Although van der Linden and Li provided several practical examples of test assembly supporting their view, the examples ignore the cases in which a slight change of constraints or item pool data might mean it would not be possible to obtain solutions as quickly as before. The article illustrates the use of heuristic solutions to improve both the performance of MIP solvers and the quality of solutions. Additional responses to the commentary by van der Linden and Li are included.
Keywords
Using mixed integer programming (MIP) with commercially available solvers such as CPLEX (IBM, 2009) when formulating a problem provides great flexibility, and its general usefulness in optimization problems has encouraged its wide application in diverse fields such as transportation and manufacturing. This is especially true because such commercial programs began to be coupled with the advancement of computer technology in recent years. Although heuristics are usually constructed for specialized problems and require customized programming, they can quickly find a reasonable solution. Writing your own program with any programming language you like or choosing commercially available MIP solvers that require some training in optimization and integer programming presents various advantages and disadvantages. It may even be tempting to stop developing heuristics and only rely on MIP solvers to solve test assembly problems. However, before one makes such a strong statement as van der Linden and Li (2016) did, let us look into how the MIP approach works and how solutions found in heuristics can be used to improve the performance of MIP solvers.
MIP Approach Versus Heuristics
In the world of optimization, it is common practice to relate computation time to problem size, which is usually described in terms of the number of constraints and variables (Wolsey & Nemhauser, 1999). According to Stocking and Swanson (1993), item pool sizes range from 300 to 5,000 whereas the number of user-specified constraints ranges from 50 to 300 in test assembly practice. Phone consultations with current practitioners show that the operational item pool size can reach about 15,000 items. Notice that the number of user-specified constraints and item pool sizes are among the factors that determine the number of constraints and variables in MIP. Table 1 shows how the number of constraints and variables varies under different item pool sizes, test length, number of forms, and problem types. Integer programming belongs to the class of NP-complete (nondeterministic polynomial–complete, which is a subset of the NP-hard class), which means that no polynomial time algorithms can be found unless P = NP. In other words, P = NP is deemed unlikely, and solution time may grow exponentially as the problem size grows, in the worst case scenario. The statement above that Chen (2016) mentioned does not try to “predict” computation time for an MIP problem, but rather to emphasize its computation intensiveness, which may require faster CPU speed and larger RAM as compared with heuristics.
Summary Table for Problem sizes, Relative Gap Settings, Solutions Type, and Objective Function Obtained in Three Different MIP Methods Under Different Conditions.
Note. MIP = mixed integer linear programming; IM = item matching; TIF = test information function; TCC = test characteristic curve.
In fact, an MIP solver is a potentially exhaustive search; in many cases, its search space may be reduced by implementing special techniques. There is no guarantee that these special techniques will reduce the search space significantly enough to make the problem solvable within a reasonable time frame. For example, for a binary integer programming problem such as a test assembly with item pool size N, one may need to enumerate
The Mixed Integer Problem Library (MIPLIB; Konrad-Zuse-Zentrum für Informationstechnik Berlin, 2016) is an official public website that maintains both MIP problems that are currently solvable with today’s codes and unsolved MIP problems that are still open. Many of the open problems have quite small problem sizes. In other words, there are still many MIP problems that can require hours or even days to obtain an optimal or near optimal solution even with state-of-the-art optimization solvers and high-speed machines (Klotz & Newman, 2013). That is part of the reason why research in the optimization field continues to progress and why optimization software developers continue to update solvers’ performances every year.
To make it more specific, the fact that a test assembly problem is NP-hard means there is no guarantee that one may not run into the worst case scenario of enumerating all the
Here is one reason why heuristics are useful in practice when one deals with NP-hard problems. If one relies solely on MIP, one might make a small change to the model or only to the data over which one optimizes, and suddenly, the solution time can explode. This is because the solver has to enumerate those
On the other hand heuristics are usually polynomial time algorithms and require less computation time in general. Furthermore, solutions obtained from heuristics can be used in MIP solvers such as CPLEX to jump-start its search. Even if heuristics fail to satisfy all constraints, partially feasible solutions obtained from heuristics can also be used as a jump-start to speed up MIP solvers. An illustration of how solutions obtained from heuristics can improve the performance of MIP solvers will be described in the next section.
Using Solutions Obtained From Heuristics to Improve MIP Solvers’ Performance: An Example
The author ran a test assembly problem using the data from Chen’s (2016) Study 3. The item pool consisted of 6,000 items, and five 30-item parallel forms were required. The author used conditions including content constraints, cognitive constraints, response time restraints, and enemy items as used in Study 3. In addition, the constraint was added of a maximum of five overlapping items among different forms with none of the items appearing in more than two forms. CPLEX 12.6 was used in an Intel i7-3700 3.4 GHz CPU on a 16GB RAM desktop computer. With the item matching (IM) objective function and the test assembly requirement described above, the IM MIP had 3,666,555 constraints and 1,050,002 variables. CPLEX took 30 min and 39 s to obtain an optimal solution with the relative gap setting of 2%. The solution gap to the best optimal solution possible was 0.67%. Using the solution obtained from heuristics (Chen & Huang, 2016a) to initiate the jump-start, CPLEX took only 3 min and 36 s to obtain an optimal solution with a zero solution gap to the best optimal solution possible. Both the running time and the quality of the solution were thus significantly improved.
Response to Comments on Study 1
Chen’s (2016) Study 1 was replicated here. The IM MIP objective function was replaced by the sum of interitem distances between one form and the target test as the new objective function, to obtain better direct control of the form-to-target distances per form. The results of Chen’s Study 1 suggest that the commonly used TIF MIP method only considers TIFs and does not yield similar test characteristic curves (TCCs); therefore, an additional test equating procedure is required. The TIF MIP method was replaced by TIF+TCC MIP to address the practical requirements in TCCs and TIFs. The maximum value of the sum of TIF deviations and TCC deviations from the target form at 13 ability points was used as the objective function for TIF+TCC MIP. The TIF and TCC results for each of the five item selection methods are shown in Figure 1 and Figure 2. TIF+TCC MIP performed better in terms of its TIFs and TCCs because its objective function addressed the two criteria directly. The results of the IM MIP and IM heuristics all look quite similar in terms of hitting the TIFs and TCCs.

Test information functions for the five item selection methods.

Test characteristic curves for the five item selection methods.
In Chen’s (2016) Study 1, the purpose of adding the D into Equations 9 to 13 was to reduce form-to-form differences. Since D cannot be larger than 2 times y, D can be dropped from the objective function. Even if D in Equations 9 to 13 was later dropped from the objective function to make it easier to solve, the solver still could not prove to be optimal within 30 min.
The IM Heuristics, IM MIP, and TIF MIP
The purpose of the Chen (2016) study was to propose an IM heuristic and compare it with Armstrong’s IM heuristic. Chen also wanted to replicate Armstrong’s IM MIP method to see how the running time could be improved after 20 years of modern advances in computer technology. It was not the main purpose of the Chen study to compare the IM heuristic and TIF MIP method. That is why the author only included the TIF MIP in the first study. Adding the results of the TIF MIP method served three purposes. First, because TIF MIP is a common practice in test assembly, it was used as a baseline comparison for the IM approach. Second, given the test assembly problem and data used in Chen’s Study 1, the results showed that the TIF MIP could not provide an optimal solution within a realistic time period, and additional parameter setting tuning is required to obtain a feasible solution. Third, comparing the TCC results highlighted the strength of an IM approach which fits the TIF and TCC by mimicking item parameters distributions (Armstrong, Jones, & Wu, 1992). Matching item parameters directly produced strictly parallel test forms (Lord, Birnbaum, & Novick, 1968), whereas forms produced by matching TIF were considered weakly parallel (Samejima, 1977). In item response theory (IRT), TCC relates the IRT ability to a number-correct true score and is used when equating true scores. When automated test assembly can produce parallel forms with similar TCCs and TIFs, it implies that forms are psychometrically parallel and no extra equating procedure is necessary.
The IM MIP method and IM heuristics are considered to be the same kind of problem because the purpose of each is to reduce interitem distances from a seed test. On the other hand, the TIF MIP method uses a totally different objective function, and it is therefore not the same kind of problem as IM. The author does not agree that the IM MIP method is inefficient when compared with the TIF MIP method. These are simply different problems with different objective functions.
Of course, using certain points in TIF and TCC as objective functions or item-to-target distances (IM) shows major differences in problem size in terms of the number of constraints and variables. Table 1 summarizes the problem size, objective functions, and relative gap settings for three different methods: IM MIP, 5-point TIF+TCC MIP, and 13-point TIF+TCC MIP. The huge differences in problem sizes between TIF MIP and IM MIP may make the latter look like an undesirable model. The author ran 13-point and 5-point TIF+TCC MIP methods with the data and test assembly constraints used in Chen’s (2016) Study 3. Table 2 summarizes the total running time for the R-R-B, R-MD-B, IM MIP, 5-point TIF+TCC MIP, and 13-point TIF+TCC MIP. The IM MIP used the revised objective function described earlier. It can be seen that for the same kind of test assembly specifications, in general, the IM MIP took more time in solving the problem than did the two heuristics.
The Total Running Time (in Seconds) of the Five Item Selection Methods Under Different Conditions.
Note. IM = item matching; MIP = mixed integer linear programming; TIF = test information function; TCC = test characteristic curve; R-R-B = Random item sequence assigned to random form with content balancing; R-MD-B = Random item sequence assigned to maximum distance form with content balance.
The running time of the two heuristics methods was recorded after 100 successful trials were obtained.
Used a larger relative gap of 1% to obtain solutions more quickly.
The 5-point and 13-point TIF+TIF MIP’s running times were preset to 1 hr to 10,000 s, and all stopped after reset time limit exceeded.
However, the modification of objective function for the IM MIP does not change the fact that IM MIP takes more time to find a solution than the R-R-B or R-MD-B because of its computational extensiveness nature. Under almost all conditions for the 5-point TIF+TCC MIP and 13-point TIF+TCC MIP, the solver could not close the gap between the best feasible solution and the best possible optimal solution within 1 or 3 hrs. In short, problems were unsolvable within a reasonable time. Setting up the relative gap to be 2% as van der Linden and Li (2016) suggested did not work at all in this case because the best possible optimal solution is zero. In this case, one instead needs to set up an acceptable absolute gap based on his or her past experience. The current default setting for absolute gap is
Response to Comments on Study 2
Due to the page limitation for articles published in Applied Psychological Measurement, the table of the total running time of the five item selection methods in Study 2 was deleted in Chen (2016). Those results are shown in Table 3 as requested. Readers should refer to the discussions of the results in Chen’s Study 2.
The Total Running Time (in Seconds) of the Five Item Selection Methods Under Different Conditions in Chen’s (2016) Study 2.
Note. The running time of the four heuristics methods was recorded after 100 successful trials were obtained. IM = item matching; MIP = mixed integer linear programming. R-R-R = Random item sequence assigned to random form with Armstrong’s replacement; R-MD-R= Random item sequence assigned to maximum distance form with Armstrong’s replacement.
Indicates that the algorithm ran for 1,000 infinite loops and could not obtain any solution.
Indicates that the CPLEX could not find any feasible solutions and stopped due to running of out of memory while solving the model.
Response to Comments on Study 3
The purpose of Chen’s article is not to provide a general purpose heuristic approach that seeks to solve every test assembly problem. Chen’s (2016) Study 3 added some of the most commonly used constraints in practice, such as content and cognitive constraints, enemy items, and response time. The item overlap was set to zero because, in most cases, test forms are usually linked by means of prechosen anchor items by psychometricians and content experts. Letting the solver (or robot) decide the anchor items may not be desirable in practice.
Actually, heuristics with a form-to-form overlap with the additional requirement that common items cannot appear in more than two forms have been developed (Chen & Huang, 2016a) on top of Chen (2016). In response to van der Linden and Li (2016), the R-R-B and R-MD-B item selection algorithms in Chen (2016) regarding content, cognitive constraints, enemy items, and response time are described in the appendix. For dealing with passage constraints such as set-based items, Chen and Huang (2016b) extended the IM heuristic method to incorporate complex test specifications, including set-based item constraints. In addition, the IM approach has recently been implemented in a large-scale Chinese proficiency test to assemble 50 forms (Wang, Zheng, Zheng, Su, & Li, 2016).
Response to Other Comments
Is Matching 3 to 5 Ability Points Enough?
Van der Linden and Li (2016) stated that matching 3 to 5 ability points to fit TIFs and TCCs could produce satisfactory results. However, the author used Study 1 item pool and test specifications and matched the target test’s TIFs and TCCs on 3 (

TIFs for the eight MIP conditions.

TCCs for the eight MIP conditions.
Optimizer Time Limit
Van der Linden and Li (2016) mentioned that the default optimizer time limit in CPLEX is 1 min. According to the IBM knowledge center (IBM, 2016) and the CPLEX software, the optimizer time limit is
Conclusion
In summary, neither the heuristics approach nor the MIP approach should be intended to completely replace the other. In fact, heuristics could serve to complement MIP methods when MIP solvers are having difficulty solving a problem within reasonable time. When MIP solvers can quickly find good solutions, one does not need to develop heuristics. However, when MIP problems become unsolvable, it makes sense to develop heuristics to quickly find a feasible solution. In addition, one can use heuristics to obtain a feasible solution while reducing the running time and increasing the quality of solutions found by MIP solvers. Even partially feasible solutions obtained from heuristics can be used in MIP solvers. In other words, solutions obtained from heuristics can be used to improve the performance of MIP solvers.
With advancements in optimization and computational technology, MIP solvers have been widely used in various fields. In an ideal world, users can simply formulate their problem as an optimization problem, write some syntax, and use MIP solvers like a black box to obtain solutions. However, MIP approach may not always provide solutions while users do not have access to the source code, and other approaches should also be considered (Chang, 2007). The author appreciates the opportunity for further discussion. Discussions in Chen (2016), van der Linden and Li (2016), and this article have clearly shown that users could not use MIP solvers without any basic knowledge and training in integer programming. Practitioners should be trained to know how to adjust various parameter settings when they run into problems such as the ones that are unsolvable within a reasonable time period or that cause the platform to run out of memory. Test assembly problems are considered difficult MIP problems, and readers who want to know more about solving such problems could refer to practical guidelines for solving difficult MIP problems (Klotz & Newman, 2013).
Footnotes
Appendix
Acknowledgements
The author would like to thank two anonymous reviewers and the editor-in-chief for their helpful comments on earlier versions of this article. The author is grateful for the time and care given by those colleagues during phone consultations. Thanks also goes to Hsu Chia-Wei and Huang Cheng-Yi for their assistance in running programs. Special thanks to Daniel Junglas for his support and assistance.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
