Abstract
Optimal test-design methods are applied to rule-based item generation. Three different cases of automated test design are presented: (a) test assembly from a pool of pregenerated, calibrated items; (b) test generation on the fly from a pool of calibrated item families; and (c) test generation on the fly directly from calibrated features defining the item families. The last two cases do not assume any item calibration under a regular response theory model; instead, entire item families or critical features of them are assumed to be calibrated using a hierarchical response model developed for rule-based item generation. The test-design models maximize an expected version of the Fisher information in the test and control critical attributes of the test forms through explicit constraints. Results from a study with simulated response data highlight both the effects of within-family item-parameter variability and the severity of the constraint sets in the test-design models on their optimal solutions.
Keywords
The field of educational and psychological testing has shown a recent trend toward automation, for instance, in the form of new technology for automated item generation (Irvine & Kyllonen, 2002), automated test assembly (van der Linden, 2005), computerized adaptive testing (van der Linden & Glas, 2010), and automated scoring and feedback (Xi, 2010). As for automated item generation, if the cognitive processes involved in solving the test items are known, rules can be defined that create the possibility of real-time automated item generation during test taking (Irvine & Kyllonen, 2002). In addition to a potential increase in cost-effectiveness for large-scale testing programs, the use of explicit item-construction rules in automated item generation also furthers the content validity of tests.
The main goal of this article is to look ahead and explore the integration of rule-based item generation and automated test assembly. This integration is expected to be the next step in the trend toward automation of testing. Successful integration prevents the more cumbersome process of having a computer generate an entire inventory of new test items in a few seconds followed by days of the manual work required to design and assemble the desired test forms.
However, for this goal to be feasible, several issues need to be addressed. One important issue is how to deal with the fact that the psychometric properties of rule-based generated items are not necessarily a priori known. The solution adopted in this research is calibration of entire families of rule-based generated items that share critical features but differ in minor, less relevant aspects, using hierarchical item response theory (IRT) modeling. More specifically, in addition to a regular response model, a different second-level model is adopted for each family representing its distribution of item parameters, which will be taken to be multivariate normal.
For this setup, item calibration can be replaced by “family calibration,” that is, estimation of the vectors of mean item parameters and covariance matrices for the families using only samples of items from each family (Glas & van der Linden, 2003). Once this has been done, any new item generated from a family does not need to be calibrated; instead, its known distribution can be used to design, assemble, and score test forms. In addition, an extension of the hierarchical model will be used, with a linear structure on the mean difficulties of the items in each family representing the critical features (or radicals) of the families used to generate items (Geerlings, Glas, & van der Linden, 2011). This extension allows for the effects of the radicals on the item families to be estimated and used to precalibrate the families.
The fact that only the family parameters are known requires an adjustment of regular optimality criteria for the design or assembly of test forms. For the case of adaptive testing from a pool with item families, this issue was already addressed by Glas and van der Linden (2003). They used a Bayesian approach with a minimum expected posterior variance criterion for the selection of the items, where the expectation was taken over the family distributions of the items to allow for the remaining uncertainty about their parameter values. In this research, the same idea was followed, but Fisher’s information integrated over the family distributions was used. Thus, instead of the regular item-information function, a “family-information function” for each of the families was used.
Obviously, the success of family calibration and the use of family-information functions depend on the degree of within-family item-parameter variability across all families. The former has already been investigated by Geerlings et al. (2011) and Glas, van der Linden, and Geerlings (2010). One of the goals of the current research was to assess the effect of within-family item-parameter variability on optimal test design and assembly.
In the next sections, first rule-based item generation, a model for the calibration of item families, and the family-information function used are reviewed. Then, optimal test design for three different cases of item generation is discussed: (a) test assembly from a pool of pregenerated, individually calibrated items; (b) test generation on the fly from pools with calibrated item families; and (c) test generation on the fly using calibrated radicals that define the item families. The first case is straightforward; the only difference with current automated test-assembly practices resides in the way the items are obtained. The case serves as a baseline to evaluate the more innovative next two cases, which do not assume any items that have already been generated but specify what item families should be sampled or what item-generation rules should be used to generate the test forms. Using simulated data, the effects of within-family uncertainty about the item parameters and the presence of different numbers of content constraints on the assembled test forms will be evaluated. The article concludes with some practical observations regarding the use of optimal test design for rule-based item generation.
Rule-Based Item Generation
An important distinction is between item features that influence the difficulty of the items and those that have only negligible effects. The former have been called radicals (Irvine, 2002); their systematic use can help to ensure the content validity of the items. The latter are known as incidentals; they only create surface variation and do not have any systematic effect on item difficulty (Irvine, 2002). Hence, they will be treated as random effects. Generating items with identical radicals but different incidentals is often referred to as “item cloning.” Items cloned from the same combination of radicals are thus expected to have similar psychometric properties but a different “look” (Bejar et al., 2003; Bejar & Yocom, 1991). For example, in the context of statistical word problems, the computations required to solve the problem could be treated as radicals but the topics used in the context stories are assumed to result in incidental variation only (Zeuch, 2011, chap. 7).
Hypotheses regarding the distinction of item features as either radicals or incidentals are typically developed through a cognitive analysis of the item domain. In the literature, such hypotheses have often been tested by coding existing items on the presence or absence of the specified radicals and performing, for example, a regression analysis on IRT-based difficulty parameters (Gorin & Embretson, 2006). Alternatively, the items have been analyzed with Fischer’s linear logistic test model (Fischer, 1973). The size of the estimated effects or the amount of variance explained by them is then taken as measures of the validity of the hypotheses. A drawback of this approach is the potential difficulty to objectively code the presence of the radicals in the items, which may manifest itself by low interrater reliabilities (Gorin & Embretson, 2006).
A more effective approach would be to test the hypotheses directly on data collected on items generated according to the hypothetical radicals and incidentals. This predictive, model-checking approach is possible using, for example, the hierarchical model discussed in the next section. The approach forces the developers of rule-based item-generation software to explicitly define their radicals and incidentals and show how they are to be translated into computer algorithms.
Item generators based on the use of radicals and incidentals have been developed both for the generation of open-ended items (e.g., algebra word problems, Arendasy & Sommer, 2007; statistical word problems, Boer Rookhuiszen, 2011; Theune, Boer Rookhuiszen, op den Akker, & Geerlings, 2011) as well as multiple-choice items (e.g., figural matrices, Arendasy, 2005; directions and distances items, Dennis, Handley, Bradon, Evans, & Newstead, 2002; literacy items, Dennis et al., 2002). In general, whereas multiple-choice items have the advantage of simple scoring, the need to generate distractors for them as well may complicate automated generation. The effects of distractor features on the difficulty of an item are generally difficult to predict, and this topic certainly deserves more research.
If certain combinations of radicals and incidentals result in invalid items, the item generator can be supplied with constraints to exclude such combinations. Also, constraints can be added to avoid the presence of cognitive processes in item solving that are not construct related (Arendasy & Sommer, 2007).
For more information on rule-based item generation, the reader is referred to the edited volumes by Irvine and Kyllonen (2002) and Gierl and Haladyna (2012).
Modeling Rule-Based Item Generation
The joint use of radicals and incidentals in item generation results in families of items, with between-family variation caused by combinations of radicals and within-family variation by the incidentals. The hierarchical structure of items grouped into families can be explicitly taken into account through multilevel response modeling. Let
In this article, for numerical reasons, the three-parameter normal-ogive (3PNO) model is used as first-level model:
where
The item parameters are transformed as
to facilitate the assumption of a multivariate normal distribution as second-level model:
with
A linear structure on the mean difficulty parameters
where
Observe that the use of Equation 4 involves the specification of an
The basic model restricted by Equation 4 will be labeled the linear item cloning model (LICM). Furthermore, the model with family-specific covariance matrices will be labeled the LICM-F, whereas the model with a common covariance matrix is referred to as the LICM-C.
A Gibbs sampling algorithm to estimate the parameters of a reparameterization of the models in a Bayesian manner was presented in Geerlings et al. (2011, Appendix). Unlike regular item calibration, where the item parameters are treated as fixed effects, estimation of the family parameters involves random samples of items from the families. This means that any constraints on specific combinations of radicals and incidentals in the item generator should also be imposed when generating a calibration sample of items. Geerlings et al. presented a parameter recovery study that shows the precision with which the family parameters are estimated for specific numbers of families, items per family, and persons taking the items.
As already indicated, this model framework can be used to test hypotheses regarding the effects of the radicals and incidentals used in the item-generation rules. In this regard, it is important to check the general validity of the first-level response models (unidimensionality, monotonicity, and local independence) as well as the second-level models (LICM: linearity; no residual variance in the prediction of the family difficulties by the radicals; multivariate normality of the family distributions; LICM-C: homogeneity of within-family item-parameter variability across families). Besides, the hypotheses on the impact of the radicals on the item families can be tested checking the validity of versions of the model with alternative design matrices. For statistics to test such hypotheses and empirical examples, see Geerlings (2012, chap. 4).
Family-Information Function
Generally, item-information functions are defined as
(see, for example, van der Linden & Pashley, 2010, Section 1.2.1). This type of information function can be used to assemble test forms from a pool with items pregenerated by the computer and individually calibrated upon their generation.
In the two other cases of test design, the parameters of the individual items are unknown but the item families have been calibrated; that is, their second-level parameters
A convenient approximation of the integral can be obtained by Monte Carlo integration. Details of the approximation are given in Appendix A. Besides, Appendix B shows the results from a small simulation study investigating the number of iterations needed to obtain a required precision.
Three Cases of Automated Test Design
The approach to automated test design is based on mixed integer programming (MIP; for a general introduction, see van der Linden, 2005). Three different cases are distinguished, differing in flexibility of test generation.
The first case is assembly of test forms from pregenerated, calibrated item pools. In this case, the computer has already been used to generate all the items in the pool and each of them has been calibrated prior to operational testing. Whereas a stand-alone response model as the 3PNO model could be used to estimate the item parameters, the authors suggest the use of the (L)ICM, because, in a Bayesian fashion, it allows the item parameters’ estimates to borrow strength from their family means, leading to more precise estimates. The test forms are assembled from the pool using the regular item-information functions calculated from the estimated item parameters. The functions are used to define the objective function in the MIP model for the assembly of the tests. The content constraints in the model are derived from the radicals and incidentals that define the item families.
The second case is test generation on the fly from a pool of calibrated item families. In this case, the family parameters
The third case goes one step further and involves test generation on the fly directly from calibrated radicals defining the item families. This case is possible for the model with a linear structure on the mean difficulty parameters and a common covariance matrix, in combination with the assumptions of common family means of the discrimination and guessing parameters. Once these common parameters and the effect parameters for the radicals,
The next sections discuss each of the three cases in more detail. For each case, the type of MIP model that can be used is discussed.
Test Assembly From a Pregenerated Item Pool
Assembling a test from a Pregenerated Item Pool with families of items with calibrated parameters
The decision variables are used to formulate the objective function and constraints in the optimization model presented below. The objective function maximizes the total information in the test subject to a set of weights
The following model can be used to assemble a test form with
subject to
with
The equations only represent the core of an optimization model for test assembly from a pregenerated pool of items. In a real-world application, several other types of constraints may have to be added; for examples, see van der Linden (2005). A solution to the optimization problem in Equations 7 to 15 is a string of values for the 0-1 decision variables that maximizes its objective function. It can be found using a solver of the branch-and-bound type (Williams, 1993, Section 6.2). In the later examples, the MIP solver in CPLEX 9.0 (ILOG, 2003) was used.
Test Generation on the Fly From Calibrated Families
For the second case, the problem changes from the assembly of a test from a Pregenerated Item Pool to one of designing a test, that is, identifying the families from which to generate the items for the test. The solution to the MIP model identifies these families. The actual items in the form are generated by randomly sampling the incidentals.
The MIP model for this type of generation on the fly contains decision variables for the families,
subject to
Again, the model is for the selection of
Test Generation on the Fly Using Calibrated Radicals Only
This time, the decision variables
The test-design model in Equations 16 to 21 can now be used to generate items from families for the selected (admissible) combinations of radicals. In principle, this is possible even for combinations that have never been used before. The only difference resides in an extension of the hierarchical response model, which now involves prediction of the mean difficulties for each family through the estimated effects of the radicals in Equation 4.
Evaluation of Results
The evaluation exists of two different parts. First, the effect of the within-family variability of the item parameters on the family-information functions used in the last two test-design models is analyzed. This is done by studying a few examples with different values for the critical second-level parameter that reflects the variability—the family covariance matrix. Second, results are presented from different applications of the test-design models based on simulated response data. Because all true parameter values are known, the effect of using different target information functions and constraint sets on the results could be evaluated.
Effect of Within-Family Item-Parameter Variability on Family Information
The variability of the item parameters within a family has a potentially large impact on the value of the family-information function in Equation 6. The effects of within-family variability are explored for the 3PNO model as first-level model.
Setup of the study
Mean family discrimination was fixed at 1, mean family guessing at 0.2 (which corresponds to

Family information for the 3PNO as the first-level model as a function of within-family variance in the item parameters and ability level

Family information for the 3PNO as the first-level model as a function of the correlation between the item parameters and ability level
Results
Figure 1 presents the curves for the case of zero correlations in combination with zero variance for all item parameters (solid line) or positive variance for only one type of parameter (dashed and dotted lines). An increase in the variance in the difficulty parameters
The first plot in Figure 2 is for the case of
The second plot in Figure 2 is for the case of
The third plot in Figure 2 is for the case of
Note that when all variances are 0 (solid line in Figure 1), all item parameters are equal to their respective family means, and family information equals item information (Equation 5) with the family values substituted for the item parameters. If the (co)variances are unequal to zero, use of item information in this way leads to overestimation of the actual amount of information on
Examples Using Simulation Data
The goal was to illustrate the use of the models for the cases of test assembly from a Pregenerated Item Pool and test generation on the fly from pools with calibrated item families. At the same time, the examples provide empirical estimates of the decrease in test information due to the use of family-information instead of item-information functions in the latter.
Item pools
A total of 32 item families was created through the use of five dichotomous radicals with a fully crossed design. All family-mean parameters used in this study are shown in Table 1. The mean difficulties of the families in this table were computed from Equation 4, with
Simulated Family Parameters
Setup of the study
Two different kinds of test-assembly models were formulated: a baseline model without any of the constraints on radicals in Equation 11 (M1) and an alternative model with constraints on the distribution of the radicals in the test (M2). Both for M1 and M2, the general formulation in Equations 7 to 15 was used. In either case, the objective was maximization of the height of the test-information function while maintaining the relative shape defined by the weights
To compare the results for test assembly from a Pregenerated Item Pool and test generation on the fly from Calibrated Families, all models were used twice, once with a target based on the item-information functions in Equation 8 and a second time with a target based on the family information in Equation 17. Both types of information functions were calculated at the three ability values,
As the true values of the item parameters for the simulated items were known, the objective function values for the solutions produced by the models for the case of test generation on the fly from Calibrated Families could be recalculated replacing the family-information functions by the item-information functions with these known parameters. The difference between the two results allows an assessment of the effects of test assembly based on knowledge of item families only.
Results
Figures 3 (uniform target at

Mean value of the objective function (

Mean value of the objective function (
A comparison between Pregenerated Item Pool and Calibrated Families shows that, with increasing variance of the item parameters per family, the information in a test produced by the former increased. This can be explained as follows: Increasing within-family variance means increasing the variance in the information in its items as well, from which the most informative items are selected. However, family information decreases because of the larger uncertainty about the item-parameter values—an effect already discussed in the previous section. Besides, for the case of test generation on the fly from Calibrated Families, the comparison between Fam Inf and True Item Inf shows minimal differences for small within-family variances, whereas for larger within-family variances True Item Inf was sometimes larger than Fam Inf. In other words, the results for test generation on the fly from Calibrated Families tended to be somewhat conservative for the larger within-family variances.
As expected, the addition of the extra constraints on the distribution of the radicals in M2 caused a decrease in the values of the objective functions, both for the assembly from a Pregenerated Item Pool and the generation on the fly from Calibrated Families. The effect was more pronounced for the latter, though.
Doubling the number of items per family,
All results for the two targets for the test-information functions were comparable. In general, for the uniform target, larger values of the objective function were obtained for both cases of test assembly.
Figures 5 (uniform target at

Simulated family and item parameters (left column) and family and item-information curves (right column) for within–between covariance ratios of 0.01 (upper row), 0.05 (second row), 0.10 (third row), and 0.20 (last row), for a test selected based on a uniform relative target

Simulated family and item parameters (left column) and family and item-information curves (right column) for within–between covariance ratios of 0.01 (upper row), 0.05 (second row), 0.10 (third row), and 0.20 (last row), for a test selected based on a relative target with twice as much information at
Discussion
The main goal of this research was to explore the possibility of automated test generation, that is, the integration of automated item generation and test design/assembly. Three different cases of test design were discussed, with each next case offering an increase in flexibility of automated test generation but at the cost of an increasing number of model assumptions. For instance, the two cases of on-the-fly test generation have the advantages of not having to calibrate newly generated items and generating unique tests in real time for different test takers but do require the assumption of known distributions of the item parameters within families. The assumption of multivariate normality for these distributions can be facilitated by using an appropriate transformation of the item parameters, such as the logit transformation for the guessing parameter in Equation 2.
The most ambitious case (“Test Generation on the Fly Using Calibrated Radicals Only”) allows for a further reduction of item calibration efforts because of its additional assumption of a linear relationship between the radical effects and the family means.
In practical applications, the assumptions can be tested using, for example, the model fit assessment methods discussed in Geerlings (2012). The results can help to decide which case(s) of test design should be appropriate for the intended domain of test items.
Another goal of this research was to assess the degree to which within-family item-parameter variability affects the solutions to the on-the-fly test-design models. In practice, the advantages of not having to calibrate every single item have to be weighed against the loss of information in the scores of the test takers. However, other factors should play a role in the decision. For example, items from families with small item-parameter variability may be remembered more easily by test takers and are thus vulnerable to test security breaches in the form of “family disclosure.” However, to mitigate the possible effects of low item-parameter variability on family disclosure, an exposure-control method could be used that, with increasing exposure of a family over time, gives lower weight to it in the test-assembly process.
The current research has a link to previous research on the effects of item-parameter uncertainty on optimal test assembly, which focused almost exclusively on uncertainty due to the use of small calibration samples (Hambleton et al., 1993; van der Linden & Glas, 2000). These effects can be described as capitalization-on-chance effects: Small calibration samples lead to larger positive errors of the discrimination parameters of some of the items, and hence to their likely selection in the test, with a less than nominal test information and underestimation of the standard error of the ability estimates as results (Tsutakawa & Johnson, 1990; Zhang, Xie, Song, & Lu, 2011). The magnitude of the capitalization-on-chance effect not only depends on the size of the calibration sample but also on the variability of the true item parameters (van der Linden & Glas, 2001) and the item bank size–test length ratio (Hambleton et al., 1993; Hambleton & Jones, 1994). However, in real-world applications, the effect of these factors is mitigated by the content constraints added to the test-assembly problem (Hambleton et al., 1993). In the current context of optimal assembly from families of items, the opposite of a capitalization-on-chance effect can be expected: Instead of selecting the most informative items from the pool, the items are randomly selected from the families that are most informative in the sense of high mean information. As the actual items in these families are distributed around these means, the true test-information functions obtained for family selection will be lower than those for optimal selection of individually calibrated items from the same pool.
Another effect of item-parameter uncertainty specific to adaptive testing is that of systematic errors in the ability parameters when the distribution of the item difficulty parameters within the pool is nonuniform (Doebler, 2012). The systematic errors occur because the nonuniformity implies an unequal probability of selecting an item with a positive or negative random error of the difficulty parameter. With regard to within-family item-parameter variability, the effect does not occur when the two-stage item-generation procedure discussed in this article is used; that is, when the item families are optimally selected and the incidentals are randomly applied to generate the items. In this case, the effect only occurs due to any imprecision with which the family parameters are estimated. As these parameters are usually estimated with high precision, the effect is expected to be small in most cases.
Finally, this article proposed a few alternative combinations of automated item generation and test design for domains where distinctions between item features that can be treated as radicals and incidentals are appropriate. Although automated item generators based on this distinction are already available, the authors are not aware of any earlier efforts to combine them with optimal test-design methods. To decide on the feasibility of such combinations, evaluation studies for different areas of application are needed.
Footnotes
Appendix A
Appendix B
Editor’s Note
This manuscript was reviewed and accepted under the editorship of Mark Davison.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the Deutsche Forschungsgemeinschaft (Grant Number HO 1286/5-1).
