Abstract

Keywords
The nonequivalent groups anchor test (NEAT) design is often used to scale item parameters from two different test forms. A subset of items, called the anchor items or common items, are administered as part of both test forms. These items are used to adjust the item calibrations for any differences in the ability distributions of the groups taking different forms. One method for scaling the item parameters is the fixed-anchor item method, in which the parameters for the anchor items are held constant at the values calibrated in the base form. With this method, the parameters for the new form should be on the metric of the old form, with no need for further scaling.
The fixed-anchor procedure may be more complicated when the base-form and new-form ability distributions show large differences, as they would for vertical scaling. Several studies have demonstrated that in BILOG (Zimowski, Muraki, Mislevey, & Bock, 2003), if one fixes the anchor item parameters and specifies NOADJUST on the CALIB line, and if the new-form ability distribution is not close to N(0,1) relative to the metric of the anchor items, the scaling for the new-form nonanchor items and the new-form ability distribution will not be accurate (Kang & Peterson, 2009; Kim, 2006).
Before discussing these findings and proposing an alternative, several BILOG options will be discussed. These options will be described in the context of separate calibration for the new form. Some options operate differently in a multiple-group context, but that is not relevant to fixed-anchor item parameter calibration.
On the TEST command, the user may fix item parameters at their starting values by specifying FIX followed by a list of 0s and 1s indicating whether the parameters should be free or fixed. Starting values may be listed on the TEST command or read from an external file. The FIX option is thus used to fix the parameter values of the anchor items to their values as estimated for the base form of the test.
On the CALIB command, the options for the ability distribution are EMPIRICAL, NORMAL, or FIXED. EMPIRICAL approximates the density as a histogram at a series of quadrature points after each step. This ability distribution is termed the posterior ability distribution because it is the product of the prior distribution, initially a normal distribution, and the likelihood function based on the data and the item parameter estimates. The posterior distribution from one cycle is used as the prior ability distribution at the next cycle. The mean and standard deviation of the ability distribution are estimated from this density, and the scaling constants needed to adjust the mean to 0 and the standard deviation to 1 are calculated. These scaling constants are then applied to the item parameter estimates to transform them to the same metric. EMPIRICAL should not be used with data sets that are too small for accurate estimation of the ability density. Alternatively, with the NORMAL option, a standard normal ability distribution is used throughout the item parameter estimation; it is not updated after each cycle. After convergence, the mean and the standard deviation of the posterior ability distribution are estimated. As with EMPIRICAL, the scaling constants are calculated to adjust the mean and standard deviation to 0 and 1, and the item parameter estimates are transformed to the same metric. The FIXED option keeps the ability distribution fixed to user-specified values. If the user does not specify the shape of the density, this option is the same as NORMAL. This option should not be confused with FIX for the item parameters.
NOADJUST is another option on the CALIB line. With NOADJUST, a standard normal ability distribution is used throughout the item parameter estimation. After convergence, the posterior ability distribution is approximated along the series of quadrature points. The mean and the standard deviation of this distribution are NOT adjusted to 0 and 1, although if no item parameters have been fixed, the mean and the standard deviation will generally be close to 0 and 1. Because the ability distribution is not rescaled, the item parameters are not rescaled either. Thus, NOADUST can be used to avoid rescaling the fixed-anchor items at the end of the estimation.
Understanding these BILOG options clarifies the problem discussed in Kang and Peterson (2009) and Kim (2006): The NOADJUST option is needed to keep the anchor items fixed at their input values so that they will not be rescaled. However, when this option is used, the ability distribution for the marginal maximum likelihood (MML) estimation of the nonanchor items is not updated; NOADJUST overrides the EMPIRICAL option, which would otherwise adjust the prior ability distribution after each cycle. Thus, if the new-form ability distribution is appreciably different from the old-form ability distribution used to scale the metric of the anchor items, the weights (from the density of the ability distribution) used in the MML item calibration will be wrong and the parameter estimates of the nonanchor items will be biased.
Paek and Young (2005) proposed a method to adjust the ability distribution in BILOG while keeping the item parameters fixed with NOADJUST. They suggested multiple runs. After each BILOG run, the quadrature weights from the estimated ability distribution would serve as input for the following run by listing them on the PRIORS line. After several runs, the estimated weights should converge and the nonanchor items should be on the same metric.
Kim (2006) suggested a somewhat simpler way to adjust for this problem in BILOG, using just two BILOG runs. In the first run, the item parameters should not be fixed and the ability distribution should be estimated, presumably using the EMPIRICAL option on the CALIB line. The user can then find the scaling constants to put the anchor items on the original metric, perhaps using the Stocking–Lord or other scaling procedure. After applying the rescaling constants to the quadrature distribution, the quadrature weights can be read into BILOG and held constant with NOADJUST while the anchor item parameters are held constant with FIX and the parameters for the other items are reestimated in a second BILOG run.
In this software note, the authors illustrate a less complicated method, which requires only one BILOG run. For this method, the anchor items are fixed to their original values with FIX on the TEST command. But instead of specifying NOADJUST on the CALIB commend, EMPIRICAL is specified. With these specifications, the parameters of the anchor items will be fixed relative to each other, but they will adjust with the ability distribution. 1 As a result, all of the anchor items will differ from the original metric by exactly the same constants. Any item can be selected to obtain the scaling constants by a simple linear transformation. After applying these constants, the items and the ability scale will be on the original metric. There is no need for more complicated scaling or a second BILOG run.
An illustration is shown for 15 anchor items in Table 1. Responses were simulated to follow a 2-parameter logistic (2PL) model. There were 25 unique items, with bs evenly spaced from −2 to 2 and as equal to 1.0 or 1.5 (with D = 1.7). There were 5,000 simulees drawn from a N(−1,1.252) new-form ability distribution. The true item parameters for the anchor items, the first 15 items for convenience, were specified and fixed on the TEST line in the BILOG syntax, and EMPIRICAL was specified on the CALIB line:
Anchor Item Parameters Before Rescaling, Using FIX and EMPIRICAL
>TEST SLOPE = (1,1.5,1,1.5,1,1.5,1,1.5,1,1.5,1,1.5,1,1.5,1,1.5,1),
THRESH = (−1.4, −1.2, −1, −0.8, −0.6, −0.4, −0.2,0,0.2,0.4,0.6,0.8,1,1.2,1.4),
FIX = (1(0)15,0(0)25);
>CALIB CYCLES = 100, NEWTON = 20, FLOAT, TPRIOR, EMPIRICAL;
Notice in Table 1 that the item parameters remained fixed in a relative sense. It does not make any difference which item is used to derive the scaling constants, and there is literally nothing to be gained by using all of the items and applying a more complicated scaling method such as the Stocking–Lord method. Given the ability distribution of the new-form examinees, the anchor item b-parameters should have been “off” by a multiplicative constant of 1.25 and an additive constant of −1. In the single replication reported in Table 1, the constants are 1.20 and −0.99. Averaged across 500 replications, the constants were 1.21 and −0.98. Essentially, relative to the base metric, the variance of the new-form abilities was slightly underestimated. When data were simulated to follow a 3PL model (c = .2), not shown in the table, the estimation was a bit worse; the constants were 1.15 and −0.94.
For comparison, the item parameters were also estimated using NOADJUST on the CALIB line, again specifying FIX and listing the anchor item parameters on the TEST line. Based on Kim’s (2006) or Paek and Young’s (2005) work, the authors anticipated considerable bias. The right panel of Figures 1 and 2 show the bias for the 2PL data, averaged across 500 replications, of the nonanchor items using these specifications. The b-parameters (Figure 1) were overestimated, except for the most difficult items. The as (Figure 2) were also overestimated, relative to the scale of the fixed-anchor items, and the bias was larger for easier or more discriminating items. In contrast, the left panels of Figures 1 and 2 show the bias for the nonanchor items using EMPIRICAL and rescaling based on the constants by which the fixed-anchor items had shifted. This bias was negligible compared with the bias shown in the panels on the right.

Bias in b-parameter for the nonanchor items. The FIX option was used for the anchor items on the TEST line, along with either EMPIRICAL or NOADJUST on the CALIB line.

Bias in a-parameter for the nonanchor items.
It should be noted that the bias shown in Figures 1 and 2 arose because the new-form ability distribution was far from standard normal on the metric of the fixed item parameters. NOADJUST would not be problematic if the new-form ability distribution was similar to that used to scale the fixed-anchor items.
This note was not intended to advocate the fixed parameter method over other scaling methods or concurrent calibration or to advocate BILOG over other programs. 2 Instead, the purpose was to provide practitioners with a simple method of using BILOG if they choose to use fixed-anchor item scaling.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) received no financial support for the research, authorship, and/or publication of this article.
