Abstract
In item response theory (IRT), when two groups from different populations take two separate tests, there is a need to link the two ability scales so that the item parameters of the tests are comparable across the groups. To link the two scales, information from common items are utilized to estimate linking coefficients which place the item parameters on the same scale. For polytomous IRT models, the Haebara and Stocking–Lord methods for estimating the linking coefficients have commonly been recommended. However, estimates of the variance for these methods are not available in the literature. In this article, the asymptotic variance of linking coefficients for polytomous IRT models with the Haebara and Stocking–Lord methods are derived. The results are presented in a general form and specific results are given for the generalized partial credit model. Simulations which investigate the accuracy of the derivations under various settings of model complexity and sample size are provided, showing that the derivations are accurate under the conditions considered and that the Haebara and Stocking–Lord methods have superior performance to several moment methods with performance close to that of concurrent calibration.
Keywords
Item response theory (IRT) is a powerful framework for analyzing categorical data in educational and psychological testing. The results from IRT analyses are often used to report scores and for inferring population characteristics. When two groups from different populations take two tests, the item parameters for the respective groups are on different scales and cannot be directly compared. To enable the comparison between the parameters of the two tests, the parameters must be placed on a common scale (Kolen & Brennan, 2014). This is often accomplished by having some items which are common between the two tests.
One way to estimate the item parameters such that all parameters are expressed on the same scale is to use a multiple-group estimation procedure, also called concurrent calibration. With such a method, the differences between the ability distributions of the groups are taken into account and the item parameters are estimated with one of the groups as the reference group. A second way to obtain item parameters expressed on the same scale is to use separate estimation for the groups. Because the parameters of the common items should be identical for both groups if the parameters were on the same scale, the information contained in the estimated common items for each group can be used to estimate linking parameters
When using separate estimation for the item parameters from two groups, two types of estimators of
The purpose of this article is to derive the asymptotic variance of response function estimators of linking coefficients when using polytomous IRT models. The results of the article are important because linking coefficients are utilized in several areas in educational and psychological measurement such as equating and linking of scales and tests. Having estimates of the variance of the response function methods for estimating linking coefficients for polytomous IRT models will add to the utility of the equating and linking methodology.
The article is structured as follows. First, polytomous IRT models are introduced briefly. Then linking coefficients are described and response function estimators of these are defined. The asymptotic covariance matrices of the Haebara and Stocking–Lord methods are then derived. The derivations are illustrated using a simulation study and lastly the results are discussed.
Polytomous IRT Models
In educational and psychological testing, items on a test are often scored in two or more categories. To model data from such a test, a polytomous IRT model is often considered suitable. A general model for such data is the generalized partial credit model (GPCM; Muraki, 1992), where the probability to obtain category
where
The test response function indicates the expected score for a given ability level and is used when estimating the linking coefficients with the Stocking–Lord method.
Linking Coefficients
Let
In an alternative formulation, the geometric mean can be used in place of the arithmetic mean when estimating the
A third moment method uses the standard deviation of the
The variances of
Estimating Linking Coefficients With Response Function Methods
Let
and
Because
and
When the parameters are estimated, the equalities in Equations 8 and 9 will not hold exactly for all values of
The Haebara and Stocking–Lord methods estimate the linking coefficients
where
and
For the Stocking–Lord method, consider the objective function
where
and
When estimating
Asymptotic Covariance of Response Function Estimators of Linking Coefficients
Let
The matrices
When estimating
Simulation Study
Design
Data for two tests
The item parameters were estimated using marginal maximum likelihood with the statistical programming language R (R Development Core Team, 2016), either separately for each group or simultaneously in a multigroup setting (concurrent calibration). Version 1.20.1 of the R (R Development Core Team, 2016) package mirt (Chalmers, 2012) was used for the item parameter estimation. With separate estimation, both groups were assumed to have an underlying
was also calculated, where MSE denotes the Mean Squared Error. The nonparametric bootstrap was used to calculate the standard errors of the MCSE and ASE and the confidence intervals for the RE in the simulation study.
Results
The results from the simulation with the three-category items are given in Table 1 for the case of five common items and in Table 2 for the case of 10 common items. The ASEs are accurate for all sample sizes and methods considered and hence there is no difference in the accuracy of the ASE between the five different linking coefficient estimators. The standard errors for the Haebara and Stocking–Lord methods are lower than those for the moment methods and the standard errors are smaller with 10 common items compared with the case of five common items. The Haebara method has smaller MCSE than the Stocking–Lord method for all conditions. The largest difference between the two response function methods is for the
ASE and MCSE (×10) and Bias (×10) for Estimators of Linking Coefficients A and B With the Three-Category Items and Five Common Items.
Note. ASE = asymptotic standard error; MCSE = Monte Carlo standard error; H = Haebara; SL = Stocking–Lord; MGM = mean–geometric mean; MM = mean–mean; MS = mean–sigma; CC = concurrent calibration.
ASE and MCSE (×10) and Bias (×10) for Estimators of Linking Coefficients A and B With the Three-Category Items and 10 Common Items.
Note. ASE = asymptotic standard error; MCSE = Monte Carlo standard error; H = Haebara; SL = Stocking–Lord; MGM = mean–geometric mean; MM = mean–mean; MS = mean–sigma; CC = concurrent calibration.
The results from the simulation with the five-category items are given in Table 3 for five common items and in Table 4 for 10 common items. The ASEs are accurate for all settings and methods. Overall, the results show that there are virtually no differences between the Haebara and Stocking–Lord methods. For all settings, the moment methods perform the worst with higher standard errors and higher bias. In contrast to the results with three-category items, the mean–sigma method performs the best among the moment methods with respect to the standard errors. The differences between the moment methods and the other estimators are largest when estimating the
ASE and MCSE (×10) and Bias (×10) for Estimators of Linking Coefficients A and B With the Five-Category Items and Five Common Items.
Note. ASE = asymptotic standard error; MCSE = Monte Carlo standard error; H = Haebara; SL = Stocking–Lord; MGM = mean–geometric mean; MM = mean–mean; MS = mean–sigma; CC = concurrent calibration.
ASE and MCSE (×10) and Bias (×10) for Estimators of Linking Coefficients A and B With the Five-Category Items and 10 Common Items.
Note. ASE = asymptotic standard error; MCSE = Monte Carlo standard error; H = Haebara; SL = Stocking–Lord; MGM = mean–geometric mean; MM = mean–mean; MS = mean–sigma; CC = concurrent calibration.
In Tables 5 and 6, the confidence intervals for the relative efficiencies of concurrent calibration compared with each of the other linking coefficient estimators are displayed. For the RE, a value larger than 1 means that the estimator has lower MSE than concurrent calibration and a value smaller than 1 means that the estimator has higher MSE than concurrent calibration. Overall, the differences between concurrent calibration and the response function methods are small. The best-performing estimator relative to concurrent calibration is the Haebara method, which has an efficiency which is comparable to concurrent calibration overall. The Stocking–Lord method is almost as good, except that for the tests with three-category items the RE is lower than that for the Haebara method. The moment methods have lower relative efficiencies for all settings compared with the response function methods, with the mean–sigma method performing worse than the mean–mean and mean–geometric mean methods with the three-category items but performing better than them with the five-category items.
Confidence Intervals for the Relative Efficiency of Linking Coefficient Estimators From Separate Estimation Compared With Concurrent Calibration, Three-Category Items.
Note. Bold font indicates that the confidence interval does not cover 1. CI = confidence interval; H = Haebara; SL = Stocking–Lord; MGM = mean–geometric mean; MM = mean–mean; MS = mean–sigma.
Confidence Intervals for the Relative Efficiency of Linking Coefficient Estimators From Separate Estimation Compared With Concurrent Calibration, Five-Category Items.
Note. Bold font indicates that the confidence interval does not cover 1. CI = confidence interval; H = Haebara; SL = Stocking–Lord; MGM = mean–geometric mean; MM = mean–mean; MS = mean–sigma.
Discussion
In this article, the asymptotic variance of linking coefficients using the Haebara and Stocking–Lord methods were derived for polytomous IRT models. While the specific results were given only for the GPCM, it is straightforward to apply the results to other polytomous IRT models such as the GRM. For score reporting with IRT, the results of this article can be applied to observed-score equating, as described in Andersson (2016), and to true-score equating, described in Wong (2015).
There are several versions of the Haebara and Stocking–Lord methods used in the literature. The most general forms of these two methods were used in this article, meaning that the results for the other versions follow directly from the forms considered here. Note that the derivations in the article also apply for the dichotomous IRT models which are special cases of the GPCM, such as the two-parameter logistic (2-PL) model. In this sense, the results of the article generalize the results of Ogasawara (2001) to all the versions of the Haebara and Stocking–Lord methods in the literature.
The results of the numerical study indicate that the ASEs are accurate for sample sizes as low as 250, suggesting that the derivations are appropriate to use in practice. Adding to several studies which have shown the superiority of the response function methods compared with the moment methods, this study indicates that the Haebara and Stocking–Lord methods outperform the moment methods mean–mean, mean–geometric mean, and mean–sigma with respect to both the sampling variance and the bias. Nevertheless, the ASEs for the moment methods were as accurate as those for the response function methods. The biases for the Haebara and Stocking–Lord methods were negligible and approximately the same as for concurrent calibration. Even so, a useful extension to this line of work is to derive the asymptotic bias of estimators of linking coefficients under correctly and incorrectly specified models, as has been done for correctly specified dichotomous IRT models with the mean–mean and mean–sigma methods (Ogasawara, 2011).
This study indicates that when using marginal maximum likelihood estimation, the response function methods are almost as good as concurrent calibration with respect to the bias and standard error of the linking coefficients. For sample size 250 with both the three-category and five-category items, the Haebara method was sometimes even better than concurrent calibration, although the improvement was small. The Haebara method had better performance than the Stocking–Lord method for the tests with three-category items but not for the tests with five-category items. Some previous studies have indicated that the concurrent calibration method has overall better performance than using linking coefficients with separate estimation (Hanson & Béguin, 2002; Kim & Kolen, 2007) but examples of studies indicating the opposite are also available (S. H. Kim & Cohen, 1998). However, these studies used estimation methods which differed to the one used in this article, where the standard marginal maximum likelihood method was used. Furthermore, this study utilized simulated data from groups with differences in both the latent mean and the latent variance which the referenced studies did not.
The asymptotic variances of linking coefficient estimators derived in this article and in previous articles only account for the variability in estimating the item parameters. Other sources of variability such as the selection of the common items from a pool of items remain unaccounted for (Haberman, Lee, & Qian, 2009; Michaelides & Haertel, 2014). For large sample sizes, the common item selection could be the main source of variability because the variability of the item parameter estimation reduces with the sample size while the variability of the common item selection does not.
Last, the method of using separate estimation and then calculating the linking coefficients has clear benefits compared with concurrent calibration. For example, with many successive calibrations it may not be possible to achieve convergence for the full data using concurrent calibration even though for each individual group it is possible to achieve convergence. It is also easier to diagnose potential problems when estimating the item parameters separately (Hanson & Béguin, 2002). It should also be noted that in this study the estimation with concurrent calibration took approximately 10 times longer to conduct compared with separate estimation. Hence, the method of concurrent calibration may be computationally infeasible to conduct in practice, especially for large data sets.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
