Abstract
A common practice of linking uses estimated item parameters to calculate projected scores. This procedure fails to account for the carry-over sampling variability. Neglecting sampling variability could consequently lead to understated uncertainty for Item Response Theory (IRT) scale scores. To address the issue, we apply a Multiple Imputation (MI) approach to adjust the Posterior Standard Deviations of IRT scale scores. The MI procedure involves drawing multiple sets of plausible values from an approximate sampling distribution of the estimated item parameters. When two scales to be linked were previously calibrated, item parameters can be fixed at their original published scales, and the latent variable means and covariances of the two scales can then be estimated conditional on the fixed item parameters. The conditional estimation procedure is a special case of Restricted Recalibration (RR), in which the asymptotic sampling distribution of estimated parameters follows from the general theory of pseudo Maximum Likelihood (ML) estimation. We evaluate the combination of RR and MI by a simulation study to examine the impact of carry-over sampling variability under various simulation conditions. We also illustrate how to apply the proposed method to real data by revisiting Thissen et al. (2015).
Introduction
Linking refers to methods that use results obtained from one test to infer what the results would have been if another test had been used (Linn et al., 2009). Linking is widely used in both small and large scale assessments (e.g., LaFlair et al., 2017; Livingston & Kim, 2010; OECD, 2016; Phillips et al., 2014). Among multiple types of linking, this paper focuses on calibrated projection.
If two scales measure distinct but possibly highly correlated constructs, predicting scores from one scale to the other can be achieved by regression or projection. Regression considers a point-to-point prediction while projection predicts the score distribution. Projection is a relatively weak form of linking—it does not require the two tests to be linked measure exactly the same construct. Calibrated projection, as a special case of projection, combines both calibration and projection steps (Thissen et al., 2011, 2015). In calibrated projection, item responses are assumed to follow a two-dimensional simple structure Item Response Theory (IRT) model: Items from the two tests load separately on two correlated latent factors. Projected IRT scores for one scale can be obtained given only responses to the other scale due to the presence of the inter-factor correlation. Linking can be in either direction, but the predicted scores from opposite directions are mostly asymmetric. Applications of calibrated projection can be found in Thissen et al. (2011) and Thissen et al. (2015). Depending on whether the scales to be linked have been calibrated previously, model parameters in calibrated projection can be estimated in either one or two stages. In Thissen et al. (2011), item and latent variable (LV) density parameters are simultaneously estimated using one data set. In contrast, Thissen et al. (2015) attempted to link previously calibrated scales and estimated the LV density parameters conditional on the published item parameter estimates.
The original calibrated projection procedure used in Thissen et al. (2015) ignores the sampling variability in calibration, which may result in understated uncertainty for IRT scale scores and consequently impact the statistical inferences of linking. Thissen et al. (2015) showed the summed score conversion tables when projecting from one scale to another. Due to sampling variability, the estimated Posterior Standard Deviations (PSDs) in the conversion tables may need further adjustment. Several methods can be used to account for the impact of sampling variability in the further adjustment, such as the multiple imputation approach (MI; e.g., Yang et al., 2012) and the fully Bayesian method (e.g., Patz & Junker, 1999). In the context of Maximum Likelihood (ML) scoring, Cheng and Yuan (2010) also proposed an analytical approach to upward adjust the standard errors (SEs) of scores. It is noted that ML scores are computed without any LV distribution, whereas calibrated projection is not possible without a bivariate LV distribution. Therefore, Cheng and Yuan (2010)’s method is not further considered in the present work.
The MI approach has been used in previous studies to address the parameter estimation uncertainty in test equating designs. In a common-item linking design, Li and Lissitz (2004) used MI to obtain SEs of IRT equating coefficients, assuming no covariance among parameter estimates for different items. Zhang and Zhao (2019) adopted a similar MI approach, but different from Li and Lissitz (2004), they used the covariance matrix among item parameter estimates. Our study differs from Li and Lissitz (2004) and Zhang and Zhao (2019) in three aspects: (1) we considered a calibrated projection design, whereas they used a common-item nonequivalent group equating design; (2) their studies focused on the performance of MI in estimating the SEs of IRT parameter transformation coefficients, while we used MI to obtain the adjusted IRT scores; and (3) we used a two-stage estimation similar to Thissen et al. (2015) in our study.
Item Response Theory scoring in calibrated projection may benefit from using published item parameters. Kolen and Brennan (2014) indicated that there are potential benefits if the published item parameters from a previously calibrated item pool are re-used. For instance, across studies using independent samples, a common ability scale can be established by fixing the item parameter of common items at the same values and adjusting the LV scales accordingly. As another example, PROMIS studies collected samples in a principled way and published their calibrated item parameters based on a representative sample from the reference population. In this way, researchers always use the same items with the same set of parameters so that scores from separate studies are comparable.
A number of calibration studies of PROMIS scales have been published (e.g., Huang et al., 2016; Irwin et al., 2010; Pilkonis et al., 2011; Tucker et al., 2014), and these calibrated PROMIS scales have been used in many subsequent studies. It is possible to perform item recalibration by collecting a new sample for the purpose of linking. However, a complete recalibration may not be ideal when the scores from the previous established scales were already used. In addition, new item parameter estimates obtained by complete recalibration of a new sample could be computationally less stable because of small sample sizes and different distribution characteristics (Liu et al., 2019). As such, a complete recalibration is less preferred than using pre-existing item parameters.
The current work aims to address two issues in calibrated projection: (1) characterizing the sampling variability when estimation is done in two stages, especially when the calibration dataset is not accessible; and (2) accounting for sampling variability in the IRT model parameters when computing IRT scale scores and PSDs. The first issue is addressed by a two-stage calibrated projection method using a conditional estimation procedure, which is a special case of the Restricted Recalibration (RR). The second issue is addressed by the MI method. The rest of the paper is organized as follows. We first present in the context of a two-dimensional simple structure IRT model the methods we use to address the two issues mentioned above. Then, we report a simulation study with a goal to show the extent to which sampling variability is carried over to the scoring process. We also revisit a calibrated projection empirical example in Thissen et al. (2015) by applying the proposed methods.
Method
Calibrated Projection
When two scales measure somewhat distinct constructs, calibrated projection can be used to translate scores between the two scales. Under a two-dimensional simple structure IRT model (see Figure 1), scores on one scale and the corresponding precision estimates can be obtained given the responses on the other scale. Calibrated projection is based on multidimensional IRT models. When items are not calibrated in previous studies, calibrated projection can be done by fitting a two-dimensional model, with LV correlation and item parameters estimated simultaneously in one stage as in Thissen et al. (2011). When scales are previously calibrated, calibrated projection can be done by fixing the item parameters from the two scales at their published values, and estimating LV density parameters conditional on the fixed item parameters in two stages (Thissen et al., 2015). We focus on the latter case. Model parameters (
We focus on a two-dimensional Graded Response Model (De Ayala, 1994) in the present work; the techniques described in the sequel can be straightforwardly generalized to other MIRT models.
Suppose that the n × J data matrix
Suppose LV follows a multivariate normal distribution with the mean vector
We adjust the LV mean vector and covariance matrix such that the reference LV in the current study is on the same scale as the LV in the original calibration sample. Similar to what Thissen et al. (2015) did in their calibrated projection study, suppose the projection is from Scale 1 to Scale 2, to compute the projected scores on Scale 2 in a hypothetical calibration population that is the same as the reference population for Scale 1, the LV mean vector and covariance matrix are rescaled as follows.
Calculation steps for Equations (5) and (6) can be found in Equations (1.3)–(1.9) in Thissen et al. (2015). The T-score unit with a mean of 50 and a standard deviation of 10 (Pilkonis et al., 2011) is used throughout this paper for the summed-score expected a posteriori (SS-EAP) in both scales.
Item Response Theory Scoring
We focus on SS-EAP in this paper, but the proposed method can be adapted to response-pattern EAP scores. The joint likelihood of summed score s and
If the estimates of
Multiple Imputation
We applied an MI-based approach to adjust SS-EAPs and their corresponding PSDs to reflect sampling variability. The procedure is in line with the predictive inference framework in Liu and Yang (2018), which considers predicting the plausible values of
The general method to obtain the MI-based SS-EAPs and PSDs is as described. Let
As can be found in the RR section, by the asymptotic theory of pseudo ML estimator, the resulting two-stage estimator of model parameters
In practice, the MI-based SS-EAPs and PSDs can be obtained by Monte Carlo approximation. Suppose M sets of parameters are drawn from the multivariate normal distribution
The values computed by equation (14) and the square root of equation (15) are referred to as MI (i.e., adjusted) SS-EAPs and PSDs.
We compute the ratio
Restricted Recalibration
One of the problems we encounter when using the MI approach to account for sampling variability is that model parameters are computed in two stages, which is a special case of RR. Restricted Recalibration concerns fitting a model parameterized by
The full ACM from RR can be constructed by the following procedures. Let n′ denote the sample size of
Under certain regularity conditions (Gong & Samaniego, 1981)
Simulation
In this section, we report a simulation study to examine the impact of carry-over sampling error on IRT scores in both scales under various combinations of simulation conditions.
Design
The manipulated variables in the simulation study include: (1) LV correlations, (2) calibration sample size, and (3) scoring sample size. For simplicity, the projection in the simulation study is from Scale 2 to Scale 1.
In the simulation study, the published item parameters in Irwin et al. (2010) and Pilkonis et al. (2011) were used as true model parameters to generate the calibration and scoring sample data. All items have a 5-point Likert-type scale. Items were calibrated separately, and hence, LV mean and variance for each of the two scales were fixed to 0 and 1 in the calibration samples. Two LV correlation conditions were examined: 0.89 and 0.96. These two correlation values come from two linking studies in Thissen et al. (2015) and Thissen et al. (2011). In the scoring samples, the true means for LVs in Scale 1 and Scale 2 are 0.631 and 0.868, respectively, based on the estimates reported in Thissen et al. (2015). The true covariance matrix for LVs is
All components were done in R (R Core Team, 2019). Data generation and model fitting were conducted using the mirt package (Chalmers, 2012). The expectation-maximization (EM) algorithm (Bock & Aitkin, 1981) was used to find the ML estimates of the two-dimensional simple structure item and LV density parameters. The default number of quadrature points in mirt were adopted—61 quadrature points for one-dimensional models and 31 quadrature points per dimension for two-dimensional models. The default settings of convergence tolerance .0001 and maximum number of iterations 500 were adopted. The expected Fisher information was approximated by Monte Carlo method proposed in Monroe (2019). The evaluation criterion is the r value. A larger r reflects a larger impact of sampling error on IRT scoring, and thus it indicates greater necessity to adjust the PSDs.
We considered two heuristic ACM constructions per the suggestion of a reviewer: the focal-only ACM approach and the block-diagonal ACM approach. Although only the full ACM approach is theoretically correct, we are interested in whether these two simplified ACM approaches yield comparable results. The focal-only ACM approach is identical to the naive method presented in Liu et al. (2019). In this approach, the estimates of the item parameters are considered known, and thus there is no carry-over impact of the sampling variability within the estimated item parameters. Only the per-observation ACM from equation (18) is used to impute the focal parameters. In the block-diagonal ACM approach, the focal parameters and nuisance parameters are treated as if they were independently estimated from separate samples, and thus the
Results
Figures 2 and 3 summarize the changes of r across summed scores when projecting from Scale 2 to Scale 1. Figure 2 shows the r values of Scale 1 and Scale 2 when the LV correlation is 0.89, and Figure 3 is when the LV correlation is 0.96. It can be found that r can be large in this projection. In Figure 2, under the condition when n′ = 250 and n = 250, the maximum r in Scale 2 can reach 0.515. That means in this projection, the sampling error is 51.5% of the measurement error—it is indeed necessary to do the MI adjustment to the PSD in this case. r for Scale 1 and Scale 2 across summed scores when projecting from Scale 2 to Scale 1. LV correlation is 0.89. The calibration sample sizes are n’ = 250, 500, and 1000. The scoring sample sizes n = 250, 500, and 1000. Scale 1 and Scale 2 are depicted in different colors: gray for Scale 1, and black for Scale 2. ACMs are constructed by the full ACM (solid), focal-only ACM (dashed), and block-diagonal ACM (dotted) approaches. Note. ACM = asymptotic covariance matrix. r for Scale 1 and Scale 2 across summed scores when projecting from Scale 2 to Scale 1. LV correlation is 0.96. The calibration sample sizes are n’ = 250, 500, and 1000. The scoring sample sizes n = 250, 500, and 1000. Scale 1 and Scale 2 are depicted in different colors: gray for Scale 1, and black for Scale 2. ACMs are constructed by the full ACM (solid), focal-only ACM (dashed), and block-diagonal ACM (dotted) approaches. Note. ACM = asymptotic covariance matrix.

It is also noticeable in Figures 2 and 3 that regardless of the LV correlation, fixing one calibration sample size and one ACM approach, the r for the reference scale (e.g., Scale 2 when projecting from Scale 2 to Scale 1) is the same across different scoring sample size conditions. The reason is that the mean and variance of the reference scale are fixed to 0 and 1, respectively, and the only carry-over impact comes from the uncertainty in the item-calibration step. Therefore, when the calibration sample size is fixed, the r values are the same for the reference scale across different scoring sample size conditions. Moreover, Figures 2 and 3 show that when projecting from Scale 2 to Scale 1, the r values in Scale 1 (i.e., the projected scale) are larger when the LV correlation is larger. There is not much difference in the r values in Scale 2 between the two LV correlation conditions. This shows that a larger LV correlation could result in a smaller measurement error, and thus a larger r in Scale 1 when the sampling error does not differ much.
It can also be found that in both figures, the r for Scale 2 gradually increases until the summed score gets around 28 and drops after that. r increases when the sampling error increases and/or measurement error decreases, which means the magnitude of sampling error relative to measurement error becomes larger. It is found in Scale 2 that both the sampling error and the measurement error continue to increase after the summed score reaches 28, but the measurement error grows faster than the sampling error, and thus the r drops after that point. For Scale 1, r decreases a little bit and then increases gradually as summed score increases, but the adjustment is not always needed if we use
Across the three ACM approaches in Figures 2 and 3, the r does not differ much when using the full ACM and the block-diagonal ACM approaches, but differs a lot from the r obtained using the focal-only ACM approach.
Summary
We used calibrated projection to link two scales when underlying constructs are correlated but not exactly the same. Restricted Recalibration was conducted to obtain the estimates of LV density parameters conditional on published item parameters of the two scales. The ACMs from different approaches have block structures. An MI-based procedure was used to account for the sampling variability that is carried over to the scoring process.
The results shown in Figures 2 and 3 can be used as a guideline for researchers to use the MI adjustment under different LV correlations, calibration sample size, and scoring sample size conditions. Our results indicate that the adjustment is larger for both scales when there is a smaller calibration sample. In the current projection setting, when projecting from Scale 2 to Scale 1, the r can be larger than 51% using the full ACM approach. This means that sampling error is over 51% of the measurement error, and indeed we need to perform PSD adjustment in such cases. When the new sample is small and the calibration sample is large, the r can still be about 15%. It is noted that the patterns found in the current study are consistent to the findings in Yang et al. (2012) that the impact of sampling variability on the uncertainty in SS-EAPs can be small. However, depending on how small the calibration sample size is relative to the number of estimated parameters and if the model is multidimensional, the impact of sampling variability on the PSDs can be large. In general, if r is larger than 0.1, we consider the adjustment is needed.
The r obtained using the full ACM approach is similar to the ones obtained using the block-diagonal ACM approach, but these r’s differ a lot to the ones obtained by the focal-only ACM approach. Among these three types of ACM approaches, only the full ACM approach is technically the correct way to construct the ACM. Moreover, a higher LV correlation results in a larger r in the projected scale.
Empirical Example
Data
We demonstrate how to account for the sampling variability of estimated model parameters that is carried over to projected scores in one of the scales being linked in Thissen et al. (2015). There are two test scales involved in the study—The PROMIS pediatric and adult anxiety test scales. Both of these scales were previously studied and their item parameters were published. Item parameters in these two scales are uncorrelated since the samples used to calibrate these two scales were collected separately. Researchers have access to the published item parameters in these two scales but not the original calibration data.
Items in PROMIS pediatric and adult anxiety tests were calibrated and evaluated by representative samples in previous studies. On the one hand, the PROMIS pediatric anxiety test contains 18 items in total. The items were randomly split into two test forms with equal number of items in each form. Each item was measured on the same five-point Likert-scale ranging from never (0) to always (4). 1529 participants aged 8–17 were recruited, and were randomly assigned to answer either test form 1 or test form 2. In total, 759 participants took test form 1 and 770 took test form 2. On the other hand, PROMIS adult anxiety test targets participants older than 18. The test contains 56 items for each of the 14 health-related measures including anxiety (Cella et al., 2010; Pilkonis et al., 2011). There are two test designs: full-bank and block testing designs. Participants in the full-bank testing group took all anxiety items. Other participants in the block testing group received content-balanced 98 items (i.e., 7 items each, for all the 14 domains). In total, 788 participants took full-bank anxiety items and 14,048 participants took items from the block testing design. The original article provided limited information about samples, so we approximated the sample size for the PROMIS adult anxiety scale by
In the calibrated projection research, Thissen et al. (2015) collected new data that contain responses to both PROMIS pediatric and adult anxiety scales for the purpose of linking, which were also used as our scoring sample. In the new data, 874 teenagers aged 14–20 responded to eight short-form PROMIS pediatric-scale and eight short-form PROMIS adult-scale anxiety items (Thissen et al., 2015). Among the eight pediatric-scale items, three items are from pediatric test form 1, and five items are from test form 2. All of the eight adult-scale items are from a published short-form PROMIS adult anxiety test.
Constructing ACM
The per-observation ACM in equation (18) can be constructed using the aforementioned calibration and scoring samples. Model parameters include item parameters estimated from original calibration samples and LV density parameters estimated from the new scoring sample. The ACM can be divided into a three block-diagonal ACMs:
The three components in full ACM can be further decomposed into sub-blocks. The block-structured
Results
The Unadjusted and MI Adjusted SS-EAPs and PSDs for Summed Scores on the Pediatric Scale. The Projection Direction is from Pediatric to Adult Scale.
Note. MI = Multiple Imputation; SS-EAP = summed-score expected a posteriori; PSD = Posterior Standard Deviation.
The Unadjusted and MI Adjusted SS-EAPs and PSDs for Summed Scores on the Adult Scale. The Projection Direction is from Adult to Pediatric Scale.
Note. MI = Multiple Imputation; SS-EAP = summed-score expected a posteriori; PSD = Posterior Standard Deviation.

The distribution of raw summed scores in pediatric and adult scales. The raw summed scores in both scales are grouped into bins with equal width of 4 ranging from 0 to 32.
PSD adjustment is not as much needed for both scales when the calibration sample size is large enough, no matter what the projection direction is. It can be found in both tables that the PSDs are on average larger for the projected scale, especially when projecting from adult to pediatric scale. Among both projection directions, the largest r is 0.054 in the adult scale when projecting from adult scale to pediatric scale. This means that the largest sampling error is only about 5% of the measurement error.
Discussion
Overall, the current research focuses on accounting for the carry-over sampling variability in a calibrated projection design, when the item parameters of the two scales to be linked are fixed at their published values. This study not only serves as an addition to the original calibrated projection study in Thissen et al. (2015) in that it provides a way to account for the impact of sampling variability but also in that it presents important findings and implications on IRT scoring by using information from previous studies.
The value of r gives the necessity to perform the PSD adjustment, and it reflects the relative uncertainty of the sampling error to the measurement error. When r values are large, there are big differences between the adjusted and unadjusted scoring values—we are understating the uncertainty in IRT scores if we do not do the adjustment. Thus, it is crucial to do the adjustment in this calibrated projection design if the r is large. It is recommended to always calculate the r in a calibrated projection design, since it helps to decide whether the adjustment needed is substantial.
There are three potentially fruitful future research directions to address current limitations. First, besides MI-based method, other methods such as a fully Bayesian method to account for the impact of sampling variability worth more explorations. Investigations are needed to compare the performance of other methods with the MI-based method in the framework of calibrated projection. Second, in the real data analysis, the construction of ACM is based on an approximation without knowing the exact design. Because detailed information for missing data is not available, it is not guaranteed the approximation accurately reflects the real scenario. Further work is needed to evaluate the performance of the approximation of ACM in real data. Third, the method in this paper can be extended to many other situations when the model is more complex. In addition to the one-to-one linking, calibrated projection procedures can be extended to many-to-one (Thissen et al., 2015) and many-to-many linking. Moreover, the performance of proposed methods in the current study can be further explored in more complicated statistical models such as multilevel IRT models (e.g., Fox, 2005) and two-tier models (Cai, 2010).
Supplemental Material
Supplemental Material—Characterizing Sampling Variability for Item Response Theory Scale Scores in a Fixed-Parameter Calibrated Projection Design
Supplemental Material for Characterizing Sampling Variability for Item Response Theory Scale Scores in a Fixed-Parameter Calibrated Projection Design by Shuangshuang Xu and Yang Liu in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
