Abstract
For mixed-type tests composed of dichotomous and polytomous items, polytomous items often yield more information than dichotomous items. To reflect the difference between the two types of items and to improve the precision of ability estimation, an adaptive weighted maximum-a-posteriori (WMAP) estimation is proposed. To evaluate the performance of WMAP, a Monte Carlo simulation comparison is presented with maximum likelihood estimation, maximum-a-posteriori estimation, and Jeffreys modal estimation. Simulation results show that the proposed method is much less biased than any of the other estimators, with relatively smaller standard errors and root mean square errors.
Item response theory (IRT) is a powerful tool for describing the influence of a latent trait (θ), such as ability, on an individual’s responses to a multiple-choice test. Trait estimation is one of the most important components of IRT because a major goal is to obtain a measure of the ability of each examinee administered an educational test or psychological instrument within the context of a test theory. Various θ estimation methods have been studied extensively for IRT (e.g., Baker & Kim, 2004; Birnbaum, 1968; Bock & Aitkin, 1981; Bock & Mislevy, 1982; Fox, 2010; Lord, 1980; Mislevy, 1986; Muraki, 1992; Thissen, 1982). In general, unbiased parameter estimation is desirable. Reducing bias in θ estimation can be important in several situations (Penfield & Bergeron, 2005; Wang, Hanson, & Lau, 1999; Warm, 1989). For example, when comparability between computerized adaptive testing (CAT) and paper-and-pencil tests is needed, unbiased θ estimates might help to reduce the need to equate these different versions (Eignor & Schaeffer, 1995; Segall, 1995; Segall & Carter, 1995; Wang & Kolen, 2001). Most previous bias reduction studies have focused separately on tests composed of either dichotomous items or polytomous items. However, mixed-type tests composed of dichotomous and polytomous items are commonly used in large-scale assessments, for instance, in the National Assessment of Educational Progress (NAEP). Unfortunately, few studies have been performed to exploit the potential type-difference information hidden in mixed-type tests.
In a mixed-type test composed of dichotomous and polytomous items, polytomous items usually provide more information concerning the level of a latent trait than dichotomous items (Donoghue, 1994; Embretson & Reise, 2000; Jodoin, 2003; Penfield & Bergeron, 2005). Hence, assigning larger weights to polytomous items is expected to produce more accurate estimates of the latent traits than equally weighting all items. Based on this rationale, Tao, Shi, and Chang (2010a, 2012) proposed an item-weighted likelihood method to better assess examinees’ ability levels under the assumption that the item parameters are known.
In their method, the weights were preassigned and known. On one hand, the fixed weights may not be statistically optimal in terms of the precision and accuracy of ability estimation (Tao, Shi, & Chang, 2010b). On the other hand, just like the usual maximum likelihood methods, the item-weighted likelihood method is not feasible for response patterns consisting of all lowest or highest score categories (i.e., all zeros or all mis, where mi is the maximum score of the ith item).
In this article, an adaptive weighted maximum-a-posteriori (WMAP) estimation method for mixed-type tests is proposed. Instead of using a set of preassigned weights, WMAP would automatically select statistically optimal weights for different item types in a mixed-type test. Derived from a Bayesian framework, WMAP overcomes the estimation difficulty that maximum likelihood estimation (MLE) encountered. Simulation results have shown that the new estimator has smaller bias and is computationally efficient.
The outline of this article is as follows: First, two models used in the article are briefly summarized and the WMAP estimation procedure is proposed. Second, under various testing conditions, simulation studies are described to evaluate the performance of the proposed WMAP estimation by comparing it with the usual MLE, maximum-a-posteriori (MAP) estimation, and Jeffreys modal estimation (JME). Third, a real data set from a large-scale reading assessment is used to demonstrate the estimation difference among the four procedures. Finally, the authors conclude the article with discussion and possible directions for further research.
Method
Three-Parameter Logistic Model (3PLM) and Generalized Partial-Credit Model (GPCM)
In this article, a mixed-type model is considered that is the combination of the following 3PLM and the GPCM:
and
where Pi(θ) in Equation 1 is determined by the 3PLM for dichotomously scored items; ai, bi and ci in Equation 1 are the discrimination, difficulty, and guessing parameters of item i, respectively; In Equation 2, θ is determined by the GPCM for polytomously scored items, θ is the discrimination parameter of item i, and θ is the step parameter of item i of category t.
WMAP Estimation
To facilitate the presentation of the estimation method, the relevant technical aspects of the IRT ability estimation methods are described in the following.
In the context of IRT ability estimation, given a response matrix
where
for i=1, 2, . . . , n1 with n1 being the number of dichotomously scored items and the responses of polytomously scored items:
for i = n1 +1, . . . , n, k =1, 2, . . . ,m, and n is the total number of items.
The usual maximum likelihood estimator is obtained by maximizing the log-likelihood function as follows:
or by equating the first derivative of the log-likelihood function (Equation 4) to zero:
Given the likelihood function L(
where g(θ) = L(
The MAP estimator (also called the Bayesian modal estimator) is the mode of the posterior distribution (Equation 5); that is,
Because the p(
The MAP estimator can be computed by taking the derivative of Equation 6, with respect to θ, and setting the derivative to 0. Then, the MAP estimator is the solution of the following equation:
This equation can be solved using an iterative numerical method such as the Fisher scoring method (Baker & Kim, 2004; Rao, 1965).
The choice of an appropriate prior distribution f(θ) of the ability level θ is the key point of the MAP estimation method. The standard normal distribution N(0, 1) is a potential prior distribution candidate that is an informative prior. With this prior distribution, it is assumed that the ability level is symmetrically distributed around the central value of zero with a standard deviation (SD) of one unit. Using the standard normal distribution as the prior density implies that the estimated ability levels are shrunken toward the prior mean value (zero in this context) and that the MAP estimator is less variable (has lower standard errors [SEs]) than is the maximum likelihood estimator (Baker, 1992; Swaminathan & Gifford, 1986). However, it also implies an increase in the estimation bias, especially at the extremes of ability level (Chen, Hou, & Dodd, 1998; Kim & Nicewander, 1993; Lord, 1983, 1986; Wang & Vispoel, 1998). Another typical prior distribution choice is a noninformative prior, such as the uniform distribution (restricted to a finite interval), which is frequently used in Bayesian statistics (Gelman, Carlin, Stern, & Rubin, 2004). This study is focused on a commonly used noninformative prior, that is, the Jeffreys prior (Jeffreys, 1939, 1946). The Jeffreys prior is proportional to the square root of the information function:
where
The Jeffreys prior is often called a noninformative prior distribution because it only requires the specification of the item response model, for instance, the mixed-type model combined with the 3PLM (Equation 1) and the GPCM (Equation 2) and the item parameter values. It can therefore be seen as a “test-driven” prior, adding more prior belief to levels that are more informative with respect to the test. In the rest of this article, the term Jeffreys modal estimator is used for the MAP estimator with Jeffreys prior distribution, and it is denoted by
where dI(θ)/dθ is the first derivative of I(θ) with respect to θ.
As the two types of models in IRT, the dichotomous model and the polytomous model, often provide different test information and play different roles in the likelihood function of mixed-type models, it seems reasonable to assign weights to the items according to their information. Therefore, a function of the ratio of test information functions of the two types of models as weights was taken to adjust the effect of different types of models on the likelihood function.
The primary feature of the estimation method proposed here is that a weighted information function is used instead of the traditional likelihood estimation based on mixed-type models. The goal is to make the estimation method less biased and still yield a smaller root mean square error (RMSE). With this estimation method, the weighted function is intended to serve as a tool to achieve technical qualities such as reduced bias.
The weighted likelihood function of a mixed-type model can be expressed as
where
and
When α = β = 0, the weighted likelihood function (Equation 10) reduces to the traditional likelihood function (Equation 3), so the weighted likelihood function can be regarded as a generalized likelihood function.
Now, consider a weighted estimation method, called the WMAP estimation method. The WMAP estimator is the value that maximizes the function, WL(
Replacing L(
The WMAP estimator is the solution of Equation 13 and can be obtained using the Fisher scoring method (for details, see Appendix).
When α = β = 0, the WMAP estimator is the MAP estimator with the Jeffreys prior, that is, the Jeffreys modal estimator.
Determination of Weights
A key issue is how to determine α and β in Equations 11 and 12. The main objective is to determine a weighting scheme for the ability estimation procedure, such that it generates more accurate estimates. A general rationale is that high-quality items should carry larger weights and low-quality items should carry smaller weights. Therefore, a common practice currently endorsed by many state assessments is investigated: Polytomously scored items carry a larger weight and dichotomously scored items carry a smaller weight. Specifically, the process determining α, β, and the two weights consists of the following three steps:
First, a suitable initial estimator 2a. In general, the polytomous-item information Ip(θ) is significantly larger than the dichotomous-item information Id(θ); in other words, λ1(θ) and λ2(θ) in Equations 11 and 12 usually satisfy λ1(θ) < λ2(θ) for any ability level θ. If 2b. If
Next, the solution of Equation 13 is obtained, denoted as
The goal is to find α and β such that the weight assigned to polytomously scored items is larger than that assigned to dichotomously scored items. The computer code for the preceding algorithm is available on request.
Simulation
To investigate the performance of the proposed WMAP estimation method, an intensive simulation study was conducted to cover a wide range of index values, such as the total number of items and the proportion of dichotomous and polytomous items in a mixed-type test. Four θ estimation methods were considered: MLE, MAP (with a standard normal prior distribution), JME, and WMAP based on a mixed-type model composed by 3PLM and GPCM. For the comparison of the four estimators under a wide range of test lengths, nine tests were constructed, three of them short (30 items with 20, 15, and 10 dichotomously scored items), three medium (50 items with 30, 25, and 20 dichotomously scored items), and three long (70 items with 45, 35, and 25 dichotomously scored items). For each test length, the item discrimination parameters of the 3PLM were generated from the uniform distribution U[.5, 1.5], the item difficulty parameters bi of the 3PLM were drawn from the uniform distribution U[−4, 4], and the guessing parameters were randomly generated from the uniform distribution U[0.1, 0.3]. The item discrimination parameters of the GPCM were generated from the uniform distribution U[0.5, 1.5], and the step parameters of each polytomous item were randomly generated from four uniform distributions: bi2~U(−4, −2), bi3~U(−2, 0), bi4~U(0, 2), and bi5~U(2, 4).
At each of 17 values of θ, ranging from −4.0 to 4.0 by steps of 0.5, 500 simulated examinees were administered all nine tests. The same item responses were used for all four estimators.
Error Indices
Three error indices were used to evaluate the θ estimation methods: bias, SE, and RMSE. For any given true ability level and any estimator, the bias is computed as the average of the differences between the ability estimates and the true ability level, although the empirical SE is given by the SD of these differences. Finally, RMSE is calculated as the root of the mean squared difference between the true ability and the corresponding ability estimate. These are presented against the true ability levels:
and
where θ is the true ability level,

Comparison of absolute bias among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 30 dichotomous items and 20 polytomous items

Comparison of SE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 30 dichotomous items and 20 polytomous items

Comparison of RMSE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 30 dichotomous items and 20 polytomous items

Comparison of absolute bias among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 25 dichotomous items and 25 polytomous items

Comparison of SE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 25 dichotomous items and 25 polytomous items

Comparison of RMSE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 25 dichotomous items and 25 polytomous items

Comparison of absolute bias among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 20 dichotomous items and 30 polytomous items

Comparison of SE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 20 dichotomous items and 30 polytomous items

Comparison of RMSE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 20 dichotomous items and 30 polytomous items
Results
The results of all three test lengths show similar trends for the four estimators. Therefore, only the 50-item test is presented.
Figures 1 to 9 show the results of absolute bias, RMSE, and SE calculated from 50-item tests for the following three simulation scenarios:
30 dichotomous items and 20 polytomous items,
25 dichotomous items and 25 polytomous items, and
20 dichotomous items and 30 polytomous items.
The absolute bias, SE, and RMSE of the four estimators presented in Figures 1 to 9 exhibit a similar change profile, respectively. The absolute bias values presented in Figures 1, 4, and 7 show that among the four θ estimation methods, WMAP has the smallest bias over larger θ ranges and that JME has a smaller bias than did MAP and MLE. The SE plotted in Figures 2, 5, and 8 shows that WMAP has a slightly higher SE than did MAP but has a substantially smaller SE than did MLE at extreme levels of the latent trait. The SE of WMAP is very similar to that of JME. For both extremes of θ, MAP has substantially lower SE than does WMAP, JME, and MLE. However, this reduction of SE comes at the expense of increased bias. Finally, the RMSE plot in Figures 3, 6, and 9 shows that WMAP has a slightly higher RMSE than MAP in the middle range of the θ scale but has a substantially smaller RMSE than MLE, particularly at extremely low or extremely high levels of θ. The RMSE of WMAP is similar to that of JME, which is consistent with the results of SE earlier.
Furthermore, as auxiliary information to use in evaluating varying degrees of scale comparability, the means and SDs of ability estimates via the four estimation methods are provided in Tables 1 to 3.
Means and Standard Deviations of the MLE, the MAP Estimation, the JME, and the WMAP Estimation for a 50-Item Test Composed of 30 Dichotomous Items and 20 Polytomous Items
Note: MLE = maximum likelihood estimation; MAP = maximum-a-posteriori; JME = Jeffreys modal estimation; WMAP = weighted maximum-a-posteriori.
Means and Standard Deviations of the MLE, the MAP Estimation, the JME, and the WMAP Estimation for a 50-Item Test Composed of 25 Dichotomous Items and 25 Polytomous Items
Note: MLE = maximum likelihood estimation; MAP = maximum-a-posteriori; JME = Jeffreys modal estimation; WMAP = weighted maximum-a-posteriori.
Means and Standard Deviations of the MLE, the MAP Estimation, the JME, and the WMAP Estimation for a 50-Item Test Composed of 20 Dichotomous Items and 30 Polytomous Items
Note: MLE = maximum likelihood estimation; MAP = maximum-a-posteriori; JME = Jeffreys modal estimation; WMAP = weighted maximum-a-posteriori.
To summarize the differences among the four estimation procedures, consider the following average absolute difference between the estimator and the examinee’s true ability:
where [·] corresponds to one of the four procedures (i.e., WMAP, MLE, MAP, or JME).
Based on Tables 1 to 3, the average absolute differences are obtained for the following three simulation scenarios:
30 dichotomous items and 20 polytomous items,
25 dichotomous items and 25 polytomous items,
20 dichotomous items and 30 polytomous items,
From these results and Tables 1 to 3, the estimated ability based on the WMAP procedure is more consistent with the examinee’s true ability than that based on the other methods.
Real Data Analysis
To investigate the applicability of the WMAP estimation method in operational large-scale assessments, consider a real data set of 2,000 examinees from a recent state reading assessment composed of 50 dichotomous items and 1 polytomous item (with five categories), in which the item parameters are known. Note that the means of item discrimination, difficulty, and guessing parameters for the first 50 dichotomous items are 0.9453, −0.3339, and 0.0800, respectively. For the five-category polytomous item, the discrimination parameter is 0.7662, and the item step parameters are 0, 2.5491, 0.5740, −0.9242, and −2.3953, respectively. The data set was used by Tao et al. (2012) to illustrate their item-weighted likelihood method.
Based on the four estimation procedures (i.e., WMAP, MLE, MAP, and JME), the estimates of ability levels of the 2,000 examinees can be obtained. The estimated means of the ability levels of the 2,000 examinees based on the four procedures are −0.6932, −0.6031, −0.4526, and −0.5729, respectively, and the medians are −0.7794, −0.7192, −0.7192, and −0.7048, respectively.
In addition, consider the total absolute difference and the total relative difference of estimated abilities between WMAP and the other three methods. These two indices have the following form:
and
where
Finally, the rank orders of the ability estimates of all the methods were compared to see if there was any discrepancy among the four methods (i.e., WMAP, MLE, MAP, and JME). The Kendall’s tau-b rank order correlations between WMAP and each of the other three methods were calculated, and the values are .9726, .9693, and .9761, respectively. It is clear that the rank order of WMAP estimates is very consistent with those generated by the other three methods. The practical importance of WMAP is that it yields less-biased estimates while maintaining high precision (low RMSE; see Tables 1-3).
Discussion and Conclusion
Improving the precision or accuracy of ability estimation is an important problem in IRT. Reducing the bias of Bayesian methods and controlling the SE of MLE still remain challenges and are always issues in real applications. In this article, an adaptive WMAP estimation method is proposed for mixed-type tests composed of dichotomous and polytomous items. By comparing WMAP with MLE, MAP, and JME under various conditions (such as different total items and different combinations of dichotomous and polytomous items), for all of the tests, WMAP is clearly a less-biased estimator of θ than the other estimators. In addition, WMAP has a relatively small variance over the entire range of θ and a small mean square error even at noncentral θ. In fact, WMAP corrects the severe bias of MAP without sacrificing much of MAP’s low SE and RMSE. In short, the proposed estimation method is a feasible solution to the large bias of MAP and the large SE of MLE. The relatively unbiased performance of WMAP makes it particularly appropriate in applications of IRT for which the parameter invariance property is important.
The core of the WMAP estimation method is its weighting functions, w1(θ) and w2(θ), which are functions of the ratio of test information functions of two types of items in the mixed-type model. They are also functions of θ and the item parameters and are specific and adaptive to each test. The WMAP estimator has the smallest bias among the four ability estimation methods considered in this study because as much information as possible is used from the information obtainable through administering items to each examinee in a mixed-type test. Furthermore, rather than weighting each item separately, just two types of items (dichotomous and polytomous) of the mixed-type test are weighted in this article. This is because the model would have to include too many parameters if each item was weighted separately. For example, if the number of items in the mixed-type test is 50, 50 unknown weight parameters would have to be estimated. When there are too many unknown weights, WMAP becomes very complex, and it becomes more difficult to obtain satisfactory estimation results.
The essential difference between the weighting rationale of WMAP and that of the weighted likelihood estimation (WLE) given by Warm (1989) needs to be clarified. WLE is well known for effectively reducing the bias and the SE of MLE. WLE has been used successfully for bias correction by solving a weighted log-likelihood equation. However, the objective is to develop a weighting technique for ability estimation in a mixed-type test that consists of dichotomous and polytomous items. As polytomous items usually provide more information than do dichotomous items, assigning larger weights to polytomous items should lead to more accurate estimates of abilities than equally weighting all items. WMAP is developed by differentiating the information obtained from the different item types. More specifically, in WMAP, items of the same type have the same weight, which is an essential difference from the weighting rationale of WLE.
Although WMAP is preferable to the other estimation methods considered, from the point of view of achieving a balance between accuracy (bias) and precision (RMSE) of ability estimates among the four estimation methods, it is possible to produce better estimation results by incorporating some new, effective techniques into the item-weighting scheme. For example, the iterative maximum-a-posteriori (IMAP; Magis & Raîche, 2010) method is an appealing ability estimation method that detects multiple local likelihood maxima to find the true proficiency level in dichotomous models. In the mixed-type tests composed of dichotomous and polytomous items, for all of the dichotomous items, the IMAP technique can be used to overcome the possible shortcomings of the MAP method itself. However, for the polytomous items, a similar IMAP technique needs to be developed, and its performance needs to be evaluated.
Finally, the proposed weighting scheme can be generalized to a broad range of applications. For example, it can be applied to CAT, not only to lower item exposure rates but also to improve ability estimation (e.g., Chang, Tao, & Wang, 2010), as well as to multistage linear testing. Although, in the current study, WMAP was only used in a combination of 3PLM and GPCM, it should work well for other models such as 1PLM, 2PLM, the partial-credit model (PCM), graded response model (GRM), and their various combinations.
Footnotes
Appendix
Acknowledgements
The authors would like to thank Editor in Chief Dr. Mark L. Davison and three anonymous reviewers for their valuable comments and constructive suggestions.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work is supported by the National Natural Science Foundation of China (Grants 11171059 and 10931002), the Natural Science Foundation of Jilin Province (201115005), the Science and Technology Development Foundation of Jilin Province (20120665 and 20100181), and the Fundamental Research Funds for the Central Universities (10JCXK007).
