Weighted Maximum-a-Posteriori Estimation in Tests Composed of Dichotomous and Polytomous Items

Abstract

For mixed-type tests composed of dichotomous and polytomous items, polytomous items often yield more information than dichotomous items. To reflect the difference between the two types of items and to improve the precision of ability estimation, an adaptive weighted maximum-a-posteriori (WMAP) estimation is proposed. To evaluate the performance of WMAP, a Monte Carlo simulation comparison is presented with maximum likelihood estimation, maximum-a-posteriori estimation, and Jeffreys modal estimation. Simulation results show that the proposed method is much less biased than any of the other estimators, with relatively smaller standard errors and root mean square errors.

Keywords

item response theory Bayesian estimation maximum-a-posteriori estimation mixed-type models

Item response theory (IRT) is a powerful tool for describing the influence of a latent trait (θ), such as ability, on an individual’s responses to a multiple-choice test. Trait estimation is one of the most important components of IRT because a major goal is to obtain a measure of the ability of each examinee administered an educational test or psychological instrument within the context of a test theory. Various θ estimation methods have been studied extensively for IRT (e.g., Baker & Kim, 2004; Birnbaum, 1968; Bock & Aitkin, 1981; Bock & Mislevy, 1982; Fox, 2010; Lord, 1980; Mislevy, 1986; Muraki, 1992; Thissen, 1982). In general, unbiased parameter estimation is desirable. Reducing bias in θ estimation can be important in several situations (Penfield & Bergeron, 2005; Wang, Hanson, & Lau, 1999; Warm, 1989). For example, when comparability between computerized adaptive testing (CAT) and paper-and-pencil tests is needed, unbiased θ estimates might help to reduce the need to equate these different versions (Eignor & Schaeffer, 1995; Segall, 1995; Segall & Carter, 1995; Wang & Kolen, 2001). Most previous bias reduction studies have focused separately on tests composed of either dichotomous items or polytomous items. However, mixed-type tests composed of dichotomous and polytomous items are commonly used in large-scale assessments, for instance, in the National Assessment of Educational Progress (NAEP). Unfortunately, few studies have been performed to exploit the potential type-difference information hidden in mixed-type tests.

In a mixed-type test composed of dichotomous and polytomous items, polytomous items usually provide more information concerning the level of a latent trait than dichotomous items (Donoghue, 1994; Embretson & Reise, 2000; Jodoin, 2003; Penfield & Bergeron, 2005). Hence, assigning larger weights to polytomous items is expected to produce more accurate estimates of the latent traits than equally weighting all items. Based on this rationale, Tao, Shi, and Chang (2010a, 2012) proposed an item-weighted likelihood method to better assess examinees’ ability levels under the assumption that the item parameters are known.

In their method, the weights were preassigned and known. On one hand, the fixed weights may not be statistically optimal in terms of the precision and accuracy of ability estimation (Tao, Shi, & Chang, 2010b). On the other hand, just like the usual maximum likelihood methods, the item-weighted likelihood method is not feasible for response patterns consisting of all lowest or highest score categories (i.e., all zeros or all m_is, where m_i is the maximum score of the ith item).

In this article, an adaptive weighted maximum-a-posteriori (WMAP) estimation method for mixed-type tests is proposed. Instead of using a set of preassigned weights, WMAP would automatically select statistically optimal weights for different item types in a mixed-type test. Derived from a Bayesian framework, WMAP overcomes the estimation difficulty that maximum likelihood estimation (MLE) encountered. Simulation results have shown that the new estimator has smaller bias and is computationally efficient.

The outline of this article is as follows: First, two models used in the article are briefly summarized and the WMAP estimation procedure is proposed. Second, under various testing conditions, simulation studies are described to evaluate the performance of the proposed WMAP estimation by comparing it with the usual MLE, maximum-a-posteriori (MAP) estimation, and Jeffreys modal estimation (JME). Third, a real data set from a large-scale reading assessment is used to demonstrate the estimation difference among the four procedures. Finally, the authors conclude the article with discussion and possible directions for further research.

Method

Three-Parameter Logistic Model (3PLM) and Generalized Partial-Credit Model (GPCM)

In this article, a mixed-type model is considered that is the combination of the following 3PLM and the GPCM:

P_{i} (θ) = c_{i} + (1 - c_{i}) \frac{1}{1 + exp [- a_{i} (θ - b_{i})]},

and

P_{ik} (θ) = \frac{exp [\sum_{t = 1}^{k} a_{i} (θ - b_{it})]}{\sum_{c = 1}^{m} \exp [\sum_{t = 1}^{c} a_{i} (θ - b_{it})]}, k = 1, 2, \dots, m,

where P_i(θ) in Equation 1 is determined by the 3PLM for dichotomously scored items; a_i, b_i and c_i in Equation 1 are the discrimination, difficulty, and guessing parameters of item i, respectively; In Equation 2, θ is determined by the GPCM for polytomously scored items, θ is the discrimination parameter of item i, and θ is the step parameter of item i of category t.

WMAP Estimation

To facilitate the presentation of the estimation method, the relevant technical aspects of the IRT ability estimation methods are described in the following.

In the context of IRT ability estimation, given a response matrix U to a set of items with known parameters, the likelihood function of a mixed-type test (Baker & Kim, 2004) is

L (U | θ) = L_{d} (θ) L_{p} (θ) = Π_{i = 1}^{n_{1}} P_{i} {(θ)}^{u_{i}} {(1 - P_{i} (θ))}^{1 - u_{i}} Π_{i = n_{1} + 1}^{n} Π_{k = 1}^{m} P_{ik} {(θ)}^{u_{ik}},

where $L_{d} (θ) = Π_{i = 1}^{n_{1}} P_{i} {(θ)}^{u_{i}} {(1 - P_{i} (θ))}^{1 - u_{i}}$ and $L_{d} (θ) = Π_{i = n_{1} + 1}^{n} Π_{k = 1}^{m} P_{i k} {(θ)}^{u_{i k}}$ are the likelihood functions of the dichotomous model and the polytomous model of a mixed-type test, respectively. For an examinee with ability θ, the response matrix U contains the responses of dichotomously scored items:

u_{i} = {\begin{matrix} 1, if the examinee gave the correct response on dichotomous item i, \\ 0, otherwise, \end{matrix}

for i=1, 2, . . . , n₁ with n₁ being the number of dichotomously scored items and the responses of polytomously scored items:

u_{ik} = {\begin{matrix} 1, if the examinee chose response k to polytomous item i, \\ 0, otherwise, \end{matrix}

for i = n₁ +1, . . . , n, k =1, 2, . . . ,m, and n is the total number of items.

The usual maximum likelihood estimator is obtained by maximizing the log-likelihood function as follows:

\begin{matrix} \log L (U | θ) = \log L_{d} (θ) + \log L_{p} (θ) \\ = \sum_{i = 1}^{n_{1}} [u_{i} \log P_{i} (θ) + (1 - u_{i}) \log (1 - P_{i} (θ))] + \sum_{i = n_{1} + 1}^{n} \sum_{k = 1}^{m} (u_{ik} \log P_{ik} (θ)), \end{matrix}

or by equating the first derivative of the log-likelihood function (Equation 4) to zero:

\frac{\partial \log L (U | θ)}{\partial θ} = 0 .

Given the likelihood function L(U|θ) and a prior distribution of the ability level θ, say, f(θ), the posterior distribution can be expressed as

P (θ | U) = \frac{L (U | θ) f (θ)}{\int L (U | θ) f (θ) d θ} \hat{=} \frac{g (θ)}{p (U)},

where g(θ) = L(U|θ)f (θ) is proportional to the posterior distribution P(θ|U) and p(U) = ∫ L(U|θ)f (θ)dθ.

The MAP estimator (also called the Bayesian modal estimator) is the mode of the posterior distribution (Equation 5); that is, ${\hat{θ}}_{M A P} = \arg \underset{θ}{m a x} P (θ | U)$ .

Because the p(U) in Equation 5 does not contain θ, the MAP estimator is the value that maximizes g(θ) = L(U|θ)f (θ) or its logarithm:

\log g (θ) = \log L (U | θ) + \log f (θ) .

The MAP estimator can be computed by taking the derivative of Equation 6, with respect to θ, and setting the derivative to 0. Then, the MAP estimator is the solution of the following equation:

\frac{\partial \log g (θ)}{\partial θ} = \frac{\partial \log L (U | θ)}{\partial θ} + \frac{\partial \log f (θ)}{\partial θ} = 0 .

This equation can be solved using an iterative numerical method such as the Fisher scoring method (Baker & Kim, 2004; Rao, 1965).

The choice of an appropriate prior distribution f(θ) of the ability level θ is the key point of the MAP estimation method. The standard normal distribution N(0, 1) is a potential prior distribution candidate that is an informative prior. With this prior distribution, it is assumed that the ability level is symmetrically distributed around the central value of zero with a standard deviation (SD) of one unit. Using the standard normal distribution as the prior density implies that the estimated ability levels are shrunken toward the prior mean value (zero in this context) and that the MAP estimator is less variable (has lower standard errors [SEs]) than is the maximum likelihood estimator (Baker, 1992; Swaminathan & Gifford, 1986). However, it also implies an increase in the estimation bias, especially at the extremes of ability level (Chen, Hou, & Dodd, 1998; Kim & Nicewander, 1993; Lord, 1983, 1986; Wang & Vispoel, 1998). Another typical prior distribution choice is a noninformative prior, such as the uniform distribution (restricted to a finite interval), which is frequently used in Bayesian statistics (Gelman, Carlin, Stern, & Rubin, 2004). This study is focused on a commonly used noninformative prior, that is, the Jeffreys prior (Jeffreys, 1939, 1946). The Jeffreys prior is proportional to the square root of the information function:

f (θ) α \sqrt{I (θ)},

where $I (θ) = \sum_{i = 1}^{n} I_{i} (θ)$ with I_i(θ) denoting the information function of item i.

The Jeffreys prior is often called a noninformative prior distribution because it only requires the specification of the item response model, for instance, the mixed-type model combined with the 3PLM (Equation 1) and the GPCM (Equation 2) and the item parameter values. It can therefore be seen as a “test-driven” prior, adding more prior belief to levels that are more informative with respect to the test. In the rest of this article, the term Jeffreys modal estimator is used for the MAP estimator with Jeffreys prior distribution, and it is denoted by ${\hat{θ}}_{JM}$ . Inserting Equation 8 into Equation 7, ${\hat{θ}}_{JM}$ must satisfy the condition,

\frac{\partial \log L (U | θ)}{\partial θ} + \frac{\frac{d I (θ)}{d θ}}{2 I (θ)} = 0,

where dI(θ)/dθ is the first derivative of I(θ) with respect to θ.

As the two types of models in IRT, the dichotomous model and the polytomous model, often provide different test information and play different roles in the likelihood function of mixed-type models, it seems reasonable to assign weights to the items according to their information. Therefore, a function of the ratio of test information functions of the two types of models as weights was taken to adjust the effect of different types of models on the likelihood function.

The primary feature of the estimation method proposed here is that a weighted information function is used instead of the traditional likelihood estimation based on mixed-type models. The goal is to make the estimation method less biased and still yield a smaller root mean square error (RMSE). With this estimation method, the weighted function is intended to serve as a tool to achieve technical qualities such as reduced bias.

The weighted likelihood function of a mixed-type model can be expressed as

WL (U | θ) = L_{d}^{w_{1} (θ)} (θ) L_{p}^{w_{2} (θ)} (θ),

where

w_{1} (θ) = λ_{1}^{α} (θ), λ_{1} (θ) = \frac{I_{d} (θ)}{I_{d} (θ) + I_{p} (θ)},

and

w_{2} (θ) = λ_{2}^{β} (θ), λ_{2} (θ) = 1 - λ_{1} (θ) = \frac{I_{p} (θ)}{I_{d} (θ) + I_{p} (θ)};

$I_{d} (θ) = \sum_{i = 1}^{n_{1}} I_{i} (θ)$ and $I_{P} (θ) = \sum_{i = n_{1} + 1}^{n} I_{i} (θ)$ are the test information functions of the dichotomous and polytomous models based on the mixed-type model, respectively; α and β are the ratio parameters that characterize the proportions of the test information functions λ₁(θ) and λ₂(θ) of the weight functions w₁(θ) and w₂(θ), respectively.

When α = β = 0, the weighted likelihood function (Equation 10) reduces to the traditional likelihood function (Equation 3), so the weighted likelihood function can be regarded as a generalized likelihood function.

Now, consider a weighted estimation method, called the WMAP estimation method. The WMAP estimator is the value that maximizes the function, WL(U|θ)f (θ), where WL(U|θ) is the weighted likelihood function of the mixed-type model (Equation 10) and f(θ) is the Jeffreys prior density. The JME technique is employed (i.e., the MAP estimation method with the Jeffreys prior) to obtain the WMAP estimator.

Replacing L(U|θ) with WL(U|θ) in Equation 9, the following is obtained:

\frac{\partial \log WL (U | θ)}{\partial θ} + \frac{\frac{d I (θ)}{d θ}}{2 I (θ)} = 0 .

The WMAP estimator is the solution of Equation 13 and can be obtained using the Fisher scoring method (for details, see Appendix).

When α = β = 0, the WMAP estimator is the MAP estimator with the Jeffreys prior, that is, the Jeffreys modal estimator.

Determination of Weights

A key issue is how to determine α and β in Equations 11 and 12. The main objective is to determine a weighting scheme for the ability estimation procedure, such that it generates more accurate estimates. A general rationale is that high-quality items should carry larger weights and low-quality items should carry smaller weights. Therefore, a common practice currently endorsed by many state assessments is investigated: Polytomously scored items carry a larger weight and dichotomously scored items carry a smaller weight. Specifically, the process determining α, β, and the two weights consists of the following three steps:

First, a suitable initial estimator ${\hat{θ}}_{0}$ is obtained. Here, the Jeffreys modal estimator is recommended.

2a. In general, the polytomous-item information I_p(θ) is significantly larger than the dichotomous-item information I_d(θ); in other words, λ₁(θ) and λ₂(θ) in Equations 11 and 12 usually satisfy λ₁(θ) < λ₂(θ) for any ability level θ. If $λ_{1} ({\hat{θ}}_{0}) < λ_{2} ({\hat{θ}}_{0})$ , then consider $λ_{1} ({\hat{θ}}_{0})$ and $λ_{2} ({\hat{θ}}_{0})$ directly as weight functions, that is, $w_{1} ({\hat{θ}}_{0}) = λ_{1} ({\hat{θ}}_{0})$ and $w_{2} ({\hat{θ}}_{0}) = λ_{2} ({\hat{θ}}_{0})$ . In this case, the ratio parameters are α = β = 1, which have no need for special adjustment.

2b. If $I_{p} ({\hat{θ}}_{0})$ is smaller than $I_{d} ({\hat{θ}}_{0})$ in a mixed-type test, that is $λ_{2} ({\hat{θ}}_{0}) < λ_{1} ({\hat{θ}}_{0})$ , $λ_{1} ({\hat{θ}}_{0})$ and $λ_{2} ({\hat{θ}}_{0})$ cannot be used directly as weight functions of the WMAP estimation method because they will no longer satisfy the requirement that the weight of the polytomous likelihood function L_p(θ) should be larger than that of the dichotomous likelihood function L_d(θ). As a result, the ratio parameters should be adjusted to make the weight functions satisfy the above requirement. However, the difference between the larger and smaller weights should not be too great. Initially, the ratio parameters α and β are set to a small value ε (usually, ε < 0.2). Then, the ratio parameters are adapted based on whether $w_{1} ({\hat{θ}}_{0})$ is smaller than $w_{2} ({\hat{θ}}_{0})$ . If $w_{1} ({\hat{θ}}_{0}) < w_{2} ({\hat{θ}}_{0})$ , no change is needed for either α or β. Otherwise, the values of α or β should be adjusted slightly (e.g., α may be increased in increments of .05 or smaller, or β may be decreased in increments of .05 or smaller) until $w_{1} ({\hat{θ}}_{0}) < w_{2} ({\hat{θ}}_{0})$ .

Next, the solution of Equation 13 is obtained, denoted as $\hat{θ} *$ , using the α and β values obtained as described earlier, and a test is carried out to determine whether $w_{1} (\hat{θ} *) < w_{2} (\hat{θ} *)$ . If so, the $\hat{θ} *$ is the desired value, that is, the WMAP estimator. Otherwise, the ratio parameters are adjusted continuously using the process described earlier.

The goal is to find α and β such that the weight assigned to polytomously scored items is larger than that assigned to dichotomously scored items. The computer code for the preceding algorithm is available on request.

Simulation

To investigate the performance of the proposed WMAP estimation method, an intensive simulation study was conducted to cover a wide range of index values, such as the total number of items and the proportion of dichotomous and polytomous items in a mixed-type test. Four θ estimation methods were considered: MLE, MAP (with a standard normal prior distribution), JME, and WMAP based on a mixed-type model composed by 3PLM and GPCM. For the comparison of the four estimators under a wide range of test lengths, nine tests were constructed, three of them short (30 items with 20, 15, and 10 dichotomously scored items), three medium (50 items with 30, 25, and 20 dichotomously scored items), and three long (70 items with 45, 35, and 25 dichotomously scored items). For each test length, the item discrimination parameters of the 3PLM were generated from the uniform distribution U[.5, 1.5], the item difficulty parameters b_i of the 3PLM were drawn from the uniform distribution U[−4, 4], and the guessing parameters were randomly generated from the uniform distribution U[0.1, 0.3]. The item discrimination parameters of the GPCM were generated from the uniform distribution U[0.5, 1.5], and the step parameters of each polytomous item were randomly generated from four uniform distributions: b_i2~U(−4, −2), b_i3~U(−2, 0), b_i4~U(0, 2), and b_i5~U(2, 4).

At each of 17 values of θ, ranging from −4.0 to 4.0 by steps of 0.5, 500 simulated examinees were administered all nine tests. The same item responses were used for all four estimators.

Error Indices

Three error indices were used to evaluate the θ estimation methods: bias, SE, and RMSE. For any given true ability level and any estimator, the bias is computed as the average of the differences between the ability estimates and the true ability level, although the empirical SE is given by the SD of these differences. Finally, RMSE is calculated as the root of the mean squared difference between the true ability and the corresponding ability estimate. These are presented against the true ability levels:

Bias (\hat{θ}) = \frac{1}{N} \sum_{j = 1}^{N} ({\hat{θ}}_{j} - θ),

SE (\hat{θ}) = \sqrt{\frac{1}{N} \sum_{j = 1}^{N} {({\hat{θ}}_{j} - \frac{1}{N} \sum_{k = 1}^{N} {\hat{θ}}_{k})}^{2}},

and

RMSE (\hat{θ}) = \sqrt{\frac{1}{N} \sum_{j = 1}^{N} {({\hat{θ}}_{j} - θ)}^{2}},

where θ is the true ability level, ${\hat{θ}}_{j}$ is the estimate of θ for the jth replication, and N is the number of replications (in the simulation comparison of this study, N = 500). These indices were plotted as a function of θ (see Figures 1-9).

Figure 1.

Comparison of absolute bias among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 30 dichotomous items and 20 polytomous items

Figure 2.

Comparison of SE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 30 dichotomous items and 20 polytomous items

Figure 3.

Comparison of RMSE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 30 dichotomous items and 20 polytomous items

Figure 4.

Comparison of absolute bias among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 25 dichotomous items and 25 polytomous items

Figure 5.

Comparison of SE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 25 dichotomous items and 25 polytomous items

Figure 6.

Comparison of RMSE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 25 dichotomous items and 25 polytomous items

Figure 7.

Comparison of absolute bias among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 20 dichotomous items and 30 polytomous items

Figure 8.

Comparison of SE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 20 dichotomous items and 30 polytomous items

Figure 9.

Comparison of RMSE among the MLE, the MAP estimation, the JME, and the WMAP estimation for a 50-item test composed of 20 dichotomous items and 30 polytomous items

Results

The results of all three test lengths show similar trends for the four estimators. Therefore, only the 50-item test is presented.

Figures 1 to 9 show the results of absolute bias, RMSE, and SE calculated from 50-item tests for the following three simulation scenarios:

30 dichotomous items and 20 polytomous items,

25 dichotomous items and 25 polytomous items, and

20 dichotomous items and 30 polytomous items.

The absolute bias, SE, and RMSE of the four estimators presented in Figures 1 to 9 exhibit a similar change profile, respectively. The absolute bias values presented in Figures 1, 4, and 7 show that among the four θ estimation methods, WMAP has the smallest bias over larger θ ranges and that JME has a smaller bias than did MAP and MLE. The SE plotted in Figures 2, 5, and 8 shows that WMAP has a slightly higher SE than did MAP but has a substantially smaller SE than did MLE at extreme levels of the latent trait. The SE of WMAP is very similar to that of JME. For both extremes of θ, MAP has substantially lower SE than does WMAP, JME, and MLE. However, this reduction of SE comes at the expense of increased bias. Finally, the RMSE plot in Figures 3, 6, and 9 shows that WMAP has a slightly higher RMSE than MAP in the middle range of the θ scale but has a substantially smaller RMSE than MLE, particularly at extremely low or extremely high levels of θ. The RMSE of WMAP is similar to that of JME, which is consistent with the results of SE earlier.

Furthermore, as auxiliary information to use in evaluating varying degrees of scale comparability, the means and SDs of ability estimates via the four estimation methods are provided in Tables 1 to 3.

Table 1.

Means and Standard Deviations of the MLE, the MAP Estimation, the JME, and the WMAP Estimation for a 50-Item Test Composed of 30 Dichotomous Items and 20 Polytomous Items

	Methods
	WMAP		MLE		MAP		JME
	M	SD	M	SD	M	SD	M	SD
−4.0	−3.9956	0.4810	−3.9570	0.2999	−3.4099	0.4948	−3.9716	0.4556
−3.5	−3.4958	0.4212	−3.4679	0.3015	−3.0489	0.4412	−3.4846	0.4194
−3.0	−2.9959	0.3870	−2.9851	0.3071	−2.6782	0.3964	−2.9908	0.3753
−2.5	−2.4990	0.3512	−2.4897	0.2941	−2.2486	0.3574	−2.4935	0.3451
−2.0	−1.9995	0.3281	−1.9912	0.2833	−1.8223	0.3203	−1.9982	0.3171
−1.5	−1.4912	0.3256	−1.4890	0.2834	−1.3986	0.3188	−1.4899	0.3152
−1.0	−0.9879	0.3248	−0.9875	0.2837	−0.9006	0.3128	−0.9859	0.3125
−0.5	−0.4908	0.3200	−0.4901	0.2793	−0.4476	0.3115	−0.4875	0.3056
0.0	0.0040	0.3107	0.0108	0.2726	0.0099	0.2983	0.0108	0.2975
0.5	0.5072	0.3121	0.5110	0.2735	0.5421	0.3119	0.5118	0.3168
1.0	1.0091	0.3132	1.0111	0.2747	1.0976	0.3205	1.0126	0.3203
1.5	1.5110	0.3125	1.5142	0.2712	1.6321	0.3206	1.5187	0.3178
2.0	2.0126	0.3115	2.0166	0.2690	2.1987	0.3026	2.0224	0.3000
2.5	2.5088	0.3412	2.5132	0.2814	2.7514	0.3501	2.5112	0.3348
3.0	3.0055	0.3681	3.0108	0.2945	3.3186	0.3624	3.0098	0.3543
3.5	3.5092	0.3847	3.5301	0.2900	3.9512	0.4012	3.5112	0.3812
4.0	4.0106	0.4257	4.0466	0.2844	4.5616	0.4465	4.0145	0.4177

Note: MLE = maximum likelihood estimation; MAP = maximum-a-posteriori; JME = Jeffreys modal estimation; WMAP = weighted maximum-a-posteriori.

Table 2.

Means and Standard Deviations of the MLE, the MAP Estimation, the JME, and the WMAP Estimation for a 50-Item Test Composed of 25 Dichotomous Items and 25 Polytomous Items

	Methods
	WMAP		MLE		MAP		JME
	M	SD	M	SD	M	SD	M	SD
−4.0	−3.9837	0.4353	−3.9213	0.4631	−3.5257	0.2737	−3.9811	0.4230
−3.5	−3.4850	0.3891	−3.4413	0.3914	−3.1888	0.2701	−3.4811	0.3854
−3.0	−2.9862	0.3193	−2.9623	0.3174	−2.7531	0.2626	−2.9810	0.3109
−2.5	−2.4922	0.3051	−2.4799	0.3049	−2.3016	0.2635	−2.4889	0.3001
−2.0	−1.9984	0.2970	−1.9910	0.2929	−1.8533	0.2647	−1.9964	0.2906
−1.5	−1.4972	0.2814	−1.4899	0.2801	−1.3979	0.2553	−1.4930	0.2801
−1.0	−0.9955	0.2685	−0.9895	0.2643	−0.9370	0.2446	−0.9909	0.2640
−0.5	−0.4979	0.2735	−0.4921	0.2712	−0.4616	0.2501	−0.4936	0.2712
0.0	0.0003	0.2866	0.0056	0.2779	0.0052	0.2580	0.0054	0.2773
0.5	0.5012	0.2801	0.5060	0.2748	0.5401	0.2554	0.5066	0.2749
1.0	1.0029	0.2791	1.0065	0.2726	1.0792	0.2523	1.0076	0.2723
1.5	1.5041	0.2856	1.5081	0.2841	1.6245	0.2611	1.5101	0.2845
2.0	2.0057	0.3029	2.0093	0.2963	2.1615	0.2693	2.0142	0.2938
2.5	2.5029	0.3112	2.5065	0.3110	2.7101	0.2655	2.5162	0.3054
3.0	3.0005	0.3232	3.0029	0.3139	3.2757	0.2638	3.0199	0.3079
3.5	3.5031	0.3945	3.5214	0.4059	3.8324	0.2884	3.5171	0.4001
4.0	4.0051	0.4523	4.0367	0.4767	4.4920	0.3071	4.0159	0.4460

Note: MLE = maximum likelihood estimation; MAP = maximum-a-posteriori; JME = Jeffreys modal estimation; WMAP = weighted maximum-a-posteriori.

Table 3.

Means and Standard Deviations of the MLE, the MAP Estimation, the JME, and the WMAP Estimation for a 50-Item Test Composed of 20 Dichotomous Items and 30 Polytomous Items

	Methods
	WMAP		MLE		MAP		JME
	M	SD	M	SD	M	SD	M	SD
−4.0	−3.9932	0.3964	−3.9338	0.4064	−3.5891	0.2761	−3.9817	0.3853
−3.5	−3.4962	0.3551	−3.4686	0.3514	−3.1795	0.2698	−3.4889	0.3549
−3.0	−2.9986	0.3020	−2.9772	0.3070	−2.7803	0.2622	−2.9929	0.3019
−2.5	−2.4889	0.2812	−2.4711	0.2810	−2.3432	0.2554	−2.4842	0.2810
−2.0	−1.9716	0.2694	−1.9663	0.2688	−1.8982	0.2468	−1.9710	0.2667
−1.5	−1.4822	0.2600	−1.4779	0.2594	−1.4109	0.2401	−1.4802	0.2588
−1.0	−0.9889	0.2589	−0.9900	0.2545	−0.9280	0.2378	−0.9887	0.2544
−0.5	−0.4914	0.2570	−0.4902	0.2530	−0.4489	0.2361	−0.4899	0.2524
0.0	0.0075	0.2552	0.0093	0.2505	0.0088	0.2351	0.0092	0.2500
0.5	0.5124	0.2550	0.5121	0.2458	0.5498	0.2345	0.5158	0.2497
1.0	1.0187	0.2547	1.0202	0.2496	1.0809	0.2339	1.0212	0.2496
1.5	1.5079	0.2654	1.5111	0.2610	1.6032	0.2401	1.5115	0.2599
2.0	2.0002	0.2764	2.0003	0.2714	2.1314	0.2497	2.0045	0.2696
2.5	2.5031	0.3012	2.5058	0.2998	2.6857	0.2614	2.5154	0.3001
3.0	3.0055	0.3310	3.0083	0.3256	3.2462	0.2768	3.0229	0.3202
3.5	3.5044	0.3611	3.5169	0.3714	3.8521	0.2731	3.5178	0.3605
4.0	4.0036	0.3936	4.0297	0.4020	4.4293	0.2707	4.0140	0.3815

Note: MLE = maximum likelihood estimation; MAP = maximum-a-posteriori; JME = Jeffreys modal estimation; WMAP = weighted maximum-a-posteriori.

To summarize the differences among the four estimation procedures, consider the following average absolute difference between the estimator and the examinee’s true ability:

δ_{[\cdot]} = \frac{1}{17} \sum_{t = 1}^{17} | {\hat{θ}}_{[\cdot]}^{(t)} - θ_{True}^{(t)} |,

where [·] corresponds to one of the four procedures (i.e., WMAP, MLE, MAP, or JME).

Based on Tables 1 to 3, the average absolute differences are obtained for the following three simulation scenarios:

30 dichotomous items and 20 polytomous items,

δ_{WMAP} = 0.0072, δ_{MLE} = 0.0181, δ_{MAP} = 0.2417, and δ_{JME} = 0.0130;

25 dichotomous items and 25 polytomous items,

δ_{WMAP} = 0.0053, δ_{MLE} = 0.0197, δ_{MAP} = 0.1942, and δ_{JME} = 0.0122; and

20 dichotomous items and 30 polytomous items,

δ_{WMAP} = 0.0090, δ_{MLE} = 0.0199, δ_{MAP} = 0.1770, and δ_{JME} = 0.0150

From these results and Tables 1 to 3, the estimated ability based on the WMAP procedure is more consistent with the examinee’s true ability than that based on the other methods.

Real Data Analysis

To investigate the applicability of the WMAP estimation method in operational large-scale assessments, consider a real data set of 2,000 examinees from a recent state reading assessment composed of 50 dichotomous items and 1 polytomous item (with five categories), in which the item parameters are known. Note that the means of item discrimination, difficulty, and guessing parameters for the first 50 dichotomous items are 0.9453, −0.3339, and 0.0800, respectively. For the five-category polytomous item, the discrimination parameter is 0.7662, and the item step parameters are 0, 2.5491, 0.5740, −0.9242, and −2.3953, respectively. The data set was used by Tao et al. (2012) to illustrate their item-weighted likelihood method.

Based on the four estimation procedures (i.e., WMAP, MLE, MAP, and JME), the estimates of ability levels of the 2,000 examinees can be obtained. The estimated means of the ability levels of the 2,000 examinees based on the four procedures are −0.6932, −0.6031, −0.4526, and −0.5729, respectively, and the medians are −0.7794, −0.7192, −0.7192, and −0.7048, respectively.

In addition, consider the total absolute difference and the total relative difference of estimated abilities between WMAP and the other three methods. These two indices have the following form:

κ = \sum_{l = 1}^{2, 000} | {\hat{θ}}_{WMAP}^{(l)} - {\hat{θ}}_{[\cdot]}^{(l)} |,

and

ζ = \sqrt{\sum_{l = 1}^{2, 000} {(\frac{{\hat{θ}}_{WMAP}^{(l)} - {\hat{θ}}_{[\cdot]}^{(l)}}{{\hat{θ}}_{WMAP}^{(l)}})}^{2}},

where ${\hat{θ}}_{[\cdot]}^{(l)}$ corresponds to the estimated ability level of examinee l based on one of the three procedures (i.e., MLE, MAP, or JME). The total absolute differences of estimated abilities between WMAP and the other three procedures are 84.99, 58.39, and 80.74, respectively, and the total relative differences are 23.11, 15.05, and 20.44, respectively. From the computational results, the differences between WMAP and the other three methods are quite apparent although there is only one polytomous item among the 51 items.

Finally, the rank orders of the ability estimates of all the methods were compared to see if there was any discrepancy among the four methods (i.e., WMAP, MLE, MAP, and JME). The Kendall’s tau-b rank order correlations between WMAP and each of the other three methods were calculated, and the values are .9726, .9693, and .9761, respectively. It is clear that the rank order of WMAP estimates is very consistent with those generated by the other three methods. The practical importance of WMAP is that it yields less-biased estimates while maintaining high precision (low RMSE; see Tables 1-3).

Discussion and Conclusion

Improving the precision or accuracy of ability estimation is an important problem in IRT. Reducing the bias of Bayesian methods and controlling the SE of MLE still remain challenges and are always issues in real applications. In this article, an adaptive WMAP estimation method is proposed for mixed-type tests composed of dichotomous and polytomous items. By comparing WMAP with MLE, MAP, and JME under various conditions (such as different total items and different combinations of dichotomous and polytomous items), for all of the tests, WMAP is clearly a less-biased estimator of θ than the other estimators. In addition, WMAP has a relatively small variance over the entire range of θ and a small mean square error even at noncentral θ. In fact, WMAP corrects the severe bias of MAP without sacrificing much of MAP’s low SE and RMSE. In short, the proposed estimation method is a feasible solution to the large bias of MAP and the large SE of MLE. The relatively unbiased performance of WMAP makes it particularly appropriate in applications of IRT for which the parameter invariance property is important.

The core of the WMAP estimation method is its weighting functions, w₁(θ) and w₂(θ), which are functions of the ratio of test information functions of two types of items in the mixed-type model. They are also functions of θ and the item parameters and are specific and adaptive to each test. The WMAP estimator has the smallest bias among the four ability estimation methods considered in this study because as much information as possible is used from the information obtainable through administering items to each examinee in a mixed-type test. Furthermore, rather than weighting each item separately, just two types of items (dichotomous and polytomous) of the mixed-type test are weighted in this article. This is because the model would have to include too many parameters if each item was weighted separately. For example, if the number of items in the mixed-type test is 50, 50 unknown weight parameters would have to be estimated. When there are too many unknown weights, WMAP becomes very complex, and it becomes more difficult to obtain satisfactory estimation results.

The essential difference between the weighting rationale of WMAP and that of the weighted likelihood estimation (WLE) given by Warm (1989) needs to be clarified. WLE is well known for effectively reducing the bias and the SE of MLE. WLE has been used successfully for bias correction by solving a weighted log-likelihood equation. However, the objective is to develop a weighting technique for ability estimation in a mixed-type test that consists of dichotomous and polytomous items. As polytomous items usually provide more information than do dichotomous items, assigning larger weights to polytomous items should lead to more accurate estimates of abilities than equally weighting all items. WMAP is developed by differentiating the information obtained from the different item types. More specifically, in WMAP, items of the same type have the same weight, which is an essential difference from the weighting rationale of WLE.

Although WMAP is preferable to the other estimation methods considered, from the point of view of achieving a balance between accuracy (bias) and precision (RMSE) of ability estimates among the four estimation methods, it is possible to produce better estimation results by incorporating some new, effective techniques into the item-weighting scheme. For example, the iterative maximum-a-posteriori (IMAP; Magis & Raîche, 2010) method is an appealing ability estimation method that detects multiple local likelihood maxima to find the true proficiency level in dichotomous models. In the mixed-type tests composed of dichotomous and polytomous items, for all of the dichotomous items, the IMAP technique can be used to overcome the possible shortcomings of the MAP method itself. However, for the polytomous items, a similar IMAP technique needs to be developed, and its performance needs to be evaluated.

Finally, the proposed weighting scheme can be generalized to a broad range of applications. For example, it can be applied to CAT, not only to lower item exposure rates but also to improve ability estimation (e.g., Chang, Tao, & Wang, 2010), as well as to multistage linear testing. Although, in the current study, WMAP was only used in a combination of 3PLM and GPCM, it should work well for other models such as 1PLM, 2PLM, the partial-credit model (PCM), graded response model (GRM), and their various combinations.

Footnotes

Appendix

Acknowledgements

The authors would like to thank Editor in Chief Dr. Mark L. Davison and three anonymous reviewers for their valuable comments and constructive suggestions.

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work is supported by the National Natural Science Foundation of China (Grants 11171059 and 10931002), the Natural Science Foundation of Jilin Province (201115005), the Science and Technology Development Foundation of Jilin Province (20120665 and 20100181), and the Fundamental Research Funds for the Central Universities (10JCXK007).

References

Baker

F. B.

(1992). Item response theory: Parameter estimation techniques. New York, NY: Marcel Dekker.

Baker

F. B.

Kim

S.-H.

(2004). Item response theory: Parameter estimation techniques (2nd ed., revised and expanded). New York, NY: Marcel Dekker.

Birnbaum

(1968). Some latent ability models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431-444.

Chang

H.-H.

Tao

Wang

(2010, July). The item-weighted likelihood method for computerized adaptive testing. Paper presented at the 75th meeting of the Psychometric Society, The University of Georgia, Athens.

Chen

S.-K.

Hou

Dodd

B. G.

(1998). A comparison of maximum likelihood estimation and expected a posteriori estimation in CAT using the partial credit model. Educational and Psychological Measurement, 58, 569-595.

Donoghue

J. R.

(1994). An empirical examination of the IRT information of polytomously scored reading items under the generalized partial credit model. Journal of Educational Measurement, 41, 295-311.

Eignor

D. R.

Schaeffer

G. A.

(1995, April). Comparability studies for GRE General CAT and the NCLEX using CAT. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.

10.

Embretson

S. E.

Reise

S. P.

(2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.

11.

Fox

J.-P.

(2010). Bayesian item response modeling: Theory and applications. New York, NY: Springer.

12.

Gelman

Carlin

J. B.

Stern

H. S.

Rubin

D. B.

(2004). Bayesian data analysis (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC Press.

13.

Jeffreys

(1939). Theory of probability. Oxford, UK: Oxford University Press.

14.

Jeffreys

(1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 186, 453-461.

15.

Jodoin

M. G.

(2003). Measurement efficiency of innovative item formats in computer-based testing. Journal of Educational Measurement, 40, 1-15.

16.

Kim

J. K.

Nicewander

W. A.

(1993). Ability estimation for conventional tests. Psychometrika, 58, 587-599.

17.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

18.

Lord

F. M.

(1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48, 233-245.

19.

Lord

F. M.

(1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23, 157-162.

20.

Magis

Raîche

(2010). An iterative maximum a posteriori estimation of proficiency level to detect multiple local likelihood maxima. Applied Psychological Measurement, 34, 75-89.

21.

Mislevy

R. J.

(1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195.

22.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 25, 373-383.

23.

Penfield

R. D.

Bergeron

J. M.

(2005). Applying a weighted maximum likelihood latent trait estimator to the generalized partial credit model. Applied Psychological Measurement, 29, 218-233.

24.

Rao

C. R.

(1965). Linear statistical inference and its applications. New York, NY: Wiley.

25.

Segall

D. O.

(1995, April). Equating the CATASVAB: Experiences and lessons learned. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.

26.

Segall

D. O.

Carter

(1995, April). Equating the CAT-GATB: Issues and approach. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA.

27.

Swaminathan

Gifford

J. A.

(1986). Bayesian estimation in the three-parameter logistic model. Psychometrika, 51, 589-601.

28.

Tao

Shi

N.-Z.

Chang

H.-H.

(2010a, April). Item-weighted WLE for ability estimation in tests composed of both dichotomous and polytomous items. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO.

29.

Tao

Shi

N.-Z.

Chang

H.-H.

(2010b, April). Optimal item-weighted WLE methods for ability estimation. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.

30.

Tao

Shi

N.-Z.

Chang

H.-H.

(2012). Item-weighted likelihood method for ability estimation in tests composed of both dichotomous and polytomous items. Journal of Educational and Behavioral Statistics,37, 298-315.

31.

Thissen

(1982). Marginal maximum likelihood estimation for the one-parameter logistic model. Psychometrika, 47, 175-186.

32.

Wang

Hanson

B. A.

Lau

C.-M. A.

(1999). Reducing bias in computerized adaptive testing trait estimation: A comparison of approaches. Applied Psychological Measurement, 23, 263-278.

33.

Wang

Kolen

M. J.

(2001). Evaluating comparability in computerized adaptive testing: Issues, criteria, and an example. Journal of Educational Measurement, 38, 19-49.

34.

Wang

Vispoel

W. P.

(1998). Properties of ability estimation methods in computerized adaptive testing. Journal of Educational Measurement, 35, 109-135.

35.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427-450.