Abstract
Plausible values can be used to either estimate population-level statistics or compute point estimates of latent variables. While it is well known that five plausible values are usually sufficient for accurate estimation of population-level statistics in large-scale surveys, the minimum number of plausible values needed to obtain accurate latent variable point estimates is unclear. This is especially relevant when an item response theory (IRT) model is estimated with MCMC (Markov chain Monte Carlo) methods in Mplus and point estimates of the IRT ability parameter are of interest, as Mplus only estimates the posterior distribution of each ability parameter. In order to obtain point estimates of the ability parameter, a number of plausible values can be drawn from the posterior distribution of each individual ability parameter and their mean (the posterior mean ability estimate) can be used as an individual ability point estimate. In this note, we conducted a simulation study to investigate how many plausible values were needed to obtain accurate posterior mean ability estimates. The results indicate that 20 is the minimum number of plausible values required to obtain point estimates of the IRT ability parameter that are comparable to marginal maximum likelihood estimation(MMLE)/expected a posteriori (EAP) estimates. A real dataset was used to demonstrate the comparison between MMLE/EAP point estimates and posterior mean ability estimates based on different number of plausible values.
The mathematical equivalency between the categorical confirmatory factor analysis (CCFA; Wirth & Edwards, 2007) and item response theory (IRT; Lord, 1980) is well documented in the psychometric literature (e.g., Kamata & Bauer, 2008; Muthén, 1984; Muthén & Christoffersson, 1981; Takane & de Leeuw, 1987). In CCFA, factor loadings and item thresholds correspond to item discrimination and item difficulty, respectively, and the factor score is often referred to as ability (or trait) in IRT (e.g., Hambleton & Swaminathan, 1985). As CCFA can be viewed as a special case of structural equation modeling (SEM; Bollen, 1989) without the structural model, it has been recommended for users of SEM and IRT to utilize the advantages of both SEM and IRT for the modeling of categorical data (e.g., Finney & DiStefano, 2013; Glockner-Rist & Hoijtink, 2003).
Traditionally, CCFA with categorical data are estimated with limited-information methods such as maximum likelihood estimation with robust standard errors (MLR) or weighted least squares methods (WLSM) that only require information in the two-way contingency table. In IRT, categorical data are usually estimated with full-information methods such as marginal maximum likelihood estimation (MMLE; Bock & Aitkin, 1981) or Markov chain Monte Carlo (MCMC) that utilize information in the whole response pattern (the full multi-way contingency table). Numerous studies (e.g., Bolt, 2005; Forero & Maydeu-Olivares, 2009; Knol & Berger, 1991; Luo, 2018a, 2018b; Paek, Cui, Öztürk Gübeş, & Yang, 2018) have investigated the feasibility of using limited information methods for IRT model estimation and found that they provide comparable estimation results to MMLE and MCMC. Likewise, SEM software programs have incorporated some full-information estimation methods as an option for the estimation of CCFA. For example, the latent variable modeling software Mplus (Muthén & Muthén, 1998-2012) added the MLR estimator, which implements MMLE estimation for categorical data (in version 3), and the Bayes estimator, which implements MCMC estimation (in version 6).
The addition of full information estimation methods in Mplus, along with limited information estimation methods such as the weighted least squares adjusted by mean and variance (WLSMV; B. Muthén, du Toit, & Spisic, 1997), makes Mplus increasingly popular as a viable software for estimation of various IRT models (e.g., Finch, 2010; Huggins-Manley & Algina, 2015). Under the one-factor CCFA model, Mplus produces estimates of factor loadings and item thresholds and provides their IRT parameterization based on CCFA-IRT relationships, using the estimators of interest here (MLR or WLSMV) with an appropriate (logit or probit) link function (Asparouhov & Muthén, 2015; see also the Note in Appendix A). For multidimensional IRT models (MIRT; Reckase, 2009), Mplus estimates of factor loadings and thresholds can be converted into their IRT counterparts using formulae provided in the literature (e.g., Finch, 2010; Luo, 2018a; McDonald, 1999).
For estimates of IRT ability parameters, Mplus produces factor scores that are equivalent to IRT ability estimates and can be directly used as such without further conversion under some restrictions on the mean and variance of the latent variable (factor, ability) (see the Note in Appendix A). In frequentist IRT, ability is often estimated as a single value used to represent the individual latent variable distribution. In Bayesian IRT via MCMC, the ability is often estimated as an empirical posterior latent variable distribution, which can also be summarized as a single value. The computation of such single values, called IRT ability point estimates, within the two (frequentist and Bayesian) IRT frameworks is discussed next.
In the frequentist paradigm, MMLE only estimates item parameters, whereas IRT ability estimates can be obtained via three approaches, namely, maximum likelihood estimation (MLE), expected a posteriori (EAP), and maximum a posteriori (MAP). As all three approaches use a single point to summarize the individual latent variable distribution, the resulting IRT ability estimates are all point estimates. It should be noted that MLE has the inherent flaw of being unable to provide estimates for examinees with zero or perfect scores, whereas EAP and MAP (hereafter referred to as MMLE/EAP and MMLE/MAP) introduce ancillary information about the latent distribution via a prior distribution and can provide ability estimates of such examinees. Also, MMLE/EAP computes the mean of the posterior distribution via quadrature points and uses this posterior mean as a point estimate, whereas MMLE/MAP computes the mode of the posterior distribution via an iterative method and uses this posterior mode as a point estimate.
In the Bayesian framework, the use of MCMC usually provides a large number of drawn values that form the empirical posterior distribution of individual ability. Also, the computation of the mean of the posterior distribution (referred to as MCMC/EAP hereafter) is straightforward by simply taking the average of the drawn values. This is not the case with the computation of MMLE/EAP and MMLE/MAP, which requires either an iterative method or quadrature points. Conceptually, the MCMC/EAP computation is not different from drawing a sample from the population and using the sample mean to infer the population mean. Consequently, the number of drawn values (sample size) affects the inferential power of the sample mean. Popular Bayesian software programs such as WinBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2003) and Stan (Carpenter et al., 2017) automatically provide such MCMC/EAP point estimates based on the drawn values. Also, as the draws used for MCMC estimation are also utilized to compute point estimates of the latent variables, the sample size is usually not a concern due to the large number of draws (the number of iterations minus the burn-in iterations, multiplied by the number of parallel chains) used for computation of MCMC/EAP point estimates. For WinBUGS or other Bayesian software programs that implement the Metropolis-Hastings algorithm or the Gibbs sampler, thousands of iterations are usually required for MCMC estimation. As a result of this, MCMC/EAP estimates are also based on thousands of drawn values. For Stan, an emerging Bayesian software program that implements the efficient Hamiltonian Monte Carlo (HMC; Neal, 2011) algorithm, hundreds of iterations may suffice for common IRT models. Also, the MCMC/EAP point estimates are based on hundreds of drawn values, a sample size large enough to ensure accuracy of MCMC/EAP estimates. For an illustration of the sampling efficiency of Stan for IRT models, the reader may refer to Luo and Jiao (2018). However, this is not the case with Mplus, where the sample size issue becomes relevant considering that the draws for MCMC estimation are not used for the computation of latent variable point estimates and the users are at liberty to specify the number of draws from the posterior distribution used for the computation of latent variable point estimates.
Compared to estimators like MLR and WLSMV, the Bayes estimator in Mplus offers some special advantages. For example, Mplus automatically implements the posterior predictive model checking (PPMC; Rubin, 1984) procedure for model check. In addition, the MCMC method implemented in the Bayes estimator has computational advantages when dealing with high-dimensional models and/or a large number of items, scenarios where estimators such as MLR and WLSMV may become infeasible due to either issues with dimensionality or difficulty with inverting an excessively large matrix. When the Bayes estimator is used, Mplus does not automatically provide MCMC/EAP point estimates as WinBUGS and Stan do. Instead, Mplus estimates the entire posterior distribution of each individual latent variable.
To obtain MCMC/EAP point estimates in Mplus, plausible values can be used. The idea of plausible values was originally developed for the analysis of National Assessment of Educational Progress data in 1983-1984 (Mislevy, 1991). Essentially, multiple scores are provided for each student to assess the measurement error associated with each individual, and as stated by Wu (2005), “If measurement error is small, then multiple scores for an individual will be close together. If measurement error is large, then multiple scores for an individual will be far apart” (p. 115). Formally, plausible values are random draws from the posterior distribution of individual score given one’s response pattern and represent “the likely distribution of a student’s proficiency” (von Davier, Gonzalez, & Mislelvy, 2009, p. 11).
Asparouhov and Muthén (2010) stated that the posterior mean factor score (MCMC/EAP point estimate) can be computed by using multiple plausible values drawn from the posterior distribution and demonstrated that MCMC/EAP estimates based on 500 plausible values have desirable psychometric properties over MMLE/EAP with small sample size and nonnormal latent distributions. Regarding the number of plausible values, they suggested that “it is necessary to use many imputed values. For example 100 or 500 such values can yield a precise posterior distribution for a latent variable which can be used to compute the posterior mean factor score estimate” (Asparouhov & Muthén, 2010, p. 2). Although it is intuitive that more plausible values can approximate the posterior distribution better, drawing plausible values in Mplus takes time and requires postprocessing by the researcher to compute statistics of interest such as MCMC/EAP point estimates. Also, researchers may face computer hardware constraints and time limitation when drawing plausible values. For example, when analyzing a dataset with half a million examinees, which is not uncommon in large-scale testing organizations, drawing 500 plausible values for each examinee takes excessively long time and results in a matrix with half a million rows and 500 columns that may be too large for the computer memory to handle. Therefore, it is of practical interest to have guidelines on the minimum number of plausible values required to compute accurate MCMC/EAP point estimates.
Although it is well known that five plausible values are usually sufficient for accurate estimation of population-level statistics in large-scale surveys (e.g., von Davier et al., 2009), one can expect that this guideline is not applicable to the case of MCMC/EAP computation as plausible values are used differently in the following two scenarios. First, for the estimation of population-level statistics, plausible values are usually used via multiple imputation techniques (Rubin, 1987). Second, when MCMC/EAP point estimates are of interest, plausible values are simply used to compute their arithmetic mean.
Based on the above discussion, the purpose of this study is to investigate the number of plausible values needed to obtain MCMC/EAP point estimates under common large-scale testing conditions (e.g., sufficiently large sample size and normally distributed latent variable) that are comparable to MMLE/EAP point estimates produced when the MLR estimator is used in Mplus. We do not consider the WLSMV estimator for CCFA in Mplus as it produces MAP estimates, which is conceptually less aligned with MCMC/EAP estimates and has been shown to be inferior to EAP estimates in many ways (Bock & Mislevy, 1982; Mislevy & Bock, 1997).
Method
Simulation Design
We used a small-scale simulation study to explore the number of plausible values needed to obtain accurate MCMC/EAP point estimates in the sense that they are comparable with MMLE/EAP point estimates. Specifically, we investigated MCMC/EAP estimates based on 5, 10, 20, 50, 100, 200, and 500 plausible values (abbreviated as PV5, PV10, PV20, PV50, PV100, PV200, and PV500, respectively). The number of five plausible values, which is recommended for estimations of population-level statistics, was chosen here as the minimum number of plausible values as we suspected that five plausible values were not enough to produce accurate MCMC/EAP point estimates. The number of 500 plausible values, used by Asparouhov and Muthén (2010), was chosen to be the maximum number of plausible values as we suspected that 500 plausible values might be an overkill when the goal is to obtain accurate MCMC/EAP point estimates. Due to the conceptual equivalence between drawing plausible values from a posterior distribution and drawing a sample from the population, we expected that when the initial number of plausible values was small, increasing it would result in considerably better MCMC/EAP point estimates. However, when the number of plausible values reached certain thresholds, increasing it would have only negligible returns in terms of improvement of estimation accuracy.
The IRT model of choice is the two-parameter logistic model (2PLM). This model was chosen for the purpose of illustration. More complex IRT models, such as multidimensional or polytomous models, can also be used as long as model convergence is achieved. Drawing plausible values from the estimated posterior distribution is essentially the same regardless of the complexity of the IRT model. Under the 2PLM model, the probability of correct item response is
where
We simulated a test of 40 dichotomously scored items taken by 1,000 examinees. This particular test length was chosen to mimic a multiple-choice question test of a medium length; the sample size of 1,000 was selected to ensure accurate estimation of item parameters with the 2PLM (Stone, 1992). The ability parameters were generated from a standard normal distribution; that is, θ ~ N(0, 1). For the item parameters listed in Table 1, the discrimination parameters were generated from a normal distribution N(1, 0.04), and the difficulty parameters from a standard normal distribution N(0,1). We generated 100 datasets based on Equation 1. For each dataset, we fit the 2PLM with both the MLR and Bayes estimators in Mplus. With the Bayes estimator, the default uninformative prior N(0, 5) in Mplus was used for the factor loading and item threshold parameters. We specified Mplus to run four parallel chains with each containing a minimum of 5,000 iterations. In addition, we requested Mplus to draw from the estimated posterior distribution of each individual latent variable 500 plausible values, from which we used the first 5 values to compute MCMC/EAP point estimates for PV5, first 10 for PV10, first 20 for PV20, first 50 for PV50, first 100 for PV100, first 200 for PV200, and all 500 for PV500. The Mplus syntax codes for estimating the 2PL IRT model with the MLR estimator and the Bayes estimator are provided in Appendixes A and B, respectively.
Item Parameter Values Used for Data Generation.
Outcome Variables
We evaluate the quality of point estimates for the ability parameter in terms of the correlation between estimated and true values, Bias, standard error (SE), and root mean square error (RMSE). The Bias, SE, and RMSE statistics were chosen as they account for systematic error, random error, and total error of parameter recovery, respectively. These statistics are defined as follows:
and
where
Results
The boxplot in Figure 1 provides a visual presentation of the correlation between the eight sets of ability estimates and the generating true ability values. As can be seen, the correlation between MCMC/EAP estimates based on plausible values and true values increases with an increase of the number of plausible values. However, the increase of the number of plausible values has diminishing returns regarding the increase of their correlation with the true values. With only five plausible values, the mean correlation between PV5 and the true values is 0.922, whereas the mean correlation between MMLE/EAP estimates and the true values is 0.937. When the number of plausible values is greater than 50, the mean correlation with the true values remains approximately 0.936, which only differs from the correlation between MMLE/EAP and the true values at the third decimal place. Such an observation is consistent with our expectation that an increase of the number of plausible values results in a noticeable improvement of the point estimation accuracy, but only to a certain point.

Comparison of ability estimates based on their correlation with generating values.
The Bias, SE, and RMSE for the eight sets of ability estimates are provided with Table 2. For all eight methods, the absolute value of the mean Bias for estimation of the ability parameter is smaller than 0.005, a value small enough to be considered close to zero. Also, the mean Bias randomly fluctuates around zero with a different number of plausible values, suggesting that such MCMC/EAP estimates are essentially unbiased and their discrepancies with the true values are caused by sampling error. As shown in Table 2, the standard deviation of Bias, as well as the mean and standard deviation of the RMSE, tend to decrease slightly with the increase of number of plausible values
Bias, RMSE, and SE of Ability Estimates.
Note. RMSE = root mean square error; SE = standard error; MMLE = marginal maximum likelihood estimation; EAP = expected a posteriori; PV = plausible value.
To further compare the estimation quality between the eight methods, we conducted analysis of variance (ANOVA) using Bias, SE, and RMSE as dependent variables, respectively. The ANOVA results with Bias as the dependent variable indicated that there were no significant differences in estimation biases between the eight methods, F(7, 7,992) = 0.899, p = .506. The ANOVA results with the RMSE as the dependent variable indicated that there were significant differences, F(7, 7,992) = 13.111, p < .001, f = 0.105. According to Cohen (1992), such an f value indicates a small effect size. Subsequent Tukey post hoc tests revealed that (a) there were no significant differences between the RMSE of PV20, PV50, PV100, PV200, PV500, and MMLE/EAP, and (b) the RMSE of PV10 and PV20 was significantly lower than those of PV5. The ANOVA results with SE as the dependent variable indicated that there were significant differences, F(7, 7,992) = 5.926, p < .001, f = 0.071 (this effect size is practically negligible). The Tukey post hoc tests showed that there were no significant differences between the SE of all seven sets of MCMC/EAP estimates based on plausible values, whereas the SE of MMLE/EAP estimates were significantly greater than those of their MCMC/EAP counterparts.
Based on the above analyses, it can be stated that PV20 provides IRT ability estimates with Bias and RMSE comparable to those based on MMLE/EAP and significantly smaller SE. Therefore, we concluded that 20 plausible values are required to obtain point estimates as accurate as MMLE/EAP estimates.
An Illustration With Real Data
In this section, we illustrate with real data how MCMC/EAP point estimates of IRT abilities based on different numbers of plausible values compare with MMLE/EAP point estimates. The dataset used in the illustration was drawn from a test form of the verbal section of the General Aptitude Test (GAT-V), a high-stakes test used for college admission purpose in Saudi Arabia and other Middle-Eastern countries. GAT-V consists of 52 multiple-choice items distributed across four domains, namely, reading comprehension, sentence completion, verbal analogy, and synonymy. We randomly sampled 1,200 students from the examinees, and the subsequent analyses were based on a matrix of zeros and ones with a dimension of 1,200 by 52.
We fit the 2PLM with both the MLR and Bayes estimators in Mplus. The configuration for the Bayes estimator is the same as in the simulation section; that is, four parallel chains with a minimum of 5,000 iterations were specified and 500 plausible values were imputed. The iteration history in the Mplus output showed that the potential scale reduction factor (PSRF; Gelman & Rubin, 1992) values for all model parameters were smaller than 1.05 with less than 2,000 iterations, whereas with 5,000 iterations, the highest PSRF value was 1.022. We concluded that model convergence had been reached and proceeded to compare the IRT ability point estimates based on the eight methods.
As an illustration, Table 3 lists the IRT ability point estimates for the first 20 students with MMLE/EAP and MCMC/EAP based on different number of plausible values. As can be seen, except for PV5 and PV10, which have mean differences of 0.07 and 0.03 relative to MMLE/EAP, the other five sets of IRT ability point estimates based on plausible values are very similar to MMLE/EAP, with the mean difference being smaller than 0.01.
Ability Estimates (on the IRT Logit Scale) for the Real Data.
Note. IRT = item response theory; MMLE = marginal maximum likelihood estimation; EAP = expected a posteriori; PV = plausible value.
Figure 2 provides a visual presentation of how the correlation between the MCMC/EAP point estimates based on plausible values and the MMLE/EAP point estimates for all 1,200 students changes with the change of the number of plausible values. Even with only five plausible values, the correlation between PV5 and MMLE/EAP is 0.9892 and it increases with more plausible values, but the increase is negligibly small. The correlation between MMLE/EAP and PV20, the MCMC/EAP estimates based on 20 plausible values, is as high as 0.9969.

Comparison of point estimates of ability based on real data.
Conclusion
The Bayes estimator in Mplus implements the Gibbs sampler and allows researchers and practitioners to estimate latent variable models such as IRT models with MCMC methods with relatively little programming efforts. When the Bayes estimator is used, Mplus automatically performs model check via PPMC, a feature that is especially attractive to those who are not familiar enough with general Bayesian programming software such as WinBUGS or Stan to program such procedures themselves. In addition, MCMC methods may be the only computationally feasible estimation methods with either multidimensional models, which cause dimensionality problems for MMLE estimation, or a large number of indicators that make limited information methods such as WLSMV too slow.
It should be noted that for latent variables such as the individual ability in IRT context, Mplus estimates the posterior distribution of each ability parameter but does not provide MCMC/EAP point estimates. Instead, such estimates can be obtained by taking the average of a certain number of plausible values (specified by the user) drawn from the posterior distribution. However, it was unclear how many plausible values were needed to obtain accurate MCMC/EAP point estimates. In this note, we investigated the minimum number of plausible values required to obtain MCMC/EAP point estimates that are comparable to those based on MMLE/EAP, which is a common method for the estimation of IRT ability.
We found that MCMC/EAP estimates based on more plausible values had higher correlation coefficients with the generating values. This makes intuitive sense as more plausible values result in a larger sample size, and thus, higher precision of the sample mean as a population mean estimator. As is dictated by the central limit theorem, increasing the sample size results in diminishing returns in the precision of the sample mean estimator (e.g., Pyrczak, 2003). Consequently, when the sample size reaches a certain level, its increase only produces negligible returns in precision, which is exactly what was observed in the simulation study. Specifically, increasing the number of plausible values from 5 to 10 resulted in the biggest increase (greater than 0.05) in the correlation between MCMC/EAP point estimates and the true values, which only changed at the third decimal place when the number of plausible values was greater than 50.
We also found that estimation biases based on different number of plausible values are not significantly different, which is hardly surprising. Indeed, as noted earlier, drawing plausible values from the posterior distribution is analogous to drawing a sample from the population. On the other side, it is known that the sample mean is an unbiased estimator of the population mean regardless of the population distribution. Therefore, the shape of the posterior distribution does not make a difference regarding the generalizability of the current finding; nor does the test length, as different test lengths only result in posterior distributions with varying standard deviations for individual latent ability, and the sample mean remains an unbiased estimator of the population mean. Although the mean of Bias does not exhibit a discernable pattern with an increase of the number of plausible values, the standard deviation of Bias was found to decrease with more plausible values. This observation also makes sense as more plausible values (i.e., a larger sample size) always result in smaller sampling errors, which in turn cause the mean and standard deviation of RMSE to drop.
Regarding the research question of how many plausible values can produce MCMC/EAP estimates comparable to MMLE/EAP estimates, the simulation study results indicated that MCMC/EAP point estimates based on five plausible values produce comparable estimation Bias, significantly smaller SE, and significantly greater RMSE. At least 20 plausible values are needed to reduce the RMSE to a level comparable to MMLE/EAP. As also shown in the simulation study, MCMC/EAP point estimates based on at least 20 plausible values produce comparable estimation Bias and RMSE and significantly smaller SE than their MMLE/EAP counterparts. Therefore, if researchers or practitioners are interested in obtaining accurate MCMC/EAP point estimates of IRT ability parameters with the Bayes estimator in Mplus, but do not want to draw a large number of plausible values due to practical constraints, they can use twenty plausible values.
Footnotes
Appendix A
Appendix B
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
