Abstract
Item response theory (IRT) models provide an appropriate alternative to the classical ordinal confirmatory factor analysis (CFA) during the development of patient-reported outcome measures (PROMs). Current literature has identified the assessment of IRT model fit as both challenging and underdeveloped. This study evaluates the performance of Ordinal Bayesian Instrument Development (OBID), a Bayesian IRT model with a probit link function approach, through applications in two breast cancer-related instrument development studies. The primary focus is to investigate an appropriate method for comparing Bayesian IRT models in PROMs development. An exact Bayesian leave-one-out cross-validation (LOO-CV) approach is implemented to assess prior selection for the item discrimination parameter in the IRT model and subject content experts’ bias (in a statistical sense and not to be confused with psychometric bias as in differential item functioning) toward the estimation of item-to-domain correlations. Results support the utilization of content subject experts’ information in establishing evidence for construct validity when sample size is small. However, the incorporation of subject experts’ content information in the OBID approach can be sensitive to the level of expertise of the recruited experts. More stringent efforts need to be invested in the appropriate selection of subject experts to efficiently use the OBID approach and reduce potential bias during PROMs development.
Keywords
Researchers often build a few candidate models and seek to select the most useful one for a given problem. The process of model comparison and selection requires rigorous model checking or assessment that is an integral part of any statistical analysis. In the development of psychometric instruments, apart from reliability, establishing evidence of validity is essential to ensuring an instrument’s psychometric integrity. Developing an evidence-based argument that scores are accurate for their intended use requires acquiring data specific to content, construct, and predictive aspects (Nunnally & Bernstein, 1994). Historically, validity has been presented as three distinct but related components—content, criterion, and construct. Today validity is viewed as a unitary concept (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014) where propositions for test score interpretation and use are supported by evidence unique to the measurement goal. Although developing a comprehensive picture of score validity often includes content and predictive components, construct validity receives the most attention from a statistical modeling perspective. This is because any score validity argument is impossible to make without evidence that the construct is relevant to the proposed interpretation and use of the scores.
Two approaches can be implemented to establish evidence of construct validity. When the participant sample size is adequately large, classical (i.e., frequentist) confirmatory factor analysis (CFA) is fairly reliable and easy to implement via statistical software such as Mplus or the free R package lavaan (Rosseel, 2012). Bayesian approach often becomes advantageous when classical CFA is challenged by small sample size (Gajewski, Price, Coffland, Boyle, & Bott, 2013; Garrard, Price, Bott, & Gajewski, 2015; Jiang et al., 2014), that may result in model convergence issues and unreliable parameter estimates.
An emerging topic in recent literature focuses on the development of patient-reported outcome measures (PROMs) or patient-reported outcome (PRO) instruments that often are designed as questionnaires with ordinal response options. PROMs have gained increasing public awareness in promoting patient-centered care, an important driving force behind the current U.S. health care. For instance, the pharmaceutical industry is required by the U.S. Department of Health and Human Services (DHHS) Food and Drug Administration (FDA) to submit evidence collected through PRO instruments in support of labeling claims. Detailed industry guidelines are provided by the FDA to assist pharmaceutical companies regarding the psychometric evaluation of any new or adapted PRO instruments (FDA, 2009).
Ordinal or binary (a special type of ordinal data) responses often are collected from PROMs that require a different modeling approach when compared with the classical CFA (e.g., normality assumption) for assessing an instrument’s construct validity. Literature has shown that the categorical version of the classical CFA model with ordinal data is equivalent to a two-parameter item response theory (IRT) model with a probit link function, when all items on an instrument are ordinal (Johnson & Albert, 1999; Quinn, 2004). IRT parameter and person ability estimates are invariant (i.e., person ability estimates are not test dependent and item indices are not group-dependent; Hambleton, Swaminathan, & Rogers, 1991; Price, 2016). Importantly, the invariance property provides a way for uses of Ordinal Bayesian Instrument Development (OBID) to directly use or compare item and ability information acquired in one study to another. Assessing IRT model fit was considered as a challenging and underdeveloped area (Sinharay & Johnson, 2003; Sinharay, Johnson, & Stern, 2006). Recent advancement in the literature has shown increasing attention on using limited-information goodness-of-fit testing for IRT model fit (Cai, Maydeu-Olivares, Coffman, & Thissen, 2006; Joe & Maydeu-Olivares, 2010; Maydeu-Olivares & Joe, 2005). This article extends the literature by focusing discussions around an alternative Bayesian IRT model comparison.
The fit of Bayesian models can be evaluated in several ways. One popular method is posterior predictive model checking (PPMC; Rubin, 1984), which is closely related to classical goodness-of-fit tests (Gelman, Meng, & Stern, 1996; Sinharay & Johnson, 2003). Other methods include graphical posterior predictive checks, assessing the posterior predictive p value, and/or the utilization of Bayes factors (Gelman, Hwang, & Vehtari, 2014). However, as pointed out by Gelman et al. (2014), when the objective is to compare models, the predictive model accuracy needs to be estimated. Cross-validation (CV) and information criteria measures are commonly used for Bayesian model comparison (Gelman et al., 2014; Vehtari & Lampinen, 2002; Vehtari & Ojanen, 2012). Information criteria are typically defined as deviance measures and represented by some variations of the log likelihood or log predictive density. Stone (1977) has showed the asymptotic equivalency between the two approaches such that information criteria can be viewed as approximations to various types of CV (Gelman et al., 2014).
The deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, 2002; Spiegelhalter, Best, Carlin, & van der Linde, 2014) remains a popular choice in the Bayesian literature despite criticisms and can be computed easily via the software WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000). Viewed analogously to the Akaike information criterion (AIC; Akaike, 1973), DIC is considered as another pointwise measure for conditioning on the posterior mean, whereas AIC conditions on the maximum likelihood estimator. Watanabe (2010) recently proposed a more fully Bayesian approach, known as WAIC (widely applicable or Watanabe–Akaike information criterion). WAIC is considered more appealing than AIC and DIC as it not only conditions on the entire posterior distribution but also works well with hierarchical and mixture structure models (Gelman et al., 2014). Among other CV methods for evaluating out-of-sample prediction performance, Bayesian leave-one-out cross-validation (LOO-CV; Vehtari & Lampinen, 2002) has been shown to be asymptotically equivalent to WAIC (Watanabe, 2010) and more applicable to problems with small sample size (n).
Although both WAIC and Bayesian LOO-CV exhibit appealing properties, they are applied less in practice as Bayesian CV approaches can become very computationally intensive due to Markov chain Monte Carlo (MCMC) simulation for all validation units. The computation burden might be tolerable for studies with smaller sample sizes. For large sample sizes, several approximation approaches have been proposed in the literature for Bayesian LOO-CV, such as importance sampling (IS; Gelfand, Dey, & Chang, 1992), expectation propagation and Laplace approximation (Vehtari, Tolvanen, Mononen, & Winther, 2014), Bayesian K-fold CV (Vehtari, Gelman, & Gabry, 2015), and a more recent Pareto smoothed importance sampling (PSIS) approach for regularizing importance weights (Vehtari & Gelman, 2015; Vehtari et al., 2015).
Within the context of latent variable modeling, excellent research recently has been conducted that approximates Bayesian LOO-CV for Gaussian latent variable models (Li, Qiu, Zhang, & Feng, 2014; Vehtari et al., 2014). However, common data collected from PROMs are ordinal in nature, which calls for an extension of the Gaussian model method to ordinal models (i.e., IRT models). Yet, there is a lack of Bayesian LOO-CV approximation with ordinal models in the current literature (A. Vehtari, personal communication, July 20, 2015). In addition, Bayesian model comparison should be evaluated from the perspective of prior selection for the IRT model parameters. The choice of prior distribution is relevant to posterior parameter inferences and model predictions when data are sparse (Gelman et al., 2014).
When developing PROMs for target populations with small sample sizes (e.g., in cases of rare disease), a novel method called OBID recently has been proposed to overcome the small sample size challenge, appropriately model participants’ ordinal responses, and expedite the development of PROMs (Garrard et al., 2015). OBID is developed within a Bayesian two-parameter IRT with a probit link modeling framework. Prior distributions derived from content experts’ data or prior studies (for establishing the instrument’s content validity) are updated with participants’ data to obtain a posterior distribution for IRT model parameters. Thus, OBID may alleviate the need for large sample sizes, especially for studies with target populations that are small to begin with. Reducing the number of participants will expedite the overall instrument development process and alleviate patients’ burden.
The current work is motivated by the need to have an appropriate method for comparing Bayesian IRT models in PROMs development with the goal to expedite the development process when sample sizes become a concern (e.g., small and/or non-normally distributed data). The OBID approach is evaluated through real data applications, and the specific aims include (a) comparing the OBID models with both informative and flat priors using exact Bayesian LOO-CV, and (b) assessing subject content experts’ bias through an exact CV information criterion (CVIC) measure. All real data used in the current study were collected for prior research purposes and provided to the authors in a de-identified fashion. Thus, this study was determined as non-human subject research by a Midwestern Academic Medical Center Internal Review Board (IRB).
Method
The main objective of this article is to evaluate further the OBID approach via Bayesian model comparison using real data applications. First, the OBID participant model and how an exact Bayesian LOO-CV can be applied to scenarios used in the current study will be briefly reviewed.
OBID Participant Model
OBID is an ordinal CFA-based approach under the Bayesian probabilistic framework. Continuing the notations from Garrard et al. (2015), a two-parameter IRT model with the probit link is expressed by
where
Under the local independence or conditional item independence assumption (Price, 2016), the likelihood for the underlying continuous latent variable
In the unidimensional (i.e., single-factor) OBID approach, the prior distribution of the item discrimination parameter
Bayesian LOO-CV
Bayesian CV is a common method used to evaluate out-of-sample prediction performance and compare models. The idea behind CV is quite intuitive, and our description of the method intentionally is kept consistent with the work by Gelman et al. (2014) and Li et al. (2014). First, the full data set repeatedly can be partitioned into holdout data
Following the work by Li et al. (2014), the CV posterior predictive evaluation is defined as the expectation of the evaluation function
Suppose the evaluation function is taken as the value of the predictive density function at the actual holdout observation
The participant model (Equations 1 and 2) in the current study is a single-factor two-parameter IRT model, where
Finally, the CVIC (Li et al., 2014) is computed by −2 times the sum of the log of the CV posterior predictive density, over all validation units. The model with the smaller CVIC value is preferred.
To demonstrate the computation of the CV posterior predictive density
Holdout data
Real Data Applications
In this section, data collected from two breast cancer-related instrument development studies will be described and analyzed using the OBID approach. An exact Bayesian LOO-CV is applied to compare the choice of prior for the item discrimination parameter
Patient Assessment of Mammography Services (PAMS)–Short Form Satisfaction Survey
Routine utilization of mammography is the most widely recommended method for breast cancer screening and offers patients a chance of early detection that is critical for overall survival. However, potential factors, such as prior experiences and satisfaction with mammography, influence patients’ decision on using mammography on a regular basis. The PAMS satisfaction survey was developed due to the lack of mammography-specific satisfaction assessments (Engelman et al., 2010; Engelman et al., 2016). The full PAMS survey consists of four factors with 20 items, and the PAMS-Short Form is a single factor with seven items. The seven items are designed to measure overall satisfaction. Other items on the full survey are added only when one needs to measure a specific domain (e.g., comfort). Items on the full survey are designed with scales ranging from two to six response categories. The seven short-form items can be rated on a 5-point Likert-type scale (i.e., 1 = poor, 2 = fair, 3 = good, 4 = very good, and 5 = excellent.)
PAMS experts and participants
Six subject experts were consulted and instructed to rate the relevancy of each item (ranging from 1 = “not relevant” to 4 = “highly relevant”) to the domain of interest. Recruited experts consist of individuals who have published or worked in some type of breast cancer research, including several physicians (Ndikum-Moffor et al., 2016). Participant data were collected from female patients to establish construct validity of the PAMS-Short Form instrument. Complete data (i.e., participants responded to all items) are used for the current study. Patients represented four ethnicity backgrounds: Hispanic (n = 36), Non-Hispanic White (n = 2,768), Black (n = 34), and American Indian (n = 287).
PAMS CV—Prior selection
For this study, analyses focused on the Hispanic, Black, and American Indian populations. First, distribution of response options (potential range = 1 to 5) from the raw participant data were examined. Very few respondents selected poor to good response options; thus, a decision was made to collapse some response categories. Potential loss of information due to scale reduction is acknowledged; however, this decision should not affect the general trend in data. For Hispanic and Black data, the 5-point scale is reduced to a 3-point scale by collapsing poor, fair, and good response options; and poor to fair response options are collapsed for the American Indian data, turning the scale into a 4-point scale.
The OBID approach promotes the incorporation of content experts’ information (when appropriate) for the item discrimination parameter
PAMS CV—Expert bias
It is beneficial to assess experts’ bias toward the item-to-domain correlation (or item relevancy), especially for smaller sample sizes. Figure 1 displays the CVIC value for each selected number of experts K. CVIC is calculated by both randomly selecting one to five experts from the pool of six experts and artificially inflating the prior sample size to represent information from 12 experts. K = 0 implies the use of flat prior that is added to the plots for comparison purposes. As the number of experts increases, the majority of CVIC values under the unequally spaced transformation are smaller than that of the equally spaced transformation. The selected experts appear to be less biased for both Hispanic and Black populations. However, the same group of experts is slightly more biased for the American Indian population. The CVIC value sharply increases after five experts for the unequally spaced transformation, whereas the CVIC value continues to decrease for the equally spaced transformation. In addition, all equally spaced transformation plots indicate that six experts are adequate, which is consistent with the suggestion in the current literature (Polit & Beck, 2006).

PAMS expert bias comparison under both equally spaced (left panel) and unequally spaced (right panel) transformations.
Nutrition Literacy Assessment Instrument for Breast Cancer (NLit-BCa) Study
Motivated by a lack of nutrition literacy instrument for female breast cancer survivors, pilot work conducted by Gibbs et al. (2015) initiated the development of the NLit-BCa, an adapted version of the NLit (Gibbs & Chapman-Novakofski, 2013). The NLit-BCa consists of six individual domains with 75 items. A larger validity study is currently in process to evaluate further the NLit-BCa instrument (H. D. Gibbs, personal communication, August 25, 2015). Considering item revisions and/or deletions based on content experts’ review, four domains with 39 items (i.e., 10 macronutrients [Macro] items, nine household food measurement [HFM] items, 10 food label and numeracy [FLN] items, and 10 consumer skills [CS] items) are deemed appropriate for analysis in this study. Items are designed with either three or four response options; and all participant responses are further classified as 0 = “incorrect” and 1 = “correct” based on an answer key provided by the instrument developers.
NLit-BCa experts and participants
Four nutrition experts were consulted for the larger validation study and rated the relevancy for each of the 75 items. Recruited experts consist of individuals who have published expertise in cancer nutrition. Because the larger validation study is ongoing, participant data for this article will come from the pilot work. Data originally were collected from two groups of participants: weight loss intervention and non-intervention. Due to data sparsity concerns, complete data from 71 patients are used after combining both groups (n = 25 and 46 for the intervention and the non-intervention groups, respectively).
NLit-BCa CV—Prior selection
A decision was made prior to analysis to exclude both Item 3 from the macronutrients domain (Macro03) and Item 2 from the FLN domain (FLN02) to avoid potential issues for LOO-CV analyses. Only one respondent answered Macro03 incorrectly, and everyone correctly answered FLN02. Thus, the total number of items was 37. The choice of flat prior versus an informative prior under both transformations was compared using exact Bayesian LOO-CV. CVIC values for the flat prior, equally spaced transformation prior, and the unequally spaced transformation prior are 504.101, 506.641, and 507.355 for Macro; 927.941, 947.594, and 941.578 for HFM; 633.835, 660.888, and 664.568 for FLN; and 716.069, 720.986, and 719.551 for CS, respectively. Across all four domains, the flat prior produces smaller CVIC values than both types of informative prior; however, the differences in CVIC values are much smaller for the CS domain.
NLit-BCa CV—Expert bias
Results from the prior selection analysis seem to suggest that content experts are more biased toward item-to-domain correlations for all four domains. Figure 2 shows the CVIC value for each selected number of experts K. Similar to the PAMS study, the CVIC is calculated by both randomly selecting one to three experts from the pool of four experts and artificially inflating the prior sample size to represent information from eight experts. The use of flat prior again is indicated by K = 0. For the Macro domain, the CVIC value continues to decrease under the equally spaced transformation prior after four experts, where the opposite is observed with the unequally spaced transformation prior. No huge differences in CVIC values are observed among two to four experts for both HFM and FLN domains, under both transformations. For the CS domain, apart from the flat prior model, two experts produce the smallest CVIC value under the equally spaced transformation, whereas the smallest CVIC value occurs with three experts. Overall, recruited experts seem to be more biased toward relevancy ratings on the items, across all four domains.

NLit-BCa expert bias comparison under both equally spaced (left panel) and unequally spaced (right panel) transformations.
Discussion
The current study evaluates the performance of OBID through applications in two breast cancer-related instrument development studies. The primary focus is to investigate an exact Bayesian LOO-CV approach for comparing Bayesian IRT models in PROMs development. Six subject experts are consulted in the PAMS-short form study for four different patient populations. Among three populations investigated in this study, using an informative prior (i.e., incorporating experts’ information), has shown to be superior to using a flat prior. One interesting observation arises from the original focus of the six content experts, as experts originally were recruited with the purpose of validating the PAMS instrument for American Indian women. Results from the PAMS study indicate that experts are less biased for both Hispanic and Black populations, which supports the appropriate utilization of experts’ information to form a “general prior” as suggested by Garrard et al. (2015). Experts appear to be slightly more biased for the American Indian population despite their original focus. Although findings suggest that five experts would be sufficient, the use of six experts does not pose any substantial concerns for the purpose of instrument validation. Overall results indicate that incorporating information from the six selected subject experts is appropriate for the construct validity analysis in the PAMS study.
Findings from the NLit-BCa study present more complexity as the current study suggests the use of a flat prior as opposed to an informative prior. Among four domains examined, only the FLN domain CVIC results slightly support incorporating experts’ information. The four selected experts appear to hold more biased opinions regarding the item-to-domain correlations for items in all domains. Although four experts were recruited, results have shown that even two to three experts would be sufficient. One thing worth noting is that the design of the NLit-BCa study differs from the PAMS study. The PAMS items are more subjective (i.e., eliciting satisfaction); whereas, the NLit-BCa items have a distinct correct answer. Nonetheless, despite the seemingly “opposite” results from the NLit-BCa study, the importance of appropriate prior selection and expert bias evaluation has been demonstrated for the OBID approach.
One limitation of the current study is associated with the selection of content experts, which remains an important yet challenging aspect in the development of psychometric instruments (Grant & Davis, 1997; Lynn, 1986). Apart from unidimensional instruments, subject experts often are asked to rate items from multiple domains. It usually is assumed that experts have expertise in all areas of interest. The current study assumes that content validity has been thoroughly assessed for both instruments. Thus, the focus is entirely on model selection during the construct validity phase of instrument development. Yet, based on current findings, subject experts’ bias may hinder the efficient utilization of experts’ information in the recently proposed OBID approach. Another limitation comes from the primary focus on using an exact Bayesian LOO-CV approach to compare different IRT models. As previously mentioned, several methods can be used to help assess and compare Bayesian models. The OBID approach certainly can be evaluated further via other established approaches in the literature. The third limitation can be viewed as a constraint associated with using the R package MCMCpack, as normal priors are required for IRT model parameters. Future work can consider other types of prior distributions.
An implication from the current study is the selection of an appropriate tuning parameter to ensure 20% to 50% acceptance rates during the MCMC procedure. Simulation results from Garrard et al. (2015) have showed an inverse relationship between the tuning parameter and the sample size. Although not discussed in the main text of the article, based on sample size information from 11 real data sets and four simulation data sets from Garrard et al., a power function is fitted for the tuning parameter t as a function of sample size n, that is,
Additional future work may involve a more thorough evaluation of the equally spaced and unequally spaced transformations in other real applications and an approximation to the Bayesian LOO-CV for ordinal latent variable models. In addition, more skewed participant data structure and other prior distributions for the OBID subject experts’ model need to be evaluated through simulation. The simulation study by Garrard et al. (2015) considers a more balanced participant data structure and that experts’ item ratings follow a normal distribution. For instruments with more subjective response scales (e.g., satisfaction), participants tend to select more positive response options. Experts also potentially can disagree with each other regarding the relevancy of proposed items.
Footnotes
Acknowledgements
The authors thank both Dr. Kimberly Engelman for use of the Patient Assessment of Mammography Services (PAMS) data from Grants CCE-103763 and 5P20MD004805, and Dr. Heather Gibbs for use of the Nutrition Literacy Assessment Instrument for Breast Cancer (NLit-BCa) data from Grants P30CA168524 and IRG-09-062-04.
Authors’ Note
This article reflects the views of the authors and should not be construed to represent the U.S. Food and Drug Administration’s (FDA) views or policies. Lili Garrard completed this work as a PhD student in the Department of Biostatistics at the University of Kansas Medical Center. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by the National Institute of Nursing Research of the National Institutes of Health under Award R03NR013236.
