Abstract
The Presence-Severity (P-S) format refers to a compound item structure in which a question is first asked to check the presence of the particular event in question. If the respondent provides an affirmative answer, a follow-up is administered, often about the frequency, density, severity, or impact of the event. Despite the popularity of the P-S format in areas such as patient reported outcomes, little attention has been paid to their psychometric analysis, which is necessary for making key design decisions about a scale. In this study, an item response theory–based framework is proposed to perform item analysis involving P-S data, which improves psychometric analysis for (a) scoring response categories, (b) calibrating items, (c) calculating reliability or internal consistency, and (d) selecting and revising items. A real-data example involving the Memorial Symptom Assessment Scale–Short Form, which is used as symptom distress measure for terminally ill cancer patients, demonstrates how the new framework can be used to address various psychometric issues in practice.
The Presence-Severity (P-S) format—the simplest of a larger family of filter question designs broadly seen in clinical instruments, census data collection, and survey research—uses a compound item to assess an event, such as occurrence of a symptom (e.g., Chang, Hwang, Feuerman, Kasimis, & Thaler, 2000; Green, 1996). P-S items come in two parts: First, a filter, the Presence part, is used to check whether the respondent experiences the particular event in question. If the answer of the Presence part is yes, then a follow-up question, the Severity part, is asked, often about the frequency, density, severity, or impact of the event. For example,
Presence: In the last week, did you have dry mouth? (yes/no).
Severity: If yes, how much did it bother you? (not at all/a little/somewhat/a lot).
P-S is particularly common in clinician-administered and patient-reported outcome (PRO) contexts because it adapts existing verbal interview protocols to structured questionnaires. It is used to reduce respondent burden and frustration by not requiring respondents to answer inapplicable questions. Respondents such as young children, nonnative speakers, or those with working memory impairment may benefit from the initial probing provided by the Presence part. In addition, P-S explicitly acknowledges the reality of what Reise and Waller (2009) term a quasi trait, which is
A unipolar construct in which one end of the scale represents severity and the other pole represents its absence (depressed versus not depressed). This in contrast to a bipolar construct, where both ends of the scale represent meaningful variation (depression versus happiness). (p. 31)
Although the P-S format is widely used, it has not been subject to much careful psychometric analysis to help users decide key questions, such as
What is a meaningful internal consistency reliability for the scale (or information curve)?
Are there items that do not behave appropriately given reasonable assumptions about the response process, in particular that lack of Presence should imply no Severity?
Is the format desirable or would it make sense to revise to a different format?
Psychometric item analysis could help answer all these questions, but P-S data have largely been subject to ad hoc analysis. For instance, one common strategy is to combine the parts by appending the Presence part to be the bottom category of the Severity part. Using the “dry mouth” item, a new item could be considered by appending the bottom category to the Severity part, forming a composite item: (did not have/had but not at all bothered/bothered a little/bothered somewhat/bothered a lot). Then, assuming these labels are coded numerically, a classical test theory reliability coefficient such as Cronbach’s alpha could be computed for the resulting total score or the items could be subject to factor analysis (McDonald, 1999). Two-part regression models can also be used to predict outcome of this type (for cross-sectional data, see Manning et al., 1981, and Duan, Manning, Morris, & Newhouse, 1983; for longitudinal data, see Olsen & Schafer, 2001). Another strategy that might be used is to analyze only the Presence component using a binary item response theory (IRT) model and ignore the Severity part altogether. This is wasteful of data and respondent time. An IRT analysis that does not ignore the Severity component might make use of ordinal IRT models such as the graded response model or generalized partial credit model (GPCM) on the combined item. This strategy preserves information but leaves ordinality as an untested assumption. If ordinality is violated, the IRT model may provide misleading item statistics and/or measurement of the respondents on the latent trait. Ordinality should not simply be assumed.
Bock’s (1972) nominal response model (NRM) has been widely used to test ordinality in an IRT context. The NRM assumes minimal structure among response categories and provides great flexibility to derive nested ordinal or partially ordinal models by placing constraints on the parameters (Thissen & Steinberg, 1986). Nesting makes it possible to use the likelihood ratio test and information criteria to examine parameter restrictions implied by assuming ordinality.
The NRM has been used in clinical or survey contexts before to assess the relationship among response categories or consider groups of similar responses. Preston, Reise, Cai, and Hays (2011) use the NRM on data from the Patient-Reported Outcomes Measurement Information System (PROMIS) Emotional Distress Item Bank to verify the ordinality of response categories. Similarly, Anderson, Verkuilen, and Peyton (2010) use the NRM to determine how “Don’t Know” responses fit in a knowledge scale given in a national survey as well as to assess the effect of changes to the prompt. Finally, Steinberg and Thissen (1996) use the NRM to consider grouped items in an assessment of violent behavior. Thissen, Cai, and Bock (2010) provide an up-to-date survey of the current literature on the NRM and its special cases, such as the GPCM.
The NRM was originally proposed to analyze simple multiple choice items, but has been more broadly used to analyze thematically grouped bundles of items to a common stimulus, known as a testlet (Thissen & Steinberg, 2010). The P-S format bundles items by common symptoms, but, unlike a standard testlet, where it is assumed that all items in the testlet are administered, the Severity part is not administered if the Presence part is negative. This creates systematically missing responses. Such missingness induces conditional dependence between the Presence and Severity parts on a given symptom. The authors refer to this as conditional dependence by design throughout the article.
In this article, a framework is presented that maps the P-S format to a compound item structure represented by a stratified contingency table with fused cells, which is analyzed using the NRM. This framework allows users to obtain IRT-based item analysis and scoring for scales of this type using commercially available software. It also offers means to (a) investigate the necessity of the P-S format, especially in comparison of competing designs, and (b) examine the separation of the Presence and the Severity parts of the structure. When the filter or the follow-up question involves ordinal response categories, the framework also enables decisions such as whether the combined P-S items can be meaningfully thought of as ordinal. If so, the commonly used practice of appending the Presence part to the Severity part would be justified. The authors illustrate their approach using data from the Memorial Symptom Assessment Scale–Short Form (MSAS-SF), a widely used P-S format instrument to assess symptom distress in hospital patients. Finally, they offer some concluding remarks about this approach.
A Framework for P-S Items
The authors consider the P-S format to show how it is a special case of a stratified incomplete two-way contingency table, which can in turn be viewed as a compound item. They start with a baseline scenario in which all items are administered, and show how the P-S design creates a particularly strong kind of missingness. This implies conditional dependence by design by forcing certain cells in partial tables to become unobserved. For simplicity, they cast their argument in the form of two binary items at a given θ level. However, the principles generalize to cases when Presence or Severity parts being polytomous with K categories (possibly varying by item stem).
Consider a pair of binary items P and S, denoting Presence and Severity, respectively. For two items, there are a total of four response patterns possible: (a) (0,0), (b) (1,0), (c) (0,1), and (d) (1,1), shown in the four 2 × 2 tables contained in the left of Table 1. Each of these response patterns has probability π ij . Conditional on θ, an aggregated table is obtained as on the right of Table 1 by combining over all possible examinees at a particular level of θ. If local independence were to hold, all cells in the aggregated table should be independently and fully observed (i.e., the top right table in Table 1). However, when P is a filter for S, the table at a given θ is incompletely observed because only the margin for P = 0 is observed. That is to say that the two cells shown in the middle table on the right-hand column are fused because response patterns (a) and (b) are empirically indistinguishable. A more direct but less informative route to seeing the conditional dependence by design is to note that the sample space of the observed table is not rectangular, and hence the two random variables cannot be independent conditional on θ.
Viewing a Matched Pair of Presence and Severity Parts in a Stratified Two-Way Contingency Table Approach (Conditional on θ)
Table 2(a) shows how responses to a pair of P and S parts can be recoded as a single compound item, with a mapping from an incompletely observed 2 × 2 table to a 1 × 3 table, when P and S are binary. With such a one-way mapping, treating the item ordinally can be justified if the category usage fits that which would be expected of an ordinal item. Table 2(b) demonstrates an example where S involves more than two categories. Especially when S is polytomous with unordered categories, the expected ordering of the recoded responses might not be straightforward. For example, in the right-hand table in Table 2(b), P = 0 may be considered as a lower category of the scale, but (P = 1, S = 0), (P = 1, S = 1), and (P = 1, S = 2) may reflect the unordered relationship among different values of S. A more complex situation arises when P is polytomous because the administration of S may be granted by one (e.g., Table 2(c)) or more than one (e.g., Table 2(d)) possible responses to P. Clearly, the situation could be further complicated when P and S have more than two categories (e.g., Table 2(d)).
Transforming a Pair of Presence and Severity Parts (Left) to a Compound Item (Right, With 1st Column Indicating Recoded Response After Combining P and S)
When the Presence and the Severity parts are generalized to have multiple options, the corresponding one-way layout would have, say, K levels. The ordinality of certain recoded response categories can be checked by using a nominal IRT model to see whether the categories order in the proper way. Of them, Bock’s (1972) NRM is the best studied and has clear advantages in terms of clarity, relative parsimony, and the fact that it nests a number of important models as a special case.
For simplicity, assume that each compound item has a common K possible response options, although this can be relaxed to allow differing numbers of response options in different items. The item response function of NRM for item j = 1, . . . , J, category k = 0, . . . , K − 1 is
To identify the model, it is usual to constrain
Despite their appearance as category slopes, the a parameters determine the ordering of the response categories (Thissen et al., 2010). Specifically, for a combined P-S item the ordering of “not present” (i.e., P = 0) and “present but least severe” (i.e., P = 1 and S = 0) can be examined by comparing a0 and a1. This comparison is crucial for determining whether the Presence part can be meaningfully thought of as the bottom category of the combined item or not. The rest of the a parameters should be ordered appropriately as well, but they reflect scaling of the Severity component, conditional on the Presence part being endorsed.
Consider the case when P is binary and S is ordinal (with a one way mapping similar to that shown in Table 2(b)). Without the loss of generality, similar rationales can be easily extended to the relevant parts when P or S is partially ordinal. If a0 < a1 < . . . < aK, then the response categories behave ordinally, with the difference a1 − a0 being particularly notable. However, it is not the case that proper ordering of the a parameters implies that a well-known ordinal IRT model fits. For instance, the GPCM imposes the additional restriction that item steps have an equal unit within a given item and the partial credit model (PCM) additionally requires all items to have equal units. In short, there are many possible ordinal models in between the NRM and the GPCM, created by constraining the NRM’s parameters in various ways. Agresti (2010) and Thissen et al. (2010) thoroughly discuss the restriction of the multinomial logit and show that many possible restrictions of the multinomial logistic form in the NRM result in ordinal models, not just ones obeying proportional odds as the GPCM and PCM do. The orthogonal basis of the generalized stereotype model is one such example (Johnson, 2007). The GPCM represents a very useful benchmark model and even if it does not fit well it might be an adequate approximation for many practical purposes. A graphical way of assessing this fit will be subsequently introduced.
The NRM is estimated using maximum marginal likelihood (MML) methods, and so it is possible to compute likelihood ratio tests between nested models. For instance, one could test whether the GPCM is adequate because the GPCM is nested within the NRM. However, global tests are frequently rejected due to relatively minor deviations from the model and should not be considered definitive (Embretson & Reise, 2000). Often, more focused, local analyses are informative about particular misfit of interest than a global test. This is especially crucial to investigations in certain aspects of the use of the P-S format because minor deviations from ordinality for the Severity part may not be particularly important but the location of the Presence part on the continuum is crucial. In the following section and the empirical example, the authors show how local analyses can be used to examine the separation and ordering between the Presence and the Severity parts, which consequently helps explore the utility of the P-S format and provides directions for scale modification.
It is not difficult to compute a Wald test for dk,k−1 = a k – a k−1, or obtain the corresponding confidence interval using the asymptotic covariance matrix of the a parameters and the relevant contrast matrix. The difference between a parameters of adjacent categories is referred to as the category boundary distance, the test for which can be used to examine the ordering of any response categories (Preston et al., 2011). This test is informative for d10 = a1 − a0 because it reflects whether the Presence part is distinct from the bottom category of the Severity part. If a1 > a0, then the Presence part is ordinally distinguished from the bottom category of the Severity part. That is, it is located below the Severity component on the latent trait, as it should if it makes sense to append it to the bottom of the Severity part. By contrast, if a1 ≤ a0, the ordering required to append the filter item to the bottom of the Severity is not met. However, sampling variability blurs this pattern. A basic decision rule has been generated for this difference based on the confidence interval for d10, at a nominal confidence level. Ordering implies a1 > a0, and so the authors suggest that the confidence interval for d10 should be above 0. Vice versa, a clear violation would occur if the confidence interval for d 10 should be below 0. If 0 is in the confidence interval, this suggests that the categories are not well distinguished, but they prefer to reserve judgment and call this item indeterminate.
In addition to the contrasts considered by the Wald tests just discussed, the authors have found it useful to have a graphical method to get a wider scale sense of the items. Category response functions (CRFs) are more usual to plot, but it is very difficult to ascertain ordinality from these plots and they are cognitively overwhelming. Because of their focus on a parameters, the authors developed the following plot. For each item, plot
Response category, k = 0, . . . , K, on the x-axis.
δ k = a k − a0 on the y-axis. (This just offsets the origin of the estimated a.)
This graph allows rapid diagnosis of the response category orderings because successive a k should be greater than the first. As an added benefit, the authors can guess how likely special case models such as the GPCM or PCM will fit because these models imply unit ordering, either within a given item or across all items, respectively. This in turn implies that the δ k lie on a line through the origin, with different slopes across items for GPCM, and a common slope for all items for PCM. It is worth noting that they chose δ k = a k − a0 over dk,k−1 = a k − ak−1 because δ k and its associated standard error are directly available in MULTILOG output. To compare the bottom two categories, look at δ1 = d10 = a 1 − a0. For the remaining response categories, it is straightforward to derive dk,k−1 from δ k .
Example
The proposed framework is used to analyze a data set based on the MSAS-SF (Chang et al., 2000). MSAS-SF is a shortened version of the Memorial Symptom Assessment Scale that was originally developed by Portenoy et al. (1994) to assess symptom distress for cancer patients. The use of this instrument was later extended to other populations with chronic diseases, such as AIDS patients (Vogl et al., 1999). MSAS-SF involves 28 symptoms classified by the scale’s authors as physical and 4 symptoms classified as psychological, each assessed by a compound item in P-S format. (However, 1 physical symptom, “problems with sexual interest or activity,” was not administered in the study.) For each symptom, respondents are asked a “yes/no” question on whether they have had the symptom in the past 2 weeks. For the physical items, respondents who report to have the symptom are further asked to evaluate “how much did it bother or distress you” on a 5-point Likert-type scale, ranging from 1 for not at all to 5 for very much. For the psychological items, respondents who report to have the symptom are further asked to evaluate “how often did it occur” on a 4-point scale, ranging from 1 for rarely to 4 for almost constantly. In accordance with the previous discussion, the Presence and Severity parts were merged into one compound item and coded “not present” as the lowest number (0), with the Severity scale being coded by integers above that.
The data used in this study were collected by Olden, Rosenfeld, Pessin, and Breitbart (2009). The sample in this study consists of 355 terminally ill cancer patients. There are 302 complete cases (85.1%) on the physical items and 346 complete cases (97.5%) on the psychological items. Only a small percentage of cases (4.2% for physical, 1.1% for psychological) were missing on more than one symptom. However, 4 respondents did not answer any of the psychological items and were thus excluded from analysis of the psychological scale. For all likelihood-based analysis (i.e., fitting IRT models), full information methods were used.
The mean age of respondents was 65.4, with a standard deviation (SD) of 13.56, ranging from 21 to 94. Females consist of 56.6% of the respondents. The racial breakdown is 68.5% White, 23.9% African American, 6.0% Hispanic, and 1.7% Asian. Table 3 shows the observed relative frequency for combined response categories. The observed sample proportions of “not present” varies from 0.20 to 0.83 across items, implying that symptoms range in prevalence greatly. In addition, category usage is not even. In particular, many items (e.g., constipation) have relatively high category usage at the bottom and top of the scale, with relatively minimal usage in the middle, particularly for the categories involving low severity ratings.
Observed Relative Frequency for Combined P-S Responses (in Descending Order of Prevalence)
Note: P-S = Presence-Severity.
The authors used joint correspondence analysis (JCA) to perform a basic screening and descriptive analysis of the physical and psychological symptoms. This is in accordance with Stout’s (2002) recommendation to use a less model-dependent technique prior to running IRT analysis to assess the plausibility of what he calls “essential unidimensionality.” JCA is a multivariate nominal data analysis technique based on an iterative weighted least squares analysis of the two-way margins (Greenacre, 2007; Vermunt & Anderson, 2005). JCA focuses on the two-way associations as measured by Pearson chi-square (called inertia in this literature). The JCA program is used in Stata 11.2 (StataCorp, 2009) to perform all analyses. JCA requires that the number of dimensions to be extracted be chosen in advance. The authors requested three dimensional solutions, but the results do not appear to change much if this number is varied.
The four psychological items appear to be strongly unidimensional, and have well discriminated and strongly ordinal response categories for each item, all of which are in the correct ordering and are far apart from each other. The percentage breakdown of inertia by dimension is 47.85%, 29.42%, and 11.95%, respectively. In addition, the “horseshoe effect” appeared for these items. This happens when ordinal scales are subject to correspondence analysis or multidimensional scaling. The second dimension is a quadratic function of the first dimension, and the categories are ordered properly along the first dimension. This makes a U-shape with the smallest category for a particular item on one end of the U and the largest on the other. The same direction and shape are true for all items.
In contrast, the physical items are much messier. The breakdown of inertia is less clear-cut, with a percentage breakdown by dimension of 29.66%, 12.66%, and 4.80%, respectively, which is indicative of a dominant first dimension, albeit not as well fitting for the psychological items. Unlike for the psychological items, the scores for the response categories for a number of the physical items are not in the proper order and are indistinct. Nonetheless, JCA suggests the physical scale is perhaps “essentially unidimensional,” in that, while the solution does not fit as well as for the psychological items, the second dimension is still a horseshoe, by and large.
Now, the authors turn their attention to the IRT fitting. JCA results suggest that the psychological items are more likely to fit well and perform ordinally, whereas the physical items are more problematic, and this is indeed what they found. They used MULTILOG 7.03 to fit the NRM and PARSCALE 4.1 for the GPCM (SSI, 2003). 1 No problems were observed in estimation, and all models converged quickly to regular optima. There is no evidence for an improper solution for any of these data. They increased the maximum number of iterations and number of quadrature points to check to see if this altered the solution, but it did not, and so all information reported here is based on program default values.
The authors start with the analysis of the psychological items, by fitting the NRM to determine whether the Presence part is the bottom category and, more broadly, whether the expected ordinal structure holds. Figure 1 is the δ-plot based on the NRM for these items, showing ascending order of estimates for all items, although weakly ascending order for “Worrying” in the categories “rarely” and “occasionally.” For all items, “not present” has a significantly lower value than “rarely,” meaning that the filter is acting as the lower category. That is to say that the category boundary distance δ10 is positive and the 95% confidence interval for it does not include 0. In addition, the categories representing values on the Severity part are in ascending order. This is perhaps unsurprising given that the Presence and the Severity parts’ wordings line up because they are both based on the frequency of a particular symptom. However, the δ (and thus a estimates) within an item do not appear to lie on a line, and therefore, the GPCM is unlikely to fit perfectly, as indeed it does not compared with the NRM according to the likelihood ratio test (−2log L: NRM, 3,428.7; GPCM, 3,447.4; G2(8) = 18.7, p = .017). Akaike information criterion (AIC) comparison gives the NRM 3,492.7 and the GPCM 3,495.4, which weakly favors the NRM. In addition, item fit statistics for the GPCM suggest that each item does not fit the GPCM, so lack of fit is not isolated to one item alone. However, it still might not be a bad approximation to the data, as the AIC comparison shows.

The δ-plot based on the NRM for all four psychological items: x-axis for responses (0 = not present, 1 = rarely, 2 = occasionally, 3 = frequently, 4 = almost constantly) and y-axis for estimated δ k values
Next, the authors consider the effect of the P-S design on the physical items. As they would expect from the JCA results, the situation is markedly messier. The likelihood ratio test shows that the GPCM is not a good model for these data (−2log L: NRM, 23,121.0; GPCM, 23,360.2; G 2(81) = 239.2, p < .001). The AIC comparison gives an NRM of 23,553.0 and an GPCM of 23,684.2, which strongly favors the NRM.
A local analysis based on plots is more informative. Figure 2 shows the δ-plot for these items. Of the 27 physical items, only 5 items have δ10 estimates greater than 0. However, none of these are statistically different from 0. Among the rest, 17 items have negative δ10, but these are not statistically significant. More troubling, however, 5 items (their titles are bold italic) have δ10 estimates that are negative and statistically different than 0. These symptoms clearly violate the ordering that would be expected if “not present” were below “present but no distress.” Estimated δ for the rest of the responses (from “a little bit” to “very much”) are generally in ascending order, suggesting that higher response category shows higher severity, which roughly confirms the authors’ expectation of the ordinal relationship for categories 1 to 5, although the category boundary distances for these later categories are often not statistically significant. Respondents appear to have difficulty distinguishing different levels of distress on this scale.

The δ-plot based on the NRM for 27 physical items: x-axis for responses (0 = not present, 1 = present but not at all distressed, 2 = a little bit, 3 = somewhat, 4 = quite a bit, 5 = very much) and y-axis for estimated δ k values
In addition to the δ-plot, the authors show CRF plots (Figure 3) for the NRM estimates of selected symptoms from the psychological and physical subscales. These provide examples of what good fitting and bad fitting items look like, illustrating four distinct scenarios for items in this case. “Difficulty concentrating” is an item for which ordering makes sense. The bottom and top categories of the item have appropriately monotone CRFs, and intermediate categories are ordered in the right way and are reasonably distinct. This is an example of a good item. “Lack of energy” illustrates an example of a weakly violating item. Its δ10 estimate is negative but not statistically significant, and thus, the CRF for “not at all” is not monotone decreasing. Nonetheless, the item seems to perform reasonably well, despite the fact that the filter is not perfectly distinguished. “Dry mouth” illustrates a strongly violating item. Its δ10 estimate is negative and statistically significant. The CRF for “not present” clearly peaks below that for “present but no distress.” Nonetheless, the ordinal structure of the item emerges in the upper categories. Finally, “Diarrhea” clearly illustrates a noninformative item, which is not surprising given its relative lack of prevalence in the sample. Here, the CRFs are nearly flat.

Item characteristic curves for four items
One point of potential interest involves using a model such as the GPCM, despite the fact that it does not adequately fit the data, which is particularly true for the physical items. The GPCM is, of course, much simpler than the NRM. Although the NRM shows clear differences at the level of items, as a scoring model, the GPCM and even the simple unweighted mean score end up giving what may appear to be relatively similar results to the NRM. For the physical items, the Pearson correlation between empirical Bayes’s scores from the NRM and the GPCM is .937, the NRM and the total score is .923, and the GPCM and total score is .982. For the psychological items, these correlations are .981, .963, and .978, respectively. They might suggest that the simple total score with the strategy of appending the filter at the bottom is adequate in the sense that the resulting scores are essentially the same. However, in Figure 4, the authors show scatterplots of the empirical Bayes’s scores from the NRM and the GPCM plotted on the standardized total score for the physical and psychological items, respectively. There is notable heteroscedasticity and nonlinearity in the physical scale concentrated at the low end of the total score. This is the location where the filter would make the most difference. It is also where the scale information functions are weakest. In sum, these seemingly high correlations should not mislead users into assuming that the scores provided by these models are the same, especially at the low end of distress.

Comparing IRT-based distress estimates to the standardized sum scores
Another point that may interest many researchers is to compare the P-S format with competing designs, or in particular, examine the loss when the follow-up question is ignored. This could be achieved by comparing the test information and thus the posterior SD of the latent trait estimates across designs. As shown in Figure 5, the P-S format with NRM fitting is contrasted to a simple scenario in which only the Presence part is considered and the two-parameter logistic (2PL) model is used to analyze the binary data. For physical items (in the top left plot in Figure 5), the test information for the P-S format considerably dominates that for the binary case when estimated distress is roughly above −1 in the standardized unit. The difference in test information achieves its maximum when distress is approximately 1 in the standardized unit. Such loss in test information when the Severity component is ignored can be translated into the loss in measurement accuracy (as shown in the top right panel in Figure 5). For psychological items, the difference is less severe, but the bottom left plot in Figure 5 still shows a serious loss in test information when distress is roughly above 0.5 in standard unit. However, the measurement accuracy (in the bottom right panel in Figure 5) is comparable for distress within 1 SD around its mean, the range in which one often expects to see majority of distress estimates fall. The huge gap between the posterior SD curves when distress is distant from the mean (i.e., distress between ±3) may be partially due to the fact that the psychological subscale only has four items.

Test information curve and posterior SD for the distress estimate: solid line for NRM with all combined response categories and dashed line for 2PL model with Presence part only (i.e., “not present” vs. “present”)
Discussion
Because scale design and item analysis go hand in hand, extending IRT methodology to areas outside educational measurement such as clinical assessment, survey research, and PROs provide useful impetus for new developments. Many well-understood methods need to be adapted or at least rethought before they will be useful in new contexts. In this article, the authors developed a framework to perform item analysis on P-S scales. They demonstrated that the P-S format can be thought of as compound items that can be analyzed using Bock’s (1972) NRM. Furthermore, they showed how the appropriateness of ad hoc strategies such as appending the Presence part as the lowest category of the Severity component can be tested empirically. They now offer some lessons learned from the empirical example, discuss the choice of models, and finally suggest future directions for research.
One key benefit of having a model is the ability to make decisions about scale format using a fine-grained analysis. For instance, the analysis of the MSAS-SF suggests that it should be revised. For the psychological items, it is probably simpler to use a more typical Likert-type response format than the P-S format. Similarly, for most physical items, the Presence part does not seem to provide any information in addition to the Severity component. For a few physical items, the design introduces confusion for respondents with low distress. For these items, it seems inappropriate to treat “not present” as the lowest category of Severity. More broadly, the scale seems to have too many categories that are not differentiated empirically. Decreasing the number of categories would reduce clinicians’ and respondents’ burden and improve accuracy.
Livote (2011) analyzes the Condensed Memorial Symptom Assessment Scale (CMSAS; Chang, Hwang, Kasimis, & Thaler, 2004), a condensed subset of 14 symptoms from the MSAS-SF. Her analysis backed the conclusion that the P-S format in the MSAS-SF may not be efficiently designed. She reported that in a palliative care setting in a large Veterans Administration hospital, about 40% of the affirmative responses to the Presence part were missing for the Severity part. Suggested reasons included differences in clinician interviewing style, compliance with completing records, and respondent fatigue. Of course, other scales may function better by switching to the P-S format from a different one. Which format is better could be studied by randomizing respondents into different formats and considering the effect on item properties, as was done in Anderson et al. (2010).
There are also important questions about the dimensionality of the MSAS-SF, and this is likely to be an issue in other contexts as well, particularly where symptom clusters are present. The scale’s authors assigned symptoms to be physical or psychological, but not both. However, many symptoms assigned to be physical may be psychological, or a mixture of the two, for example, “difficulty sleeping,” “lack of energy,” or “difficulty concentrating.” Unfortunately, assessment of ordinality is confounded with multidimensionality or local dependence among symptoms. The authors analyzed these data the way that scale’s authors conceived of it based on the instructions given to the respondents. It would be clearly very useful to consider this question further, but estimation of multidimensional nominal models is not easy. Further results in Livote (2011) based on the bifactor model applied to the Presence part shows that essential unidimensionality holds. To make the information curve more realistic in the presence of multidimensionality but where a dominant first dimension is present, the robust covariance matrix method given in Cheng and Yuan (2010) could be used during scoring.
The authors use Bock’s (1972) parametric model as the basis for modeling for two primary reasons. First, it is reasonably well understood and its parameters have known interpretations. The analysis depends on the a parameters from the NRM, and they can be interpreted in CRFs directly, which focuses attention on the comparisons that actually matter. Second, the advantage of the fact that NRM is available in commercial software is substantial. Nearly, all analyses shown here can be performed using widely available software with minimal special coding, which is very helpful to applied researchers who may lack the expertise or time to code their own model. The newly released IRTPRO greatly extends what is possible. IRTPRO estimates a multidimensional NRM, allows specification of a broad array of restricted nominal models, generates improved goodness-of-fit measures, and has the ability to generate standard errors for the δ-plot.
Although the authors prefer the NRM estimated by MML, there are alternatives or generalizations that bear mentioning. Anderson et al. (2010) develop a pseudolikelihood approach to fitting the multidimensional NRM that is computationally efficient (although not as statistically efficient) for the multidimensional case. However, their approach is not implemented in standard software. Suh and Bolt (2010) generalize the NRM using a nested logit approach to allow a broader range of choice patterns among options. The discrete choice literature has additional models that could be used as the basis for an IRT model, for instance, incorporating category- and/or respondent-specific covariates (Train, 2009). Finally, Ramsay proposed a nonparametric IRT model for nominal data estimated by kernel smoothing methods that makes less stringent assumptions than the NRM and thus would provide more empirical fidelity to the given data set. Indeed, this model has been used in clinical applications to test the ordinality of items (Santor, Ramsay, & Zuroff, 1994). Rossi, Wang, and Ramsay (2002) extend the nonparametric approach to the likelihood framework using the EM algorithm and considering their approach would be useful for future research.
Looking past the P-S format, the general approach here can be extended to other filter question designs. The authors sketch a few possibilities. As implied in Table 2(d), when the numbers of the response categories in the Presence and the Severity parts increase, the number of the recoded categories may expand drastically. The strategy adopted for P-S would be severely complicated by the increasing number of cells in the contingency table. For instance, the Clinician-Administered PTSD Scale (CAPS; Blake et al., 1995) has five points for each part, which creates a 5 × 5 table for each symptom. Turning this into a one-way layout directly is impractical because the number of categories and hence the number of parameters are very large. Parameter restrictions such as used in log-multiplicative models (Vermunt & Anderson, 2005) or the generalized stereotype model (Johnson, 2007) may be useful.
The P-S format only involves a single branching within the compound item, but questionnaires often have sequential branching patterns. For instance, in a survey about alcohol use, if the respondent reports not drinking alcohol, it is not informative to ask about how often he or she has been drunk in the last 30 days or whether he or she has six drinks in a row. Reardon and Raudenbush (2006) propose fusing a Rasch model and a discrete survival model to address with this problem. Their approach may be helpful in the context of a scale with skipping, and it represents an initial foray into modeling these more complex structures. The resulting model can be estimated using ordinary multilevel logistic regression, which is a substantial advantage. However, it requires a great deal of judgment in model specification and maintains the restrictive properties of the Rasch model. Böckenholt (in press) provides methods that might greatly broaden the reach of sequential branching methods and relax the restrictions imposed by the Rasch model. However, these models are likely to be even more complex than Reardon and Raudenbush’s, and thus out of the reach of nonmethodologists.
Footnotes
Acknowledgements
The authors would like to thank Barry Rosenfeld for sharing the data, Charles Lewis for key discussion, and Amy Racanello for helpful editing and providing insight into why clinicians use the Presence-Severity format. Remaining errors are the responsibility of the authors.
Authors’ Note
Authors have equal contribution, and their names are listed alphabetically.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
