Abstract
We evaluated the use of the nominal response model (NRM) to score multiple-choice (also known as “select the best option”) situational judgment tests (SJTs). Using data from two large studies, we compared the reliability and correlations of NRM scores with those from various classical and item response theory (IRT) scoring methods. The SJTs measured emotional management (Study 1) and teamwork and collaboration (Study 2). In Study 1 the NRM scoring method was shown to be superior in reliability and in yielding higher correlations with external measures to three classical test theory–based and four other IRT-based methods. In Study 2, only slight differences between scoring methods were observed. An explanation for the discrepancy in findings is that in cases where item keys are ambiguous (as in Study 1), the NRM accommodates that ambiguity, but in cases where item keys are clear (as in Study 2), different methods provide interchangeable scores. We characterize ambiguous and clear keys using category response curves based on parameter estimates of the NRM and discuss the relationships between our findings and those from the wisdom-of-the-crowd literature.
Keywords
A situational judgment test (SJT) is one comprising items in which a situation is presented followed by several possible responses to that situation. Consider the following example, which uses a multiple-choice response format (also known as “select the best option”), which is the focus of the current investigation: You are nervous about a speech that you need to give to the whole school about a project you’ve been working on. You are worried that some students will not understand your speech, as it deals with some very difficult topics. What would you do in this situation? A. Work on your speech to make it easier to understand and ask for questions afterwards. B. Practice your speech in front of your family or close friends. C. Just give the speech. D. Be positive and confident, knowing it will go well.
A characteristic of SJTs is that they are often designed to measure judgments in situations for which there is no consensus on the correct response, or there might be multiple correct or acceptable responses. This is not a flaw in their design, as it might be considered in conventional standardized ability or achievement testing (e.g., mathematics, figural matrices). Rather, accommodating situations in which there is no consensus correct answer is a feature that enables the measurement of constructs not easily amenable to conventional testing, such as tacit knowledge (Polanyi, 1966), practical intelligence (Sternberg et al., 2000), or the things “they don’t teach you in Business School” (Nelson, 2011). SJTs are not constructs (Arthur & Villado, 2008); they are measurement methods. But SJTs may be distinctively useful for identifying and measuring constructs that might otherwise be difficult to identify and measure.
SJTs are typically constructed with critical incidents (Flanagan, 1954), which elicit knowledge or behavior related to a construct (e.g., teamwork, leadership, interpersonal skills) based on real-world situations. There might be some agreement or a majority opinion (the wisdom of the crowd; Surowiecki, 2004) on the best course of action in those situations, or there might be situations where even reasonable people with good intentions can disagree. Because of this feature of SJTs, scoring SJTs can be challenging. For example, in the example item, Options A, B, and D might all be deemed acceptable, and opinions might vary on which alternative is more justifiable. One response to this situation by the assessment community might be to design items so that only clear, objectively correct keys can be established, a kind of “looking for the keys under the lamppost because that is where the light is” strategy. Another response is to acknowledge the complexity of situational judgment as inherent to much real-world human decision making and identify a useful scoring strategy for assessments that address this activity.
Thus, the purpose of this study is to evaluate strategies for scoring situational judgment tests and evaluate them empirically over two studies with two large-scale real data sets. We first propose a taxonomy for scoring multiple-choice SJTs, which includes both classical test theory–based and item response theory (IRT)–based methods. Among these methods, we argue that the nominal response model (NRM; Bock, 1972) is an ideal candidate for scoring STJs because the NRM does not require a scoring key or a priori ordering of response options. The flexibility of the NRM for yielding information on response options, which then can be used to score those responses, makes it ideally suited for scoring SJTs, which often do not have clear-cut answer keys. We evaluate the efficacy of the NRM for scoring STJs by comparing it to other IRT and classical scoring methods used in the literature or that could be used.
SJT Response Formats
For situational judgment items, there are many response format variations, but most often, examinees are asked either to (a) select the best (and optionally, also the worst) from among two or more options (multiple-choice SJTs), (b) rank options from best to worst, (c) rate each option using a rating scale (e.g., Likert scale; rating-scale SJTs), or (d) rate each option as acceptable or not (true-false SJTs). Of these, multiple-choice SJTs (select the best option or select the best and select the worst from among several options) and Likert-scale option rating formats are relatively common response methods. Ranking options is rarely used, perhaps because of a belief that distinguishing middle-level options is not very informative for the additional time it takes to make such judgments, a belief in line with findings from Arthur et al. (2014). The true-false (or acceptable–not acceptable) format is also less frequently used.
Both rating-scale and true-false SJT formats are subject to response-style effects, that is, the tendency to respond on rating scales in ways independent of one’s standing on the construct being measured (He, 2015). Response styles include acquiescence bias or yea-saying (the tendency to select true or respond positively independently of the construct being measured), extreme response style (the tendency to select extreme responses on the response scale), or modest response style (the tendency to select rating categories in the middle). The magnitude of response style effects can vary, being more pronounced with some constructs than with others (van Dijk, Datema, Piggen, Welten, & van de Vijver, 2009) and more pronounced with diverse cultural subgroups (e.g., Bachman & O’Malley, 1984; Chen, Lee, & Stevenson, 1995). Response style effects may be controlled to some degree through standardization to reduce elevation and scatter differences (McDaniel et al., 2011), but these and other statistical adjustments cannot be guaranteed to solve the response style problem because there is no unambiguous way to separate construct-relevant from construct-irrelevant variance.
There may be evidence for response style effects on SJTs. Arthur et al. (2014) found that although ranking and multiple-choice formats were associated with higher correlations (compared to rating formats) with cognitive ability, the rating scale format (compared to ranking and multiple choice) was associated with higher internal consistency and higher correlations with personality scales. Both of these findings favoring rating scales could be attributable to response style effects. Response styles are a source of consistent score variance, which therefore boosts estimates of internal consistency. Response styles also are expressed across constructs in a consistent manner, which would boost correlations between other measures that also used rating scales, which the personality measures reported by Arthur et al. do (had Arthur et al. used forced-choice personality measures, this source of common variance would not be present).
In addition to response style problems, there are additional problems with rating-scale SJTs. McDaniel et al. (2011) point out that middle location items (i.e., items for which the best answer is at a middle point of the Likert response rating scale) have lower validity, at least partly due to adoption of a construct-irrelevant minimax strategy for option selection. McDaniel et al. recommend not using such items, eliminating a large pool of potential items.
A solution to all these problems is to avoid the use of rating scales or true-false judgments altogether and simply adopt the use of multiple-choice (or ranking) methods for assessing situational judgment. Consequently, in this study, we focus on multiple-choice SJTs. (For scoring rating-scale SJTs, see De Leng et al., 2017; Legree, Kilcullen, Psotka, Putka, & Ginter, 2010.)
Scoring Multiple-Choice SJTs
It is useful to consider the variety of approaches that can be taken to score multiple-choice SJTs. As suggested in Table 1, scoring methods can be categorized by (a) whether they provide dichotomous (right-wrong) or polytomous (partial credit) item scores; (b) whether items contribute equally (unit weighted) or differentially (varying) to the final score; (c) for partial-credit scoring only, whether categories within an item are considered to have equal discrimination or not; (d) whether scoring requires the prior ordering of response options within each item (e.g., best to worst or simply, which is best); and (e) whether scoring is based on classical test theory or item response theory. Scoring methods associated with each set of design attributes, which are the methods examined in this study, are shown in Table 1.
Taxonomy of Classical and Item Response Theory–Based Scoring Methods for Situational Judgment Tests.
Note: 1PL = one-parameter logistic model; 2PL = two-parameter logistic model; RSM = rating scale model; PCM = partial credit model; GPCM = generalized partial credit model; GRM = graded response model; NRM = nominal response model.
a In this study, we used principal components weighting, but other methods are possible; see text.
b Empirical methods (Bergman, Drasgow, Donovan, Henning, & Juraska, 2006): regression-based item weighting to a criterion; see text (not used in this study).
c Empirical methods (Bergman et al., 2006): regression-based item and option weighting to a criterion; see text (not used in this study).
The taxonomy of approaches listed in Table 1 complements previous reviews of SJT scoring approaches. For example, Bergman et al. (2006) reviewed “the scoring strategies that have been described in the literature” (p. 223), all of which were classical test theory approaches. Bergman et al. identified three categories of SJT scoring methods: empirical, theoretical, and expert-based. We do not adopt that terminology here, but we cover the basic scoring methods and show how Bergman et al.’s categories are aligned with those discussed in the following. Table 1 includes both IRT scoring methods as well as classical test theory–based methods. We next describe the scoring methods in more detail.
Classical Test Theory–Based Methods
Number correct scores or sum scores more generally
To compute number correct scores or sum scores requires a key, typically given by experts (Bergman et al.’s [2006] “expert-based” scoring category). It is also possible that the key can be derived from a theory (Bergman et al.’s [2006] “theoretical scoring”). For example, MacCann and Roberts (2008) developed a key based on appraisal theory to score their Situational Test of Emotional Understanding. A key also could be defined by response popularity, that is, the key is the most popular response. An item can also have more than one key, such as scoring responses for identifying both best option and worst option responses (a point is subtracted for every time a worst option is selected as best or vice versa). These can similarly be identified by experts, theory, or popularity. Bergman et al. obtained worst option keys by comparing experts and novices and identifying items in which novices and experts differed on what they considered best, and at least one-third of the novices selected that option as best. Regardless of method, a total score is the number of correct responses an examinee selected, or if worst option judgments are also elicited, the total score is the sum of the number of best options correctly identified and worst options correctly identified minus the number of worst options selected as best and best options selected as worst (e.g., Oswald, Schmitt, Kim, Ramsay, & Gillespie, 2004). There are other ways to combine best option and worst option responses, such as treating them as separate items and weighting them differentially, which is a weighted sum score.
Weighted sum scores
Weighted sum scores are sums of the weighted number correct, in which each item (or item response in the case of best option and worst option responses) is weighted according to its importance or contribution to the total score. Bergman et al. (2006) propose a “factorial scoring” method but do not describe it in detail. Factor analysis (or principal components) can provide a basis for weighting items. In this study, items are weighted based on a principal components analysis of the right-wrong (1-0) item covariance matrix; the weighted sum score is the component score based on the first principal component.
Regression-based item and option weights (empirical methods)
The weighted sum score is one example of Bergman et al.’s (2006) “empirical methods” to determine item weights. However, both items and response options can be weighted or determined empirically, for example, by regressing an outcome (criterion) variable on dummy-coded options and items. Options and items with greater regression weights contribute proportionately more to the total score. However, this approach requires an outcome variable, and changing the outcome variable will in general change the crediting scheme (i.e., options and items receiving greater credit with one outcome variable will not in general be the same ones that get greater credit with a different outcome variable). We do not use this approach in the current study.
Proportion-consensus scores
Although not reviewed by Bergman et al. (2006), another empirical method, known as proportion-consensus scoring, does not require a key or a response order, nor does it need an outcome variable (Legree, Psotka, Tremble, & Bourne, 2005; MacCann, Roberts, Matthews, & Zeidner, 2004). A respondent’s score for an item is the proportion of the sample (in this study, it is the sample of respondents, but in principle, it could be an external sample, such as experts) who selects the same response the respondent chooses. For example, if for a particular item the proportions of the sample selecting Options A, B, C, and D are .40, .28, .10, and .22, respectively, then a respondent selecting A gets a score of .40 for that item, a respondent selecting B gets a score of .28 for that item, and so on. The total score is the sum of the item scores.
Item Response Theory–Based Methods
Suppose a multiple-choice situational judgment test has I scenarios (each scenario is an item), the ith item has mi options, and examinees are asked to pick the best options for each item. Let θ be the latent ability that generates the responses on these items. The IRT models summarized in Table 1 are described in detail in the following subsections.
One- and two-parameter logistic models
If a key for each item is available, obtained through expert knowledge, by in- or out-of-sample popularity, or by other means, the responses to item i can be coded as
The 1PL model presents the probability of answering the ith item correctly given the ability θ, as follows:
where a is the common slope parameter for all the items in the test, and bi is the difficulty parameter for item i. That is, items vary in difficulty, but all items are assumed to have the same discrimination value.
The 2PL model allows items to contribute differentially to the measurement of the latent ability by having their own slope parameters (ai ), that is,
Note that Equation 2 is Equation 1 with subscripts on the item slopes (ai
) indicating different weights for different items. Also note that both the 1PL and 2PL assume there is a single correct response for each item and that item responses are coded in a binary fashion as either being correct (
Multiple choice response data, as are obtained with a multiple-choice SJT, can also be treated as polytomous. If treated as polytomous, then which category the respondent chooses is the response for the item. Treating the data as polytomous enables partial credit scoring.
Nominal response model
The nominal response model (Bock, 1972) is a very flexible polytomous IRT model. It does not require a scoring key or an ordering of responses; it enables partial credit for different option selections and allows for differential item weights and varying category discriminations.
The NRM assumes that a continuous latent variable (θ) accounts for the covariances among items with unordered (i.e., nominal) responses. Let Xi
be the response on the ith item and
where
By fitting the NRM (Equation 3) to the response data, item parameters (
Other IRT models (e.g., 1PL, 2PL, rating scale, partial credit, generalized partial credit models) can be seen as special cases of the NRM. For example, the 2PL model is the NRM model with known binary keys that dichotomously scored the items to two response categories (0 and 1). Parameterizations of the partial credit model and the generalized partial credit model as constrained versions of the NRM are provided as follows.
Partial credit model and generalized partial credit model
The partial credit model (PCM; Masters, 1982) generalizes the Rasch (1960) dichotomous model (see Equation 1) to handle the case of polytomous data. The generalized partial credit model (GPCM; Muraki, 1992, 1997) generalizes the PCM to allow for differential item weights. The GPCM can also be seen as a generalization of the dichotomous 2PL model to handle polytomous data. The difference between the PCM and the GPCM is that the PCM requires all items to have the same item slope, whereas the GPCM allows different items to have different item slopes.
Both the PCM and the GPCM require responses to be ordered from best to worst with respect to the latent ability, which could be accomplished through prior knowledge, expert ratings, in- or out-of-sample response popularity, or other means (e.g., the
where ai is the slope parameter for item i. The number of parameters for item i under the GPCM is the number of response categories, mi .
The PCM can be expressed by further constraining all ai in Equation 4 to be equal across all items. (The rating scale model [Andrich, 1978] makes the additional assumption that adjacent threshold differences are constant within and across items; however, this model is typically used with Likert-scale type items, not multiple choice, as in this study.)
Applications of the Nominal Response Model
In the literature, the NRM or its extensions are mainly used for two purposes, taking advantage of its two unique strengths. One is for modeling items responses that are nominal, or as Samejima (1996) put it: for “discovery of implicit orders behind nominal responses.” In Bock’s (1972) original proposal of the NRM, it was used to model all response alternatives, including omits, of multiple-choice items. Murray, Booth, and Molenaar (2016) studied the meaning of the response option “?” of the 16 Personality Factor Questionnaire, Version 5 (16PF5) by fitting the NRM to the U.S. and UK standardization samples of the 16PF5. By comparing the NRM category slope parameter estimates for the three response options—positive, ?, and negative—they found that for many items (e.g., items in the Warmth and Openness to Change scales for the U.S. sample), the ? option behaved as the lowest response option. Another example is to model response styles in self-report Likert-scale data using multidimensional extensions of the NRM (Bolt & Johnson, 2009; Johnson & Bolt, 2010). In this model, categorical item responses are modeled as a function of both latent traits that the instrument is designed to measure and response style factors. The NRM is a good candidate because the relationship between item score categories and the response style factor(s) might not be ordered. Further extensions of the multidimensional NRM has been made to model latent traits and response styles from multiple scales (Bolt & Newton, 2011) or longitudinally (Deng, McCarthy, Piper, Baker, & Bolt, 2018).
Another application of the NRM has been to empirically check whether items with ordered responses behave as expected or whether more constrained models for ordered responses (e.g., GPCM) are appropriate. For example, Preston, Reise, Cai, and Hays (2011) tested whether the within-item category discrimination is constant for items in the Emotional Distress item pools. By testing the parameter estimates from the NRM model, they found differential within-item category discrimination for 25 of the 86 items. Cole, Turner, and Gitchel (2018) studied reversed-worded matched item pairs by fitting the NRM and the GPCM to Likert-scale items from three scales (anxiety, depression, and perceived stress). They found the NRM fit significantly better than GPCM, and positively worded items appeared to have higher discrimination than the paired negatively worded items.
Objectives
The purpose of the empirical studies is to evaluate the efficacy of the nominal response model as a method for scoring situational judgment tests. SJTs are often used to assess hard to measure abilities, such as emotional intelligence or teamwork, using tests that do not have clear right-wrong answers or have several correct answers. The flexibility of the NRM in being able to identify a variety of relationships between proficiency and response patterns suggests that NRM may be ideally suited for scoring SJTs. To evaluate the efficacy of the NRM, we computed several alternative scores based on restricted versions of the NRM, as outlined previously, to ascertain the requirement for the full flexibility of the NRM versus the suitability of more parsimonious scoring methods. We also compared NRM and other IRT scoring methods with the more commonly used classical test theory–based methods. We compared all these methods on reliability and their correlations with a wide range of psychological factors.
The validity of this approach rests on an assumption that given two methods for scoring a test of situational judgments, a noncognitive measure, the superior method is the one that tends to yield higher correlations with other noncognitive measures, a convergent validity argument. This argument depends on the idea that there is a general factor measured by the SJT and that the general factor also underlies scores on many of the other psychological factors reflected in the other noncognitive measures. In this study, the general factor is a kind of broad social-emotional factor. The argument also relies on the idea that measurement error is generally unique and uncorrelated with other psychological constructs of interest. The superior scoring method is the one that reduces measurement error and tends to correlate more highly with other social-emotional measures.
Study 1
The data analyzed here were obtained from reports by Burrus and Roberts (2012) and Burrus, Roberts, Brenneman, and Lipnevich (2012).
Method
Participants
A sample of N = 2,081 students were administered an SJT along with other measures. Only data from the subsample (N = 2048, 51% female; ages 9–18, 99% between 11 and 14 years old; M = 12.3, SD = 1.0) who responded to all 11 items on the SJT were included in the analysis.
Situational judgment test
The Situational Test of Emotional Management for Youths (STEM-Y) was designed to measure students’ ability to manage emotions, that is, to moderate negative emotions and enhance positive ones (MacCann, Fogarty, Zeidner, & Roberts, 2011; MacCann & Roberts, 2008; MacCann, Wang, Matthews, & Roberts, 2010). It consists of 11 four-choice SJT items. Examinees were asked to pick one option that best matched what they would do in the situation. The item provided in the introduction is an item from STEM-Y.
Scoring schemes
Eight scoring methods (number correct, weighted sum, proportion-consensus, 1PL, 2PL, PCM, GPCM, and NRM) were applied to the SJT data. For methods requiring item keys (number correct, weighted sum, 1PL, and 2PL), the keys were the most popular response in the sample. For methods requiring response option ordering (PCM and GPCM), ordering was based on ranked response category popularity. IRT analyses were conducted using both IRTPRO 2.1 (Cai, du Toit, & Thissen, 2011) and flexMirt 3.5 (Cai, 2017). Item parameters were estimated using marginal maximum likelihood via the Bock-Aitkin expectation-maximization algorithm (Bock & Aitkin, 1981). Then, each examinee’s latent ability was estimated using the expected a posteriori (EAP) estimator (Bock & Mislevy, 1982) and treated as the model-based score. The two software packages yielded the same results for the models considered in this study (i.e., Akaike information criterion [AIC], Bayesian information criterion [BIC], item parameter estimates, reliability, and EAP estimates of latent abilities). Annotated syntax for fitting the NRM using flexMIRT are provided in Appendix A in the Supplemental Materials (available in the online version of the journal). Screenshots for fitting the NRM using IRTPRO are shown in Appendix B in the Supplemental Materials (available in the online version of the journal).
Unidimensionality of the SJT
Whether the SJT can be considered unidimensional was evaluated with a scree plot and parallel analysis 2 (Hayton, Allen, & Scarpello, 2004). Parallel analysis involves comparing the eigenvalues of the sample correlation matrix among item scores with the 95th percentile of eigenvalues derived from random data sets that parallel the real data in terms of sample size and number of variables. Because different scoring methods used different item scores (thus different correlation matrices for eigenvalues), four ways of item scoring were used. They were dichotomous item scores using popularity keys (they are building blocks for number correct, 1PL, and 2PL), consensus item scores, polytomous item scores by response popularity ordering (for PCM and GPCM), and polytomous item scores by NRM-slope ordering. 3 For each scoring method, 100 random data sets were generated. For ordinal item scores, polychoric correlation was used, and for consensus item scores, the Pearson correlation matrix was used. Analyses were conducted in R (R Core Team, 2017).
Background and psychological variables
The STEM-Y situational judgment test was one of many tests administered to students as part of a larger study. We developed a total of 24 additional external variables from the data set. They included grades, cognitive test scores, personality and attitude measures, and other factors. These served as criterion variables to be correlated with the various SJT scores as a way to compare the SJT scoring methods.
Some of the variables (e.g., grades, cognitive test scores) were observed variables. The remaining variables (e.g., cooperation, leadership, teacher evaluations) were latent variables obtained from exploratory factor analyses (EFA) of five item sets of 10 to 95 items per set. Items were 4-point Likert-scale items. Analyses were conducted using Mplus version 6 (Muthén & Muthén, 1998-2010). All items with the same labels for the rating categories were treated as a set regardless of the questionnaire in which they might have originally been administered. (One set was split into two because the number of items, 185, resulted in analyses exceeding the Mplus memory capacity.) The goal of the EFA was to produce as many distinct but interpretable variables as possible to maximize the criterion space available for the correlational analysis.
Items, all of which had rating-scale response formats, were treated as ordered categorical variables. The number of factors for each of the item sets was determined by a parallel analysis (Hayton et al., 2004, with polychoric correlations) and factor interpretability. The weighted least squares mean and variance-adjusted estimator, a robust weighted least squares estimator (WLSMV; Muthén, du Toit, & Spisic, 1997), and oblique Geomin rotation were used for the analysis. An EFA for the determined number of factors was carried out using exploratory structure equation modeling 4 (ESEM) to estimate the factor scores (Asparouhov & Muthén, 2009). The factor scores produced by the analysis just described served as the criterion variables for comparing SJT scoring methods for the purposes of this study. The scores we produced are similar to but not identical with the scores reported in Burrus and Roberts (2012).
Validation by splitting the data
Because scoring keys for SJTs are often derived from samples (e.g., test-taking sample or a smaller sample of experts), one may question the degree to which findings are sample specific. We addressed this question by randomly splitting the current sample in half, with a randomly sampled n = 1,000 as Sample 1 and the remaining n = 1,048 as Sample 2. We scored Sample 2 using two set of scoring keys—keys developed from Sample 1 and keys developed from Sample 2. We did that for the 2PL, GPCM, consensus, and NRM scoring methods. More specifically, to score Sample 2 with IRT methods using Sample 1 key, we coded item scores for the whole sample using Sample 1 keys (dichotomously for 2PL and polytomously for GPCM), fit the IRT models to Sample 1, then fixed item parameter to estimates obtained from Sample 1 to score Sample 2 using EAP.
Results
Unidimensionality of SJT
Parallel analysis results for the SJT using the four ways of scoring (dichotomous, consensus, polytomous with popularity-ordering, and polytomous with NRM slope ordering) are shown in Figure 1. Solid dots represent eigenvalues from real data, and hollow dots are the 95th percentile of eigenvalues from 100 random data sets. Results showed that only the first eigenvalue was higher than the 95th percentile from random data sets. This indicates that unidimensionality of the SJT is a reasonable assumption.

Parallel analysis results (Study 1). Solid dots are eigenvalues from the sample, and hollow dots are the 95th percentile of eigenvalues from 100 random data sets.
Reliability
Descriptive statistics, reliability estimates, and correlations among scores generated from the eight scoring methods applied to the SJT responses are summarized in Table 2. To compare reliability of scores, we used Cronbach’s alpha for the classical methods and marginal reliability (Green, Bock, Humphreys, Linn, & Reckase, 1984) for the item response theory methods. Table 2 shows that as is commonly found with SJTs, reliabilities tended to be relatively low. The NRM score had the highest reliability (.63), followed by the consensus score (.60). The reliabilities based on the number correct score and 1PL were in the .40s, substantially lower than those of the other scoring methods.
Descriptive Statistics, Reliability, and Intercorrelations of Different SJT Scores (Study 1).
Note.
Correlations among SJT scores
The various methods yielded fairly highly correlated scores (see Table 2), as might be expected given that they are all based on the same data. Correlations among scores from the non-NRM scoring methods were all greater than .85. The correlations between the NRM scores and other scores were lower: The NRM scores correlated around r = .85 with consensus and GPCM scores, around r = .80 with PCM and 2PL scores, r = .78 with weighted sum score, and r = .68/.69 with number correct and 1PL scores; thus, NRM was the most unique score.
Model fit
For the IRT models, two model fit indexes, the AIC and the BIC, are summarized in Table 3. For both, lower values indicate better fit. Of the dichotomous methods, 2PL was found to fit the data better. Of the polytomous methods, NRM was found to fit the data best. Both BIC and AIC penalize for the number of parameters, and so the superiority of fit for NRM is not merely due to the number of parameters.
Model Fit of Item Response Theory Models for Study 1.
Note: Lower AIC and BIC values indicate better fit. However, AIC and BIC are useful only for within model type (dichotomous, polytomous) comparisons. AIC = Akaike Information Criterion; BIC = Bayesian Information Criterion; 1PL = one-parameter logistic model; 2PL = two-parameter logistic model; PCM = partial credit model; GPCM = generalized partial credit model; NRM = nominal response model.
Correlations with external (criterion) variables
Table 4 shows the correlations between five (of the eight) SJT scores and the 24 criterion factors. (We did not show all eight because some of the SJT score patterns were nearly identical.) The 1PL and number correct scores were identical to the second decimal place, weighted sum and 2PL scores were within .02, and PCM and GPCM scores were within .02.
Correlation Among Scores of the SJT and Criterion Variables in Study 1 (N = 2,048).
Note: Boldface indicates the highest absolute correlation is statistically higher than the second highest absolute correlation per row according to Williams’s (1959) t test of the difference between two dependent correlations. Variables 1 to 5 are observed; Variables 6 to 24 are factors obtained from exploratory factor analyses. Items with the highest factor loading are in parentheses. SJT = situation judgment test; 2PL = two-parameter logistic model; GPCM = generalized partial credit model; NRM = nominal response model; ERB = Educational Records Bureau.
Criterion factors were sorted into three groups: social-emotional factors for which NRM scores were significantly more predictive compared to other scores (10 criterion variables) according to Williams’s (1959) t test of the difference between two dependent correlations, p < .05, see also Howell (2013, p. 287); social-emotional factors for which that was not true (9 criterion variables); and grades and standardized cognitive test scores (5 criterion variables). We noted that the NRM score seemed to have higher comparative correlations for variables with high correlations (r > .20).
To examine this systematically, we plotted the difference between the absolute correlations of two sets of scoring methods with external variables (Figure 2). A solid triangle indicates the correlations are significantly different, while a hollow circle indicates nonsignificance by Williams’s (1959) t test, and the numbering of criterion variables matches that in Table 4. Plots suggest that (a) as the absolute correlation with the external variables increases, the superiority of the consensus score over the number correct score emerges and increases (see Figure 2a), and (b) as the absolute correlation with the external variables increases, the superiority of the NRM score over the other scores emerges and increases (see Figures 2b-2d).

Comparison of (a) consensus score and number correct scores, (b) nominal response model (NRM) and consensus scores, (c) NRM and two-parameter logistic model scores, and (d) NRM and generalized partial credit model scores in correlations with external variables (Study 1). | r | represents the absuloute correlation of the situational judgment test with an external variable. A solid triangle indicates the two absolute correlations are statistically significantly different, while a hollow circle indicates the correlations are not statistically different.
We also calculated the correlation with external variables separately by gender. Within-gender SJT-score intercorrelation suggested small differences between groups but did not moderate the main conclusion—the superiority of the NRM score with
Validation by splitting the data
Sample 2 scores using Sample 1 and Sample 2 keys correlated at .95 for GPCM, .98 for 2PL, and .99 for consensus and NRM scoring methods. Reliability of Sample 1 and Sample 2 scores when estimated separately were within .02 difference from those reported in Table 2, which are based on the full sample. The NRM scores still had the highest reliability. Correlations of Sample 2 scores using Sample 1 keys with external variables confirmed the patterns reported in Table 4. The superiority of the NRM scores with
Study 2
The data analyzed here were reported in MacCann and Roberts (2013). However, we rescored the situational judgment test and recomputed factor scores for the criterion variables.
Method
Participants
A sample of N = 1,036 students from 24 community colleges (n = 462) and universities (n = 574) drawn from five regions across the United States (64% female; ages 15-85, 83% in the 18- to 25-year-old range; 53% White, 23% African American, 14% Hispanic, and 6% Asian) were administered the SJT along with many other measures. All students were included in the analysis.
Situational judgment test
The SJT in this study was designed to measure teamwork and collaboration. For each of 24 situations, four possible responses were presented, and examinees were asked to choose the answer that most closely matched how they would respond. A single expert key was available for each situation. An example item is: You’re one of the managers for a large volunteer agency. In a discussion about how to find new volunteers, you bring up what you think is a great new idea. But the other managers tell you that the idea is “off base” and not workable. How would you handle this situation? A. Drop your idea because the group is probably right. B. Point out several good reasons why your idea might work. C. Drop your idea for now, but tell it to your boss later. D. Tell the other managers that lots of people don’t recognize great ideas at first.
Scoring schemes
The same scoring methods described in the introductory section and used in Study 1 were applied to the SJT data for this study. Expert keys (which matched the most popular response for all 24 items) were used for the number correct, 1PL, weighted sum, and 2PL scores. All analysis procedures were the same as those used in Study 1.
Unidimensionality of the SJT
Same as Study 1, the unidimensionalilty of the SJT was studied by scree plot and parallel analysis for four sets of item scores: dichotomous item scores by expert keys, consensus item scores, polytomous item scores by response popularity ordering, and polytomous item scores by NRM-slope ordering.
Background and psychological variables
There were 53 criterion factors from the data set that served to compare the various SJT scoring methods. They included variables similar to those in Study 1, such as grades, cognitive test scores, and personality and attitude measures, and also included additional factors measuring test anxiety, time management, mathematics attitude, attribution patterns, career interest, sleep and diet habits and attitudes, and other factors. The methodology for computing factor scores from these variables was the same as what was used in Study 1.
Results
Unidimensionality of SJT
Parallel analysis results for the SJT using the four ways of item scores are shown in Figure 3. Parallel analysis results confirmed the salient first factor but also showed that the second eigenvalues were slightly greater than those of the 95th percentile from 100 random data sets. We subsequently fit EFAs with one and two factors to the data. Results suggested that one-factor models fit the data well. For different sets of item scores, the RMSEA for the one-factor models ranged from .025 to .034, indicating satisfactory fit. The two-factor models lacked interpretability. Thus, we retain the SJT as sufficiently unidimensional for unidimensional IRT analysis.

Parallel analysis results (Study 2). Solid dots are eigenvalues from the sample, and hollow dots are the 95th percentile of eigenvalues from 100 random data sets.
Reliability
Descriptive statistics, reliability estimates, and correlations among scores generated from the eight scoring methods applied to the SJT responses are summarized in Table 5. Reliabilities of the SJT from different scoring methods were all in the .70s, which was higher than those from Study 1 (the SJT in Study 2 was more than twice as long as the one from Study 1, 24 vs. 11 items; however, this length difference did not entirely account for the difference in reliability, that is the Spearman-Brown test length–adjusted reliabilities from Study 1 did not match the observed Study 2 reliabilities). The consensus score had the highest reliability (.79), the 1PL the lowest (.72), with NRM in the middle (.74).
Descriptive Statistics, Reliability, and Intercorrelations of Different SJT Scores (Study 2).
Note: rxx’ = reliability. Reliability for number correct and consensus scores is Cronbach’s alpha; for all other scores, it is marginal reliability. SJT = situation judgment test; 1PL = one-parameter logistic model; 2PL = two-parameter logistic model; PCM = partial credit model; GPCM = generalized partial credit model; NRM = nominal response model.
Correlation among scores from different scoring methods
Correlations among scores from different methods were all higher than .95, indicating that unlike Study 1, there were no substantial differences between scores.
Model fit
Table 6 shows that 2PL was the better fitting dichotomous model; for the polytomous scoring methods, the NRM best fit the data.
Model Fit of Item Response Theory Models for Study 2.
Note: Lower AIC and BIC values indicate better fit. However, AIC and BIC are useful only for within model type (dichotomous, polytomous) comparisons. AIC = Akaike Information Criterion; BIC = Bayesian Information Criterion; 1PL = one-parameter logistic model; 2PL = two-parameter logistic model; PCM = partial credit model; GPCM = generalized partial credit model; NRM = nominal response model.
Correlation with external (criterion) variables
We calculated the correlations between the eight SJT scores and the 53 criterion variables. There were only small differences in correlations as a function of scoring method. Results are provided in the Supplemental Materials (Appendix C, available in the online version of the journal). NRM scores might have been slightly better than other scores at the highest correlation levels, but these differences were very small. This is in contrast to Study 1, where clear differences could be seen.
Within 2- and 4-year schools, SJT-score intercorrelations showed small differences between groups. But the main conclusion we drew from the whole sample—the interchangeability of scores—was true for the within-group analyses as well.
Conclusions and Discussion
There is a growing literature on the use of SJTs for various uses such as selection and program evaluation in education and the workforce. The purpose of this study was to evaluate classical test theory and item response theory–based strategies for scoring situational judgment tests, based on data from two large-scale studies with different populations using two different SJTs. We suggested that because SJTs often are used to assess judgment in cases where the correct answer is not clear or there are multiple correct answers, the NRM might be an ideal scoring model, being more flexible and capable of identifying varying proficiency-response relationships compared to classical scoring methods and other IRT methods. The NRM is designed for polytomous items whose order of response options with respect to the latent ability is unknown (i.e., nominal) and is therefore a good candidate for scoring SJTs because the NRM does not require a key or a prior ordering of options.
We applied various scoring methods to two different SJTs from two samples. Both samples also had available scores from extensive achievement and psychological test batteries, from which we obtained various scores to serve as criteria against which to compare the various SJT scores.
In Study 1, the NRM score was shown to be superior to other scores in that it had higher reliability and its correlations with external variables were higher, particularly for variables for which the correlations were high. Of the classical test theory–based scores, consensus scoring had a similar pattern to the one shown by the NRM score—its superiority over number correct scoring increased as correlations increased. However, the NRM score was superior to the consensus score. In Study 2, the story was different. All scores performed approximately equally.
This suggests a hypothesis. The value of the NRM might best be realized in cases where the key is ambiguous or there are multiple acceptable answers. In cases where the key is clear (e.g., in typical standardized tests), there is no additional value to NRM or, for that matter, for consensus compared to number right scores.
Defining Ambiguous or Multiple Keys
What exactly do we mean by an ambiguous key? The option response curves based on parameter estimates from the IRT models provide information that can be used to identify items with ambiguous keys.
Consider Figure 4, which contains the response curves based on NRM, 2PL, and GPCM for Items 2 and 7 of the SJT used in Study 1. For Item 2, the NRM response category curves suggest that as ability increases, the probability of choosing A, B, and D all increase while the probability of choosing C decreases. Estimates of the category slopes based on NRM for A, B, and D were not significantly different (the estimates are, respectively, 0.00, 0.00, and 0.02, while for C was –1.86). This suggests the key for this item is ambiguous, with A, B, and D all being correct in a sense and the distinction among these three options is minimal, while C is clearly a wrong option. Fitting a 2PL model by treating the most popular response category (A, 40%) as the key and all other options as equally wrong would not fairly capture the relationships between answer choice and latent ability. This is clearly seen by comparing the response curve based on the 2PL model (the solid line in Figure 4, Panel b) with the response curve for choosing A based on the NRM model (the solid line in Figure 4, Panel a). By examining the content of Item 2, which is the example item provided at the beginning of this article, we can see that indeed, A, B, and D all demonstrate good emotional management to a certain degree while option C does not.

Category response curves for Item 2 (a to c) and Item 7 (d to f) in Study 1 based on the nominal response model, two-parameter, and generalized partial credit model.
Under GPCM, the options were ordered by their popularity (for Item 2, A = 40%, B = 28%, D = 22%, and C = 10%) and the discriminations among the adjacent options were constrained to be the same. These constraints lead Item 2 response curves to be of different shapes (Figure 4, Panel c) than those from the NRM (Figure 4, Panel a). Under the GPCM, as ability increases, the probability of choosing the most popular option, A, increases with a deeper slope than under the NRM; the probability of choosing the second popular option, B, increases for ability lower than 0 but decreases for ability higher than 0; and the probability of choosing the third popular option, D, monotonically decreases (under the NRM, the probabilities of choosing B and D monotonically increase). The difference between the response curves based on the GPCM and the NRM and the better model fit of NRM (see Table 3) suggest that the equal discrimination constraint in GPCM was not adequate. By using parameter estimates from NRM, one could also test the adequacy of fitting a GPCM using the methods described in Preston et al. (2011).
For Item 7, the response curves based on NRM (Figure 4, Panel d) show that as the latent ability increases, the only option with an increasing probability being chosen is C, indicating that C is the key. Because the key is rather clear for this item, the 2PL response curves (Figure 4, Panel e) are not very different from those based on NRM. The GPCM response curves (Figure 4, Panel f) are also similar as those based on NRM.
Now consider Study 2. Almost all items in the SJT administered in that study have a clear key, thus NRM and other models provide similar results. The response curves based on NRM, 2PL, and GPCM for Item 1 are provided in Figure 5. Note that these behave similarly to those of Item 7 in Study 1 (shown in Figure 4 Panels d, e, and f).

Category response curves for Item 1, a typical item in Study 2, based on (a) nominal response model, (b) two-parameter logistic model, and (c) generalized partial credit model.
The response curves for all items in both studies based on NRM are provided in the Supplemental Materials (Appendix D, available in the online version of the journal). From the NRM plot (Figure D1, available in the online version of the journal), Study 1 Items 2, 5, 8, 10, and 11 (5 out of 11 items) all show ambiguous keys in the sense that two or more responses are attractive to high ability (i.e., θ > 1) examinees. For Study 2, from the NRM plots (Figures D2 and D3, available in the online version of the journal), only Items 12 and 23 (2 out of 24 items) might have ambiguous keys.
The hypothesis that the value of NRM scoring emerges as keys become ambiguous could be evaluated by examining SJTs that are characterized by ambiguous keys. Many SJTs are characterized by ambiguous keys, which explains the popularity of consensus scoring. Besides SJTs, other tests have ambiguous keys. Some performance tests of emotional intelligence seem to have this character, such as the Reading the Mind in the Eyes test (Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, 2001) or the Mayer-Salovey-Caruso Emotional Intelligence Test (Mayer, Salovey, & Caruso, 2004; see also MacCann et al., 2004), suggesting a follow-up test of the value of NRM scoring. There are some cautions regarding the use of the NRM for scoring situational judgment tests. First, it is important to note that complex models such as NRM, or IRT models more generally, involve the estimation of more parameters compared to classical methods and thus entail higher sample size requirements. Rules of thumb for IRT models range from a minimum of 200 (Drasgow, 1989) for 2PL to 250 for the partial credit model (de Ayala, 2009), to 500 for the graded response and generalized partial credit models (de Ayala, 2009), and 600 for the nominal response model (de Ayala, 2009). 5
Second, the NRM is a data-driven method, and as such, it is appropriate to question the degree to which findings derived generalize to other samples and populations. Here the NRM parameter estimates can indicate best option choices and the rank ordering of option choices. However, it is an open question whether those same best option and rank orderings would hold up with another sample and more importantly, a different population (note that this problem also exists with the classical versions of NRM, such as the consensus scoring approach). We are currently investigating this issue to determine the degree of consistency in determining rank ordering of option choices across subgroups (e.g., male-female, racial/ethnic subgroups) and over time (across grades).
Finally, using NRM for scoring situational judgment tests or potentially other tests in which the true answer is not known can be seen as a method for exploiting the wisdom of the crowd (Surowiecki, 2004). As such, it can be placed alongside other crowd wisdom methods such as the Bayesian truth serum (BTS; Prelec, 2004), which relies on comparisons of one’s preferred choices with one’s perceptions of others’ preferred choices, and small crowd selection methods (Budescu & Chen, 2014; Olsson & Loveday, 2015), which rely on giving more weight to crowd members with good track records for making good choices. The advantage of the NRM approach over BTS is that there is no requirement to collect additional data on perceptions of others’ choices. The advantage of NRM over small crowd methods is that NRM does not require identifying experts or identifying expertise based on a track record of accurate predictions. Instead, the NRM in the application discussed here simply exploits the phenomenon that the most popular answer in a multiple-choice test is often, even though not always, the best or most appropriate one, and so differential weighting of respondents according to their tendency to choose popular answers is a useful strategy both for determining option appropriateness and crediting option selection in scoring.
Supplemental Material
Supplemental Material, Appendix_A,_B,_C_and_D - Nominal Response Model Is Useful for Scoring Multiple-Choice Situational Judgment Tests
Supplemental Material, Appendix_A,_B,_C_and_D for Nominal Response Model Is Useful for Scoring Multiple-Choice Situational Judgment Tests by Jiyun Zu, and Patrick C. Kyllonen in Organizational Research Methods
Footnotes
Acknowledgments
We thank Richard Roberts and Jeremy Burrus for making the data sets available to us, which were obtained from schools participating in the Elementary Schools Research Collaborative (ESRC) and Benchmark Research. We also thank Jonas Bertling, Isaac Bejar, Shelby Haberman, and Jonathan Weeks for their comments on earlier versions.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by ETS research funding from the Workforce Readiness and Next Generation Higher Education R&D initiatives.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
