Abstract
Extreme response set, the tendency to prefer the lowest or highest response option when confronted with a Likert-type response scale, can lead to misfit of item response models such as the generalized partial credit model. Recently, a series of intrinsically multidimensional item response models have been hypothesized, wherein tendency toward extreme response set is simultaneously estimated alongside one or more psychological constructs of interest. The multidimensional nominal response model (MNRM) is a divide-by-total model that allows person parameters for response sets, including extreme response set. The proportional thresholds model (PTM) is a difference model with response set parameters. The present study introduces a two-decision model (TDM) as an alternative to the MNRM and PTM and compares all three on data from assessments used in employee selection.
Keywords
Introduction
Since its origin in seminal work by Thurstone (1928) and Likert (1932), formal measurement of attitudes and personality has most often depended on the idea that the items measure a single construct dimension identified by the item stem. The items are often questions that provide a choice of graded responses on what is usually referred to as a Likert-type scale. Contemporary item analysis and scoring for Likert-type data using item response theory (IRT) have tended to depend on the assumption of an ordered placement of response categories along such a dimension.
However, it has also become well established that item response categories may elicit different interpretations depending on other individual difference variables, one of which is reflected in a tendency to endorse the lowest and highest categories; Cronbach (1946) called this “extreme response set.” One conceptualization of the way respondents whose tendency toward extreme responses is high differ from those who prefer milder responses is to posit an item response model with a latent dimension that is tendency toward extreme response; that may be unrelated to the item stem.
The ultimate goal of this article is to present a new IRT model, the two-decision model (TDM), for Likert-type responses in contexts in which tendency toward extreme response is an important dimension of individual differences. Before doing that, however, we review the substantial literature that has accrued over the past six decades, which provides empirical evidence for variation in extreme response set. Then we describe several other IRT models that have been proposed for the phenomenon. Finally, we describe the TDM and two studies of its usefulness with empirical data.
Correlates of Extreme Response Measures
Since its identification by Cronbach (1946), extreme response set (or style), the tendency to prefer the lowest or highest responses when confronted with a Likert-type item, has been examined in several ways. Questions of reliability, stability, and content independence have been investigated, while other studies have tested relationships with demographic and cultural variables, psychological traits, states, and test conditions. Although investigation of the correlates of extreme response set is beyond the scope of this article, a brief review of known correlates follows.
Preferences for extreme and midpoint responses have been found to differ across nations, with some disagreement about the direction and magnitude of the differences (Buckley, 2009; Chen, Lee, & Stevenson, 1995; Clarke, 2000, 2001; Meisenberg & Williams, 2008; Pope, 1991; Van Herk, Poortinga, & Verhallen, 2004). Two large cross-national studies linked differences in extreme response set to Hofstede’s (2001) descriptive dimensions for nations (De Jong, Steenkamp, Fox, & Baumgartner, 2008; Johnson, Kulesa, Cho, & Shavitt, 2005), following Smith’s (2004) study on culture and acquiescence. In support of the culture hypothesis, Chen, Lee, and Stevenson (1995) linked extreme and midpoint responding to individualist and collectivist orientations within each of four nations.
A second line of research documents the prevalence of extreme response set in various demographic groups or subnational cultural contexts. In the United States and Belgium, differences in extreme response set have been linked to region, ethnicity, race, and language (Bachman & O'Malley, 1984; Bachman, O'Malley, & Freedman-Doan, 2010; Berg & Collier, 1953; Clarke, 2000; Hui & Triandis, 1989; Marin, Gamba, & Marin, 1992; Moors, 2003, 2004). Age is positively correlated with extreme response set in several nations (Greenleaf, 1992a; Meisenberg & Williams, 2008; Van Rosmalen, Van Herk, & Groenen, 2010; Weijters, Geuens, & Schillewaert, 2010b), but children also show extreme response set (De Jong et al., 2008; Hamilton, 1968). Some studies find that women choose extreme responses more than men (Berg & Collier, 1953; Hamilton, 1968; Weijters et al., 2010b), but a few find the opposite (Bachman et al., 2010; Meisenberg & Williams, 2008) and some find no difference (Bachman & O'Malley, 1984; Clarke, 2000; Crandall, 1982).
Educational level is generally negatively associated with extreme response set across nations (Bolt & Johnson, 2009; Meisenberg & Williams, 2008; Paulhus, 1991; Van Rosmalen et al., 2010; Weijters et al., 2010b). However, Bachman and O’Malley (1984), in a survey of U.S. high school seniors, did not find extreme response set to be associated with either future educational plans or high school grades. Socioeconomic status may be negatively related to extreme response set (Greenleaf, 1992a) or not related (Bachman & O'Malley, 1984).
Psychological traits, including both anxiety and intelligence, form a third set of correlates of extreme responding. High anxiety subjects have been shown to produce more extreme responses on a variety of instruments (Berg & Collier, 1953; Crandall, 1965; Hamilton, 1968; Lewis & Taylor, 1955), but Grimm and Church (1999) were unable to replicate the finding. Crandall (1982) linked extreme responding to low “social interest,” a composite of variables such as altruism and empathy. Hamilton (1968) found in a review that intelligence correlated negatively with extreme responding; Meisenberg and Williams (2008) inferred agreement from national level statistics, but Bachman and O’Malley (1984) inferred no effect from the lack of a relationship with high school grades.
Measurement of Extreme Response Set
All of these findings depend on a notion of extreme response set as a trait with a great deal of permanence. Two studies showed test stability at intervals from 1 week to 1 year (Berg, 1953; Weijters et al., 2010b). Extreme response set measures are typically highly internally consistent (Berg, 1953; Hamilton, 1968; Messick, 1968), although Meisenberg and Williams (2008) describe an exception. However, certain test conditions affect the expression of extreme response set, including the number of response options available (Clarke, 2001; Messick, 1991), ambiguity, emotional arousal and speededness (Paulhus, 1991).
Although extreme response set is expected to be content-independent (Hamilton, 1968; Messick, 1968), only moderate generalizability has been observed for extreme response set measures derived from different content clusters even within an administration of a single questionnaire (Berg, 1953; Crandall, 1965; Grimm & Church, 1999; Rundquist, 1950; Weijters, Geuens, & Schillewaert, 2010a). Hui and Triandis (1985) suggested fatigue effects, and Parducci (1968) described contextual order effects for a related class of items. Messick (1991) suggested that introspective ambiguity, a person-by-content interaction, might produce an effect like item ambiguity in which some stems elicit greater extreme response set than others. Some authors have suggested estimating response set across items of diverse content in order to better distinguish content effects from response sets (Bolt & Newton, 2011; De Jong et al., 2008).
A particular test condition that may elicit extreme responding is knowledge of the scoring mechanism. Where a score is calculated by summation, extreme response set renders high scores higher and low scores lower (Baumgartner & Steenkamp, 2001; Bolt & Johnson, 2009; Bolt & Newton, 2011). In a high-stakes test where a summed score is used and item content is even slightly transparent, extreme responding is adaptive; such a cheating strategy is easily coached (Landers, Sackett, & Tuzinski, 2011; Thissen-Roe, Scarborough, Chambless, & Hunt, 2006) and well known (van Hooft & Born, 2012; see also O’Connell, 2009, for a spontaneously given example).
If ignored, extreme responding can enable score inflation strategies, reorder individuals at the high end of a scale (Landers et al., 2011), introduce spurious correlations between constructs (Paulhus, 1991), distort the factor structure of a multidimensional assessment (Cheung & Rensvold, 2000), and introduce a non-content-related form of differential item functioning (Bolt & Johnson, 2009). Several strategies have been proposed to ameliorate these effects. Ipsative rescaling or normalizing within subject is a common approach, matching a conceptualization of extreme response set as within-subject standard deviation, but it can eliminate construct differences along with the response set (Cheung & Rensvold, 2000). A second strategy involves the creation of an extreme response scale; Greenleaf (1992b) recommends using items of dissimilar content in order to avoid the conflation of a variable of interest with extreme response set. A third strategy is to model response sets as additional psychological variables in simultaneous operation with traits, states, or attitudes of interest.
Models for Extreme Responding
Cronbach (1946) suggested modeling response sets, including extreme response set, as distinct individual difference variables operating within items, a practice that IRT makes feasible. A series of four or five decisions are made to construct models that involve latent individual-difference variables for both the construct that is the primary measurement target, and for tendency toward extreme responding. Integrated models for construct measurement and extreme responding have been developed as extensions of existing models for construct measurement, so the first set of decisions involves (a) the choice of a parametric model for construct measurement and (b) the number of latent variables (one or more) for construct measurement. Then a decision is required about (c) the parametric form describing the effect of the extreme-responding individual differences variable on the item response. After those first three fundamental decisions yield a general parametric form for an item response model, decisions must be made about (d) the extent to which item and response-category parameters are assumed (or constrained) to be equal across items. Finally, if the construct model is multidimensional, some choice of constraints is required to (e) provide identification constraints for the fundamentally indeterminate multiple common factor model.
Böckenholt (2001) dealt with middle preference in trinary items by adding scale and shift parameters for the thresholds of an ordinal polytomous item response model; Lenk, Wedel, and Böckenholt (2006) considered the possibility that the additional parameters need not be item-specific. Johnson (2003) extended the model to items with more than three response categories, and introduced four sets of possible constraints, one being the proportional thresholds model (PTM). For the fundamental decisions as listed above, these models use (a) ordinal polytomous (or graded); (b) unidimensional construct item response models that yield categorical item responses depending on whether or not a latent response variable exceeds some threshold; the (c) model for extreme responding shifts the thresholds to produce more or fewer extreme responses.
Javaras and Ripley (2007) modeled within- and between-culture differences among British and American survey respondents by adding scale and shift parameters to a polytomous unfolding model, with group-specific priors. For the fundamental decisions as listed above, these models use (a) an ideal point, (b) unidimensional construct item response model that yields categorical item responses depending on whether or not a latent response variable exceeds some threshold. The (c) model for extreme responding shifts the thresholds to produce more or fewer extreme responses.
Bolt and Johnson (2009) introduced a multidimensional nominal response model (MNRM), capable of modeling multiple traits with differential influence on each of several response options. This MNRM is flexible: Under different constraints, it has been demonstrated to model response sets in a survey (Bolt & Newton, 2011; Johnson & Bolt, 2010) or multiple sequential skill components in educational items (Bolt & Newton, 2010). In these capabilities, Bolt and Johnson’s MNRM is distinct from the MNRM of Thissen, Cai, and Bock (2010), which requires the item response categories to be related to all latent variables in the same order. For the fundamental decisions as listed above, the MNRMs use (a) a nominal, (b) unidimensional construct item response model that yields categorical item responses in a way that depends on the extremity of an underlying latent response process (see Bock’s appendix in Thissen, Cai, & Bock, 2010). The (c) model makes use of another latent variable that is related to the response categories in order of their extremity, instead of their dominance order. Given reordering of the response categories for the construct-related and extreme-responding dimensions of the model, this is a compensatory model in that either a very high level on the construct or very high extreme responding yields an extremely high response.
In this study, we introduce a sequential decisions model for extreme response set. For the fundamental decisions as listed above, this TDM uses (a) an ordinal polytomous (or graded); (b) unidimensional or multidimensional construct item response model that yields a (latent) decision about the direction of a response depending on whether or not a latent response variable exceeds some threshold; and (c) a second or graded item response model that yields (latent) extremity. The two components combine multiplicatively in a non-compensatory fashion to produce the observed response.
This research compares the performance of unidimensional and multidimensional TDM, PTM and MNRM models on data from two assessments. Versions of the TDM, PTM, and MNRM models that are unidimensional for construct measurement are referred to as unidimensional and those that are multidimensional for construct measurement are referred to as multidimensional even though any integrated model for construct measurement and tendency to extreme responding has at least two latent variables—(at least) one for construct measurement plus one for extreme responding. For the multidimensional models in this research, bifactor (or hierarchical) models are used to provide identification constraints for the fundamentally indeterminate multiple common factor model, decision (e) above. The following sections provide notational definitions for the three models considered in Studies 1 and 2.
A Multidimensional Nominal Categories Model
Bolt and Johnson (2009) defined their MNRM as:
Thissen et al. (2010) gave nearly the same definition, but decomposed ajkd into ajdsjk . This decomposition enforces a single vector of scoring function parameters, sjk , within each item; the scoring function values are 0, 1, … K – 1 for K categories if the responses are graded, and deviate from that pattern otherwise. This parameterization precludes modeling items with categories in different orders along different axes, a condition necessary to fit a unipolar item simultaneously with extreme response set. However, this decomposition remains useful if modified to become ajdsjkd , so that the scoring function values can differ across dimensions. For example, if the scoring function values are 1 for extreme responses and 0 for middle responses, the corresponding dimension becomes tendency toward extreme responding.
Johnson and Bolt (2010) worked with the special case of one construct dimension plus one extreme response set dimension. They decomposed ajk for the construct dimension into ajsk , enforcing a cross-item scoring function equivalence. For the extreme response set dimension, aj was constrained to a, equal across items, and the scoring function values for the extreme response dimension with index x, sxk , were defined to be 1 in the most extreme categories and 0 elsewhere, reducing the problem of estimation substantially. They also decomposed cjk into skbjk , and further, following Anderson’s (1984) regression model, into skbj + dk , yielding item- and category-specific effects. This decomposition and Johnson and Bolt’s preferred identifiability constraints form the stereotype factor-analytic multinomial logit model (FAMLM).
An item model is a mathematical statement of a cognitive theory of item response. The MNRM and constraints used for Likert-type items imply trait-like response sets operating in a compensatory manner with item content; that is, at any level of extreme response set, there is some level of the underlying construct at which any individual will choose the extreme response. Following most of the earlier work, Johnson and Bolt (2010) and Bolt and Newton (2011) constrained the parameters dealing with extreme response set to be equal across items, implying that extreme response set is a trait measured by the response categories to the extent that stem ambiguity allows, and not otherwise elicited by content.
A Difference Model With a Scale Parameter
Like Bolt and Johnson (2009), Böckenholt (2001) added cross-item response set parameters to a flexible polytomous response model. However, he extended Samejima’s (1969) graded response model, with ordered response categories, rather than the nominal response model. Böckenholt considered trinary items, and allowed both thresholds to vary by respondent.
Johnson (2003) subsequently extended Böckenholt’s model to more than three item response categories, permitting a symmetric vector of thresholds parameterized as differences from a central location. For a Likert item, the first threshold parameter defines the upper and lower bounds of the middle category, if present, relative to a location halfway between them. Subsequent thresholds separate the remaining categories in order of extremity. For a five- or six-category item, there are two individual threshold parameters; for a four-category item there is only one, and for a seven-category item there are three.
Johnson identified four related models with increasingly flexible constraints on the threshold parameters. The PTM (Rossi, Gilula, & Allenby, 2001) requires that the threshold parameters be proportional across individuals. This amounts to a decomposition of the item response category thresholds into a vector of values applicable across all items and individuals, associated only with the responses, and a single individual-differences scale parameter α i , which expands or contracts the scale. Rossi, Gilula, and Allenby (2001) called for the scale parameter to be lognormally distributed such that ln(α i ) has a common variance σ2. The variance is constrained for model identification. We reparameterize α i = exp(θ x /σ) so that the location parameter θ and the extreme response parameter θ x may be jointly normally distributed; in this reparameterization, θ x is the individual differences latent variable for extreme responding.
The PTM may be expressed as:
A Sequential Decisions Model for Extreme Response Set
Before innovative work by Likert (1932) and Thurstone (1928) created modern personality and attitude assessment, some studies attempted to measure degrees of assent and dissent with statements by methods involving multiple steps. If one responded to the questionnaire of Hart (1923), one made a first pass through the items to indicate assent, dissent, or neutrality, followed by a second pass to underline and double underline some responses for relative emphasis. Neumann (1926) followed a similar procedure, but in a second study devised a 5-point scale of the general form of a Likert-type scale as a time-saving measure.
Two-response items now appear infrequently in psychometric measures, although they are still used in the study of judgment and decision making, for example, in eyewitness testimony. Such an item might be a lineup of suspects, in which the respondent—the witness—must identify an individual in the lineup, and then indicate confidence in the identification.
An alternative to the cognitive process model underlying the MNRM is that respondents internally represent each Likert-type item as two separate questions. “Do I agree with this statement?” one might ask, and then, “How strongly do I want to emphasize my position?” A respondent’s first choice may or may not influence the second. O’Connell (2009) quoted an individual who said he had committed to choosing between the highest and lowest response options for every item on a test, before even beginning to take the test! The TDM treats Likert-type item responses as though the respondents answer two (internal) questions.
The TDM to be introduced in this section has precursors in the literature: The “compound models” described by Tutz (1989, p. 262) were simpler antecedents of the TDM that, in one example, regressed on observed variables first a decision about direction of response and second a decision about strength of response. Drawing ideas from econometrics (McFadden, 1981, 1982), “nested models” for the responses to multiple-choice items were introduced by Suh and Bolt (2010); these models combined a two-parameter logistic (2PL) or three-parameter logistic (3PL) model for a first stage, distinguishing between correct and incorrect responses, with a nominal model for a second stage that described distractor selection. Suh and Bolt’s (2010) proposal used the same latent variable at both stages; Bolt, Wollack, and Suh (2012) elaborated the model with different (correlated) latent variables for the first and second stages, as in the TDM and other recent work with tree-based models.
“IRTrees” are “tree-based item response models of the GLMM family” (De Boeck & Partchev, 2012, p. 1): another set of latent-variable models that explain observed responses with a series of decisions. In the language of De Boeck and Partchev (2012, p. 3), the TDM is a “nested response tree” model. To remain within the class of generalized linear mixed model (GLMM), those suggested by De Boeck and Partchev exclude differential item discrimination parameters and other parameters that are included in the two decision; however, simpler models may be useful in some contexts.
Böckenholt (2012) provides a general theoretical overview of the use of item response models for multiple response processes; like De Boeck and Partchev (2012), Böckenholt uses tree diagrams to express models for multiple processes with branched outcomes. One of Böckenholt’s examples involves Likert-type responses with a neutral point, in which the first process produces a decision to endorse neither agree or disagree, a second process selects the direction of response (if there is one) between agree and disagree, and a third process decides the intensity of the response (strongly or not). Böckenholt’s (2012) models, like those of De Boeck and Partchev (2012), avoid bilinear forms; but that work with the simpler models (and simpler data sets) suggests value to the general approach, because the multiple-process models fit the data well.
The TDM for Even Numbers of Categories
It is convenient to define the TDM separately for items with even and odd numbers of response alternatives. The left panel of Figure 1 shows a tree diagram for four response categories k = 0, 1, 2, 3: The first decision, whether the response is positive (kd
= 1) or negative (kd
= 0) depends on Td
(kd
). The notational element d indicates that Td
(kd
) is a function of the (possibly vector-valued) latent variable θ representing individual differences on the construct the items are intended to measure. The (possibly multidimensional variant of the) 2PL model is used for this probability as a function of θ:
Tree diagrams of the two-decision model for four- and five-category items.
At the second stage, as shown in the left panel of Figure 1, the response is made more extreme (kx
= 1) or less so (kx
=0) with probabilities Tx
(kx
). This part of the model depends on another latent variable θ
x
, which reflects individual differences in the tendency to select extreme responses. For a four-category item like that illustrated in Figure 1, Tx
(kx
) is a modified 2PL model. However, in general, for six or eight or more response categories, a modified version of Samejima’s (1969) graded model is used:
The cx
parameters are the overall response category intercepts, and the slope parameters ax
describe the item-specific relationships of extreme responding with the latent variable describing an individual’s tendency toward extreme responses. To introduce a compensatory aspect to the model, a parameter v is added, regressing the extreme-response logit on the logit for response direction. This can also be interpreted as a shift of the intercept parameter cx
as |
Then the complete model for the observed responses is given by:
All latent variables are assumed to follow a joint multivariate normal distribution; the estimated parameters include the correlation between
The Two-Decision Model for Odd Numbers of Categories
Items with an odd number of response categories usually have a middle response indicating neutrality of direction. In the TDM, there are two ways of selecting a middle category: The first is to neither agree nor disagree with the statement; the second is to de-emphasize one’s agreement or disagreement, even though it exists. That duality is reflected in the model for items with odd numbers of categories.
As a result, the first-stage model has three latent responses, kd
= 0, 1, or 2 for negative, neutral, and positive. Samejima’s (1969) graded model is used for those probabilities as a function of the (possibly vector-valued) latent variable
The two intercepts c
1 and c
2 are associated with response category boundaries in this three-category model; the slope parameter (vector)
At the second stage for a five-alternative item, as shown in the right panel of Figure 1, the response is made more extreme (kx
= 2), less so (kx
= 1), or neutral (kx
= 0) with probabilities Tx
, again depending on θ
x
, which reflects individual differences in the tendency to select extreme responses. A modified version of Samejimas (1969) graded model that is essentially the same as that used for the model for even numbers of categories is used:
The parameters of the second-stage model are all the same for odd numbers of categories as for even numbers, except that the two first-stage intercepts are averaged to yield c', which is used as the intercept in the compensatory regression term with coefficient v:
The complete model for the observed responses is the same as for even numbers of categories for all but the middle (neutral) response:
Diagnostic Characteristics of Undetected Extreme Response Set
If present and ignored, extreme response set can cause a number of measurement issues, from spurious correlations of scores with other variables to lack of factorial invariance between groups to apparent differential item functioning. One effect not noted by previous authors, though unsurprising, is misfit of models that assume ordered unidimensional responses—because the combination of extreme response set with a set of items that measure asymmetrically around the mean, when reduced to a single measurement dimension, results in unordered responses.
Consider a 4-point Likert-type item that distributes responses according to two latent dimensions, a personality trait dimension—say extraversion—and extreme response set. The item is “easy,” so that most respondents choose the upper two response categories. The best model will primarily fit the latent difference between the top two response categories, a line between the centers of Agree and Strongly Agree, a vector that contains both an extraversion component and an extreme response component. Strongly Disagree, which attracts respondents low on extraversion but high on extreme response set, may be placed higher on the best fit line than Disagree or even Agree, depending on the relative strength of the extreme response set component. The item model, if it has the freedom to do so, will “fold”: The middle response categories, Disagree and Agree will be associated with one end of the latent variable, while the extreme response categories Strongly Disagree and Strongly Agree with the other end (see the right panel of Figure 2 for an illustration).

Left panel: Dominant regions for responses to an example item, in two dimensions. Right panel: Trace lines for an example item under the nominal model.
To demonstrate this effect, 50,000 persons’ responses were simulated to a fictional 86-item test using the TDM and reasonable parameters. Bolt and Johnson’s model could have been substituted with few changes. The correlation between the construct dimension and extreme response set was zero. The distribution of responses to a typical item, in the original two dimensions, is shown in the left panel of Figure 2.
IRTPRO (Cai, Thissen, & duToit, 2011) was used to fit the generalized partial credit model (GPCM; Muraki, 1992) and the nominal response model (NRM; Bock, 1972) to the resulting data set. GPCM was chosen because it is a nested submodel of NRM (Thissen & Steinberg, 1986). The likelihood ratio test (Neyman & Pearson, 1928; Wilks, 1938) can be used with nested IRT models; if it is significant, the additional parameters allowed for the unconstrained model contribute to better fit and the constraint is inappropriate to the data (Bock & Aitkin, 1981; Bock & Lieberman, 1970; Thissen, 1982). The difference between the doubled log likelihoods in this case was 239,538.80; the threshold for significance at α = .05 was 210.2 (df = 172). The order constraint on the responses was untenable.
Item folding can be seen in the sjk sequence for the NRM, because the order of the scoring function values indicates the ordering of the response categories. The first and last category scoring function parameters, sj 0 and sj 3, are fixed to 0 and 3 for identification; the middle categories are expected to be between 0 and 3 in ascending order. In this case, the sjk values are {0, −3.63, −2.37, 3}. The right panel of Figure 2 shows the trace lines for the four categories. The nominal response model has achieved better fit at the price of placing the responses out of order.

Dominant regions for responses to a five-category item under MNRM, PTM, and TDM.
Empirical Study 1
Method
Responses to 42 Likert-type items were collected as a part of electronic job applications to entry-level hourly field positions at 75 large and mid-size organizations, primarily in grocery and specialty retail. Job applicants were citizens or residents of the United States. Applicants overall reflected the diversity of the United States population; 42% were male and 45% were White. Additional demographic data beyond gender and ethnicity were not collected consistently across the sample.
Five disjoint representative subsamples were selected for replication of study procedures. Subsets were defined by the remainder of the application’s sequential database key after division by a number N > 5. Not all possible subsets were used.
Many applicants submit more than one application, which may be nearly simultaneous (e.g., applications for the same position at two nearby stores) or separated by months or years. Repeat applications violate a fundamental assumption in parameter estimation, namely local independence of individual response patterns. Therefore, repeats were identified and only the first application by an individual applicant was used. Because repeats were identified across the full data set, replication subsets are disjoint in terms of applicants as well as applications.
After repeats were removed, about 675,000 records remained in each subsample. Sample sizes are given in Table 1.
Sample Size.
The items assessed reliability in work behavior, mostly conscientiousness. Twenty-eight of the items were used to generate scores predicting retention if hired, and termination reason. Fourteen items were not scored. All items used a 4-point Likert-type scale with response options Strongly Agree, Agree, Disagree, and Strongly Disagree.
An example item stem is: “Most of your decisions have been good ones.”
Unidimensional and multidimensional versions of six item models were fit within each subsample. The models are the graded response model (GRM), GPCM and NRM, the unidimensional stereotype FAMLM and a multidimensional MNRM with matching constraints, the PTM, and the TDM. The multidimensional models are bifactor models for the construct subset of the latent space; items had non-zero slopes on one of three cluster-specific factors that were identified in previous analyses of these item sets.
All parameters were estimated using an algorithm similar to that described by Bock and Aitkin (1981), with fixed quadrature. For unidimensional GRM, GPCM and NRM, 49-point quadrature spanned [−6,6]; for the FAMLM, unidimensional PTM and unidimensional TDM, 41-point quadrature spanned [−5, 5]; for bifactor GRM, GPCM and NRM, 21-point quadrature spanned [−5, 5]; for the multidimensional MNRM, PTM and TDM, 13-point quadrature spanned [−4.5, 4.5]. Parameters for the multidimensional MNRM, PTM and TDM were estimated using an algorithm modeled after that used to fit a two-tier bifactor model (Cai, 2010).
Results
Tables 2 and 3 show the increase in Bayesian Information Criterion (BIC) for each model and sample, compared to the lowest BIC for the sample, for unidimensional models and multidimensional models, respectively. As expected, GRM fit the data poorly, and GPCM was worse; the likelihood ratio test statistic for the GPCM against the unconstrained NRM exceeds three million for all samples and both factor structures. (With 84 degrees of freedom available, the threshold for significance at α = .05 is 111.2.) NRM, MNRM, PTM, and TDM all fit better. By smaller margins, MNRM fit better than NRM, PTM fit better than MNRM, and TDM fit better than PTM. No prediction was made about the relative fit of these models.
Increase in BIC for Each Unidimensional Model and Sample, Compared to the Lowest BIC for the Sample.
Note. BIC = Bayesian Information Criterion; GRM = graded response model; GPCM = generalized partial credit model; NRM = nominal response model; MNRM = multidimensional nominal response model; PTM = proportional thresholds model; TDM = two-decision model.
Increase in BIC for Each Multidimensional Model and Sample, Compared to the Lowest BIC for the Sample.
Note. BIC = Bayesian Information Criterion; GRM = graded response model; GPCM = generalized partial credit model; NRM = nominal response model; MNRM = multidimensional nominal response model; PTM = proportional thresholds model; TDM = two-decision model.
Bock’s NRM displayed item folding behavior as expected. For the item “Most of your decisions have been good ones,” sjk = {0.00, −11.95, −9.70, 3.00}. Similar behavior was observed for the other 41 items and in the corresponding bifactor model. The highest middle-category scoring parameter was −5.08; the lowest was −1513.13.
The MNRM, PTM, and TDM had noteworthy differences in how they fit these data. Under MNRM, the correlation between the reliability construct dimension and extreme response set averaged −0.206 across replications, with a standard deviation of 0.002. Under PTM, the correlation averaged 0.008 (SD = 0.002), and under TDM, the correlation averaged 0.320 (SD = 0.041). Most of the differences occurred in the reliability construct dimension; the extreme response set estimates correlated 0.85–0.96 between models.
Empirical Study 2
Method
Responses to 86 items, of which 56 were Likert items, were collected as a part of electronic job applications to hourly field positions, not including supervisory or nursing roles, at 17 large and mid-size organizations in the long-term care industry. Job applicants were citizens or residents of the United States. Applicants were mostly (80%) female; 48% were White, 31% were Black, and 10% were Hispanic. Additional demographic data beyond gender, race, and ethnicity were not collected consistently across the sample.
As before, three disjoint representative subsamples, consisting only of first-time applicants, were selected for replication of study procedures. Each subsample comprised about 64,000 applicants. Sample sizes are given in Table 4.
Sample Size.
The items were written to assess several personality and work style constructs previously shown to relate to five performance competencies common to health care jobs. Three item formats were used. Fifty-six 5-point Likert items were accompanied by 26 adjective dyads and 4 forced-choice work preference items. Adjective dyads asked applicants to describe themselves by selecting one of a pair of equally socially desirable adjectives, such as “serious” and “resilient.” Work preference items asked applicants to indicate a preference between a pair of work conditions or requirements, such as “Put up with rude or aggressive comments from others” and “Receive no feedback about how you are doing in your job.”
Unidimensional and multidimensional versions of six item models were fit to the Likert items within each subsample: the GRM, GPCM and NRM, the stereotype FAMLM and a multidimensional MNRM with matching constraints, the PTM, and the TDM. In all cases, the 2PL model was used for the adjective dyads and work preference items.
Parameters for the multidimensional MNRM, PTM, and TDM were estimated using an algorithm modeled after that used to fit a two-tier bifactor model (Cai, 2010). For comparison, bifactor models were fit using the GRM, GPCM, and NRM. Items were assigned to nine cluster-specific dimensions consistently across models based on previous analyses of these item sets. All parameters were estimated using the same algorithms used in Study 1.
Results
Under all conditions, the NRM and all three models with extreme response variables outperformed the GPCM and GRM by large margins; the likelihood ratio test statistics for GPCM nested within NRM were in the hundred thousands. The differences between extreme response models were smaller but consistent; TDM outperformed PTM, which in turn outperformed MNRM. Tables 5 and 6 show the fit of each model across replications for the unidimensional and multidimensional models, respectively.
Increase in BIC for Each Unidimensional Model and Sample, Compared to the Lowest BIC for the Sample.
Note. BIC = Bayesian Information Criterion; GRM = graded response model; GPCM = generalized partial credit model; NRM = nominal response model; MNRM = multidimensional nominal response model; PTM = proportional thresholds model; TDM = two-decision model.
Increase in BIC for Each Multidimensional Model and Sample, Compared to the Lowest BIC for the Sample.
Note. BIC = Bayesian Information Criterion; GRM = graded response model; GPCM = generalized partial credit model; NRM = nominal response model; MNRM = multidimensional nominal response model; PTM = proportional thresholds model; TDM = two-decision model.
The nominal response model, as before, displayed clear item folding. For a sample Likert item, “I dislike having to change my plans because of other people’s mistakes,” sjk = {0.00, −13.51, −13.73, −10.84, 4.00} based on Sample A data. Similar behavior was observed for the other 55 Likert items. The highest middle-category scoring parameter was −4.89; the lowest was −1896.83.
Scores computed for each specific factor had, with few exceptions, similar reliabilities across PTM, MNRM, and TDM, as shown in Table 7. Correlations between observed scores on specific factors were low, supporting the nine specific-factor models tested.
Reliabilities for General and Specific Factors Under PTM, MNRM, and TDM.
On the whole, the three extreme response models exhibited more similar than dissimilar behavior. However, as before, the modeled correlations between the main construct and extreme response set dimensions were different by (multidimensional) model: −0.052 (SD = 0.007) under MNRM, 0.091 (SD = 0.009) under PTM and 0.331 (SD = 0.017) under TDM, indicative of a difference in the meaning of the traits estimated.
Discussion
Both Study 1 and Study 2 clearly demonstrate the utility of simultaneously estimating an extreme response set trait. The fit of all three tested extreme response models to the data is unambiguously superior to any of the conventional models: GRM, GPCM, and NRM.
Figure 3 shows the distribution of modal item responses across general trait and extreme response set dimensions for the same example item under all three extreme responding models, as well as the corresponding item parameter estimates. Because all of the extreme-responding models involve many parameters that functionally interact, graphical displays of their joint consequences for the item response surfaces are more useful interpretive devices than are tables of item parameters. In practice, when these models are used in test construction, we use graphics such as those in Figure 3.
Differences in form and function exist among the MNRM, PTM, and TDM. Some of these were better tested by the present data than others. Under MNRM, extreme response behavior consisted only of Strongly Agree and Strongly Disagree responses, with no special treatment for exact middle responses. PTM treats middle responding in 5-point Likert-type items as a negative counterpart to extreme responding; TDM allows for middle responses by individuals high on extreme response set as well as consistent middle responding. The difference in placement of middle responses is readily apparent in Figure 3, as is the asymmetry of MNRM, discussed below. It is unclear which definition was most advantageous.
Both the MNRM and the PTM assume that extreme response set elicitation was a function of response text only, not stem text; cross-item parameters are used to define the geometry of the responses. Among the models considered in this study, the TDM alone assigns extreme response slope and intercept parameters to each item; only under TDM can an item be a poor measure of all assessed constructs at once. Future research involving less-constrained versions of the MNRM and PTM may be useful, although each will present challenges with respect to model identification.
Inheriting as it does from the nominal response model, MNRM allows a category-spacing asymmetry that PTM and TDM do not. Partial item folding was observed under MNRM; sk for the lowest-middle category (but not its next-door neighbor) was slightly negative across all samples and factor models in both Study 1 and Study 2. The two lowest categories remain separated on the extreme responding dimension. The item is “folded” only in that the set of vectors along which the response categories are ordered as intended does not include the construct dimension. This asymmetry—and the positive correlations between construct dimension and extreme response set under TDM and PTM—may be due to the presence of another unmeasured response set such as socially desirable responding, or the use of online keys to obtain a higher score. Such a “response set” would need to be more correlated with extreme response set than the actual construct dimension, and also more consistently present in items regardless of factor loading. Further research is needed to test these hypotheses.
The TDM provided best fit to the data of all models tested in both Study 1 and Study 2, lending support to the idea that a non-compensatory two-stage process is a plausible model for responses to Likert-type items, at least in some contexts, such as the employment testing situation that yielded the data considered here. Although further research is needed on a variety of assessment and survey data, under controlled conditions, in order to resolve the questions raised in this section, the TDM holds promise as a theory of item response.
Footnotes
Acknowledgments
Our thanks to John Morrison, Phil Mangos, Martin Jetton, David Scarborough, Bjorn Chambless, Sandip Sinharay, and three anonymous reviewers for helpful comments on an earlier draft. Any errors that remain are, of course, our own.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
