The Generalized Thurstonian Unfolding Model (GTUM): Advancing the Modeling of Forced-Choice Data

Abstract

Forced-choice (FC) measurement has become increasingly popular due to its robustness to various response biases and reduced susceptibility to faking. Although several current Item Response Theory (IRT) models can extract normative person scores from FC responses, each has its limitations. This study proposes the Generalized Thurstonian Unfolding Model (GTUM) as a more flexible IRT model for FC measures to overcome these limitations. The GTUM (1) adheres to the unfolding response process, (2) accommodates FC scales of any block size, and (3) manages both dichotomous and graded responses. Monte Carlo simulation studies consistently demonstrated that the GTUM exhibited good statistical properties under most realistic conditions. Particularly noteworthy findings include (1) the GTUM's ability to handle FC scales with or without intermediate statements, (2) the consistently superior performance of graded responses over dichotomous responses in person score recovery, and (3) the sufficiency of 10 mixed pairs to ensure robust psychometric performance. Two empirical examples, one with 1,033 responses to a static version of the Tailored Adaptive Personality Assessment System and the other with 759 responses to a graded version of the Forced-Choice Five-Factor Markers, demonstrated the feasibility of the GTUM to handle different types of FC scales. To aid in the practical use of the GTUM, we also developed the R package “fcscoring.”

Keywords

forced-choice Item Response Theory unfolding response process GTUM

The Likert rating scale is undoubtedly the most widely used format to assess organizationally relevant constructs for its relative ease of development, administration, and scoring. Each time, respondents are presented with a single statement and are asked to indicate their absolute degree of agreement with it on a graded scale (e.g., 1 = “Strongly disagree”; 2 = “Disagree”; 3 = “Agree”; 4 = “Strongly agree”). However, this format has long been criticized for its susceptibility to various response biases (Podsakoff et al., 2003; Zettler & Lang, 2015; Zettler et al., 2016) and faking (Hu & Connelly, 2021; Speer et al., 2023; Sun et al., 2022; Zickar & Drasgow, 1996; Zickar et al., 2004), which threatens the validity of scores derived from such a format.

To overcome these issues, researchers in the US Army first developed the forced-choice (FC) format (Sisson, 1948). FC scales show respondents a block of at least two statements measuring different latent factors. When there are two statements per block, respondents are required to pick the one that is “most like me” (PICK); when there are three or more statements per block, respondents are often asked to either choose the one that is “most like me” and the one that is “least like me” (MOLE), or rank all statements from “most like me” to “least like me” (RANK; Hontangas et al., 2015). As respondents must choose one over the other(s), it is hard to be all high or all low on all dimensions, thus reducing halo effects. Given the dichotomous nature of choices, it is also impossible for respondents to display response styles—the idiosyncratic ways of using response options regardless of statement content (Li et al., 2021). If a further step is taken to ensure the statements within a block are matched on social desirability, FC scales are also faking-resistant (Cao & Drasgow, 2019). As a result, scores derived from FC scales are more strongly related to job performance (Speer et al., 2023) and have higher inter-rater agreement (Brown et al., 2017; Wetzel & Frick, 2020) compared to scores derived from Likert scales. Given these advantages, FC measurement has been playing a critical role in various high-stakes contexts, such as global talent management (Bartram, 2013; Boyce et al., 2014), and soldier selection (Drasgow et al., 2012) and placement (Kirkendall et al., 2020) in the United States.

However, the dichotomous FC format discussed above is not without issues. First, scores derived from such scales are often less reliable than Likert rating scales with identical statements because dichotomous responses offer more limited information (Brown & Maydeu-Olivares, 2011; Bürkner, 2022). Second, people may respond unfavorably to such a dichotomous FC format because they are forced to make decisions even if the statements within a block are similarly like/unlike them (Dalal et al., 2021). Lower reliability and less favorable respondent reaction will likely reduce the validity of scores derived from FC scales and further threaten the accuracy of decisions based on these scores. To address these issues, Brown and Maydeu-Olivares (2018) introduced the graded FC format. Instead of choosing between two statements (1= “A describes me better”; 2 = “B describes me better”), respondents can indicate the degree to which they prefer one statement over the other (e.g., 1 = “A describes me much better”; 2 = “A describes me slightly better”; 3 = “B describes me slightly better”; 4 = “B describes me much better”). It has been shown that the graded format can improve reliability (Brown & Maydeu-Olivares 2018; Zhang et al., 2023a) and respondent reaction (Dalal et al., 2021).

Given the variety of block sizes and response formats (PICK, MOLE, RANK, and graded) and that each has unique advantages, it would be beneficial to have a versatile psychometric model capable of handling all variants. This would enable FC users to rely on a unified framework to model responses to various FC scales so that researchers can directly compare across different FC scales. It is also essential for the psychometric model to accurately depict the underlying response process (see detailed discussion below), as using a scoring model that misaligns with the underlying response process can lead to detrimental effects (Nye et al., 2020; Tay & Drasgow, 2012). However, as will be elaborated on later, the two main psychometric models for FC responses, namely the Multi-Unidimensional Pairwise Preference Model (MUPP; Stark et al., 2005) and the Thurstonian Item Response Theory Model (TIRT; Brown & Maydeu-Olivares, 2011), and their extensions, are limited in certain aspects. The MUPP model is currently restricted to handling FC scales with a block size of two and the dichotomous response format. The GGUM-RANK model (Lee et al., 2019), an extension of the MUPP model, can accommodate FC scales of any block size (e.g., pairs, triplets, quadruplets). Yet, it does not apply to the graded FC format. The TIRT model can handle FC scales of any block size and response format. However, it assumes that individuals respond to statements using a dominance response process, inconsistent with many studies suggesting that the unfolding response process better characterizes how people respond to self-report assessments (Drasgow et al., 2010).

For these reasons, there is a compelling need for a more versatile psychometric model that not only consolidates the advantages of prior models but also advances beyond them. Ideally, this model (1) should adhere to the unfolding response process, (2) should be capable of handling FC scales with any block size, and (3) can model both dichotomous and graded responses. Additionally, in the context of practical considerations, improving the accessibility of the new model is a pivotal factor in ensuring ease of application by fellow researchers (Borsboom, 2006). With these considerations in mind, we present a new Item Response Theory (IRT) model that fulfills these objectives, supplemented by an accompanying R package. In doing so, we offer FC users a flexible and easily accessible tool for analyzing FC data and enhancing FC measurement. This aim aligns fundamentally with the recognized significance of measurement in organizational research (Cortina et al., 2017; Foster et al., 2017; Zickar, 2020) and the recent advocacy for novel modeling techniques in personnel assessments, and more specifically, in FC assessments (Joo et al., 2020, 2021; Speer & Delacruz, 2021).

Modeling Forced-Choice Data

Traditionally, FC scales are scored by counting how many statements of a specific trait are chosen as “most like me” after adjusting for the direction of statement wording (Baron, 1996; Meade, 2004). However, scores derived in this approach are ipsative because the sum of scores across all factors is the same for everyone. Ipsative scores are fundamentally flawed and inappropriate for between-person comparisons (Hicks, 1970), which is what personnel selection relies on. Therefore, the use of FC scales waned soon after people recognized the issue of ipsative scores. Fortunately, multiple IRT models have been developed to recover normative scores from FC data in later years. Table 1 presents a summary of key features of existing FC IRT models that are capable of producing normative scores from FC data. Among the 16 existing models, the MUPP and the TIRT models have been the most influential. Below, we will review the MUPP model and its GGUM-RANK extension and the TIRT model in detail because our new model integrates and extends their strengths.

Table 1.
A Summary of Existing FC IRT Models and the Current One.

FC IRT Model Authors Block Size Decision Model Response Process Dimensionality Graded Response Response Time

1. Multi-Unidimensional Paired Preference Model Stark et al., (2005) 2 Andrich's Forced Endorsement Model Unfolding Multi No No

2. GGUM-RANK Lee et al., (2019) >=2 Andrich's Forced Endorsement Model Unfolding Multi No No

3. Zinnes-Griggs Model Zinnes and Griggs (1974) 2 Bradley-Terry Unfolding Uni No No

4. Andrich's Squared Difference Model Andrich (1989) 2 Bradley-Terry Unfolding Uni No No

5. Andrich's Hyperbolic Model Andrich (1995) 2 Bradley-Terry Unfolding Uni No No

6. Zinnes-Griggs Pairwise Preference Item Response Theory Model Joo et al., (2021) 2 Thurstone's Law of Comparative Judgement Unfolding Multi No No

7. Forced-Choice Ranking Models Hung and Huang (2022) >=3 Andrich's Forced Endorsement Model Unfolding Multi No No

8. Generalized Thurstonian Unfolding Model Current study >=2 Thurstone's Law of Comparative Judgement Unfolding Multi Yes No

9. Thurstonian Item Response Theory Model Brown and Maydeu-Olivares, (2011) >=2 Thurstone's Law of Comparative Judgement Dominance Multi Yes No

10. Multi-Unidimensional Paired Preference −2PL Morillo et al. (2016) 2 Andrich's Forced Endorsement Model Dominance Multi No No

11. Bayesian Random Block Item Response Theory Model Lee and Smith (2020) >=3 Andrich's Forced Endorsement Model Dominance Multi No No

12. Thurstonian D-diffusion Item Response Model Bunji and Okada, (2020) 2 Thurstone's Law of Comparative Judgement Dominance Multi No Yes

13. Linear Ballistic Accumulator Item Response Theory Model Bunji and Okada (2022) 2 Thurstone's Law of Comparative Judgement Dominance Multi No Yes

14. JRT-TIRT Guo et al. (2023) >=2 Thurstone's Law of Comparative Judgement Dominance Multi No Yes

15. The Faking Mixture Model Frick (2022) >=2 Thurstone's Law of Comparative Judgement Dominance Multi No No

16. The Careless Response Mixture Model (M-TCIR) Peng et al., (2023) >=2 Thurstone's Law of Comparative Judgement Dominance Multi No No

Using the framework proposed by Brown (2016), we can describe a FC IRT model along three axes: (1) the block size used, (2) the decision model for choice behaviors, and (3) the measurement model for the relationship between statements and the underlying psychological attributes. Regarding the first axis, the MUPP model can only handle FC scales with a block size of two, while the GGUM-RANK and the TIRT models work for FC scales with any block size.

Regarding the second axis, the MUPP model is built upon Andrich's Forced Endorsement model (Andrich, 1989); the GGUM-RANK model is based on an integration of Luce's Choice Axiom model (1959) and Andrich's Forced Endorsement model; the TIRT model is based on Thurstone's Law of Comparative Judgement (Thurstone, 1927). Specifically, Andrich's Forced Endorsement model assumes that when presented with a pair of statements (i and k), respondents will first form a dichotomous response to each statement separately, producing 2² possible outcomes: agree with both statements ( $y_{i}$ =1 and $y_{k}$ =1), disagree with both statements ( $y_{i}$ =0 and $y_{k}$ =0), agree with statement i and disagree with statement k ( $y_{i}$ =1 and $y_{k}$ =0), or disagree with statement i and agree with statement k ( $y_{i}$ =0 and $y_{k}$ =1). The choice is easy in the latter two scenarios (i.e., choose statement i in the third scenario and statement k in the fourth scenario). However, the first two scenarios, though likely to occur, are not admissible in a FC task. Andrich further assumed that respondents would reevaluate the two statements until they reached a preference. Then, the probability that respondent j will choose statement i over statement k is given by
$P (y_{j {i, k}} = 1) = \frac{P (y_{j i} = 1) P (y_{j k} = 0)}{P (y_{j i} = 1) P (y_{j k} = 0) + P (y_{j i} = 0) P (y_{j k} = 1)} .$
(1)
The current MUPP model is strictly built upon Andrich's Forced Endorsement model, which is designed only for dichotomous responses, thus constraining the MUPP model from being able to handle graded responses where individuals can indicate the degree of preference. Also, the current MUPP model cannot handle FC scales with larger blocks.

To accommodate larger blocks, Lee et al. (2019) extended the MUPP model to the GGUM-RANK model. According to Luce's Choice Axiom, a complete ranking of K statements can be achieved by K−1 choice tasks. Specifically, it is assumed that respondents first choose one statement that is “most like me” from the K statements. Then they exclude this chosen statement and choose another one that is “most like me” from the remaining K−1 statements. This process goes on until only two statements remain, and respondents make the last choice among the two. It is further assumed that these consecutive choice tasks are independent. Therefore, the probability of the complete ranking is given by the joint probability of these choice tasks. For example, the probability that person j ranks the three statements as A > B > C can be obtained as follows:
$P (A > B > C) = P (A [A, B, C]) P (B [B, C]) .$
(2)
$P (A [A, B, C])$ refers to the probability of choosing statement A as the “most like me” from A, B, and C; $P (B [B, C])$ denotes the probability of choosing statement B as the “most like me” from B and C. These probabilities are given by
$P (A [A, B, C]) = \frac{P (y_{j A} = 1) P (y_{j B} = 0) P (y_{j C} = 0)}{P (y_{j A} = 1) P (y_{j B} = 0) P (y_{j C} = 0) + P (y_{j A} = 0) P (y_{j B} = 1) P (y_{j C} = 0) + P (y_{j A} = 0) P (y_{j B} = 0) P (y_{j C} = 1)} .$
(3)
$P (B [B, C]) = \frac{P (y_{j B} = 1) P (y_{j C} = 0)}{P (y_{j B} = 1) P (y_{j C} = 0) + P (y_{j B} = 0) P (y_{j C} = 1)} .$
(4)
The GGUM-RANK model's ability to model responses to larger blocks has substantially broadened the scope of the MUPP model. Yet, the GGUM-RANK model has limitations, notably not accommodating the graded FC format. Additionally, augmenting the GGUM-RANK model for graded responses appears challenging, primarily because both Luce's Choice Axiom and Andrich's Forced Endorsement Model are exclusively tailored for dichotomous choices.

The TIRT model is based on Thurstone's Law of Comparative Judgement. Specifically, Thurstone adopted the notion of item utility and argued that it is the difference between the utilities of the two statements under comparison that determines the preference so that people will choose the statement with higher utility (Thurstone, 1927). Denote the utilities of statements i and k for person j as t_ji and t_jk, the latent utility difference as
$y_{j {i, k}} * = t_{j i} - t_{j k}$
(5)
and the observed dichotomous choice as $y_{j {i, k}}$ . Then, the latent utility difference determines the observed choice via the following threshold process:
$y_{j {i, k}} = {\begin{matrix} 1, if y_{j {i, k}} * > 0 \\ 0, if y_{j {i, k}} * \leq 0 \end{matrix} .$
(6)
When there are more than two statements within a block (n > 2), researchers need to decompose responses to such blocks into n(n−1)/2 pseudo dichotomous items representing all unique pairwise comparisons. For example, the ranking A > B > C will be decomposed into three pseudo items AB =1, AC = 1, and BC = 1, where 1 means that the first statement is preferred over the second one and 0 indicates the other way around. Equations 5 and 6 then model these pseudo items. When respondents are allowed to indicate their degree of preference on an M-point scale (M ≥ 3), the polytomous threshold process
$y_{j {i, k}} = {\begin{matrix} M, & i f y_{j {i, k}} * & \geq τ_{{i, k} (M - 1)} \\ M - 1, & if τ_{{i, k} (M - 2)} & \leq y_{j {i, k}} * & < τ_{{i, k} (M - 1)} \\ \dots \\ 2, & if τ_{{i, k} (1)} & \leq y_{j {i, k}} * & < τ_{{i, k} (2)} \\ 1, & i f y_{j {i, k}} * & < τ_{{i, k} (1)} \end{matrix}$
(7)
is adopted to map latent utility differences onto observed graded responses (Brown & Maydeu-Olivares 2018). In this way, the TIRT model can handle FC scales with any block size and response format, making it an adaptable framework from which we can glean insights.

As for the third axis, the MUPP model and the GGUM-RANK model adopt an unfolding response process while the TIRT model assumes a dominance response process, which are two fundamentally different response processes on how people respond to statements measuring various types of constructs (Drasgow et al., 2012; Roberts et al., 2000; Tay & Ng, 2018; Zhang et al., 2020a). The unfolding response process assumes an inverse U-shaped relationship between people's latent trait levels and the probability of endorsing a statement, such that people whose standings on the latent continuum are closer to the statement's location are more likely to endorse the statement than those who are further away from the statement's location (either from above or below). For example, people who feel that they work as hard as an average person are more likely to endorse the statement “I am as hardworking as an average person,” while those who are very hardworking or lazy are more likely to reject this statement. Both the MUPP model and the GGUM-RANK model adopt the dichotomous version of the Generalized Graded Unfolding Model (GGUM; Roberts et al., 2000) to model the probability that individual j will endorse statement i:
$P (y_{j i} = 1 | θ_{j}) = \frac{\exp (α_{i} [(θ_{j} - δ_{i}) - τ_{i}]) + \exp (α_{i} [2 * (θ_{j} - δ_{i}) - τ_{i}])}{1 + \exp (α_{i} [3 * (θ_{j} - δ_{i})]) + \exp (α_{i} [2 * (θ_{j} - δ_{i}) - τ_{i}]) + \exp (α_{i} [(θ_{j} - δ_{i}) - τ_{i}])} .$
(8)
Here, $α_{i}$ , $δ_{i}$ , and $τ_{i}$ are the discrimination, location, and threshold parameters of statement i, and $θ_{j}$ refers to the latent trait score of person j. In contrast, the dominance response process assumes that there is a monotonic relationship between people's latent trait levels and the probability of endorsing a statement, such that people with higher trait scores are more likely to endorse the statement. The TIRT model assumes that the latent utility is linked to the latent trait following a linear factor analysis model:
$t_{j i} = λ_{i} θ_{j} + ε_{j i} .$
(9)
where $λ_{i}$ represents the factor loading of statement i, $θ_{j}$ represents the latent trait score of person j, and $ε_{j i}$ represents statement uniqueness.

Despite the wide use of dominance models, dozens of studies have shown that the unfolding response process better describes how people respond to measures of noncognitive constructs (Drasgow et al., 2010), such as personality (Cao et al., 2015; Zhang et al., 2020a), vocation interests (Tay et al., 2009), emotional intelligence (Cho et al., 2015), job satisfaction (Carter & Dalal, 2010), attitudes (McGrane, 2019; Roberts et al., 2000), attachment styles (Sun et al., 2021), and motivation (Freund & Lohbeck, 2021). These are typically the types of constructs to which FC measurement is applied. It has been further shown that the misapplication of dominance models to unfolding data reduces the accuracy of selection decisions (Stark et al., 2006), attenuates criterion-related validity (Sun et al., 2021), and lowers the power to detect moderation effects (Cao et al., 2018; Carter et al., 2017). This suggests that the MUPP model and the GGUM-RANK model are theoretically more appropriate for noncognitive assessment than the TIRT model because they adopt the unfolding response process. Another practical advantage of the MUPP model and the GGUM-RANK model due to the use of the unfolding response process is that they allow intermediate statements (e.g., “My extraversion is about the average level”) in addition to extreme statements (e.g., “I am very extraverted” or “I am very introverted”). While extreme statements are often socially desirable or undesirable, intermediate statements are more neutral, making them good candidates for developing faking-resistant FC scales. Besides, intermediate statements also provide more information for extreme latent trait levels (Roberts et al., 2000) and thus increase measurement accuracy for people located on the two ends, often the focus of personnel selection.

In sum, the MUPP model, though theoretically more attuned to the underlying response process, was not designed for graded responses or FC scales with larger block sizes. While the GGUM-RANK model was designed for various block sizes, it cannot handle graded responses. Further complicating its use, there is currently no user-friendly software publicly available for the GGUM-RANK model, which limits its utility for researchers. On the other hand, while the TIRT model can analyze graded responses and any block size, it assumes a dominance response process, which may not align with how individuals respond to noncognitive items. As a result, there is a need for a new model that incorporates the strengths of these existing models. This model should be able to (1) handle both dichotomous and graded responses, (2) score FC scales of any block size, and (3) accommodate the unfolding response process. Additionally, this proposed model should be supported by readily accessible software designed with user-friendly features.

The Generalized Thurstonian Unfolding Model (GTUM)

To meet the goals outlined above, we propose the GTUM to incorporate the strengths of earlier models. The GTUM is based on Thurstone's Comparative Law of Judgement (Thurstone, 1927) and adopts the same notion of item utility as the TIRT model. Therefore, the decision model for FC behavior is the same as the TIRT model, and it can handle FC scales with any block sizes and response formats (dichotomous and graded). The major difference between the GTUM and the TIRT model lies in the relationship between statements and the underlying psychological attributes. Assuming that in block d, statements i and k measure Agreeableness and Extraversion, instead of using a dominance model (Equation 9) as in the TIRT model, we use the negative Euclidean distance between person j 's trait score and the statement location to define statement utility¹ in the GTUM as
${\begin{matrix} t_{j i} = - λ_{i} | θ_{A j} - δ_{i} | \\ t_{j k} = - λ_{k} | θ_{E j} - δ_{k} | \end{matrix} .$
(10)
where $λ$ and $θ$ represent the statement factor loading and the latent trait level, respectively, and $δ$ is the statement location parameter. Note that there is a negative sign in the equations to ensure that the statement utility is the highest when $θ$ and $δ$ are equal, or put in another way, when there is a perfect match between person characteristics and the statement content. We assume that the latent traits follow a multivariate normal distribution with latent factor means fixed at zero and factor variances fixed at 1, and correlations among factors freely estimated. The latent utility difference between two statements is given by
$y_{j d {i, k}} * = t_{j i} - t_{j k} = - λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} |$
(11)
Then, we can map the latent utility difference to the observed response following the logic of the 2-Parameter Logistic model if respondents were asked to make dichotomous choices,
$P (y_{j d {i, k}} = 1 | (θ_{A j}, θ_{E j})) = \frac{\exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d})}{1 + \exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d})},$
(12)
where $τ_{d}$ is the pair threshold parameter. Note that the statement uniqueness difference did not appear in Equation 12 because the randomness is implied by the probabilistic function. When respondents are asked to indicate their degree of preference, we follow the logic of the Graded Response Model (Samejima, 1969) to model the probability of choosing each response category given $θ_{A j} a n d θ_{E j}$ as
$P (y_{d j {i, k}} = k | (θ_{A j}, θ_{E j})) = {\begin{matrix} \frac{\exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d 1})}{1 + \exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d 1})} (k = 1) \\ \frac{\exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d k})}{1 + \exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d k})} - \frac{\exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d (k - 1)})}{1 + \exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d (k - 1)})} (2 \leq k < M - 1) \\ 1 - \frac{\exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d (M - 1)})}{1 + \exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{d (M - 1)})} (k = M) \end{matrix}$
(13)
The above model applies to two-alternative FC scales where each statement appears just once in a single pair throughout the entire FC scale. However, researchers often design FC scales with more than two statements per block to increase the amount of information. In that case, each block needs to be decomposed into different pairs (e.g., ABC → AB, BC, and AC). Therefore, each statement will appear in multiple decomposed pairs, resulting in local dependence between pairs (e.g., pairs AB and AC are locally dependent even after controlling for the latent traits because they share the common statement A). To account for local dependence, we added a random block factor to all pairs from the same block. Specifically, suppose block d has three statements i, k, and l measuring Agreeableness, Extraversion, and Openness, respectively. It will be decomposed into three pairs $y_{j d {i, k}}$ , $y_{j d {i, l}}$ , and $y_{j d {k, l}}$ , whose probabilities of the first option being chosen are defined as
${\begin{aligned} P (y_{j d {i, k}} & = 1 | (θ_{A j}, θ_{E j})) = \frac{\exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{i, k} + γ_{j d})}{1 + \exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{k} | θ_{E j} - δ_{k} | + τ_{i, k} + γ_{j d})} \\ P (y_{j d {i, l}} & = 1 | (θ_{A j}, θ_{O j})) = \frac{\exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{l} | θ_{O j} - δ_{l} | + τ_{i, l} + γ_{j d})}{1 + \exp (- λ_{i} | θ_{A j} - δ_{i} | + λ_{l} | θ_{O j} - δ_{l} | + τ_{i, l} + γ_{j d})} \\ P (y_{j d {k, l}} & = 1 | (θ_{E j}, θ_{O j})) = \frac{\exp (- λ_{k} | θ_{E j} - δ_{k} | + λ_{l} | θ_{O j} - δ_{l} | + τ_{k, l} + γ_{j d})}{1 + \exp (- λ_{k} | θ_{E j} - δ_{k} | + λ_{l} | θ_{O j} - δ_{l} | + τ_{k, l} + γ_{j d})} \end{aligned}$
(14)
The difference in Equations 14 from 12 is in the added person-specific random effect, $γ_{j d}$ , to all pairs from the same block d. Across blocks, the person random block factors are assumed to be orthogonal to one another, uncorrelated with focal latent factors, and normally distributed with a mean of zero. The variance of the random factor quantifies the degree of local dependence. To reduce model complexity, we assume all random factors have the same variance. This idea was inspired by Lee and Smith (2020), who used the same approach to account for the local dependence of FC pairs that share the same statement. The random block factors equally apply to graded FC scales. It is only a statistical device to handle local dependence and has little substantive meaning.

Model Estimation

While the TIRT model can be easily estimated using limited information methods, such as the unweighted least square estimators (Brown & Maydeu-Olivares, 2011, 2018) in common latent variable modeling software, the estimation of the GTUM proves to be more complex due to its inherent nonlinearity. Consequently, we opted for a Bayesian approach. This estimation method is particularly well-suited for complex models like the GTUM, as it facilitates the incorporation of reasonable priors to aid in estimation. For instance, it is straightforward to distinguish between extreme positive statements (e.g., “I am very sociable”), extreme negative statements (e.g., “I hate to be the center of attention”), and intermediate statements (e.g., “I am as talkative as an average person”). Such information can be integrated into model estimation by assigning moderately informative priors to location parameters. Furthermore, the creation of FC scales necessitates calibration studies on the psychometric properties of single statements administered in a Likert format. Statement parameters obtained from such calibration studies can also be used as priors for GTUM estimation. This approach is predicated on the assumption that statement parameters from Likert scales are moderately associated, albeit not necessarily identical, with parameters in the FC format. Therefore, they offer additional useful, though not perfectly accurate, information for the estimation of statement parameters in the FC format.

To facilitate applications of the GTUM, we developed an R package “fcscoring” that uses Stan, a probabilistic programming language for Bayesian inference and optimization (Gelman et al., 2015), as the backend estimation engine. Stan is more efficient and computationally stable for large datasets and large models because it uses the no-U-turn sampler (Hoffman & Gelman, 2014), a variant of Hamiltonian Monte Carlo (HMC; Betancourt, 2013). HMC produces posterior samples that have lower autocorrelation than other MCMC algorithms, thus substantially reducing the number of iterations and improving efficiency. The R package only requires users to input raw data. It allows users to customize priors for all model parameters. We also provide several functions to facilitate model diagnostics and a tutorial in the package. The R package fcscoring can be downloaded by typing devtools::install_github(“Naidantu/fcscoring”) in R.

The Present Study

As the GTUM is new, we first conducted three Monte Carlo simulation studies to examine the statistical performance under both ideal and realistic conditions. Specifically, the first simulation study focused on the performance of the GTUM for pairs where there is no random block factor. In the Appendix, we also presented the second simulation devoted to FC scales with a block size of 3, where random block factors were invoked to handle local dependency. We also presented a third small-scale simulation to examine the convergence between GTUM- and MUPP-based trait scores in the Appendix. Then, we used two empirical datasets to showcase their feasibility and utility. Empirical data for the graded FC, data analysis script, simulation code, and supplementary tables are available on the Open Science Framework (OSF) at https://osf.io/x6749/?view_only=11f0d49373214f1c936c18eb82156dfc. Empirical data for the dichotomous FC scale was not made public because the FC scale TAPAS is proprietary.

Simulation Study: GTUM for Paired Comparisons

The main goal is to examine how accurately person and statement parameters of the GTUM can be recovered under ideal and realistic conditions for paired comparisons where each statement appears in a pair once (and only once) throughout the test so that the local dependence issue is nonexistent. Following previous studies that used 12 statements per factor and 2–5 factors when studying the statistical performance of the Zinnes and Griggs pairwise preference IRT model (Joo et al., 2021), the MUPP-2PL model (Morillo et al., 2016), the TIRT model (Brown & Maydeu-Olivares, 2011), the Thurstonian D-diffusion item response model (Bunji & Okada, 2020), and the Log-Linear model integrating response time and FC responses (Guo et al., 2023), throughout the simulation, we fixed the number of latent factors at 5 and the number of unique statements per trait at 12 to mimic classic measures of the Big Five personality factors (McCrae & Costa, 1989; Soto & John, 2017). So, there were 30 pairs in total. Correlations among the five factors were all set to .30 for simplicity, which also resembles the meta-analytical correlations among the Big Five personality factors after reverse-scoring neuroticism (van der Linden et al., 2010).

We also conducted another two simulation studies to examine (1) the performance of the GTUM when applied to FC scales with a block size of three where local dependency needs to be modeled, and (2) the convergence between trait scores obtained from the GTUM and the MUPP model. Triplets results largely replicated the patterns we reported below for pairs, and GTUM- and MUPP-based trait scores generally converged well. Due to space limitations, details regarding the two simulation designs and results are reported in the Appendix.

Simulation Design and Data Generation

Number of response options. Two levels of response options were investigated: 2 and 5. Two response options represent the traditional dichotomous FC format, and five response options represent the graded FC format (Brown & Maydeu-Olivares 2018; Zhang et al., 2023a).

Sample size. Two levels of sample size were studied: 300 and 1,000. Although a sample size of 300 is generally considered small for most latent variable models (Zhang et al., 2023b), it is typical in organizational research where personality tests are frequently administered (Shen et al., 2011; Su et al., 2019). Therefore, it is important to examine how the GTUM performs in such nonideal but realistic conditions. A sample size of 1,000 was also included because we are interested in how the GTUM performs in large sample sizes.

Statement quality. Statement quality is indicated by the size of their discrimination parameters ( $λ$ ), which is directly related to the amount of information a statement can provide for its designated latent factor. Two levels of statement quality were examined: average and high. In the average quality conditions, $λ$ s were randomly sampled from U(0.75, 1.25), which matches the distribution of $λ$ we found in the empirical example presented later; in the high-quality conditions, $λ s$ were randomly sampled from U(1.25, 1.75). Note that $λ$ is only meaningfully defined in the positive space for unfolding models.

Proportion of mixed pairs. For a FC scale to be faking-resistant, it is critical to ensure that statements within a block are matched on social desirability. With the TIRT model, it almost always means that factor loadings of statements within a block should have the same sign (i.e., all positive or all negative). With the GTUM, it mostly means that statements within a block should have similar location parameters. However, person parameters in the TIRT model are difficult to recover if statements within a block are of the same sign due to the limited formation provided by such blocks (Brown & Maydeu-Olivares, 2011; Frick et al., 2023). There must be some mixed pairs with positively and negatively keyed statements (Lee et al., 2022) for accurate person score estimation, which may compromise the faking-resistance of a FC scale. However, it remains unknown whether matching statements within a block on location parameters will equally lead to loss of information. If yes, how many mixed pairs do we need to make up for the loss? Therefore, we examined five proportions of mixed pairs: 0/6, 1/6, 2/6, 3/6, and 6/6. For mixed pairs, we ensured that statements within them had location parameters of opposite signs. For non-mixed pairs, they were matched based on the “degree of match” factor presented below.

Degree of match. We manipulated the degree of match between statements, operationalized as the difference in location parameters between two statements within a pair. In the close match conditions, the location parameter difference was .25; in the loose match conditions, the difference was .50. No perfect match was simulated because it is unlikely in the real world.

Statement extremity. Even though the GTUM is designed for FC scales that include both extreme and intermediate statements, writing enough good intermediate statements for some constructs may be challenging (Cao et al., 2015). Therefore, we also studied two levels of statement extremity: intermediate-including and intermediate-excluding. In the intermediate-including conditions, location parameters for the 12 statements of each factor spanned evenly between −2.5 and 2.5; in the intermediate-excluding conditions, all location parameters laid within [−3, −1.5] and [1.5, 3]. Across both levels, we ensured that about half statements within each factor had positive location parameters, and the other half had negative ones.

In total, there were $2 \times 2 \times 2 \times 5 \times 2 \times 2 = 160$ conditions. Under each condition, we generated statement utilities following Equation 10 and calculated latent utility differences. The threshold parameters in conditions with two response options were randomly generated from U(−1.5, 1.5). In conditions with five response options, the first threshold parameters of all pairs were randomly generated from N(−1.5, 0.01), the second from N(−.75, 0.01), the third from N(.75, 0.01), and the fourth from N(1.5, 0.01). Observed responses were generated following Equations 12 and 13. Generated data were preprocessed using the edstan package (Furr, 2017).

Model Estimation

A truncated LogNormal(.2, .5) prior on $[0, 5]$ was adopted for the discrimination parameters ( $λ$ ); the threshold parameters ( $τ$ ) were assigned an N(0, 2) prior. As it is easy to tell extreme positive statements, extreme negative statements, and intermediate statements from each other, we also incorporated this information into the prior for $δ$ : specifically, we assigned a truncated N(1.5, 2) prior between [0, 5] if the true location parameter ( $δ$ ) is greater than .50, a truncated N(−1.5, 2) prior on [−5, 0] if the true location parameter is below −.50, and a truncated N(0, 2) prior between [−5, 5] if the true location parameter is between [−.5, .5]. A multivariate normal prior was assumed for person parameters ( $θ$ ). For the correlations among focal latent factors, the LKJ-correlation prior was assumed with a shape parameter of 1, equivalent to the uniform density over $5 \times 5$ correlation matrices (Lewandowski et al., 2009). We also used 1 as the starting value for the discrimination parameters, and −1.5, 0, and 1.5 as the starting values for extreme negative statements ( $δ < - .5$ ), intermediate statements ( $- .5 \leq δ \leq .5$ ), and extreme positive statements ( $δ > .5$ ), respectively. Starting values for other parameters were automatically randomly generated by Stan from their prior distributions.

Two chains were utilized so that we could estimate model convergence based on the potential scale reduction (PSR), which measured the ratio of the average variance of samples within each chain to the variance of the pooled samples across chains, with PSR close to 1 indicating convergence. In the present study, we considered the model converged if the maximum PSR of all parameters falls below 1.20. Chain length was set to 2,000, with the first 1,000 discarded as warmups. One hundred replications per condition were performed.

Accuracy Metrics

Bias, Absolute Bias (ABias), Root Mean Square of Error (RMSE), and Pearson Correlation between true and estimated parameters (PCorr) were calculated for $λ$ , $δ$ , and $τ$ . Bias, ABias, and RMSE were evaluated for latent correlations. For person parameters, we calculated Bias, ABias, and PCorr for each factor. Given little difference across factors, we reported the average results.

Results

Model Convergence

Overall, the GTUM converged well, with an average convergence rate of 99% across all conditions (see Table S1). More specifically, convergence rates were above 90% in 156 out of 160 conditions (most were 100%). Only four challenging conditions with only extreme statements and five response options had convergence rates below 90% (83%–89%). We note that imperfect convergence rates for correctly specified models have also been reported for the MUPP model (Tu et al., 2023a), the GGUM-RANK model (Lee et al., 2019), and other non-FC IRT models (Jang & Cohen, 2020; Paek et al., 2018; Tay et al., 2011). Therefore, we believe the GTUM converged reasonably well. We reported results from converged replications.

Recovery of Person Scores ( $θ$ )

Four patterns can be observed in Table 2. First, the number of mixed pairs was positively related to the accuracy of person score recovery in a nonlinear way, such that the beneficial effects of more mixed pairs reached a plateau when there were more than 10 mixed pairs. Second, five response options always outperformed two response options across all conditions (the average correlation difference based on Fisher's z transformation and back transformation between two and five response options was .23, with an average raw correlation difference of .07), particularly in the less favorable conditions. For example, the correlation difference based on Fisher's z transformation and back transformation between two and five response options can be as high as .32 (raw correlation difference = .12) in conditions with zero mixed pairs, 300 respondents, highly discriminating statements and intermediate statements. Third, the exclusion of intermediate statements did not affect the performance of the GTUM in most conditions except for those with zero mixed pairs, suggesting that the GTUM may also serve as a good candidate scoring model for existing FC scales that are based on extreme statements only. However, in conditions with zero mixed pairs, the GTUM benefited more from a higher number of response options when there were intermediate statements compared to intermediate-excluding conditions. Specifically, the average correlations between true and estimated person scores based on Fisher's z transformation and back transformation for two and five response options were .66 and .79, respectively, when there were intermediate statements. However, when there were only extreme statements, the average correlations were .61 and .67. Fourth, highly discriminating statements improved the recovery of person scores in all conditions, again, particularly in these less favorable conditions. For example, when there were intermediate statements, two response options, 300 respondents, and closely matched pairs with zero mixed ones, the correlation between true and estimated person scores were .56 and .72 for average and highly discriminating statements, respectively. In fact, even if there were no mixed pairs, we could still achieve accurate recovery (r = .83–.86) if we used statements of high-quality and more response options. Larger samples slightly improved trait recovery. Whether statements within each pair were closely or loosely matched had no substantial impact. Average absolute bias can be found in the online Supplemental material (Table S2).

Table 2.
Correlation Between True and Estimated Person Scores.

Statement Type Match Options λ = 1 λ = 1.5

L-Diff = .25 L-Diff = .50 L-Diff = .25 L-Diff = .50

SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000

Intermediate + Extreme 0/6 2 .54 .58 .57 .61 .70 .74 .73 .76

5 .68 .71 .70 .73 .83 .84 .85 .86

1/6 2 .70 .72 .70 .72 .82 .84 .82 .84

5 .80 .82 .81 .82 .90 .90 .90 .90

2/6 2 .74 .76 .74 .76 .84 .85 .84 .85

5 .83 .84 .83 .84 .91 .91 .91 .91

3/6 2 .75 .76 .75 .76 .85 .85 .84 .85

5 .83 .84 .83 .84 .91 .91 .91 .91

6/6 2 .75 .77 .75 .77 .84 .85 .84 .85

5 .83 .84 .83 .84 .90 .91 .91 .91

Extreme only 0/6 2 .56 .58 .57 .59 .65 .65 .65 .65

5 .64 .66 .64 .66 .69 .70 .69 .70

1/6 2 .75 .76 .74 .76 .84 .84 .83 .84

5 .83 .84 .83 .84 .90 .90 .90 .90

2/6 2 .79 .80 .79 .80 .87 .87 .86 .87

5 .86 .87 .86 .87 .92 .92 .92 .92

3/6 2 .80 .81 .80 .81 .87 .87 .87 .87

5 .87 .87 .87 .87 .92 .92 .92 .92

6/6 2 .80 .80 .79 .80 .85 .86 .85 .86

5 .86 .86 .86 .86 .91 .91 .91 .91

Note. Intermediate refers to intermediate statements; Extreme refers extreme statements; L-Diff = difference in location parameters; SS = Sample Size; Match = The proportion of mixed blocks; Options = The number of response options.

Recovery of Statement Parameters ( $λ, δ, τ$ )

In the main text, we present bias for $λ$ and reported its absolute bias, RMSE, and correlations among true and estimated parameters in the online Supplementary material (Tables S3–S5). For $δ and τ$ , we report absolute bias in the main text and other accuracy metrics in the online Supplemental material (Tables S6–S11). This choice was based on the observation that negative location and threshold parameters were overestimated, and positive location and threshold parameters were underestimated. Basically, they were shrunk towards zero. If averaged, they will cancel each other out, and the averaged bias was always essentially zero.

Discrimination parameters ( $α$ ). We can see from Table 3 that in most conditions, discrimination parameters were estimated with reasonable accuracy. Only in conditions with highly discriminating statements, zero mixed pairs, and no intermediate statements were the discrimination parameters underestimated. Larger samples can bring bias slightly closer to zero. The number of response options and the degree of match had no substantial impact on the overall pattern. Tables S3–S5 showed that more response options and larger sample sizes could effectively bring absolute bias and RMSE closer to zero and improve the recovery of the rank order of discrimination parameters. Note that the rank order of the discrimination parameters was not as well recovered as other parameters, primarily due to range restriction.

Table 3.
Bias of Discrimination Parameters.

Statement Type Match Options λ = 1 λ = 1.5

L-Diff = .25 L-Diff = .50 L-Diff = .25 L-Diff = .50

SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000

Intermediate + Extreme 0/6 2 .10 .02 .10 .03 −.17 −.10 −.15 −.09

5 .04 .00 .04 .00 −.11 −.05 −.10 −.04

1/6 2 .09 .02 .09 .02 −.10 −.04 −.10 −.04

5 .04 .01 .04 .01 −.07 −.03 −.07 −.03

2/6 2 .09 .02 .09 .02 −.09 −.03 −.08 −.03

5 .04 .01 .04 .01 −.06 −.02 −.06 −.02

3/6 2 .08 .02 .09 .02 −.09 −.04 −.08 −.03

5 .04 .01 .05 .01 −.06 −.02 −.05 −.02

6/6 2 .09 .03 .09 .03 −.09 −.04 −.09 −.04

5 .05 .01 .05 .01 −.06 −.02 −.05 −.02

Extreme only 0/6 2 .15 .09 .14 .07 −.22 −.28 −.23 −.27

5 .07 .02 .07 .02 −.25 −.25 −.25 −.24

1/6 2 .13 .05 .13 .05 −.04 −.02 −.04 −.02

5 .07 .02 .07 .02 −.04 −.02 −.04 −.02

2/6 2 .13 .05 .13 .05 .00 .00 −.01 −.01

5 .08 .02 .08 .02 −.02 −.01 −.02 −.01

3/6 2 .14 .05 .14 .05 .00 .00 .00 .00

5 .08 .03 .08 .03 −.01 −.01 −.01 −.01

6/6 2 .14 .06 .15 .06 .01 .01 .00 .01

5 .09 .03 .08 .03 .00 .00 .00 .00

Note. Intermediate refers to intermediate statements; Extreme refers extreme statements; L-Diff = difference in location parameters; SS = Sample size; Match = The proportion of mixed blocks; Options = The number of response options.

Location parameters ( $δ$ ). It can be easily seen from Table 4 that more response options, larger sample sizes, and the use of statements with higher discriminating power can effectively reduce absolute bias across all conditions with intermediate statements. However, when there were only extreme statements, more response options and the use of statements with higher discriminating power were of virtually no help in improving the estimation accuracy of location parameters. The number of mixed pairs had no substantial effect. RMSE displayed the same pattern as absolute bias. The rank order of location parameters was always accurately recovered in all conditions (r = .97–1.00).

Table 4.
Absolute Bias of Location Parameters.

Statement Type Match Options λ = 1 λ = 1.5

L-Diff = .25 L-Diff = .50 L-Diff = .25 L-Diff = .50

SS = 300 SS = 1000 SS =300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000

Intermediate + Extreme 0/6 2 .57 .46 .55 .43 .49 .32 .45 .30

5 .43 .30 .40 .28 .31 .21 .29 .20

1/6 2 .54 .44 .53 .41 .48 .31 .45 .29

5 .40 .28 .38 .27 .30 .21 .30 .20

2/6 2 .54 .42 .53 .38 .47 .31 .43 .28

5 .39 .28 .38 .26 .31 .22 .29 .22

3/6 2 .55 .41 .51 .39 .49 .31 .45 .28

5 .40 .28 .36 .26 .32 .22 .29 .22

6/6 2 .55 .43 .54 .40 .51 .34 .48 .30

5 .41 .28 .37 .26 .33 .25 .30 .23

Extreme only 0/6 2 .38 .40 .43 .42 .38 .48 .41 .46

5 .38 .40 .39 .41 .42 .50 .42 .45

1/6 2 .38 .43 .42 .44 .42 .52 .43 .48

5 .38 .45 .39 .42 .44 .46 .41 .40

2/6 2 .38 .45 .42 .46 .43 .51 .43 .49

5 .39 .45 .40 .42 .44 .46 .42 .41

3/6 2 .39 .45 .42 .45 .42 .51 .43 .48

5 .39 .45 .39 .42 .45 .48 .42 .41

6/6 2 .37 .42 .42 .44 .39 .46 .40 .46

5 .39 .46 .39 .43 .43 .49 .41 .45

Note. Intermediate refers to intermediate statements; Extreme refers extreme statements; L-Diff = difference in location parameters; SS = Sample size; Match = The proportion of mixed blocks; Options = The number of response options.

Threshold parameters ( $τ$ ). Table 5 shows that more response options led to a more accurate recovery of threshold parameters across all conditions, even though more response options also mean more threshold parameters to be estimated. A larger sample size was also helpful in improving estimation accuracy, particularly when there were only two response options. The effects of other manipulated factors were trivial. RMSE displayed the same pattern as absolute bias. The rank order of location parameters was accurately recovered in most conditions (r = .86–1.00) except for conditions with only extreme statements, two response options, and 300 respondents (r = .76–.85).

Table 5.
Absolute Bias of Threshold Parameters.

Statement Type Match Options λ = 1 λ = 1.5

L-Diff = .25 L-Diff = .50 L-Diff = .25 L-Diff = .50

SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000

Intermediate + Extreme 0/6 2 .45 .35 .48 .37 .45 .32 .51 .33

5 .26 .20 .28 .20 .27 .20 .27 .19

1/6 2 .42 .32 .45 .35 .44 .31 .49 .32

5 .25 .19 .27 .19 .25 .19 .25 .18

2/6 2 .42 .33 .46 .35 .45 .32 .49 .33

5 .25 .18 .27 .19 .25 .18 .25 .18

3/6 2 .43 .33 .46 .36 .44 .33 .50 .33

5 .25 .19 .28 .19 .25 .19 .25 .18

6/6 2 .47 .37 .50 .39 .47 .37 .55 .38

5 .27 .21 .29 .20 .27 .19 .27 .18

Extreme only 0/6 2 .53 .46 .56 .50 .56 .50 .60 .55

5 .26 .24 .28 .27 .24 .22 .27 .26

1/6 2 .48 .42 .52 .46 .54 .50 .58 .54

5 .24 .23 .26 .25 .23 .24 .25 .26

2/6 2 .49 .44 .52 .46 .54 .50 .58 .53

5 .25 .23 .27 .25 .23 .23 .25 .25

3/6 2 .50 .44 .54 .47 .57 .51 .58 .53

5 .25 .23 .28 .26 .22 .23 .24 .24

6/6 2 .56 .47 .58 .51 .58 .53 .63 .57

5 .26 .23 .27 .25 .22 .20 .23 .22

Note. Intermediate refers to intermediate statements; Extreme refers extreme statements; L-Diff = difference in location parameters; SS = Sample size; Match = The proportion of mixed blocks; Options = The number of response options.

Recovery of Latent Correlations ( $ρ$ )

Bias was reported for the estimation of latent correlations. Absolute bias and RMSE can be found in the online Supplemental material (Tables S12–S13). As can be seen from Table 6, latent correlations can be estimated with reasonable to good accuracy except for conditions with zero pairs and highly discriminating statements where latent correlations were severely underestimated. The degree of underestimation was particularly concerning when there were only extreme statements (a true correlation of .30 was estimated to be around −.10). Fortunately, when there were five mixed pairs, estimation accuracy quickly resumed to an acceptable range.

Table 6.
Bias of Latent Correlations.

Statement Type Match Options λ = 1 λ = 1.5

L-Diff = .25 L-Diff = .50 L-Diff = .25 L-Diff = .50

SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000 SS = 300 SS = 1000

Intermediate + Extreme 0/6 2 −.03 −.02 −.02 −.02 −.18 −.09 −.15 −.08

5 −.04 −.02 −.04 −.02 −.10 −.04 −.10 −.04

1/6 2 −.04 −.02 −.03 −.02 −.08 −.03 −.07 −.03

5 −.03 −.01 −.03 −.01 −.06 −.02 −.06 −.02

2/6 2 −.02 −.01 −.02 −.01 −.05 −.02 −.04 −.02

5 −.02 −.01 −.02 −.01 −.04 −.01 −.04 −.02

3/6 2 −.03 −.01 −.03 −.01 −.04 −.02 −.04 −.02

5 −.02 −.01 −.02 −.01 −.04 −.01 −.04 −.01

6/6 2 −.03 −.01 −.03 −.01 −.02 −.01 −.02 −.01

5 −.02 .00 −.02 .00 −.03 −.01 −.03 −.01

Extreme only 0/6 2 −.11 −.02 −.11 −.04 −.42 −.41 −.43 −.39

5 −.09 −.04 −.09 −.04 −.40 −.33 −.40 −.30

1/6 2 −.06 −.02 −.06 −.02 −.10 −.04 −.11 −.04

5 −.03 −.01 −.03 −.01 −.07 −.02 −.07 −.03

2/6 2 −.03 −.01 −.02 .00 −.05 −.02 −.05 −.02

5 −.02 .00 −.01 −.01 −.04 −.01 −.04 −.01

3/6 2 −.02 .00 −.02 .00 −.03 −.01 −.03 −.01

5 −.01 .00 .00 .00 −.03 −.01 −.03 −.01

6/6 2 .01 .01 .01 .00 .00 .00 .00 .00

5 .00 .00 .01 .00 −.01 .00 −.01 .00

Note. Intermediate refers to intermediate statements; Extreme refers extreme statements; L-Diff = difference in location parameters; SS = Sample size; Match = The proportion of mixed blocks; Options = The number of response options.

Summary

We examined the psychometric performance of the GTUM in various conditions, and the results have several implications for future FC scale development and scoring. First, the most reassuring finding is that the GTUM performed well in most conditions, supporting its feasibility as an alternative FC scoring model. Second, we found strong evidence that five response options outperformed two response options in all aspects across all conditions, lending statistical support to the currently underused graded FC format. Third, we also provided evidence showing that the GTUM can handle FC scales with or without intermediate statements, making it more flexible than the TIRT model. Therefore, the GTUM can also be used to score existing FC scales where there are only extreme statements (as will be demonstrated in the empirical illustration below). Fourth, the number of mixed pairs had a strong positive impact on almost all aspects of the GTUM in a nonlinear way, such that the incremental effect of more mixed pairs gradually decreases and plateaus at 10. This finding is significant because (1) to our knowledge, this is the first study that reveals that the tradeoff between test information and social desirability matching is not limited to the TIRT model but is instead a general issue, and (2) we provided an implementable recommendation for scale construction. Fifth, although the GTUM can handle FC scales with or without intermediate statements, we warn against the application of GTUM to FC scales with zero mixed pairs and all extreme statements regardless of sample size and the number of response options because both person and statement parameters cannot be accurately recovered. If zero mixed pairs are required, our results suggest the use of both extreme and intermediate statements.

Empirical Illustrations

After establishing the statistical performance of the GTUM, we illustrated its feasibility and utility in two organizationally-relevant datasets in comparison with the MUPP and the TIRT models. These illustrations demonstrated how GTUM can be flexibly applied to different types of FC scales. It also served as a showcase for researchers and practitioners seeking to use the GTUM.

Samples and Measures

Sample 1. The first dataset contains 1,033 valid Amazon Mechanical Turk (MTurk) responses collected between the Fall of 2016 and the Spring of 2017 (Zhang et al., 2020b). The average age was 34.41 (SD = 12.24) and 60.61% of the participants were female. Participants responded to a static version of the Tailored Adaptive Personality Assessment System (TAPAS; Drasgow et al., 2012) that measures ten facets of the Big Five personality factors: Sociability, Dominance, Physical Condition, Selflessness, Achievement Orientation, Orderliness, Optimism, Even-Temperedness, Tolerance, and Intellectual Efficiency. The TAPAS was developed based on the unfolding response process and includes both intermediate and extreme statements. Statements were presented in pairs and respondents were asked to choose the one that is “more like me.” Therefore, local dependence is not an issue. Aside from the TAPAS, respondents also responded to the Big Five Inventory (BFI; John et al., 1991), the Satisfaction with Life Scale (Diener et al., 1985), and the Core Self-Evaluation Scale (Judge et al., 2003) on a 5-point scale (1 = “Strongly disagree”; 2 = “Disagree”; 3= “Neither agree nor disagree”; 4 = “Agree”; 5 = “Strongly agree”), and single-item measures of annual income and overall health.

Sample 2. The second dataset was collected from a group of 757 Chinese undergraduate students from diverse geolocations and school majors in the Fall of 2021 (Zhang et al., 2023a). The average age was 19.77 (SD = .96) and 59% of the participants were female. Participants responded to a Likert scale and a graded FC scale of the Big Five personality factors administered in a counterbalanced way. Both scales shared 60 identical statements from the Forced-Choice Five Factor Markers (FCFFM; Brown & Maydeu-Olivares, 2011). Both the Likert version and the FC version have been shown to be valid in Chinese-speaking population (Zhang et al., 2022; Zhang et al., 2023a). Specifically, the original FCFFM has 20 blocks with 3 statements per block. We decomposed each block into three pairs (ABC → AB, AC, BC) and administered each pair separately. To minimize the potential carry-over effect due to shared statement content, we administered the first 20 AB pairs, followed by the 20 AC pairs, and then the 20 BC pairs. For each pair, respondents were asked to indicate the degree to which they prefer one statement over the other on a 5-point scale (1 = “A is much more like me”; 2 = “A is slightly more like me”; 3 = “A and B are equally like me”; 4 = “B is slightly more like me”; 5 = “B is much more like me”). For the Likert counterpart, respondents were asked to indicate the degree to which they agree with each statement on a 5-point scale (1 = “Strongly disagree”; 2 = “Disagree”; 3= “Neither agree nor disagree”; 4 = “Agree”; 5 = “Strongly agree”). Responses to the Likert scale served as a criterion to examine the validity of FC scores. As the FCFFM was originally developed under the assumption of a dominance response process, there is no intermediate statement.

Data Analysis

For the TAPAS in Sample 1, we first scored the responses by the MUPP model using statement parameters and scoring software provided by the test developer to obtain MUPP-based maximum a posteriori (MAP) scores². Then, we rescored the TAPAS with the GTUM and obtained the GTUM-based MAP scores. Thus, there were two sets of scores for each facet.

For the FCFFM in Sample 2, we first scored the responses by the TIRT model following Brown and Maydeu-Olivares (2018) using Mplus 8.5 (Muthén & Muthén, 1998–2017) to obtain TIRT-based MAP scores. Next, we rescored the FCFFM with the GTUM and obtained the GTUM-based MAP scores. Thus, there were two sets of scores for each of the Big Five factors.

Responses to the Likert scales were scored using the Graded Response Model (Samejima, 1969) and MAP estimates were obtained from the R package mirt (version 1.36.1; Chalmers, 2012). Responses to single-item measures were used as they were.

Results

TAPAS Results

Statement parameters. Statement parameters and their standard errors from the GTUM are shown in Table S34. Specifically, the discrimination parameters ranged from .20 to 2.81, with a mean of 1.01. The magnitude of location parameters ranged from .35 to 3.37, which correctly reflected the fact that intermediate statements were intentionally included in TAPAS. The threshold parameters were between −4.52 and 3.80.

Convergence between different scoring methods. As shown in the shaded diagonal of the upper panel of Table 7, it was clear that the GTUM-based and MUPP-based scores resulted in high convergence (r's = .85–.96), supporting the validity of the GTUM.

Table 7.
Reliability, Convergent Validity, Facet Intercorrelations, and Criterion-Related Validity (TAPAS).

TAPAS Reliability SOC DOM PHY SEL ACH ORD OPT EVT TOL IEF

GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP GTUM MUPP

SOC .79 .85 .93

DOM .81 .82 .68 .43 .96

PHY .81 .82 .36 .25 .31 .19 .94

SEL .65 .74 .12 .11 −.05 .00 −.17 −.04 .85

ACH .69 .71 .26 .08 .39 .21 .24 .12 .48 .24 .87

ORD .80 .85 .22 .13 .17 .08 .41 .26 −.07 −.03 .49 .27 .96

OPT .76 .85 .66 .35 .38 .22 .45 .27 .14 .09 .39 .21 .45 .26 .91

EVT .69 .79 .09 .05 −.20 −.11 .13 .09 .36 .18 .28 .14 .18 .08 .48 .30 .95

TOL .70 .79 .12 .12 .15 .15 .00 .03 .63 .37 .21 .10 −.18 −.12 −.01 .07 .18 .10 .92

IEF .66 .75 .14 −.01 .44 .26 .01 −.07 .34 .06 .70 .26 .01 −.03 .20 .10 .11 .05 .31 .16 .87

SS-E .85 .77 .71 .57 .48 .27 .22 .08 .04 .22 .08 .16 .09 .50 .35 −.01 −.03 .11 .10 .17 .07

SS-A .79 .24 .23 .00 −.02 .06 .06 .41 .39 .26 .16 .14 .10 .35 .28 .43 .37 .22 .18 .11 .01

SS-C .81 .21 .15 .17 .13 .26 .20 .17 .10 .56 .48 .54 .47 .40 .34 .23 .16 .01 .00 .27 .17

SS-N .86 −.40 −.32 −.24 −.19 −.29 −.22 −.06 −.01 −.28 −.17 −.29 −.22 −.65 −.55 −.50 −.47 .01 .02 −.20 −.12

SS-O .81 .11 .10 .17 .17 .07 .08 .24 .19 .25 .16 −.02 −.02 .06 .04 .07 .06 .34 .32 .33 .29

LS .90 .31 .21 .16 .10 .27 .22 .06 .07 .19 .12 .25 .17 .57 .54 .22 .16 −.03 −.03 .06 .00

CSE .85 .38 .27 .24 .18 .29 .22 .07 .03 .33 .24 .36 .26 .69 .64 .32 .24 −.06 −.05 .19 .08

Income NA −.02 −.04 −.02 −.03 .05 .04 .11 .06 .22 .20 .16 .10 .11 .12 .07 .03 .00 .01 .14 .10

Health NA −.29 −.22 −.21 −.16 −.43 −.37 .05 .03 −.13 −.07 −.21 −.16 −.37 −.32 −.15 −.11 −.05 −.05 −.03 .03

Note. SOC = sociability; DOM = dominance; PHY = physical condition; SEL = selflessness; ACH = achievement orientation; ORD = orderliness; OPT = optimism; EVT = even-tempered; TOL = tolerance; IEF = intellectual efficiency; N = neuroticism; E = extraversion; A = agreeableness; C = conscientiousness; O = openness; LS = life satisfaction; CSE = core self-evaluation; SS = single-statement (Likert) scale. Values in gray shade were convergence between GTUM-based and MUPP-based scores.

Empirical reliability. Empirical reliability was calculated for each dimension using the formula suggested by Brown and Maydeu-Olivares (2018). As shown in the first two columns of Table 7, all facet scores displayed adequate to satisfactory reliability. Relatively, MUPP scores were slightly more reliable than GTUM-based scores (M_MUPP = .80, M_GTUM = .74). It is likely due to the fact that the MUPP model assumes statement parameters are known with no sampling variability while the GTUM properly incorporated such uncertainty into the scoring procedure. Thus, the reliability for MUPP scores might have been overestimated.

Correlation among facets. We note that the MUPP scoring procedure we adopted did not consider latent correlations among facets while the GTUM did. Therefore, we presented correlations among estimated factor scores for both MUPP-based scores and GTUM-based scores for a fairer comparison (see Table S36 for the latent correlations estimated by the GTUM). As shown in the off-diagonals of the upper panel of Table 7, GTUM-based scores, on average, had slightly higher correlations with one another than MUPP-based scores. This is probably because the GTUM incorporated correlations among facets when estimating person scores while the MUPP model did not make use of such information.

Correlations with external variables. As shown in the lower panel of Table 7, GTUM-based scores, in general, displayed a slightly higher correlation with theoretically relevant external variables than MUPP-based scores. For example, GTUM-based Dominance scores had a correlation of .57 with extraversion measured by the BFI, while the corresponding correlations for MUPP-based Dominance scores were .48. Similarly, GTUM-based Sociability scores were correlated with life satisfaction at .31 while that for MUPP-based score was .21. The slight improvement in criterion-related validity of GTUM-based scores over MUPP-based scores is likely to come from the relaxation of the strict assumption of parameter invariance across formats.

FCFFM Results

Statement parameters. As shown in Table S35, statement discrimination parameters ranged from .65 to 1.75 with a mean of 1.05. All location parameters were large in magnitude (4.10 to 8.66). This is also consistent with the fact that the FCFFM was developed based on the dominance assumption—only statements with extreme location parameters were included. The threshold parameters were largely in the range between −4 and 4. The standard deviation of the random block factor was .65, reflecting a moderate amount of local dependence.

Convergence between different scoring methods. As shown in the shaded diagonal of the upper panel of Table 8, it was clear that the GTUM-based and TIRT-based scores resulted in almost perfect convergence (r's = .99–1.00), strongly supporting the validity of the GTUM.

Table 8.
Reliability, Convergent Validity, Factor Intercorrelations, and Criterion-Related Validity (FCFFM).

FCFFM Reliability FC-E FC-A FC-C FC-N FC-O

GTUM TIRT GTUM TIRT GTUM TIRT GTUM TIRT GTUM TIRT GTUM TIRT

FC-E .89 .86 .99

FC-A .90 .87 −.11 −.13 .99

FC-C .82 .76 −.15 −.19 .36 .42 .99

FC-N .86 .83 −.16 −.15 .11 .11 .12 .13 1 .00

FC-O .84 .80 −.13 −.14 .30 .31 .22 .22 .06 .09 .99

SS-E .89 .86 .87 .31 .34 .06 .06 −.13 −.15 .27 .28

SS-A .82 .40 .46 .76 .76 .10 .13 −.08 −.11 .21 .24

SS-C .86 .11 .10 .15 .16 .80 .82 −.14 −.13 .08 .09

SS-N .90 −.16 −.17 −.12 −.13 −.15 −.15 .81 .81 −.16 −.16

SS-O .86 .31 .32 .16 .17 .06 .07 −.20 −.21 .80 .80

Note. N = neuroticism; E = extraversion; A = agreeableness; C = conscientiousness; O = openness; FC = forced-choice scale; SS = single statement (Likert) scale; Values in gray shade were convergence between GTUM-based and TIRT-based scores.

Empirical reliability. As shown in the first two columns of Table 8, GTUM scores were slightly more reliable than TIRT scores. One post-hoc explanation could be that the GTUM better represented how people respond to personality statements than the TIRT, thus resulting in more reliable scores. Regardless, these differences were generally small.

Correlation among factors. As can be seen from the upper panel of Table 8, correlations among the five factors were very similar to each other for the GTUM-based and the TIRT-based scores. This is unsurprising, given the almost perfect convergence between the two sets of scores.

Correlations with external variables. As shown in the lower panel of Table 8, the GTUM-based scores displayed practically identical correlations with externally measured Big Five factors as the TIRT-based scores.

Summary

Across two FC scales constructed under different assumptions and with different response formats, we found strong evidence that the GTUM can successfully extract reliable and valid scores from FC responses. Overall, the two empirical illustrations effectively supported the feasibility and utility of the GTUM to analyze a broader set of FC scales.

Discussion

The GTUM was proposed as a next-generation psychometric model for a variety of FC scales. The GTUM combines the advantages of the MUPP model and the TIRT model such that it can (1) accommodate unfolding responding, (2) handle FC scales of any block sizes, and (3) score both dichotomous and graded responses. The GTUM can also be applied to FC scales without intermediate statements. Two simulation studies systematically showed that the GTUM generally performs well, even in conditions with only 300 respondents. The empirical illustrations supported the feasibility of the GTUM in real applications.

Implications

The findings of this study have several implications. First, we, for the first time, systematically showed that the tradeoff between psychometric performance and location parameter matching, which is likely to be closely related to social desirability matching, is not only limited to the TIRT model. The GTUM is not immune, either. Fortunately, including intermediate statements can substantially alleviate the negative impact of location parameter matching such that even in conditions with zero mixed pairs, person parameters can still be recovered with reasonable accuracy. However, if only extreme statements are included, the psychometric performance of the GTUM is similarly impacted by social desirability matching as the TIRT model (Lee et al., 2022). Another practical advantage of intermediate statements is that they are more likely to be faking-resistant because they are often more neutral than extreme statements. When designing FC scales for use in high-stakes situations, we can at least pair extreme statements with intermediate statements instead of pairing extreme positive statements with extreme negative statements, the former of which should be less fakable than the latter. Overall, even though both the GTUM and the TIRT model are impacted by social desirability matching, the GTUM suffers less than the TIRT model due to its unique capability to leverage intermediate statements. Those troubled by the tradeoff between social desirability matching and psychometric information are encouraged to consider the GTUM and intermediate statements for scale development and scoring.

Second, our findings strongly support using the graded FC format over the traditional dichotomous FC format. This finding is not surprising as numerous studies have shown that more response options always lead to higher reliability in Likert scales. However, unlike Likert scales where many polytomous IRT models have been developed to make full use of graded responses, currently, only the TIRT model can handle graded responses in FC scales. The GTUM represents an important addition to the family of FC IRT models that can leverage the advantages of graded responses (Zhang et al., 2023a). In fact, we may consider the TIRT model as a model practically subsumed under the GTUM, because the GTUM is equally capable of handling FC scales constructed under the dominance assumption. However, the TIRT model cannot handle FC scales with intermediate statements.

Practical Considerations

Given the complexity of the GTUM and its estimation, users may have some questions about using this model. To aid researchers and practitioners in applying this model to FC scale construction and scoring, we provide brief discussions on practical issues and suggestions below.

How to best construct FC scales for accurate estimation of trait scores?

According to the simulation findings, several steps can be taken to construct high-quality FC scales. First, researchers should always use high-quality statements. Our simulations demonstrated that high-quality statements consistently lead to superior psychometric performance. Therefore, the importance of crafting high-quality statements should never be underestimated. Second, incorporating both intermediate and extreme statements effectively achieves robust psychometric performance and adequate social desirability matching. Achieving this balance becomes challenging if only extreme statements are used. Third, for optimal psychometric results, it is recommended to include 5–10 mixed statement pairs. However, caution should be exercised with the use of mixed pairs as they may make FC scales vulnerable to faking. Fourth, larger block sizes can enhance reliability. For example, with a pool of 60 statements, 30 unique pairs can be formed. If researchers adopt a block size of 3, they can produce 20 unique triplets, which can further be broken down into 60 unique pairs. The latter scale provides approximately twice the information of the former. However, it is advisable to maintain block sizes below six due to the cognitive burden associated with larger blocks. Moreover, the block size should not exceed the total number of latent traits intended for measurement; incorporating multiple statements from the same trait into the same block may be psychometrically ineffective. If, for any reason, the block size has to be set at two, utilizing the graded response format is encouraged, as it has proven effective in our simulations and previous empirical studies (Brown & Maydeu-Olivares, 2018; Zhang et al., 2023a). Regardless, researchers are strongly encouraged to use the autoFC R package (Li et al., 2022) to automate the test construction process.

How should I choose among the MUPP model, the GGUM-RANK model, the TIRT model, and the GTUM?

The answer is: it depends. When there are only two statements per block and intermediate statements are intentionally included, both the GTUM and the MUPP model with simultaneous estimation of person and statement parameters can be used, though the GTUM may be preferred for its parsimony; when the graded response format is adopted, and intermediate statements are included, the GTUM is the only option; when there are at least three statements per block and intermediate statements are intentionally included, both the GTUM and the GGUM-RANK model are appropriate, though the GTUM is more accessible to many researchers. Both the GTUM and the TIRT model can be used if users are sure that no intermediate statements are present, though the TIRT might be preferred in this case due to faster computation. When users are not sure whether statements are functioning as intermediate statements, the GTUM is recommended, as it allows the empirical examination of the extremity of statements. It is important to note that some seemingly extreme statements can turn out to be intermediate, and vice versa. Therefore, if time permits, it does not hurt to fit both the GTUM and the TIRT model to empirically decide which one is more appropriate.

What is the sample size requirement for the GTUM?

Before discussing the tentative recommendation, we note that sample size planning is very complex, and the exact answer depends on various factors. Thus, our recommendation should not be used as a gold standard. Instead, our aim is to provide a recommendation that would lead to reasonable parameter estimates. The simulation showed that with 300 respondents, person parameters can be recovered as well as when there were 1,000 respondents; statement parameters can also be recovered with reasonable accuracy. Therefore, we consider 300 as a safe option, especially when the focus is on estimating person scores. Beyond this, we also ran some small-scale simulations and found that person scores could be recovered reasonably well even when the sample size was 200. However, statement parameters cannot be accurately estimated in that case.

How to choose priors?

Our personal experience with Bayesian estimation of unfolding models in general and the GTUM, in particular, suggests that choosing appropriate priors for location parameters is very important. Specifically, it is critical to let the algorithm know which statements are extreme positive statements and which are extreme negative statements. Fortunately, it is easy to tell from the statement content. For statements that we are sure are extreme positive, we can choose a normal prior with a mean of 1.5 or 2 and a variance of 4 or 9, and constrain them to be within a reasonable positive range; for statements that we are sure are extreme negative, we can choose a normal prior with a mean of −1.5 or −2 and a variance of 4 or 9, and constrain them to be within a reasonable negative range; for statements whose directionality or extremity we are not sure about, we can choose a normal prior with a mean of 0 and a variance of 4 or 9, and constrain them to be within a reasonable range symmetrical around zero. These priors and constraints are moderately informative and reasonable. In fact, we also tested the use of noninformative normal priors with a variance of 100 for location parameters (using the same range constraints as discussed above) and the results were close to what was obtained with moderately informative priors. When available, using semi-informative priors obtained from the calibration of single statements could speed up model convergence. When applying the model to empirical scenarios that differ from the current settings, researchers may conduct prior sensitivity analysis, as was done here, by comparing the parameter estimates obtained under different choices of priors. Although we recommend moderately reasonably informative priors for other parameters (as used in the simulation because we know the approximate range within which most parameters will lie), they are less critical than the priors for location parameters, especially when the sample size is large. Therefore, users can use less informative priors for other parameters in the model if they want the data to dominate the posterior distributions.

How to speed up computation?

One critical issue with Bayesian approaches implemented in MCMC algorithms is that they are much slower than frequentist approaches, and the computation time increases with sample size, scale length, and the number of response options. Therefore, it is desirable to speed up computation. One potential approach is to use the variational Bayes method that is available in the vb function of RStan. Variational approximation is a class of analytical techniques for approximating high-dimensional integrals, the key to which is to approximate the intractable integrals with a simple tractable form and thus create a lower bound to the marginal likelihood. Then, we can maximize the more tractable lower bound of the likelihood function (Jeon et al., 2017). The main advantage of the variational Bayes is computational efficiency. In our small-scale experiment, with 1,000 respondents and 20 blocks whose size was 3, MCMC took about 10 h to finish, while the variational Bayes only took about 15 min. Person scores were estimated only moderately less accurately (r's = .65–.75) than those from MCMC (r's = .80–.90); the recovery of statement parameters was not satisfactory. Therefore, if person scores are the focus and fast computation is required, the variational Bayes may be a practical way to go. However, if accurate statement parameter estimation is desired, then the current variational Bayes approach is not recommended. Note that the variational Bayes approach implemented in Stan is the beta version and thus may not perform optimally. Testing of future formal versions is encouraged.

What if my model does not converge?

Simulation results showed that model convergence may be an issue when there are at least three statements per block. The post-hoc exploration showed that relaxing the bounds on the location parameters is an effective way to facilitate convergence. This is also supported by the empirical example. Specifically, fitting the GTUM to the second dataset by constraining the location parameters between [−5, 5] resulted in nonconvergence. After relaxing the range to [−10, 10], all model parameters converged well. Users can also increase the number of MCMC chains and the number of iterations per chain, which are generally recommended (Jiang & Carter, 2019).

Limitations and Future Directions

Despite the strengths, the present study has the following limitations. First, our simulation suggested that the use of intermediate statements may be particularly helpful in cases where zero mixed blocks are required. Future researchers are encouraged to examine this potential opportunity with empirical data. Second, the simulation suggested that 10 mixed pairs suffice for accurate person score recovery. However, we do not know the degree to which 10 mixed pairs will induce faking and how faking may compromise the validity of scores. Empirical evidence is needed to gauge the degree of tradeoffs between faking resistance and psychometric information. Third, no model fit indices are currently available for the GTUM except those general relative fit indices such as leave-one-out cross-validation information criterion, deviance information criterion, and posterior predictive checks. Future studies are encouraged to adapt existing IRT model fit indices (Nye et al., 2020) or develop new indices to evaluate the relative and absolute fit of the GTUM. Fourth, the current GTUM does not allow the incorporation of covariates into the estimation process. We highly encourage future research to add this functionality as previous studies have shown that the incorporation of covariates into the estimation process can effectively improve the accuracy of person score recovery (Curran et al., 2018; Joo et al., 2022; Tu et al., 2021, 2023b). Fifth, we recommend future researchers derive information functions for the GTUM to facilitate the combination of the GTUM with computerized adaptive testing to increase test efficiency (Stark et al., 2012). Sixth, though most conditions converged well, there were still a few conditions with suboptimal convergence rates. It remains unknown why these conditions had convergence issues. Future researchers are strongly encouraged to systematically examine potential factors associated with nonconvergence and develop more effective ways to improve model convergence. Seventh, following most of the previous papers, we assumed that the latent variables followed a normal distribution. However, this assumption may not hold for all constructs, such as dark personality. How robust GTUM estimates are against the nonnormality of latent variable distributions remains an open question. While some of the previous studies have found mixed evidence regarding the impact of nonnormality on model parameter estimates (Flora & Curran, 2004; Wang et al., 2018), no one has examined its impact on FC IRT models. Future researchers are strongly encouraged to (1) examine the impact of nonnormality on GTUM estimates, and (2) investigate the effectiveness of methods like the empirical histogram and Ramsay-curve (Woods, 2007a, 2007b) in alleviating its impact.

Conclusion

We proposed the GTUM as a more versatile IRT model for various types of FC scales by combining the advantages of the MUPP model and the TIRT model. Simulation studies and empirical illustrations supported the feasibility and utility of the GTUM. The accompanying R package fcscoring was also developed to facilitate the application of the GTUM.

Supplemental Material

sj-xlsx-1-orm-10.1177_10944281231210481 - Supplemental material for The Generalized Thurstonian Unfolding Model (GTUM): Advancing the Modeling of Forced-Choice Data

Supplemental material, sj-xlsx-1-orm-10.1177_10944281231210481 for The Generalized Thurstonian Unfolding Model (GTUM): Advancing the Modeling of Forced-Choice Data by Bo Zhang, Naidan Tu, Lawrence Angrave, Susu Zhang, Tianjun Sun, Louis Tay and Jian Li in Organizational Research Methods

FC IRT Model	Authors	Block Size	Decision Model	Response Process	Dimensionality	Graded Response	Response Time
1. Multi-Unidimensional Paired Preference Model	Stark et al., (2005)	2	Andrich's Forced Endorsement Model	Unfolding	Multi	No	No
2. GGUM-RANK	Lee et al., (2019)	>=2	Andrich's Forced Endorsement Model	Unfolding	Multi	No	No
3. Zinnes-Griggs Model	Zinnes and Griggs (1974)	2	Bradley-Terry	Unfolding	Uni	No	No
4. Andrich's Squared Difference Model	Andrich (1989)	2	Bradley-Terry	Unfolding	Uni	No	No
5. Andrich's Hyperbolic Model	Andrich (1995)	2	Bradley-Terry	Unfolding	Uni	No	No
6. Zinnes-Griggs Pairwise Preference Item Response Theory Model	Joo et al., (2021)	2	Thurstone's Law of Comparative Judgement	Unfolding	Multi	No	No
7. Forced-Choice Ranking Models	Hung and Huang (2022)	>=3	Andrich's Forced Endorsement Model	Unfolding	Multi	No	No
8. Generalized Thurstonian Unfolding Model	Current study	>=2	Thurstone's Law of Comparative Judgement	Unfolding	Multi	Yes	No
9. Thurstonian Item Response Theory Model	Brown and Maydeu-Olivares, (2011)	>=2	Thurstone's Law of Comparative Judgement	Dominance	Multi	Yes	No
10. Multi-Unidimensional Paired Preference −2PL	Morillo et al. (2016)	2	Andrich's Forced Endorsement Model	Dominance	Multi	No	No
11. Bayesian Random Block Item Response Theory Model	Lee and Smith (2020)	>=3	Andrich's Forced Endorsement Model	Dominance	Multi	No	No
12. Thurstonian D-diffusion Item Response Model	Bunji and Okada, (2020)	2	Thurstone's Law of Comparative Judgement	Dominance	Multi	No	Yes
13. Linear Ballistic Accumulator Item Response Theory Model	Bunji and Okada (2022)	2	Thurstone's Law of Comparative Judgement	Dominance	Multi	No	Yes
14. JRT-TIRT	Guo et al. (2023)	>=2	Thurstone's Law of Comparative Judgement	Dominance	Multi	No	Yes
15. The Faking Mixture Model	Frick (2022)	>=2	Thurstone's Law of Comparative Judgement	Dominance	Multi	No	No
16. The Careless Response Mixture Model (M-TCIR)	Peng et al., (2023)	>=2	Thurstone's Law of Comparative Judgement	Dominance	Multi	No	No

Statement Type	Match	Options	λ = 1	λ = 1.5
Intermediate + Extreme	0/6	2	.54	.58	.57	.61	.70	.74	.73	.76
5	.68	.71	.70	.73	.83	.84	.85	.86
1/6	2	.70	.72	.70	.72	.82	.84	.82	.84
5	.80	.82	.81	.82	.90	.90	.90	.90
2/6	2	.74	.76	.74	.76	.84	.85	.84	.85
5	.83	.84	.83	.84	.91	.91	.91	.91
3/6	2	.75	.76	.75	.76	.85	.85	.84	.85
5	.83	.84	.83	.84	.91	.91	.91	.91
6/6	2	.75	.77	.75	.77	.84	.85	.84	.85
5	.83	.84	.83	.84	.90	.91	.91	.91
Extreme only	0/6	2	.56	.58	.57	.59	.65	.65	.65	.65
5	.64	.66	.64	.66	.69	.70	.69	.70
1/6	2	.75	.76	.74	.76	.84	.84	.83	.84
5	.83	.84	.83	.84	.90	.90	.90	.90
2/6	2	.79	.80	.79	.80	.87	.87	.86	.87
5	.86	.87	.86	.87	.92	.92	.92	.92
3/6	2	.80	.81	.80	.81	.87	.87	.87	.87
5	.87	.87	.87	.87	.92	.92	.92	.92
6/6	2	.80	.80	.79	.80	.85	.86	.85	.86
5	.86	.86	.86	.86	.91	.91	.91	.91

Statement Type	Match	Options	λ = 1	λ = 1.5
Intermediate + Extreme	0/6	2	.10	.02	.10	.03	−.17	−.10	−.15	−.09
5	.04	.00	.04	.00	−.11	−.05	−.10	−.04
1/6	2	.09	.02	.09	.02	−.10	−.04	−.10	−.04
5	.04	.01	.04	.01	−.07	−.03	−.07	−.03
2/6	2	.09	.02	.09	.02	−.09	−.03	−.08	−.03
5	.04	.01	.04	.01	−.06	−.02	−.06	−.02
3/6	2	.08	.02	.09	.02	−.09	−.04	−.08	−.03
5	.04	.01	.05	.01	−.06	−.02	−.05	−.02
6/6	2	.09	.03	.09	.03	−.09	−.04	−.09	−.04
5	.05	.01	.05	.01	−.06	−.02	−.05	−.02
Extreme only	0/6	2	.15	.09	.14	.07	−.22	−.28	−.23	−.27
5	.07	.02	.07	.02	−.25	−.25	−.25	−.24
1/6	2	.13	.05	.13	.05	−.04	−.02	−.04	−.02
5	.07	.02	.07	.02	−.04	−.02	−.04	−.02
2/6	2	.13	.05	.13	.05	.00	.00	−.01	−.01
5	.08	.02	.08	.02	−.02	−.01	−.02	−.01
3/6	2	.14	.05	.14	.05	.00	.00	.00	.00
5	.08	.03	.08	.03	−.01	−.01	−.01	−.01
6/6	2	.14	.06	.15	.06	.01	.01	.00	.01
5	.09	.03	.08	.03	.00	.00	.00	.00

Statement Type	Match	Options	λ = 1	λ = 1.5
Intermediate + Extreme	0/6	2	.57	.46	.55	.43	.49	.32	.45	.30
5	.43	.30	.40	.28	.31	.21	.29	.20
1/6	2	.54	.44	.53	.41	.48	.31	.45	.29
5	.40	.28	.38	.27	.30	.21	.30	.20
2/6	2	.54	.42	.53	.38	.47	.31	.43	.28
5	.39	.28	.38	.26	.31	.22	.29	.22
3/6	2	.55	.41	.51	.39	.49	.31	.45	.28
5	.40	.28	.36	.26	.32	.22	.29	.22
6/6	2	.55	.43	.54	.40	.51	.34	.48	.30
5	.41	.28	.37	.26	.33	.25	.30	.23
Extreme only	0/6	2	.38	.40	.43	.42	.38	.48	.41	.46
5	.38	.40	.39	.41	.42	.50	.42	.45
1/6	2	.38	.43	.42	.44	.42	.52	.43	.48
5	.38	.45	.39	.42	.44	.46	.41	.40
2/6	2	.38	.45	.42	.46	.43	.51	.43	.49
5	.39	.45	.40	.42	.44	.46	.42	.41
3/6	2	.39	.45	.42	.45	.42	.51	.43	.48
5	.39	.45	.39	.42	.45	.48	.42	.41
6/6	2	.37	.42	.42	.44	.39	.46	.40	.46
5	.39	.46	.39	.43	.43	.49	.41	.45

Statement Type	Match	Options	λ = 1	λ = 1.5
Intermediate + Extreme	0/6	2	.45	.35	.48	.37	.45	.32	.51	.33
5	.26	.20	.28	.20	.27	.20	.27	.19
1/6	2	.42	.32	.45	.35	.44	.31	.49	.32
5	.25	.19	.27	.19	.25	.19	.25	.18
2/6	2	.42	.33	.46	.35	.45	.32	.49	.33
5	.25	.18	.27	.19	.25	.18	.25	.18
3/6	2	.43	.33	.46	.36	.44	.33	.50	.33
5	.25	.19	.28	.19	.25	.19	.25	.18
6/6	2	.47	.37	.50	.39	.47	.37	.55	.38
5	.27	.21	.29	.20	.27	.19	.27	.18
Extreme only	0/6	2	.53	.46	.56	.50	.56	.50	.60	.55
5	.26	.24	.28	.27	.24	.22	.27	.26
1/6	2	.48	.42	.52	.46	.54	.50	.58	.54
5	.24	.23	.26	.25	.23	.24	.25	.26
2/6	2	.49	.44	.52	.46	.54	.50	.58	.53
5	.25	.23	.27	.25	.23	.23	.25	.25
3/6	2	.50	.44	.54	.47	.57	.51	.58	.53
5	.25	.23	.28	.26	.22	.23	.24	.24
6/6	2	.56	.47	.58	.51	.58	.53	.63	.57
5	.26	.23	.27	.25	.22	.20	.23	.22

Statement Type	Match	Options	λ = 1	λ = 1.5
Intermediate + Extreme	0/6	2	−.03	−.02	−.02	−.02	−.18	−.09	−.15	−.08
5	−.04	−.02	−.04	−.02	−.10	−.04	−.10	−.04
1/6	2	−.04	−.02	−.03	−.02	−.08	−.03	−.07	−.03
5	−.03	−.01	−.03	−.01	−.06	−.02	−.06	−.02
2/6	2	−.02	−.01	−.02	−.01	−.05	−.02	−.04	−.02
5	−.02	−.01	−.02	−.01	−.04	−.01	−.04	−.02
3/6	2	−.03	−.01	−.03	−.01	−.04	−.02	−.04	−.02
5	−.02	−.01	−.02	−.01	−.04	−.01	−.04	−.01
6/6	2	−.03	−.01	−.03	−.01	−.02	−.01	−.02	−.01
5	−.02	.00	−.02	.00	−.03	−.01	−.03	−.01
Extreme only	0/6	2	−.11	−.02	−.11	−.04	−.42	−.41	−.43	−.39
5	−.09	−.04	−.09	−.04	−.40	−.33	−.40	−.30
1/6	2	−.06	−.02	−.06	−.02	−.10	−.04	−.11	−.04
5	−.03	−.01	−.03	−.01	−.07	−.02	−.07	−.03
2/6	2	−.03	−.01	−.02	.00	−.05	−.02	−.05	−.02
5	−.02	.00	−.01	−.01	−.04	−.01	−.04	−.01
3/6	2	−.02	.00	−.02	.00	−.03	−.01	−.03	−.01
5	−.01	.00	.00	.00	−.03	−.01	−.03	−.01
6/6	2	.01	.01	.01	.00	.00	.00	.00	.00
5	.00	.00	.01	.00	−.01	.00	−.01	.00

TAPAS	Reliability	SOC	DOM	PHY	SEL	ACH	ORD	OPT	EVT	TOL	IEF
SOC	.79	.85	.93
DOM	.81	.82	.68	.43	.96
PHY	.81	.82	.36	.25	.31	.19	.94
SEL	.65	.74	.12	.11	−.05	.00	−.17	−.04	.85
ACH	.69	.71	.26	.08	.39	.21	.24	.12	.48	.24	.87
ORD	.80	.85	.22	.13	.17	.08	.41	.26	−.07	−.03	.49	.27	.96
OPT	.76	.85	.66	.35	.38	.22	.45	.27	.14	.09	.39	.21	.45	.26	.91
EVT	.69	.79	.09	.05	−.20	−.11	.13	.09	.36	.18	.28	.14	.18	.08	.48	.30	.95
TOL	.70	.79	.12	.12	.15	.15	.00	.03	.63	.37	.21	.10	−.18	−.12	−.01	.07	.18	.10	.92
IEF	.66	.75	.14	−.01	.44	.26	.01	−.07	.34	.06	.70	.26	.01	−.03	.20	.10	.11	.05	.31	.16	.87
SS-E	.85	.77	.71	.57	.48	.27	.22	.08	.04	.22	.08	.16	.09	.50	.35	−.01	−.03	.11	.10	.17	.07
SS-A	.79	.24	.23	.00	−.02	.06	.06	.41	.39	.26	.16	.14	.10	.35	.28	.43	.37	.22	.18	.11	.01
SS-C	.81	.21	.15	.17	.13	.26	.20	.17	.10	.56	.48	.54	.47	.40	.34	.23	.16	.01	.00	.27	.17
SS-N	.86	−.40	−.32	−.24	−.19	−.29	−.22	−.06	−.01	−.28	−.17	−.29	−.22	−.65	−.55	−.50	−.47	.01	.02	−.20	−.12
SS-O	.81	.11	.10	.17	.17	.07	.08	.24	.19	.25	.16	−.02	−.02	.06	.04	.07	.06	.34	.32	.33	.29
LS	.90	.31	.21	.16	.10	.27	.22	.06	.07	.19	.12	.25	.17	.57	.54	.22	.16	−.03	−.03	.06	.00
CSE	.85	.38	.27	.24	.18	.29	.22	.07	.03	.33	.24	.36	.26	.69	.64	.32	.24	−.06	−.05	.19	.08
Income	NA	−.02	−.04	−.02	−.03	.05	.04	.11	.06	.22	.20	.16	.10	.11	.12	.07	.03	.00	.01	.14	.10
Health	NA	−.29	−.22	−.21	−.16	−.43	−.37	.05	.03	−.13	−.07	−.21	−.16	−.37	−.32	−.15	−.11	−.05	−.05	−.03	.03

FCFFM	Reliability	FC-E	FC-A	FC-C	FC-N	FC-O
FC-E	.89	.86	.99
FC-A	.90	.87	−.11	−.13	.99
FC-C	.82	.76	−.15	−.19	.36	.42	.99
FC-N	.86	.83	−.16	−.15	.11	.11	.12	.13	1 .00
FC-O	.84	.80	−.13	−.14	.30	.31	.22	.22	.06	.09	.99
SS-E	.89	.86	.87	.31	.34	.06	.06	−.13	−.15	.27	.28
SS-A	.82	.40	.46	.76	.76	.10	.13	−.08	−.11	.21	.24
SS-C	.86	.11	.10	.15	.16	.80	.82	−.14	−.13	.08	.09
SS-N	.90	−.16	−.17	−.12	−.13	−.15	−.15	.81	.81	−.16	−.16
SS-O	.86	.31	.32	.16	.17	.06	.07	−.20	−.21	.80	.80

Footnotes

Appendix 1: GTUM for Triplets

The primary goal of this simulation study is to examine the performance of the GTUM for FC scales with a block size of three where local dependence emerges. To be consistent with Study 1, we fixed the number of latent factors at 5, the number of statements per factor at 12, the number of latent factors at 5, and latent correlations among the factors at .30. In total, there were 20 blocks with a block size of three. After decomposing these blocks, there were 60 pairs (pseudo items), and each statement appeared in two pairs.

Appendix 2. Convergence Between GTUM- and MUPP-based Trait Scores

One reviewer raised a thoughtful question about the practical equivalence between the GTUM and the MUPP model: if we fit both the GTUM and the MUPP model to the same dataset and obtain trait scores, to what degree the two sets of scores will be correlated. Therefore, we conducted additional simulations to examine the degree to which estimated trait scores from the two models converge. Specifically, we manipulated (1) the proportion of mixed pairs, (2) statement quality, (3) degree of match, and (4) statement extremity in the same way as the focal study. Sample size, the number of response options, and scale length were fixed at 300, 2, and 30, respectively. In total, there were $5 \times 2 \times 2 \times 2 = 40$ conditions. In each condition, 100 datasets were generated according to the GTUM. Each generated dataset was fitted by the GTUM model and the MUPP model. The MUPP model was fitted using the source code from the fcirt R package (Tu et al., 2023a).

As can be seen from Table A2, GTUM- and MUPP-based trait scores were generally highly correlated. Whether intermediate statements were included or not further moderated the degree of correlation such that their convergence was higher in conditions with no intermediate statements (M _r = .94, min = .89, max = .97) compared to conditions with intermediate statements (M _r = .82, min = .72, max = .88).

Acknowledgment

We thank Dr. Fritz Drasgow for providing valuable feedback on an earlier version of the draft.

Authors’ Note

An earlier version of this paper was presented at the 38th Annual Conference of the Society for Industrial and Organizational Psychology.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Data collection for sample 2 was supported by Grant 2020YFC200300 from the National Key R&D Program of China awared to Dr. Jian Li.

ORCID iDs

Bo Zhang

Naidan Tu

Tianjun Sun

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Bo Zhang is currently an assistant professor at the School of Labor and Employment Relations and the Department of Psychology at the University of Illinois Urbana-Champaign. His research focuses on personnel selection, personality, and quantitative methods.

Naidan Tu is a PhD candidate in Industrial-Organizational Psychology at the University of South Florida. Her primary research interest lies in the application, development, and evaluation of quantitative methods in the domain of psychometrics to improve noncognitive assessment, with the ultimate goal of enhancing the effectiveness of personnel selection and other organizational decision making.

Lawrence Angrave is a teaching professor at the Department of Computer Science at the University of Illinois Urbana-Champaign. His research areas include digital accessibility, education, AI, and computing at scale.

Susu Zhang is an assistant professor of Psychology and Statistics at the University of Illinois Urbana-Champaign. Her research integrates latent variable modeling and statistical learning to advance statistical and psychometric methods in educational and psychological testing.

Tianjun Sun received her PhD in psychology from University of Illinois Urbana-Champaign and is currently an assistant professor of industrial-organizational psychology at Kansas State University. Her research primarily focuses on personnel selection, individual differences, and quantitative methods. Her work advocates for the responsible use of psychometric tools and advanced technology to improve psychological sciences and solve organizational problems.

Louis Tay is the William C. Byham professor in Industrial-Organizational Psychology at Purdue University. His research interests are in well-being, vocational interests, measurement, taxonometrics, latent class modeling, and data science. His lab website is www.wam-lab.com. He is the founder of ExpiWell () a technology startup enabling researchers to conduct better experience sampling studies.

Jian Li is a full professor at the Faculty of Psychology at Beijing Normal University. His research primarily focuses on personnel selection, individual differences, higher-order thinking, and quantitative methods. His work advocates for the responsible use of psychometric tools and advanced technology to solve organizational and educational problems.

References

Andrich

(1989). A probabilistic IRT model for unfolding preference data. Applied Psychological Measurement, 13(2), 193-216. https://doi.org/10.1177/014662168901300211

Andrich

(1995). Hyperbolic cosine latent trait models for unfolding direct responses and pairwise preferences. Applied Psychological Measurement, 19(3), 269-290. https://doi.org/10.1177/014662169501900306

Baron

(1996). Strengths and limitations of ipsative measurement. Journal of Occupational and Organizational Psychology, 69(1), 49-56. https://doi.org/10.1111/j.2044-8325.1996.tb00599.x

Bartram

(2013). Scalar equivalence of OPQ32: Big five profiles of 31 countries. Journal of Cross-Cultural Psychology, 44(1), 61-83. https://doi.org/10.1177/0022022111430258

Betancourt

(2013, August). A general metric for Riemannian manifold Hamiltonian Monte Carlo . In International conference on geometric science of information (pp. 327-334). Springer.

Borsboom

(2006). The attack of the psychometricians. Psychometrika, 71(3), 425-440. https://doi.org/10.1007/s11336-006-1447-6

Boyce

A. S.

Conway

J. S.

Caputo

(2014). Development and validation of Aon Hewitt’s personality model and Adaptive Employee Personality Test (ADEPT-15). Aon Hewitt.

Brown

(2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(1), 135-160. https://doi.org/10.1007/s11336-014-9434-9

Brown

Inceoglu

Lin

(2017). Preventing rater biases in 360-degree feedback by forcing choice. Organizational Research Methods, 20(1), 121-148. https://doi.org/10.1177/1094428116668036

10.

Brown

Maydeu-Olivares

(2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460-502. https://doi.org/10.1177/0013164410375112

11.

Brown

Maydeu-Olivares

(2018). Ordinal factor analysis of graded-preference questionnaire data. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 516-529. https://doi.org/10.1080/10705511.2017.1392247

12.

Bunji

Okada

(2020). Joint modeling of the two-alternative multidimensional forced-choice personality measurement and its response time by a Thurstonian D-diffusion item response model. Behavior Research Methods, 52(3), 1091-1107. https://doi.org/10.3758/s13428-019-01302-5

13.

Bunji

Okada

(2022). Linear ballistic accumulator item response theory model for multidimensional multiple-alternative forced-choice measurement of personality. Multivariate Behavioral Research, 57(4), 658-678. https://doi.org/10.1080/00273171.2021.1896351

14.

Bürkner

P. C.

(2022). On the information obtainable from comparative judgments. Psychometrika, 87(4), 1439-1472. https://doi.org/10.1007/s11336-022-09843-z

15.

Cao

Drasgow

(2019). Does forcing reduce faking? A meta-analytic review of forced-choice personality measures in high-stakes situations. Journal of Applied Psychology, 104(11), 1347-1368. https://doi.org/10.1037/apl0000414

16.

Cao

Drasgow

Cho

(2015). Developing ideal intermediate personality items for the ideal point model. Organizational Research Methods, 18(2), 252-275. https://doi.org/10.1177/1094428114555993

17.

Cao

Song

Q. C.

Tay

(2018). Detecting curvilinear relationships: A comparison of scoring approaches based on different item response models. International Journal of Testing, 18(2), 178-205. https://doi.org/10.1080/15305058.2017.1345913

18.

Carter

N. T.

Dalal

D. K.

(2010). An ideal point account of the JDI work satisfaction scale. Personality and Individual Differences, 49(7), 743-748. https://doi.org/10.1016/j.paid.2010.06.019

19.

Carter

N. T.

Dalal

D. K.

Guan

LoPilato

A. C.

Withrow

S. A.

(2017). Item response theory scoring and the detection of curvilinear relationships. Psychological Methods, 22(1), 191-203. https://doi.org/10.1037/met0000101

20.

Chalmers

R. P.

(2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. https://doi.org/10.18637/jss.v048.i06

21.

Cho

Drasgow

Cao

(2015). An investigation of emotional intelligence measures using item response theory. Psychological Assessment, 27(4), 1241-1252. https://doi.org/10.1037/pas0000132

22.

Cortina

J. M.

Aguinis

DeShon

R. P.

(2017). Twilight of Dawn or of evening? A century of research methods in the Journal of Applied Psychology. Journal of Applied Psychology, 102(3), 274-290. https://doi.org/10.1037/apl0000163

23.

Curran

P. J.

Cole

V. T.

Bauer

D. J.

Rothenberg

W. A.

Hussong

A. M.

(2018). Recovering predictor–criterion relations using covariate-informed factor score estimates. Structural Equation Modeling: A Multidisciplinary Journal, 25(6), 860-875. https://doi.org/10.1080/10705511.2018.1473773

24.

Dalal

D. K.

Zhu

X. S.

Rangel

Boyce

A. S.

Lobene

(2021). Improving applicant reactions to forced-choice personality measurement: Interventions to reduce threats to test takers’ self-concepts. Journal of Business and Psychology, 36(1), 55-70. https://doi.org/10.1007/s10869-019-09655-6

25.

Diener

E. D.

Emmons

R. A.

Larsen

R. J.

Griffin

(1985). The satisfaction with life scale. Journal of Personality Assessment, 49(1), 71-75. https://doi.org/10.1207/s15327752jpa4901_13

26.

Drasgow

Chernyshenko

O. S.

Stark

(2010). 75 Years after Likert: Thurstone was right!. Industrial and Organizational Psychology, 3(4), 465-476. https://doi.org/10.1111/j.1754-9434.2010.01273.x

27.

Drasgow

Stark

Chernyshenko

O. S.

Nye

C. D.

Hulin

C. L.

White

L. A.

(2012). Development of the tailored adaptive personality assessment system (TAPAS) to support army personnel selection and classification decisions. Drasgow Consulting Group Urbana IL.

28.

Flora

D. B.

Curran

P. J.

(2004). An empirical evaluation of alternative methods of estimation for confirmatory fac- tor analysis with ordinal data. Psychological Methods, 9(4), 466-491. https://doi.org/10.1037/1082-989X.9.4.466

29.

Foster

G. C.

Min

Zickar

M. J.

(2017). Review of item response theory practices in organizational research: Lessons learned and paths forward. Organizational Research Methods, 20(3), 465-486. https://doi.org/10.1177/1094428116689708

30.

Freund

P. A.

Lohbeck

(2021). Modeling self-determination theory motivation data by using unfolding IRT. European Journal of Psychological Assessment, 37(5), 388-396. https://doi.org/10.1027/1015-5759/a000629

31.

Frick

(2022). Modeling faking in the multidimensional forced-choice format: The faking mixture model. Psychometrika, 87(2), 773-794. https://doi.org/10.1007/s11336-021-09818-6

32.

Frick

Brown

Wetzel

(2023). Investigating the normativity of trait estimates from multidimensional forced-choice data. Multivariate Behavioral Research, 58(1), 1-29. https://doi.org/10.1080/00273171.2021.1938960

33.

Furr

D. C.

(2017). edstan: Stan Models for Item Response Theory. R package version 1.0.6, https://CRAN.R-project.org/package=edstan.

34.

Gelman

Lee

Guo

(2015). Stan: A probabilistic programming language for Bayesian inference and optimization. Journal of Educational and Behavioral Statistics, 40(5), 530-543. https://doi.org/10.3102/1076998615606113

35.

Guo

Wang

Cai

(2023). An Item Response Theory Model for incorporating response times in forced-choice measures. Educational and Psychological Measurement, Advanced online publication. https://doi.org/10.1177/00131644231171193

36.

Hicks

L. E.

(1970). Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74(3), 167-184. https://doi.org/10.1037/h0029780

37.

Hoffman

M. D.

Gelman

(2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593-1623.

38.

Hontangas

P. M.

De La Torre

Ponsoda

Leenen

Morillo

Abad

F. J.

(2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39(8), 598-612. https://doi.org/10.1177/0146621615585851

39.

Connelly

B. S.

(2021). Faking by actual applicants on personality tests: A meta-analysis of within-subjects studies. International Journal of Selection and Assessment, 29(3-4), 412-426. https://doi.org/10.1111/ijsa.12338

40.

Hung

S. P.

Huang

H. Y.

(2022). Forced-Choice Ranking Models for Raters’ Ranking Data. Journal of Educational and Behavioral Statistics, 47(5), 603-634. https://doi.org/10.3102/10769986221104207

41.

Jang

Cohen

A. S.

(2020). The impact of Markov chain convergence on estimation of mixture IRT model parameters. Educational and Psychological Measurement, 80(5), 975-994. https://doi.org/10.1177/0013164419898228

42.

Jeon

Rijmen

Rabe-Hesketh

(2017). A variational maximization–maximization algorithm for generalized linear mixed models with crossed random effects. Psychometrika, 82(3), 693-716. https://doi.org/10.1007/s11336-017-9555-z

43.

Jiang

Carter

(2019). Using Hamiltonian Monte Carlo to estimate the log-linear cognitive diagnosis model via Stan. Behavior Research Methods, 51, 651-662. https://doi.org/10.3758/s13428-018-1069-9

44.

John

O. P.

Donahue

E. M.

Kentle

R. L.

(1991). Big Five Inventory (BFI). APA PsycTests.

45.

Joo

S. H.

Lee

Stark

(2020). Adaptive testing with the GGUM-RANK multidimensional forced choice model: Comparison of pair, triplet, and tetrad scoring. Behavior Research Methods, 52(2), 761-772. https://doi.org/10.3758/s13428-019-01274-6

46.

Joo

S. H.

Lee

Stark

(2021). Modeling multidimensional forced choice measures with the Zinnes and Griggs Pairwise preference item response theory model. Multivariate Behavioral Research, 58(2), 241-261. https://doi.org/10.1080/00273171.2021.1960142

47.

Joo

S. H.

Lee

Stark

(2022). The explanatory generalized graded unfolding model: Incorporating collateral information to improve the latent trait estimation accuracy. Applied Psychological Measurement, 46(1), 3-18. https://doi.org/10.1177/01466216211051717

48.

Judge

T. A.

Erez

Bono

J. E.

Thoresen

C. J.

(2003). The core self-evaluations scale: Development of a measure. Personnel Psychology, 56(2), 303-331. https://doi.org/10.1111/j.1744-6570.2003.tb00152.x

49.

Kirkendall

C. D.

Nye

C. D.

Rounds

Drasgow

Chernyshenko

O. S.

Stark

(2020). Adaptive vocational interest diagnostic: Informing and improving the job assignment process. Military Psychology, 32(1), 91-100. https://doi.org/10.1080/08995605.2019.1652480

50.

Lee

Smith

W. Z.

(2020). A Bayesian random block item response theory model for forced-choice formats. Educational and Psychological Measurement, 80(3), 578-603. https://doi.org/10.1177/0013164419871659

51.

Lee

Joo

S. H.

Stark

Chernyshenko

O. S.

(2019). GGUM-RANK statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43(3), 226-240. https://doi.org/10.1177/0146621618768294

52.

Lee

Joo

S. H.

Zhou

Son

(2022). Investigating the impact of negatively keyed statements on multidimensional forced-choice personality measures: A comparison of partially ipsative and IRT scoring methods. Personality and Individual Differences, 191, 111555. https://doi.org/10.1016/j.paid.2022.111555

53.

Lewandowski

Kurowicka

Joe

(2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100(9), 1989-2001. https://doi.org/10.1016/j.jmva.2009.04.008

54.

Sun

Zhang

(2022). AutoFC: An R package for automatic item pairing in forced-choice test construction. Applied Psychological Measurement, 46(1), 70-72. https://doi.org/10.1177/01466216211051726

55.

Zhang

Cao

Tay

(2021, September 17). Accounting for item response process and response styles using the Unfolding Item Response Tree (UIRTree) Model. https://doi.org/10.31219/osf.io/8w36e

56.

McCrae

R. R.

Costa Jr

P. T.

(1989). Reinterpreting the Myers-Briggs type indicator from the perspective of the five-factor model of personality. Journal of Personality, 57(1), 17-40. https://doi.org/10.1111/j.1467-6494.1989.tb00759.x

57.

McGrane

J. A.

(2019). The bipolarity of attitudes: Unfolding the implications of ambivalence. Applied Psychological Measurement, 43(3), 211-225. https://doi.org/10.1177/0146621618762741

58.

Meade

A. W.

(2004). Psychometric problems and issues involved with creating and using ipsative measures for selection. Journal of Occupational and Organizational Psychology, 77(4), 531-551. https://doi.org/10.1348/0963179042596504

59.

Morillo

Leenen

Abad

F. J.

Hontangas

de la Torre

Ponsoda

(2016). A dominance variant under the multi-unidimensional pairwise-preference framework: Model formulation and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 40(7), 500-516. https://doi.org/10.1177/0146621616662226

60.

Muthén

L. K.

Muthén

(2017). Mplus (Version 8)[computer software].(1998–2017). Los Angeles, CA: Muthén & Muthén.

61.

Nye

C. D.

Joo

S. H.

Zhang

Stark

(2020). Advancing and evaluating IRT model data fit indices in organizational research. Organizational Research Methods, 23(3), 457-486. https://doi.org/10.1177/1094428119833158

62.

Paek

Cui

Öztürk Gübeş

Yang

(2018). Estimation of an IRT model by Mplus for dichotomously scored responses under different estimation methods. Educational and Psychological Measurement, 78(4), 569-588. https://doi.org/10.1177/0013164417715738

63.

Peng

Man

Veldkamp

B. P.

Cai

(2023). A mixture model for random responding behavior in forced-choice noncognitive assessment: Implication and application in organizational research. Organizational Research Methods, Advanced online publication. https://doi.org/10.1177/109442812311816

64.

Podsakoff

P. M.

MacKenzie

S. B.

Lee

J. Y.

Podsakoff

N. P.

(2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879-903. https://doi.org/10.1037/0021-9010.88.5.879

65.

Roberts

J. S.

Donoghue

J. R.

Laughlin

J. E.

(2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1), 3-32. https://doi.org/10.1177/01466216000241001

66.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(S1), 1-97. https://doi.org/10.1007/BF03372160

67.

Shen

Kiger

T. B.

Davies

S. E.

Rasch

R. L.

Simon

K. M.

Ones

D. S.

(2011). Samples in applied psychology: Over a decade of research in review. Journal of Applied Psychology, 96(5), 1055-1064. https://doi.org/10.1037/a0023322

68.

Sisson

E. D.

(1948). Forced choice: The new army rating. Personnel Psychology, 1(3), 365-381. https://doi.org/10.1111/j.1744-6570.1948.tb01316.x

69.

Soto

C. J.

John

O. P.

(2017). The next big five inventory (BFI-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power. Journal of Personality and Social Psychology, 113(1), 117-143. https://doi.org/10.1037/pspp0000096

70.

Speer

A. B.

Delacruz

A. Y.

(2021). Introducing a supervised alternative to forced-choice personality scoring: A test of validity and resistance to faking. International Journal of Selection and Assessment, 29(3-4), 448-466. https://doi.org/10.1111/ijsa.12345

71.

Speer

A. B.

Wegmeyer

L. J.

Tenbrink

A. P.

Delacruz

A. Y.

Christiansen

N. D.

Salim

R. M.

(2023). Comparing forced-choice and single-stimulus personality scores on a level playing field: A meta-analysis of psychometric properties and susceptibility to faking. Journal of Applied Psychology, Advance online publication. https://doi.org/10.1037/apl0001099

72.

Stark

Chernyshenko

O. S.

Drasgow

(2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29(3), 184-203. https://doi.org/10.1177/0146621604273988

73.

Stark

Chernyshenko

O. S.

Drasgow

White

L. A.

(2012). Adaptive testing with multidimensional pairwise preference items: Improving the efficiency of personality and other noncognitive assessments. Organizational Research Methods, 15(3), 463-487. https://doi.org/10.1177/1094428112444611

74.

Stark

Chernyshenko

O. S.

Drasgow

Williams

B. A.

(2006). Examining assumptions about item responding in personality assessment: Should ideal point methods be considered for scale development and scoring? Journal of Applied Psychology, 91(1), 25-39. https://doi.org/10.1037/0021-9010.91.1.25

75.

Zhang

Liu

Tay

(2019). Modeling congruence in organizational research with latent moderated structural equations. Journal of Applied Psychology, 104(11), 1404-1433. https://doi.org/10.1037/apl0000411

76.

Sun

Fraley

R. C.

Drasgow

(2021). Matches made with information: Fitting measurement models to adult attachment data. Assessment, 28(7), 1828-1847. https://doi.org/10.1177/1073191120971847

77.

Sun

Zhang

Cao

Drasgow

(2022). Faking detection improved: Adopting a Likert item response process tree model. Organizational Research Methods, 25(3), 490-512. https://doi.org/10.1177/10944281211002904

78.

Tay

Ali

U. S.

Drasgow

Williams

(2011). Fitting IRT models to dichotomous and polytomous data: Assessing the relative model–data fit of ideal point and dominance models. Applied Psychological Measurement, 35(4), 280-295. https://doi.org/10.1177/0146621610390674

79.

Tay

Drasgow

(2012). Theoretical, statistical, and substantive issues in the assessment of construct dimensionality: Accounting for the item response process. Organizational Research Methods, 15(3), 363-384. https://doi.org/10.1177/1094428112439709

80.

Tay

Drasgow

Rounds

Williams

B. A.

(2009). Fitting measurement models to vocational interest data: Are dominance models ideal? Journal of Applied Psychology, 94(5), 1287-1304. https://doi.org/10.1037/a0015899

81.

Tay

(2018). Ideal point modeling of non-cognitive constructs: Review and recommendations for research. Frontiers in Psychology, 9, 1-10. https://doi.org/10.3389/fpsyg.2018.02423

82.

Thurstone

L. L.

(1927). The method of paired comparisons for social values. The Journal of Abnormal and Social Psychology, 21(4), 384-400. https://doi.org/10.1037/h0065439

83.

Joo

Lee

Stark

(2023a). Comparison of parameter estimation approaches for multi-unidimensional pairwise preference tests. Behavior Research Methods.

84.

Zhang

Angrave

Sun

(2021). Bmggum: An R package for Bayesian estimation of the multidimensional generalized graded unfolding model with covariates. Applied Psychological Measurement, 45(7-8), 553-555. https://doi.org/10.1177/01466216211040488

85.

Zhang

Angrave

Sun

Neuman

(2023b). Estimating the multidimensional generalized graded unfolding model with covariates using a Bayesian approach. Journal of Intelligence, 11, 163, 1-17. https://doi.org/10.3390/jintelligence11080163

86.

Van der Linden

te Nijenhuis

Bakker

A. B.

(2010). The general factor of personality: A meta-analysis of Big Five intercorrelations and a criterion-related validity study. Journal of Research in Personality, 44(3), 315-327. https://doi.org/10.1016/j.jrp.2010.03.003

87.

Wang

Weiss

D. J.

(2018). Robustness of parameter estimation to assumptions of normality in the multidimensional graded response model. Multivariate Behavioral Research, 53(3), 403-418. https://doi.org/10.1080/00273171.2018.1455572

88.

Wetzel

Frick

(2020). Comparing the validity of trait estimates from the multidimensional forced-choice format and the rating scale format. Psychological Assessment, 32(3), 239-253. https://doi.org/10.1037/pas0000781

89.

Woods, C. M. (2007a). Ramsay curve IRT for Likert-type data. Applied Psychological Measurement, 31(3), 195–212. https://doi.org/10.1177/0146621606291567

90.

Woods

C. M.

(2007b). Empirical histograms in Item Response Theory with ordinal data. Educational and Psychological Measurement, 67(1), 73-87. https://doi.org/10.1177/0013164406288163

91.

Zettler

Lang

J. W.

(2015). Employees’ political skill and job performance: An inverted U-shaped relation? Applied Psychology, 64(3), 541-577. https://doi.org/10.1111/apps.12018

92.

Zettler

Lang

J. W.

Hülsheger

U. R.

Hilbig

B. E.

(2016). Dissociating indifferent, directional, and extreme responding in personality data: Applying the three-process model to self-and observer reports. Journal of Personality, 84(4), 461-472. https://doi.org/10.1111/jopy.12172

93.

Zhang

Cao

Tay

Luo

Drasgow

(2020a). Examining the item response process to personality measures in high-stakes situations: Issues of measurement validity and predictive validity. Personnel Psychology, 73(2), 305-332. https://doi.org/10.1111/peps.12353

94.

Zhang

Y. M.

Luo

Yin

Soto

C. J.

John

O. P.

(2022). The Big Five Inventory–2 in China: A comprehensive psychometric evaluation in four diverse samples. Assessment, 29(6), 1262-1284. https://doi.org/10.1177/10731911211008245

95.

Zhang

Luo

(2023a). Moving beyond Likert and traditional forced-choice scales: A comprehensive investigation of the graded forced-choice format. Multivariate Behavioral Research. Advanced online publication. https://doi.org/10.1080/00273171.2023.2235682

96.

Zhang

Luo

Sun

Cao

Drasgow

(2023b). Small but nontrivial: A comparison of six strategies to handle cross-loadings in bifactor predictive models. Multivariate Behavioral Research, 58(1), 115-132. https://doi.org/10.1080/00273171.2021.1957664

97.

Zhang

Sun

Drasgow

Chernyshenko

O. S.

Nye

C. D.

Stark

White

L. A.

(2020b). Though forced, still valid: Psychometric equivalence of forced-choice and single-statement measures. Organizational Research Methods, 23(3), 569-590. https://doi.org/10.1177/1094428119836486

98.

Zickar

M. J.

(2020). Measurement development and evaluation. Annual Review of Organizational Psychology and Organizational Behavior, 7, 213-232. https://doi.org/10.1146/annurev-orgpsych-012119-044957

99.

Zickar

M. J.

Drasgow

(1996). Detecting faking on a personality instrument using appropriateness measurement. Applied Psychological Measurement, 20(1), 71-87. https://doi.org/10.1177/014662169602000107

100.

Zickar

M. J.

Gibby

R. E.

Robie

(2004). Uncovering faking samples in applicant, incumbent, and experimental data sets: An application of mixed-model item response theory. Organizational Research Methods, 7(2), 168-190. https://doi.org/10.1177/1094428104263674

101.

Zinnes

J. L.

Griggs

R. A.

(1974). Probabilistic, multidimensional unfolding analysis. Psychometrika, 39(3), 327-350. https://doi.org/10.1007/BF02291707

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.10 MB

Statement Type	Match	Options	λ = 1				λ = 1.5
			L-Diff = .25		L-Diff = .50		L-Diff = .25		L-Diff = .50
			SS = 300	SS = 1000	SS = 300	SS = 1000	SS = 300	SS = 1000	SS = 300	SS = 1000
Intermediate + Extreme	0/6	2	.54	.58	.57	.61	.70	.74	.73	.76
	0/6	5	.68	.71	.70	.73	.83	.84	.85	.86
	1/6	2	.70	.72	.70	.72	.82	.84	.82	.84
	1/6	5	.80	.82	.81	.82	.90	.90	.90	.90
	2/6	2	.74	.76	.74	.76	.84	.85	.84	.85
	2/6	5	.83	.84	.83	.84	.91	.91	.91	.91
	3/6	2	.75	.76	.75	.76	.85	.85	.84	.85
	3/6	5	.83	.84	.83	.84	.91	.91	.91	.91
	6/6	2	.75	.77	.75	.77	.84	.85	.84	.85
	6/6	5	.83	.84	.83	.84	.90	.91	.91	.91
Extreme only	0/6	2	.56	.58	.57	.59	.65	.65	.65	.65
	0/6	5	.64	.66	.64	.66	.69	.70	.69	.70
	1/6	2	.75	.76	.74	.76	.84	.84	.83	.84
	1/6	5	.83	.84	.83	.84	.90	.90	.90	.90
	2/6	2	.79	.80	.79	.80	.87	.87	.86	.87
	2/6	5	.86	.87	.86	.87	.92	.92	.92	.92
	3/6	2	.80	.81	.80	.81	.87	.87	.87	.87
	3/6	5	.87	.87	.87	.87	.92	.92	.92	.92
	6/6	2	.80	.80	.79	.80	.85	.86	.85	.86
	6/6	5	.86	.86	.86	.86	.91	.91	.91	.91

TAPAS	Reliability		SOC		DOM		PHY		SEL		ACH		ORD		OPT		EVT		TOL		IEF
TAPAS	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP	GTUM	MUPP
SOC	.79	.85	.93
DOM	.81	.82	.68	.43	.96
PHY	.81	.82	.36	.25	.31	.19	.94
SEL	.65	.74	.12	.11	−.05	.00	−.17	−.04	.85
ACH	.69	.71	.26	.08	.39	.21	.24	.12	.48	.24	.87
ORD	.80	.85	.22	.13	.17	.08	.41	.26	−.07	−.03	.49	.27	.96
OPT	.76	.85	.66	.35	.38	.22	.45	.27	.14	.09	.39	.21	.45	.26	.91
EVT	.69	.79	.09	.05	−.20	−.11	.13	.09	.36	.18	.28	.14	.18	.08	.48	.30	.95
TOL	.70	.79	.12	.12	.15	.15	.00	.03	.63	.37	.21	.10	−.18	−.12	−.01	.07	.18	.10	.92
IEF	.66	.75	.14	−.01	.44	.26	.01	−.07	.34	.06	.70	.26	.01	−.03	.20	.10	.11	.05	.31	.16	.87
SS-E	.85		.77	.71	.57	.48	.27	.22	.08	.04	.22	.08	.16	.09	.50	.35	−.01	−.03	.11	.10	.17	.07
SS-A	.79		.24	.23	.00	−.02	.06	.06	.41	.39	.26	.16	.14	.10	.35	.28	.43	.37	.22	.18	.11	.01
SS-C	.81		.21	.15	.17	.13	.26	.20	.17	.10	.56	.48	.54	.47	.40	.34	.23	.16	.01	.00	.27	.17
SS-N	.86		−.40	−.32	−.24	−.19	−.29	−.22	−.06	−.01	−.28	−.17	−.29	−.22	−.65	−.55	−.50	−.47	.01	.02	−.20	−.12
SS-O	.81		.11	.10	.17	.17	.07	.08	.24	.19	.25	.16	−.02	−.02	.06	.04	.07	.06	.34	.32	.33	.29
LS	.90		.31	.21	.16	.10	.27	.22	.06	.07	.19	.12	.25	.17	.57	.54	.22	.16	−.03	−.03	.06	.00
CSE	.85		.38	.27	.24	.18	.29	.22	.07	.03	.33	.24	.36	.26	.69	.64	.32	.24	−.06	−.05	.19	.08
Income	NA		−.02	−.04	−.02	−.03	.05	.04	.11	.06	.22	.20	.16	.10	.11	.12	.07	.03	.00	.01	.14	.10
Health	NA		−.29	−.22	−.21	−.16	−.43	−.37	.05	.03	−.13	−.07	−.21	−.16	−.37	−.32	−.15	−.11	−.05	−.05	−.03	.03

FCFFM	Reliability		FC-E		FC-A		FC-C		FC-N		FC-O
FCFFM	GTUM	TIRT	GTUM	TIRT	GTUM	TIRT	GTUM	TIRT	GTUM	TIRT	GTUM	TIRT
FC-E	.89	.86	.99
FC-A	.90	.87	−.11	−.13	.99
FC-C	.82	.76	−.15	−.19	.36	.42	.99
FC-N	.86	.83	−.16	−.15	.11	.11	.12	.13	1 .00
FC-O	.84	.80	−.13	−.14	.30	.31	.22	.22	.06	.09	.99
SS-E	.89		.86	.87	.31	.34	.06	.06	−.13	−.15	.27	.28
SS-A	.82		.40	.46	.76	.76	.10	.13	−.08	−.11	.21	.24
SS-C	.86		.11	.10	.15	.16	.80	.82	−.14	−.13	.08	.09
SS-N	.90		−.16	−.17	−.12	−.13	−.15	−.15	.81	.81	−.16	−.16
SS-O	.86		.31	.32	.16	.17	.06	.07	−.20	−.21	.80	.80

The Generalized Thurstonian Unfolding Model (GTUM): Advancing the Modeling of Forced-Choice Data

Abstract

Keywords

Modeling Forced-Choice Data