Abstract
Historically, multidimensional forced choice (MFC) measures have been criticized because conventional scoring methods can lead to ipsativity problems that render scores unsuitable for interindividual comparisons. However, with the recent advent of item response theory (IRT) scoring methods that yield normative information, MFC measures are surging in popularity and becoming important components in high-stake evaluation settings. This article aims to add to burgeoning methodological advances in MFC measurement by focusing on statement and person parameter recovery for the GGUM-RANK (generalized graded unfolding-RANK) IRT model. Markov chain Monte Carlo (MCMC) algorithm was developed for estimating GGUM-RANK statement and person parameters directly from MFC rank responses. In simulation studies, it was examined that how the psychometric properties of statements composing MFC items, test length, and sample size influenced statement and person parameter estimation; and it was explored for the benefits of measurement using MFC triplets relative to pairs. To demonstrate this methodology, an empirical validity study was then conducted using an MFC triplet personality measure. The results and implications of these studies for future research and practice are discussed.
Keywords
To control response biases and rater errors, multidimensional forced choice (MFC) measures have been proposed as an alternative to Likert-type scales for noncognitive assessment (Stark, Chernyshenko, & Drasgow, 2005). MFC measures commonly present statements in blocks of two, three, or four. Within the blocks, statements representing different dimensions are matched on social desirability so the best answers are not obvious. The respondent’s task is to rank the statements in each block from most to least “like me.” Historically, MFC measures have been criticized because conventional scoring methods can lead to ipsativity problems that render scores unsuitable for interindividual comparisons (Hicks, 1970). However, over the last decade, advances in item response theory (IRT) have made it possible to derive normative information from the MFC format (e.g., Brown & Maydeu-Olivares, 2011; de la Torre, Ponsoda, Leenen, & Hontangas, 2012; Stark et al., 2005).
Stark et al. (2005) proposed the multi-unidimensional pairwise preference (MUPP) model for constructing and scoring pairwise preference responses. They estimated discrimination
Although these modeling developments have helped to advance MFC research and practice, there remain many unexplored issues concerning parameter estimation. By scaling a statement pool using a unidimensional model prior to MFC testing, Stark et al.’s method avoids the complexity of estimating statement parameters directly from forced choice responses via marginal maximum likelihood (MML) methods, which require complicated derivatives, or Markov chain Monte Carlo (MCMC) methods that require long run-times. This approach also facilitates MFC computerized adaptive testing (CAT), as well as nonadaptive testing with many forms, because any number of MFC tests can be created and scored once a statement pool has been calibrated. On the contrary, this approach may be considered impractical if just one MFC form is needed. In such situations, it may be better to construct a single MFC form having some extra items based on expert judgment, administer the form to a selected sample of respondents, estimate statement parameters directly from the MFC responses, and cull any items identified as problematic before scoring for assessment purposes. Importantly, such an approach would allow a test developer to evaluate the actual items (i.e., pairs, triplets, or tetrads) presented to respondents. In addition, by using simulation methodology, one could systematically explore how statement and person parameter estimation depends on the combinations of statements forming the MFC blocks.
Purpose of Research
An MCMC estimation algorithm was developed and evaluated for GGUM-RANK parameters, and its usefulness was demonstrated for empirical research. Unlike previously published studies that focused exclusively on scoring (e.g., Hontangas et al., 2015; Stark et al., 2005), this research examined the recovery of GGUM-RANK statement and person parameter estimation directly from the MFC responses (i.e., direct estimation process). It is noted that the direct estimation process is similarly effective or even better than two-step estimation process for person parameter recovery (Seybert, 2013). Three studies were conducted. Study 1 examined the recovery of GGUM-RANK parameters with MFC triplet measures while manipulating sample size, test length, intrablock discrimination, and intrablock location parameter variability. Rather than using idealized distributions of statement parameters for data generation, MFC measures were constructed using statement parameters that were systematically varied across conditions likely to be encountered in practice. Study 2 compared GGUM-RANK parameter recovery for MFC pair and triplet measures of different test lengths and sample sizes. Because there is growing interest in the benefits of triplet measures relative to pairs for noncognitive assessment (e.g., Anguiano-Carrasco, MacCann, Geiger, Seybert, & Roberts, 2014; Guenole, Brown, & Cooper, 2016), it was particularly important to see how much the added complexity of triplets might improve person parameter estimation. “Study 3” was an empirical example exploring construct and criterion-related validities of an MFC triplet personality measure calibrated and scored using the new GGUM-RANK estimation method. Before discussing these studies and results, MUPP and GGUM-RANK models were reviewed briefly.
The MUPP Model
Stark et al. (2005) proposed the MUPP model. The model assumes that when a respondent is presented with a pair of statements (j and k) and is asked to choose the statement that is more descriptive of him or her, the respondent considers each statement separately. The probability of preferring statement j over statement k (j > k) given his or her scores on the respective dimensions
where
Stark (2002) suggested using the dichotomous version of the GGUM (Roberts et al., 2000) for computing MUPP statement endorsement probabilities (
The GGUM-RANK Model for MFC Triplet Measures
Following Luce (1959), who proposed that the probability of a set of ranks can be viewed as a sequence of independent “most like” (PICK) decisions among a set of diminishing alternatives (M, M–1, . . ., 2), de la Torre et al. (2012) developed the RANK model for MFC rank responses. For a triplet example, the probability of the hypothetical ranking A > B > C would be given by the following sequence of PICK decisions:
where
Study 1
Study 1 investigated the accuracy of MCMC statement and person parameter recovery using a Metropolis-Hasting Within Gibbs (MHWG; Tierney, 1994) algorithm developed for GGUM-RANK triplet responses in the Ox programming language (Doornik, 2009). This algorithm is available in Appendix A of the Online Supplementary material.
Simulation Study Design
MFC test dimensionality was set at 10 dimensions. Four independent variables were fully crossed to produce 16 experimental conditions: (a) sample size (250, 500), (b) test length (30 triplets, 60 triplets), (c) intrablock discrimination (low = α parameters sampled randomly from a uniform distribution, U(0.75, 1.25); high = α parameters sampled randomly from U(1.75, 2.25)), and (d) intrablock location
MFC Tests Constructed for This Simulation
Four 30-triplet and four 60-triplet MFC tests were created by crossing levels of intrablock discrimination and location SD. The test specifications and generating parameters are provided in the Online Supplemental Appendices B and C.
Data Generation
GGUM-RANK response data were generated using an Ox computer program (Doornik, 2009). For each respondent, a vector of 10 latent trait scores (
Prior Distributions and Initial Parameter Values
Four-parameter beta priors {1.5, 1.5, .25, 3}, {2, 2, –4, 4}, and {2, 2, –3, 1} were used for α, δ, and τ estimation, respectively. For the person parameters associated with each dimension (d), a N(0,1) prior was used. To start MCMC estimation, all
Convergence Checks and Indices of Estimation Accuracy
Convergence was checked using
Results
Convergence rates were generally high, ranging from .93 to 1.00 across conditions. In the few instances where a parameter did not meet the convergence criterion, the parameter estimate was excluded from the calculation of the recovery statistics. Table 1 presents the parameter recovery results for GGUM-RANK statement parameter estimation, averaged over replications. Across all conditions, ABS ranged from .13 to .23, .11 to .27, and .16 to .21 for α, δ, and τ, respectively. The corresponding RMSEs ranged from .16 to .31, .14 to .35, and .20 to .26. The CORR between true and estimated δ parameters ranged from .96 to .99, but were lower for τ and quite low for α in some conditions. It was noted that the low CORRs between generating and estimated α parameters were due in part to the restricted range of the generating parameters, and the relatively poor recovery of τ parameters is consistent with previous simulation results using the dichotomous version of the GGUM (e.g., de la Torre, Stark, & Chernyshenko, 2006; Joo, Lee, & Stark, 2017). As expected, recovery of α and δ parameters was improved with test length (60-triplet better than 30-triplet), sample size (500 better than 250), and intrablock discrimination (high better than low), although the pattern of results for τ was inconsistent. Interestingly, intrablock location SD did not seem to influence the recovery results. It can be seen that ABS, RMSE, PSD, and CORR values were nearly the same across the small and large intrablock location SD conditions.
Statement Parameter Recovery With MFC Triplet Tests in Study 1.
Note. MFC = multidimensional forced choice; ABS = absolute bias; RMSE = root mean square error; PSD = posterior standard deviation; CORR = correlation between true and estimated parameter.
Table 2 presents parameter recovery statistics for latent trait scores
Person Parameter Recovery With MFC Triplet Tests in Study 1.
Note. MFC = multidimensional forced choice; ABS = absolute bias; RMSE = root mean square error; PSD = posterior standard deviation; CORR = correlation between true and estimated parameter.
Study 2
Study 2 was conducted to examine the benefits of parameter estimation with MFC triplets relative to pairs. Statement and person parameter recovery were assessed in a simulation study that crossed two independent variables: sample size (250, 500) and test type (30-triplet, 30-pair, 90-pair).
MFC Test Design and Analyses
From Study 1, the 30-triplet test in the high intrablock discrimination, high intrablock location SD condition was selected. The 30 triplets were decomposed into the 90 possible pairs to create a 90-pair MFC test for this study. A 30-pair MFC test was then created by selecting two statements in each of the 30 triplet blocks. The average generating parameters for each dimension in the 30-pair and 90-pair test type conditions were similar. The process for data generation, parameter estimation, and analysis was the same as in Study 1. The test specifications and generating parameters for the 30-pair and 90-pair tests are provided in the Online Supplemental materials (see Appendices D and E).
Results
Overall convergence rates approached 100% within 30,000 iterations. Table 3 presents the 30-pair and 90-pair parameter recovery results. (The results for the selected 30-triplet test in Study 1 are shown again for convenience.) The main finding is that the 30-triplet measure exhibited better recovery statistics than the 90-pair measure, so there is a distinct advantage in using a shorter triplet measure over a much longer pairwise preference measure for statement calibration. In the corresponding 250 and 500 sample size conditions, the 30-triplet measure had higher CORR and lower ABS, RMSE, and PSD values.
Statement Parameter Recovery for MFC Triplet and Pair Tests in Study 2.
Note. MFC = multidimensional forced choice; ABS = absolute bias; RMSE = root mean square error; PSD = posterior standard deviation; CORR = correlation between true and estimated parameter.
Table 4 presents the person parameter (
Person Parameter Recovery for MFC Triplet and Pair Tests in Study 2.
Note. MFC = multidimensional forced choice; ABS = absolute bias; RMSE = root mean square error; PSD = posterior standard deviation; CORR = correlation between true and estimated parameter.
Study 3: Empirical Example
Studies 1 and 2 examined statement and person parameter recovery using GGUM-RANK MFC triplet and pair measures. Although these simulations provided insights into MFC test construction practices, they provided no evidence concerning the comparability of GGUM-RANK MFC and Likert-type scale scores with real examinees. To address that issue, an empirical validity investigation was conducted using MFC triplet and single-statement (SS) personality measures and relevant SS criterion variables.
Procedure
A total of 60 statements measuring the Big Five factor markers (Goldberg, 1992) were selected from the International Personality Item Pool, translated into Korean, and two personality measures were created. A total of 12 statements were selected to measure each of the five factors. The first was a 60-item SS Big Five measure that required respondents to indicate their level of agreement using a 5-point Likert-type format. The second was a 20-triplet MFC measure, in which each triplet measured three different personality dimensions. The statements forming the triplets were matched as closely as possible on social desirability, computed as the difference in the Likert-type item scores between “honest” and “fake good” administrations using a within-subjects design (N = 205). Respondents were instructed to rank the statements in each MFC triplet from 1 (most like you) to 3 (least like you). The SS and MFC personality measures were administered to 417 Korean college students, and a smaller subset of students (N = 235) also completed Korean versions of SS criterion measures that included life satisfaction (The Satisfaction With Life Scale; Diener, Emmons, Larsen, & Griffin, 1985), positive or negative feeling (Scale of Positive and Negative Experience; Diener et al., 2010), aggression (Buss-Perry Aggression Questionnaire; Bryant & Smith, 2001), and RIASEC vocational interests (Heo, 2011).
Analyses
The SS personality and SS criterion measures were scored using the conventional summative approach. To score the MFC triplet personality measure, the GGUM-RANK model was fitted to the rank responses using the Ox program developed for this research with the specifications described in simulations. Next, the correspondence between the MFC and SS personality scores was examined via a multitrait multimethod (MTMM) analysis and by comparing CORRs with the criterion measures.
Results
Tables 5 and 6 present the MTMM and criterion validity findings. Importantly, the MFC and SS personality measures exhibited convergent validity, with monotrait-heteromethod CORRs ranging from .67 to .86. These CORRs are similar to those reported by Chernyshenko et al. (2009) for MUPP MFC and SS Big Five measures. Furthermore, the MFC and SS measures exhibited a similar pattern of CORRs with criterion measures. However, MFC criterion CORRs were somewhat lower which could be due, in part, to differences in the magnitudes of the reliabilities or reduced response consistency bias. Overall, the CORRs in Table 6 were generally consistent with previous meta-analytic findings (e.g., Barrick, Mount, & Gupta, 2003; Miller, Lynam, & Leukefeld, 2003; Steel, Schmidt, & Shultz, 2008).
Correlations of Personality Scores Based on Single-Statement and MFC Triplet Responses in Study 3.
Note. Bold values indicate monotrait-heteromethod correlations; values in the parentheses are reliability coefficients. The reliabilities of the single-statement (SS; “Likert-type”) measures were computed using coefficient alpha. The reliabilities for the MFC triplet measure were calculated using the marginal reliability equation provided by Brown and Croudace (2015). MFC = multidimensional forced choice; O = openness; C = conscientiousness; E = extraversion; A = agreeableness; N = neuroticism.
Criterion-Related Validities Based on SS and MFC Triplet Responses in Study 3.
Note. All criterion variables used the SS format. SS = single statement (“Likert-type”); MFC = multidimensional forced choice; O = openness; C = conscientiousness; E = extraversion; A = agreeableness; N = neuroticism; SWLS = Life Satisfaction Scale; PA = Positive affect; NA = negative affect; AGG = aggression; HR = Holland realistic; HI = investigative; HA = artistic; HS = social; HE = enterprising; HC = conventional.
p < .05. **p < .01.
Conclusion and Discussion
The main findings and practical implications of these studies are as follows. First, larger sample size led to more accurate statement parameter estimation, but the effect on person parameter estimation was small. The results suggest that at least 250 respondents are needed for GGUM-RANK estimation with MFC triplets test involving highly discriminating statements, and larger samples (e.g., 500) are recommended for statement parameter estimation when measures are developed for high-stakes decision making. Second, regarding test length, 30 triplets may be adequate for assessment with 10-dimension MFC measures, provided that the triplets are pretested to ensure adequate intrablock discrimination. Importantly, using short MFC triplet measures should decrease the “cognitive load” on respondents, relative to long MFC triplet measures, and in turn reduce test fatigue, careless responding, and completion time. Third, intrablock discrimination was found to be of primary importance for estimation accuracy. Researchers and practitioners are, therefore, strongly encouraged to create MFC tests comprising highly discriminating statements to ensure sufficient measurement precision. Importantly, the GGUM-RANK MCMC direct estimation process accounts for context, or potential “interactions” between statements within MFC blocks, that may influence overall triplet quality. This method will thus enable practitioners to conduct more effective MFC item analysis and facilitate parallel MFC test construction. Fourth, intrablock location variability had little to no effect on parameter recovery when generating theta CORRs were zero. In accordance with a reviewer’s suggestion, a follow-up simulation was conducted for selected conditions (30 triplets with high α and large δSD, and high α and small δSD) in Study 1 with correlated thetas (0.3). The results were just slightly better in the large δSD conditions, and the CORRs between generating thetas had little adverse effect on item and person parameter recovery (e.g., .42 vs. .47 RMSEs for person parameter recovery; see Online Appendix F for the detailed results). Fifth, MFC triplet measures outperformed MFC pair measures of similar length and intrablock discrimination in terms of estimation accuracy. The 30-triplet tests consistently yielded better discrimination and location parameter recovery than 30-pair tests, and the 30-triplet tests were nearly as good as 90-pair tests in terms of person parameter recovery. Finally, this research not only developed GGUM-RANK estimation methods but also illustrated their viability for applied use. The empirical study provided evidence of convergent and discriminant validity for an MFC measures with real participants.
This research had some limitations. The simulation studies considered limited numbers of replications and conditions out of all the possibilities that may be seen in MFC testing applications due to the very long run-times. Development of alternative estimation methods that can reduce computation time would be a worthy topic for future research. In addition, these simulations explored parameter recovery exclusively with 10-dimension tests, but MFC tests of higher (and lower) dimensionality are used in some applied settings (e.g., Brown & Bartram, 2009; Stark et al., 2014). Thus, additional simulation research is needed to explore the accuracy of GGUM-RANK scoring with measures involving greater (and fewer) than 10 dimensions.
This research also considered just two levels of intrablock location parameter variability. Future simulations should examine a wider variety of location SD conditions and deliberately explore estimation with all positively or negatively worded statements within MFC blocks. This would complement the research by Brown and Maydeu-Olivares (2011) suggesting that MFC items should be created by mixing positively and negatively worded statements to ensure more accurate estimation with their Thurstonian IRT model. If future GGUM-RANK research finds that all positive or all negative statements can be used in MFC blocks without adversely affecting parameter estimation, then there will be potentially greater resistance to faking and related forms of response distortion.
Moreover, this research focused on single-sample statement and person parameter estimation. To facilitate applications, research is needed to develop methods for assessing GGUM-RANK model-data fit, linking parameters across subpopulations, and concurrent calibration. This line of research would provide a foundation for GGUM-RANK differential item functioning (DIF) detection that is important for answering questions about fairness in high-stakes settings, as well as score comparisons with partially overlapping forms that may be of interest in multinational assessment contexts.
In closing, there is increasing interest in the use of MFC measures for noncognitive measurement. This research expanded on previous investigations (Hontangas et al., 2015; Seybert, 2013; Stark et al., 2005) by introducing an MCMC algorithm for estimating GGUM-RANK statement and person parameters from MFC triplet responses. It is hoped that this research provides a solid foundation for field applications and a springboard for new psychometric developments.
Supplementary Material
Supplementary Material, APM-17-04-070.R2.GGUM-RANK_Online_Supplement – GGUM-RANK Statement and Person Parameter Estimation With Multidimensional Forced Choice Triplets
Supplementary Material, APM-17-04-070.R2.GGUM-RANK_Online_Supplement for GGUM-RANK Statement and Person Parameter Estimation With Multidimensional Forced Choice Triplets by Philseok Lee, Seang-Hwane Joo, Stephen Stark and Oleksandr S. Chernyshenko in Applied Psychological Measurement
Footnotes
Author’s note
Oleksandr S. Chernyshenko is currently affiliated with the University of Western Australia
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
