Abstract
Forced-choice (FC) measures are gaining popularity as an alternative assessment format to single-statement (SS) measures. However, a fundamental question remains to be answered: Do FC and SS instruments measure the same underlying constructs? In addition, FC measures are theorized to be more cognitively challenging, so how would this feature influence respondents’ reactions to FC measures compared to SS? We used both between- and within-subjects designs to examine the equivalence of the FC format and the SS format. As the results illustrate, FC measures scored by the multi-unidimensional pairwise preference (MUPP) model and SS measures scored with the generalized graded unfolding model (GGUM) showed strong equivalence. Specifically, both formats demonstrated similar marginal reliabilities and test-retest reliabilities, high convergent validities, good discriminant validities, and similar criterion-related validities with theoretically relevant criteria. In addition, the formats had little differential impact on respondents’ general emotional and cognitive reactions except that the FC format was perceived to be slightly more difficult and more time-saving.
The scientific study of personality, job attitudes, job engagement, vocational interests, and many other psychological attributes heavily relies on the use of self-reported measures. Since its introduction by Likert in 1932, the single-statement (SS) rating scale format has no doubt been the most widely used form of measurement (Brown & Maydeu-Olivares, 2013). SS measurement is, however, not without issues. Faking and rating scale errors, such as halo, acquiescent responding, extreme response style, and midpoint response style, lead to serious concerns (e.g., Donovan, Dwight, & Schneider, 2014; Landy & Farr, 1980; Pavlov, Maydeu-Olivares, & Fairchild, 2018). An alternative that was originally developed to overcome problems with SS measures was the forced-choice (FC) scale (Rundquist, 1946). It has the advantage of deeming some response styles impossible (e.g., midpoint/extreme response style), and some have found improved measurement with this format (Bartram, 2007; Brown, Inceoglu, & Lin, 2017; Cao & Drasgow, in press; Guenole, Brown, & Cooper, 2018).
However, psychometric models make different assumptions about response processes for FC and SS measures (Sass, Frick, Reips, & Wetzel, 2018). These theoretical differences raise a fundamental question: Are scales using FC and SS formats measuring the same construct? In addition, respondents’ reactions to the FC format compared to their reactions to the SS format is another important but largely ignored topic. The present study is designed to examine the comparability of SS and FC measures in the context of response processes modeled by ideal point models. An abridged static version of the Tailored Adaptive Personality Assessment System (TAPAS; Drasgow et al., 2012), which measures narrow facets of personality factors and has been widely used for military selection and classification (Drasgow et al., 2012), is used to answer our research questions. To the extent that SS and FC measures of the same construct are equivalent under ideal conditions (e.g., low stakes, homogeneous sample), we can build on SS findings with a more robust measurement method and psychometric model that may provide meaningful results under much wider circumstances.
The Single-Statement Format
Likert’s (1932) SS approach avoids items reflecting intermediate trait levels and requires respondents to rate their degree of agreement with each statement on an n-point scale. Individual trait scores are then computed as the sum of all responses after reverse-coding negative items. Likert’s methods for item selection and scoring are consistent with a dominance response process which assumes that the probability of endorsing an item will monotonically increase as the trait level increases (Drasgow, Chernyshenko, & Stark, 2010). For example, when responding to the statement “I am very conscientious,” more conscientious respondents are more likely to endorse this item, thus obtaining higher scores. Most popular personality scales were constructed and scored following the dominance assumption.
Recent research suggests that ideal point models, rather than a dominance model, should be used for personality measures. These models posit that each statement has a location parameter describing its extremity that is on the same metric as the latent trait. It is assumed that a respondent is more likely to endorse an item when the statement’s location is closer to the individual’s latent trait level. The more a person’s trail level departs from the statement’s location, either higher or lower, the less likely he or she is to endorse the statement. If we plot the endorsement probability against the latent trait level, we will see an inverse U-shaped curve (Cao, Drasgow, & Cho, 2015). For instance, when responding to the statement “I am about average in regards to conscientiousness,” moderately conscientious respondents have the greatest probability of endorsement because the statement’s location matches their latent trait levels. However, those very high or low on conscientiousness are less likely to endorse this item because there is a mismatch between the statement’s location and their own standings on the latent trait continuum.
Evidence in support of ideal point models has been found for personality assessment (e.g., Cao et al., 2015; Chernyshenko, Stark, Drasgow, & Roberts, 2007; Stark, Chernyshenko, Drasgow, & Williams, 2006), vocational interests (Tay, Drasgow, Rounds, & Williams, 2009), job satisfaction (Carter & Dalal, 2010), emotional intelligence (Cho, Drasgow, & Cao, 2015), and attitude measures (Roberts & Laughlin, 1996). Using a dominance model when an ideal point process applies may impair diagnostic, selection, and classification decisions due to reduced measurement precision (e.g., Chernyshenko et al., 2007; Roberts, Laughlin, & Wedell, 1999; Sun, 2018; Tay et al., 2009) and hinder the detection of interaction effects (Cao, Song, & Tay, 2018; Carter, Dalal, Guan, LoPilato, & Withrow, 2017). Therefore, the current study used ideal point models for items in the SS format.
The Forced-Choice Format
In an FC item, respondents are presented with at least two single statements in a block. They are instructed to choose the statement that is “more like me,” to choose one statement that is “more like me” and one that is “the least like me,” or to rank all statements within a block (Brown, 2016a). Recently, some other variations of the FC format are also proposed (Brown, 2016b; Brown & Maydeu-Olivares, 2017). Statements within a block are usually balanced on social desirability. A typical example of an FC item is given as follows:
Choose the statement that is the more like you:
I keep my room tidy. I like talking to people.
These statements assess the conscientiousness and extraversion dimensions of personality. If a respondent chooses the first statement as “the more like me,” the traditional scoring approach would allocate 1 point to conscientiousness and 0 points to extraversion. Trait scores would be computed by adding up scores for all of the items assessing each construct. This scoring of FC responses, however, leads to ipsativity, which means that “the sum of the scores obtained over the attributes measured for each respondent is a constant” (Hicks, 1970, p. 169). Consequently, ipsative scoring allows only intraindividual comparisons, and between-person comparisons are technically not appropriate. As a result of the concerns about ipsativity, the popularity of the FC format appears to have declined substantially from its high point in the 1950s.
Recently, however, the tide appears to have turned due to the development of several psychometric models that enable researchers to obtain normative trait estimates from FC measures. Building upon the finding that ideal point models may better describe the underlying response process for some constructs, Stark, Chernyshenko, and Drasgow (2005) developed the multi-unidimensional pairwise preference (MUPP) model. This model assumes that when a respondent is asked to indicate a preference for one of the two statements in a block, he or she first evaluates each statement separately and independently and then decides whether or not to endorse each statement. If he or she initially decides to endorse both or neither of the statements, the respondent has to reevaluate the statements until a single preference is reached. The MUPP model has solved the ipsativity problem and produces normative scores (Stark, Chernyshenko, Drasgow, & White, 2012; Stark et al., 2005; see also Brown & Maydeu-Olivares, 2011, and McCloy, Heggestad, & Reeve, 2005, for alternative approaches that also overcome ipsativity).
Psychometric Equivalence Between FC and SS
The use of the FC format seems to be on the upswing. However, psychometric models make different assumptions about response processes for FC and SS measures. According to Tourangeau and Rasinski (1988), when answering an SS measure, a respondent first interprets the statement by understanding the content of the statement and estimating the location of the statement on the trait continuum. Then he or she searches memories for relevant information to locate himself or herself along the same trait continuum. At the information integration stage, he or she calculates the distance between his or her location and the item’s location and uses the distance to form a judgment. The inverse of the distance is similar to Thurstone’s perceived utility (Thurstone, 1927), which then determines choice. When answering an FC measure, however, respondents are assumed to make relative judgments among two or more options. They first evaluate each statement separately, perhaps following the process described above. However, what determines the final choice is the perceived utility differences among alternatives (Thurstone, 1927). When the utility difference is small, this process may require deep contemplation to arrive at a finer differentiation (Meglino & Ravlin, 1998). These response process differences lead to a fundamental question: Do FC and SS scales measure the same construct?
Concerns about the construct interpretation of FC-derived scores have been long recognized (Tenopyr, 1988; Usami, Sakamoto, Naito, & Abe, 2016), but little addressed. Although simulation studies have shown that modern item response theory models can recover normative person parameters (e.g., Brown & Maydeu-Olivares, 2011; Stark et al., 2005), these models have specific assumptions about the underlying response process. If these assumptions do not hold empirically, treating FC estimates and SS estimates as equivalent may be problematic. Empirical data are needed to examine their degree of equivalence.
To ensure that a lack of convergence is not due to content differences, researchers need to compare FC and SS measures that are constructed from the same statement pool. To our knowledge, however, only a few studies have done this. Among these studies, most have examined the equivalence of dominance-model-based SS and FC formats and found mixed evidence. For example, statement factor loadings were found to differ substantially between FC and SS formats (Ackerman, Donnellan, Roberts, & Fraley, 2016; Dueber, Love, Toland, & Turner, 2019; Guenole et al., 2018; Wetzel, Roberts, Fraley, & Brown, 2016), suggesting that respondents may interpret them differently (Bürkner, Schulte, & Holling, in press). While some have reported moderate to high convergent validity (Brown & Maydeu-Olivares, 2011, 2013; Lee, Joo, & Lee, 2019; Lee, Lee, & Stark, 2018; Usami et al., 2016), others have observed very low convergent validity (Anguiano-Carrasco, MacCann, Geiger, Seybert, & Roberts, 2015; Dueber et al., 2019; Seybert, 2013). Moreover, as mentioned above, evidence has been accumulating that shows ideal point models more accurately capture the response processes underlying various psychological measures. Thus, it is largely unknown whether ideal-point-based SS and FC formats produce scores that are equivalent. To date, only one study has compared ideal-point-based FC and SS formats, and found supportive evidence of equivalence (Chernyshenko et al., 2009).
These earlier studies have limitations that need to be addressed. First, the single ideal-point-based study that has been conducted only tested format equivalence with three constructs in a homogenous college sample. The generalizability of its results is unknown. Second, earlier studies have generally adopted a within-subjects design in which respondents finished both FC and SS measures in the same session. Such a design may artificially increase their equivalence due to single-subject response consistency error (Podsakoff, MacKenzie, Lee, & Podsakoff, 2003). Third, previous studies have not counterbalanced the order in which FC and SS were presented to respondents, which may further inflate the consistency between FC and SS responses. Fourth, little empirical evidence on the temporal stability of FC measures has been reported. As hypothesized by Meglino and Ravlin (1998), the cognitive processes underlying FC may be very intricate in that respondents need to make fine differentiations among statements to arrive at their final decision. Such fine differentiations may be much less consistent across time.
Research Questions
The first research question we address is the psychometric equivalence of FC scales and SS scales that are constructed from the same statement pool. If FC scores and SS scores are equivalent, we would expect them to display the following psychometric properties: (a) both formats have similar reliability; (b) scores on the same facet are highly correlated across formats; (c) scores of both formats show similar discriminant validity; (d) scores on the same facet show similar high correlations with a third external measure of the same construct; and (e) both formats show very similar criterion-related validity. We will examine these specific questions.
Regarding criterion-related validity, we selected four theoretically relevant criteria that have differential and robust correlations with the Big Five personality factors. The first criterion is subjective well-being (SWB). Steel, Schmidt, and Shultz (2008) meta-analytically showed that neuroticism has the largest correlation with SWB, followed by extraversion, conscientiousness, and agreeableness. The second criterion is core self-evaluation (CSE), which has been consistently found to be highly correlated with neuroticism, followed by extraversion, conscientiousness, and agreeableness (e.g., Judge, Erez, Bono, & Thoresen, 2003; Judge, Van Vianen, & De Pater, 2004). The third criterion is subjective health (SH). Many studies have found that SH is strongly related to neuroticism, extraversion, and conscientiousness (e.g., Löckenhoff, Sutin, Ferrucci, & Costa, 2008; Stephan, Demulier, & Terracciano, 2012). The fourth criterion is job satisfaction (JS). Meta-analytic evidence shows that workers low in neuroticism and high in conscientiousness, extraversion, and agreeableness are more likely to be satisfied with their jobs (Judge, Heller, & Mount, 2002). If FC scores and SS scores are psychometrically equivalent, they are expected to demonstrate similar correlations with these criteria.
Another important question concerns the comparability of respondent reactions to the FC format and SS format. Previous studies have found that applicants’ reactions to test formats and procedures can affect test validity, test motivation, adverse impact, and various other behavioral intentions (e.g., Hausknecht, Day, & Thomas, 2004; Smither, Reilly, Millsap, Pearlman, & Stoffey, 1993). Clearly, respondent reactions should be studied, and this seems particularly important for the two-alternative forced-choice format, which may require respondents to make very difficult choices (Bartram & Brown, 2004). We are particularly interested in whether respondents react differently to FC and SS in both emotional and cognitive aspects.
The Present Study
Two samples were recruited to examine the degree of equivalence between the FC and SS formats and ascertain respondents’ reactions. Sample 1 employed a between-subjects design where a group of respondents completed the FC version and another group completed the SS measure. A between-subjects design allowed us to study score equivalence and circumvent issues related to single subject response consistency. Sample 2 adopted a within-subjects design where all respondents completed both FC and SS measures, but with an interval of two days between assessments. The order of administration was counterbalanced. This within-subjects design allowed us to directly study the convergent validity between the two formats and the two-day interval should minimize memory effects without running the risk of personality change.
Method
Sample
Between-subjects sample
Two subsamples were recruited from the Amazon Mechanical Turk worker pool. The FC subsample completed TAPAS_FC (see below) and other criterion measures. The SS subsample rated the same statements that make up the forced-choice pairs using a 5-point Likert format and completed the same criterion measures. Six quality control items were embedded in the FC condition and five in the SS condition. The quality control items instructed the respondent to select a particular response option (e.g., strongly agree). In this study, we deleted data from respondents who responded incorrectly to more than one quality control item. To assess test-retest reliability, respondents were asked to complete exactly the same instruments twice with a 10-day interval. Respondents were allowed to enroll in only one sample. All the measures were administered twice. Independent groups t tests on the four demographic variables showed that the two samples were demographically similar (t = 0.77 ∼ 1.42, p = .16 ∼ .44). We note that the same set of analyses reported below were also performed on Time 2 data and the results were almost identical to those of Time 1. Therefore, we report only results from Time 1 data because the sample size is larger.
Within-subjects sample
About half of this sample completed the FC format first and the SS format two days later (FS group), and the other half completed the SS format first and the FC format two days later (SF group). Six quality control items were embedded and we allowed one quality control item to be incorrect. Independent groups t tests on the four demographic variables revealed no significant demographic difference between the two groups (t = –0.92 ∼ 0.39, p = .36 ∼ .70). Also, no order effect was found for the FC and SS scores after Bonferroni correction (t = –2.22 ∼ 2.33, p = .03 ∼ .93). Thus, data from the two groups were combined for further analysis.
All respondents received monetary compensation. Details about sample size and demographic composition of each subsample at each time are shown in Table 1.
Sample Demographic Information.
Note: FC = forced choice; SS = single statement; TAPAS = Tailored Adaptive Personality Assessment System. FS group = The group that completed TAPAS-FC first and TAPAS-SS two days later. SF group = The group that completed TAPAS-SS first and TAPAS-FC two days later. For cells that contain sample sizes, the first number is the raw sample size and the number in parentheses indicates the valid sample size after excluding these who failed the quality control test. Gender: male is coded as 0 and female is coded as 1. Education is measured using a 6-point scale: 1 = primary school; 2 = high school or equivalent; 3 = some college or vocational school; 4 = a bachelor’s degree or equivalent; 5 = a master’s degree or equivalent; 6 = a doctoral or professional degree. Income is measured in dollars using an 8-point scale. 1 = under 10,000; 2 = 10,000-19,000; 3 = 20,000-29,000; 4 = 30,000-39,000; 5 = 40,000-49,000; 5 = 50,000-74,999; 6 = 75,000-99,999; 7 = 100,000-150,000; 8 = over 150,000.
Measures
TAPAS
TAPAS statements were specifically developed for ideal point measurement, and therefore the statement pools for each facet include both extreme (positive and negative) and intermediate items that cover the entire range of the facet. This abridged version used in the current study included 10 facets: intellectual efficiency (IE) and tolerance (TO) for openness, achievement (AC) and order (OR) for conscientiousness, dominance (DO), sociability (SO) and physical conditioning (PC) for extraversion, selflessness (SE) for agreeableness, and even tempered (ET) and optimism (OP) for emotional stability. FC items were created by matching statements on extremity and social desirability based on preexisting ratings collected before (Stark et al., 2014). A brief description of each facet is shown in Table 2.
Brief Descriptions of TAPAS Facets (Nye et al., 2012).
TAPAS_SS
There were 16 statements for IE, 17 items for TO, 17 items for AC, 16 items for OR, 14 items for DO, 17 items for PC, 18 items for SO, 15 items for SE, 18 items for ET, and 18 items for OP.
TAPAS_FC
Statements that made up the FC pairs were the same as those in TAPAS_SS. There were 10% unidimensional pairs and 90% multidimensional pairs.
Criterion measures
Throughout the two samples, we used the Subjective Well-Being Scale (SWBS; Diener, Emmons, Larsen, & Griffin, 1985), the Core Self-Evaluation Scale (CSES; Judge et al., 2003), and a single-item measure of SH (Pinquart, 2001). The Big Five Inventory (BFI; John & Srivastava, 1999) was used in the between-subject sample, and the Big Five Inventory-2 (BFI-2; Soto & John, 2017) was used in the within-subject sample as external measures of the Big Five factors. Job satisfaction was measured using the Abridged Job Descriptive Index (AJDI; Stanton et al., 2002) in the within-subject sample.
Respondent reactions
We also measured respondents’ immediate affect and vitality level right before and right after the administration of either format of TAPAS using the 10-item version of the Positive and Negative Affect Schedule (PANAS; Thompson, 2007) and three items from the Subjective Vitality Scale (SVS; Bostic, Rubio, & Hood, 2000). Respondents were also asked to rate perceived difficulty of responding, preference, how much effort they had expended, and the degree of concentration during the process of responding.
Details about the criterion measures and respondents’ reactions can be found in Table 3. The criterion measures and respondent reactions were all scored by computing the mean after reverse coding because there were no intermediate items. Details about scoring the two formats of TAPAS can be found in the online supplemental material.
Details About Criterion Measures and Respondent Reaction Measures.
Note: SWBS = Subjective Well-Being Scale; CSES = Core Self-Evaluation Scale; SH = subjective health; BFI = Big Five Inventory; E = extraversion; A = agreeableness; C = conscientiousness; N = neuroticism; O = openness; AJDI = Abridged Job Descriptive Index; PANAS = Positive and Negative Affect Schedule; SVS = Subjective Vitality Scale. We were not able to calculate Cronbach’s α for single item measures.
Response time
Qualtrics recorded response time for the whole test battery. The difference between subgroups (for the between-subject sample) or the difference for each person between his/her response times for the two tests (for the within-subject sample) can be used as a proxy to indicate response time differences between the FC and SS measures. This is because the only difference was the TAPAS format (FC or SS), and all other measures were exactly the same.
Results
Psychometric Equivalence
Reliability
Because it is inappropriate to calculate Cronbach’s α for the FC format and the SS format with intermediate items, we calculated item response theory model-based marginal reliability instead (Green, Bock, Humphreys, Linn, & Reckase, 1984). As reported in Table 4, there were two replicable patterns. First, the rank order of reliabilities across formats was similar in both samples (r = 78 and .79). Second, TAPAS_SS facet scores had relatively higher marginal reliability than TAPAS_FC facet scores in both samples. The average marginal reliability was .76 and .77 for TAPAS_FC in the two samples (ranging from .66 to .86), and .89 for TAPAS_SS in both samples (ranging from .79 to .94). Achievement and selflessness assessed by the FC format had reliability lower than .70 in both samples (RelAC = .66 and .69, RelSE = .67 and .67). The reliability of other facets was above .70. The same two patterns were observed for test-retest reliability: (a) the rank order of test-retest reliabilities of the 10 facets was similar across formats (r = .75); and (b) TAPAS_SS facet scores had relatively higher test-retest reliabilities than TAPAS_FC facet scores. The average was .77 (ranging from .69 to .83) for TAPAS_FC and .89 (ranging from .85 to .93) for TAPAS_SS. The achievement facet again showed the lowest reliability in the FC format (r = .69). Overall, TAPAS_FC showed marginal and test-retest reliability that were generally comparable to the TAPAS_SS.
Marginal Reliability and Test-Retest Reliability.
Note. FC = forced choice; SS = single statement; DO = dominance; SO = sociability; PH = physical conditioning; SE = selflessness; AC = achievement; OR = order; EV = even tempered; OP = optimism; IE = intellectual efficiency; TO = tolerance.
In addition to overall reliability, we also examined conditional standard errors (SEs), which are inversely related to reliability. Specifically, we divided the trait continuum into three ranges ([–4, –1], [–1, 1], and [1, 4]) and computed the average SEs of trait estimates within each range. Results are shown in Table 5. Consistent with overall reliability, trait scores based on the SS format had slightly smaller SEs. Generally, individuals in the middle of the trait continuum (i.e., with scores between –1 and 1) were measured more reliably than those at the ends when the SS format was used. However, the FC format seemed to measure individuals equally well regardless of their standings on the trait continuum. In some cases, the FC format measured individuals at the ends even more reliably than the SS format.
Mean Standard Errors Conditioning on Different Theta Values.
Note: FC = forced choice; SS = single statement; DO = dominance; SO = sociability; PH = physical conditioning; SE = selflessness; AC = achievement motivation; OR = order; EV = even tempered; OP = optimism; IE = intellectual efficiency; TO = tolerance.
Convergent and discriminant validity
As shown in Table 6, the average correlation between scores on the same dimensions assessed using the different formats was .73 and ranged from .58 ∼ .78. A closer inspection revealed that achievement and selflessness had relatively low convergent validities (r = .58 and .65, respectively); those two facets also had lower reliabilities compared to the other personality dimensions. When we corrected the raw convergent validities for unreliability using Spearman’s formula (Spearman, 1904), the average convergent validity reached .89 (ranging from .77 ∼ .96). Overall, there appears to be strong evidence that the FC and SS versions are measuring the same underlying constructs.
Convergent and Discriminant Validity (Within-Subjects Sample).
Note: FC = forced choice; SS = single statement; DO = dominance; SO = sociability; PH = physical conditioning; SE = selflessness; AC = achievement; OR = order; EV = even tempered; OP = optimism; IE = intellectual efficiency; TO = tolerance. Shaded correlations are uncorrected convergent validities. Correlations in the upper triangle are convergent validities corrected for unreliability.
The mean absolute correlations between different facets within TAPAS_FC and TAPAS_SS were .14 (|r| = .03 ∼ .39) and .27 (|r| = .03 ∼ .57), respectively. These two sets of discriminant correlations had very similar rank orders (r = .85). Similarly, the mean absolute interfacet correlations across formats was .16 (|r| = .00 ∼ .42). In sum, positive results for discriminant validity were found.
Correlations with external measure of the Big Five factors
In the next analysis, TAPAS facet scores were correlated with scores obtained from the BFI (between-subjects sample) and the BFI2 (within-subjects sample). Results are presented in Table 7. The shaded cells represent the convergent validities with external measures. Two patterns were apparent. First, both TAPAS_FC scores and TAPAS_SS scores showed moderate to high correlations with external measures of the same factors (MFC.Between = .47, MSS.Between = .56, MFC.Within = .52, MSS.Within = .64; the rank order correlations of the validity of the 10 facets across formats were .96 and .98 in the two samples), providing construct validity evidence for both TAPAS formats. Physical conditioning was an exception because it was only weakly correlated with extraversion scores in the between-subject sample. Second, TAPAS_FC facet scores showed slightly lower correlations with the single-statement BFI and BFI2 than TAPAS scores. There were also several moderate correlations among different personality domains that replicated across formats and samples. For example, optimism was moderately correlated with extraversion (r = .35 ∼ .62), agreeableness (r = .31 ∼ .44), and conscientiousness (r = .35 ∼ .44). Even tempered was also moderately correlated with agreeableness (r = .47 ∼ .57). In sum, these findings also support the psychometric equivalence of TAPAS_FC and TAPAS_SS.
Correlations With External Measures of the Big Five Factors.
Note: FC = forced choice; SS = single statement; DO = dominance; SO = sociability; PH = physical conditioning; SE = selfishness; AC = achievement; OR = order; EV = even tempered; OP = optimism; IE = intellectual efficiency; TO = tolerance; E = extraversion; A = agreeableness; C = conscientiousness; N- = neuroticism (scored in the direction of emotional stability); O = openness. Shaded correlations are convergent validities.
Criterion-related validity
Correlations with criterion variables are shown in Table 8. Two similar patterns emerged. First, the criterion-related validities of TAPAS_FC and TAPAS_SS scores showed highly similar rank orders across criteria (the rank order correlations of the criterion-related validity of the 10 facets across formats ranged from .90 to .98 in the two samples). Second, TAPAS_SS scores had relatively higher criterion-related validity in most cases. These validity estimates were largely consistent with previous findings reviewed in the introduction, thus providing another piece of evidence supporting the psychometric equivalence of TAPAS_FC and TAPAS_SS. Next, we discuss each criterion separately.
Criterion-Related Validity.
Note: FC = forced choice; SS = single statement; DO = dominance; SO = sociability; PH = physical conditioning; SE = selflessness; AC = achievement; OR = order; EV = even tempered; OP = optimism; IE = intellectual efficiency; TO = tolerance; SWB = subjective well-being; CSE = core self-evaluation; SH = subjective health; JS = job satisfaction.
Subjective well-being
Neuroticism facets had the highest correlations with SWB (MFC = .40, MSS = .51), followed by extraversion facets (MFC = .15, MSS = .26), and conscientiousness facets (MFC = .11, MSS = .25). The agreeableness facet and openness facets were had very low correlation with SWB.
Core self-evaluation
Neuroticism facets again showed the highest correlations (MFC = .53, MSS = .63), followed by conscientiousness facets (MFC = .23, MSS = .40), and extraversion facets (MFC = .23, MSS = .38). The openness facet intellectual efficiency was also correlated with CSE (MFC = .11, MSS = .30). Selflessness and tolerance did not show substantial correlations with CSE.
Subjective health
Neuroticism facets (MFC = .27, MSS = .33) and extraversion facets (MFC = .24, MSS = .30) showed comparable correlations with SH. The rest of the facets had generally low correlations with SH.
Job satisfaction
We only report results for overall job satisfaction. Detailed results for each job satisfaction facet can be found in the online supplemental material. Neuroticism facets again had the highest correlation with JS (MFC = .35, MSS = .41), followed by the facets of extraversion (MFC = .20, MSS = .32) and achievement (MFC = .23, MSS = .37). Order, selflessness, and the two openness facets did not show consistent correlations with job satisfaction.
Respondent Reaction
As shown in Table 9, even though respondents found the FC format more difficult to answer than the SS format (d = .59) and showed a slight preference for the SS format (d = –.15), they generally did not show differential emotional or cognitive reactions. For example, respondents in both conditions had similar levels of positive affect, negative affect, and vitality before and after responding to TAPAS. They also devoted similarly high levels of effort and concentration.
Respondent Reactions.
Note: FC = forced choice; SS = single statement. It was not possible to calculate correlation between response time from two independent samples. Dashes are placed in these two cells. Raw = raw response time; log = log-transformed response time. Formula described in Dunlap, Cortina, Vaslow, and Burke (1996, S.171) was used to calculate effect sizes for dependent groups.
Because the distribution of raw response time was skewed (skewness ranged from 1.94 to 7.8 and kurtosis ranged from 5.61 to 82.79), 1 t test results based on both raw data and log-transformed data are reported. Whether transformed or not, both between-subject data and within-subject data indicated that responding to TAPAS_FC required significantly less time than responding to TAPAS_SS (ps < .001, draw = –.37 ∼ –.25, dlog = –.48 ∼ –.46).
Discussion
The present study used both between- and within-subjects designs to examine the equivalence of FC and SS versions of TAPAS. As the results illustrate, substantial evidence was found for equivalence of the two formats. Specifically, both formats demonstrated similarly rank-ordered marginal reliabilities and test-retest reliabilities, high convergent validities, good discriminant validities, and similar criterion-related validities with theoretically relevant criteria. In addition, the formats had little differential impact on respondents’ general emotional and cognitive reactions except that the FC format was perceived to be a bit more difficult and more time-saving.
Psychometric Equivalence Between Formats
Both marginal reliability and temporary stability were obtained for 10 facets of TAPAS administered in FC and SS formats. All 10 facets assessed via the FC format showed moderate to high levels of reliabilities. More importantly, FC marginal and test-retest reliabilities resembled those of their SS counterparts in terms of rank order despite being relatively lower. We see two possible explanations for FC’s lower reliability. First, scoring of each FC pair was dichotomous: one statement was selected as “most like me” and the other statement was not. SS scoring was polytomous in nature, which appears to offer more psychometric information. For example, respondents can differentiate whether they disagree or strongly disagree. It is therefore not surprising that the SS scale scores were more reliable. A second, more pernicious explanation is that responses to SS scales may be influenced by respondents’ focal trait level as well as other stable but irrelevant traits, such as response styles. Previous studies have found evidence that response styles are stable even across multiple years (e.g., Weijters, Geuens, & Schillewaert, 2010; Wetzel, Lüdtke, Zettler, & Böhnke, 2016). Such stable but irrelevant traits could artificially increase the reliability of SS scales. From this perspective, the seemingly higher reliability of the SS format would not necessarily mean that it is more precisely measuring the intended construct. More studies are needed to test these two explanations.
It is also interesting to see that the FC format provided roughly equal measurement precision across the entire trait continuum. Moreover, the FC format provided slightly more reliable measurement at the two ends of the trait continuum for some traits. This is relevant for personnel selection as it is often those who are high/low on certain traits that are selected/deselected. Accurate measurement of those individuals can facilitate effective decision making. A post hoc explanation is that the test information (inversely related to SE) of the FC format is more spread out than for the SS format due to the difference in the assumed response process. More future studies are needed to examine the generalizability of this finding.
Scores on the same facet measured by different formats were highly correlated. Compared to the FC format, the SS format also consistently demonstrated similar rank-orders but slightly higher correlations with external measures of the Big Five factors, which may be a result of common method bias (Podsakoff et al., 2003). Our findings provide evidence for the construct validity of both formats. A reviewer noted that some facets (e.g., sociability, even tempered) seemed to be slightly more equivalent than other facets (e.g., selflessness, achievement). We suspect that this may be partially due to reliability differences because scores of the less consistent facets had relatively lower reliabilities. Indeed, the discrepancies shrank after correcting for unreliability. However, we still observed some differential equivalence. We speculate that trait desirability may also play a role. Specifically, perhaps more neutral traits (e.g., sociability) can be measured by both formats equally well, thus displaying higher equivalence across formats. More desirable traits (e.g., selflessness) may have more construct irrelevant variance in the SS format (i.e., socially desirable responding), which leads to lower equivalence across formats. However, our current study design and data are insufficient to examine this explanation. Future researchers are encouraged to investigate it by manipulating assessment contexts as trait desirability may change in different contexts (Krumpal, 2013).
Apart from the high convergent validity, both formats also displayed good discriminant validity (i.e., correlations between facets were generally low). It is particularly interesting to see that the FC format had greater discriminant validity than the SS format (MFC = .14, MSS = .27), given that the Big Five factors were originally theorized to be orthogonal (Goldberg, 1990; Saucier, 2002). We see this finding as consistent with one of the purported advantages of the FC format: reducing the influence of response styles (Brown & Maydeu-Olivares, 2011). As discussed above, response styles introduce domain-general but irrelevant variance, and thus can artificially increase interfacet correlations. Importantly, the FC format can reduce the impact of response styles.
Although we observed a few high cross-domain correlations (e.g., optimism and extraversion), they were not a big concern for the present study because they were also well-replicated across formats and samples, which again supported the equivalence between the FC and SS formats. In fact, these high cross-domain correlations were consistent with previous findings (e.g., Duijsens & Diekstra, 1995; Milligan, 2003; Sharpe, Martin, & Roth, 2011). Both formats also showed similar correlations with relevant outcomes. They are largely in line with previous studies (e.g., Judge et al., 2002; Judge et al., 2003; Steel et al., 2008; Stephan et al., 2012).
Respondent Reaction Comparability
It is not surprising that the FC measure was considered harder to complete than the SS measure. This finding is in line with what was found with think-aloud techniques (Sass et al., 2018). Despite the perceived difficulty, respondents reported the same degree of concentration and an equal (and high) level of effort when answering both formats. The two formats also elicited almost identical levels of affective states and subjective vitality after their administrations. Although they reported a statistically significant preference for the SS format, the effect size was small (d = –.15). A surprising finding is that respondents spent less time answering the FC measures than the SS measures, which is in direct contrast with previous findings showing that FC measurement is more time consuming (Bowen, Martin, & Hunt, 2002). One potential explanation is that previous studies asked respondents to choose “the most like me” and “the least like me” statements from a tetrad. To respond to such a tetrad, three pairwise comparisons per block are needed. However, in our design, only one pairwise comparison is involved in each FC block (pair). Therefore, it is expected that a tetrad design is more time-consuming than a two-alternative design. Such a design difference may also explain why previous studies found more negative reactions to FC while we did not. More pairwise comparisons per block would induce more cognitive load, which might result in negative reactions (Fraser, Ma, Teteris, Baxter, Wright, & McLaughlin, 2012).
Limitations and Future Directions
There are several limitations to our studies that need to be addressed in future work. First, we included only low-stakes samples and do not know the generalizability of current results to high-stakes contexts where faking may occur. Faking has been found to result in score inflation (Viswesvaran & Ones, 1999) for SS formats and may even impair the construct validity of SS scales (Schmit & Ryan, 1993). As one of the main feature FC is its faking-resistance potential (Cao & Drasgow, in press), it would be a great contribution if we could empirically show that FC scales can prevent score inflation and retain their construct validity in high-stakes contexts. Future research should ask participants to answer FC scales in both high- and low-stakes contexts to compare their psychometric equivalence under these conditions. However, our focus was on the equivalence of the FC and SS formats. This is the most fundamental question that needs to be explicitly answered before researchers and practitioners embrace the FC format. If these formats are not equivalent even in low-stakes situations, it may be pointless to explore the performance of FC in high-stakes situations because researchers cannot be confident that they are measuring the same thing. Fortunately, we found strong evidence supporting equivalence across the two formats. Second, the test-retest intervals used in our studies were relatively short. It would be informative to administer FC to the same people at multiple times across longer intervals. In this way, we would be able to see how the test-retest reliability of FC measures changes across time and compare the change trajectory to that of SS measures. The third limitation is that total response time was used instead of response time for TAPAS. Total response time may introduce more noise and affect the validity of the current conclusion. Readers should take caution when interpreting this result. Fourth, the current study exclusively relied on self-report data, which limits the generalizability of our findings. It is not known whether we would observe the same degree of equivalence if the FC and SS measures were used for third-party ratings. Future researchers are encouraged to explore this direction. Fifth, the external measures used in the present study did not differentiate between facets and the Big Five factors. Again, research is needed with more diverse criteria.
Conclusion
Based on both between-subjects and within-subjects samples with a total of approximately 2,000 participants, we demonstrated that the FC and SS formats are psychometrically equivalent in terms of reliability, convergent validity, discriminant validity, and criterion-related validity. In addition, respondents did not report substantially negative reactions to the FC format. Therefore, we encourage future research examining the performance of FC scales in high-stakes contexts.
Supplemental Material
Supplemental Material, ORM-18-0089_Supplementary_materials - Though Forced, Still Valid: Psychometric Equivalence of Forced-Choice and Single-Statement Measures
Supplemental Material, ORM-18-0089_Supplementary_materials for Though Forced, Still Valid: Psychometric Equivalence of Forced-Choice and Single-Statement Measures by Bo Zhang, Tianjun Sun, Fritz Drasgow, Oleksandr S. Chernyshenko, Christopher D. Nye, Stephen Stark and Leonard A. White in Organizational Research Methods
Footnotes
Authors’ Note
Part of the results was presented on April 20, 2018, at the 33rd Annual Conference of the Society for Industrial and Organizational Psychology.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Note
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
