Abstract
We introduce a novel, regression-based moderation framework to model faking effects that incorporates evaluation of faking tendency as a moderator. We also consider how perceived trait desirability may be factored into the framework and provide programming code for applied researchers to utilize the method in their research. Using this framework, we revisit a well-known response format (i.e., forced-choice) to formally evaluate its ability to mitigate the effects of applicant faking as compared to the widely used Likert format. The impetus for the latter evaluation stems from the use of item response theory (IRT) modeling to yield non-ipsative scores from forced-choice measures. We found strong support for the need to incorporate moderating effects of faking tendency and desirability in predicting applicants’ responses. Also, we found that the only substantial difference across formats lies in forced-choice scores yielding a lower mean inflation at high faking values. As a result, forced-choice scores do not outperform Likert scores when selection ratios are used but may be beneficial when cutoff scores are used. Application of the moderation framework presented extends to self-reported construct measures of varied kinds.
Self-report personality inventories represent one of the standard assessment tools in many organizational settings. Their popularity comes as no surprise given their ease of use, cost-effectiveness, and substantial levels of validity in predicting a number of organizational outcomes, including academic performance (Poropat, 2009), job performance (Barrick, Mount, & Judge, 2001; Hurtz & Donovan, 2000; Schmidt & Hunter, 1998), as well as various counterproductive work behaviors (Berry, Ones, & Sackett, 2007; Salgado, 2002). Despite these positive aspects of self-report inventories, methodological work has demonstrated critical issues with response bias in scores from self-report inventories (for a review, see Paulhus & Vazire, 2007; Wetzel, Böhnke, & Brown, 2016). Response bias may be exacerbated in high-stakes situations where such instruments are used for collecting information to make decisions about who should advance in a job selection process, where applicants may be inclined to provide responses they believe might increase their chances of performing well. Intentional, goal-oriented behaviors that emerge in an interaction between applicants and an assessment context have been commonly referred to as applicant faking, given the propensity for this set of behaviors to result in inaccurate and enhanced impressions (MacCann, Ziegler, & Roberts, 2011).
Applicant faking has been a concern in self-reported personality assessment, particularly in high-stakes situations where applicants may perceive value in presenting themselves in a positive light to selection personnel. Selection specialists consider applicant faking one of the major threats to the utility of self-report personality information (e.g., Goffin & Christiansen, 2003; Griffith, Chmielowski, & Yoshita, 2007; Robie, Tuzinski, & Bly, 2006). Empirical evidence casts little doubt as to applicants’ ability to substantially elevate their scores on self-rated personality scales (Birkeland, Manson, Kisamore, Brannick, & Smith, 2006; Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Morgeson et al., 2007; Stark, Chernyshenko, Chan, Lee, & Drasgow, 2001; Viswesvaran & Ones, 1999), calling into question the construct validity of the measures (Schmit & Ryan, 1993; Stark et al., 2001) as well as introducing negative impacts on the quality of selection decisions (Berry & Sackett, 2009; Donovan, Dwight, & Schneider, 2014; Griffith et al., 2007; Mueller-Hanson, Heggestad, & Thornton, 2003; Robie, Brown, & Beaty, 2007; Rosse, Stecher, Miller, & Levin, 1998). Despite the accumulating evidence of these detrimental effects of faking on the utility of self-report personality information, as well as reports of the substantial prevalence of faking in applied settings (Griffith & Chet, 2013; Griffith & Converse, 2011), the personality test industry shows continued growth (Psychometric Success, 2013). Reliance on self-reported data in personality research as well as other psychosocial domains more broadly makes it incumbent to develop models that can empirically quantify potential faking that can compromise integrity of observed scores and any ensuing inference from the scores. Numerous efforts to counter applicant faking, aimed at either preventing it (e.g., instance warnings: Dwight & Donovan, 2003; alternative response formats: Norman, 1963; implicit personality measures: Greenwald, McGhee, & Schwartz, 1998; James et al., 2005) or detecting it (e.g., impression management scales: Paulhus, 1984; bogus items: Anderson, Warner, & Spencer, 1984; overclaiming techniques: Paulhus, Harms, Bruce, & Lysy, 2003), have been rewarded with only a modest amount of success (Griffith & Chet, 2013; Ones, Dilchert, Viswesvaran, & Judge, 2007; Steffens, 2004). Advancing our understanding of the faking phenomenon and developing and improving methods for countering it thus remains an outstanding need in personnel assessment as well as in personality research more broadly.
Accordingly, the present study has two primary goals. First, we introduce a novel, regression-based moderation framework to model faking effects that incorporates evaluation of faking tendency as a moderator. We additionally consider how perceived trait desirability may be factored into the framework and provide programming code and step-by-step instructions for applied researchers to utilize the method in their research. Second, we revisit a well-known response format (i.e., forced-choice) to formally evaluate its ability to mitigate the effects of applicant faking as compared to the widely used Likert format. The impetus for the latter evaluation stems from a recently developed item response theory (IRT) model that yields non-ipsative scores from forced-choice measures. We consider both study goals within a single, experimental paradigm that mimics low-stakes and high-stakes response conditions to facilitate evaluation of specific study hypotheses.
In the following sections, we first review the forced-choice format as a method for countering applicant faking, pinpoint major problems with traditional scoring procedures of forced-choice responses, and explain how the new IRT model for forced responses overcomes previous limitations. Next, we discuss the validity of assumptions underlying the most common statistical approaches for estimating faking effects (i.e., the standardized mean difference model and the correlation between honest and applicant trait scores approach). We then develop a moderation model in which we account for the differences in applicants’ faking tendencies, show how the proposed model subsumes and integrates traditional approaches as special cases, and extend the framework to account for individual differences in perceived desirability of measured attributes. Finally, we conduct a simulated applicant scenario experiment, applying the proposed model to real data to: (a) estimate the effects of faking on applicant trait scores, (b) assess differences in faking effects across Likert and forced-choice response formats, and (c) evaluate efficacy of the model in selection decision quality.
The Forced-Choice Response Format
In the forced-choice response format, respondents are presented with items organized in blocks and are asked to fully or partially rank the items according to how well the items describe them. For instance, in Brown and Maydeu-Olivares’s (2011a) Forced Choice Five Factor Markers, respondents are presented with items in blocks of three, with each item reflecting a different Big Five personality dimension (Goldberg, 1990), and are asked to indicate which of the items describes them the most and which describes them the least. An example item block is presented in the following.
In contrast to a Likert response format, where all of the items could be endorsed, the forced-choice response format imparts a heavier cognitive load on applicants by requiring them to make decision trade-offs. That is, an applicant cannot simultaneously endorse all items in the block. If in addition the items within each block are deemed similarly desirable, it is hoped that applicants might have difficulties identifying the most desirable response and might hence be forced to respond honestly (Dilchert & Ones, 2011; Gordon & Stapleton, 1956).
Although it has been shown that the forced-choice item representation can reduce numerous response biases commonly associated with the Likert response format (see Wetzel et al., 2016), empirical evidence regarding the robustness of the forced-choice format to applicant faking has been mixed. Past faking research evaluating the forced-choice format has predominantly relied on experimental, instructed faking studies that focus on two estimates of faking effects: (a) standardized mean differences and (b) correlations between honest and applicant trait scores. Research to date on standardized mean differences has produced mixed findings, ranging from negligible (Longstaff & Jurgensen, 1953; Rusmore, 1956; Vasilopoulos, Cucina, Dyomina, Morewitz, & Reilly, 2006), to moderate (Christiansen, Burns, & Montgomery, 2005; Gordon & Stapleton, 1956; Jackson, Wroblewski, & Ashton, 2000), to high mean inflation of faked forced-choice trait scores (Dicken, 1959; Heggestad, Morrison, Reeve, & McCloy, 2006; Maher, 1959). Previous research conducted on correlations between honest and applicant trait scores has also reported mixed findings, ranging from low (Christiansen et al., 2005; Heggestad et al., 2006), to moderate (Longstaff, & Jurgensen, 1953), to substantial correspondence between honest and applicant sets of forced-choice trait scores (Gordon & Stapleton, 1956; Rusmore, 1956). Findings have been equally heterogeneous and inconclusive regarding how the forced-choice response format performs in comparison to the Likert format. Despite considerable variation in reported effect sizes, most of the comparative studies between forced-choice and Likert format responses have reported lower standardized mean inflation of forced-choice in comparison to Likert trait scores (Bowen, Martin, & Hunt, 2002; Christiansen et al., 2005; Heggestad et al., 2006; Hirsh & Peterson, 2008; Jackson et al., 2000; Vasilopoulos et al., 2006). However, correlations between honest and applicant forced-choice scores have been shown to be comparable across formats (Christiansen et al., 2005; Heggestad et al., 2006). Comparisons across response formats on the quality of selection decisions has suggested that both formats perform poorly in preserving the normative ordering of applicants, especially at the top end of trait score distributions (Heggestad et al., 2006).
Traditional Scoring of Forced-Choice Items
Responses to blocks of items in a forced-choice questionnaire are simply rankings. Traditionally, the items in a block are scored using inverse ranks. For instance, in a block with two items, the preferred item (Rank 1) adds 1 point to its respective scale, and the nonpreferred item (Rank 2) adds 0 points to its respective scale. For blocks of three items, such as the one from the Forced Choice Five Factor Markers shown previously, the item with the highest ranking adds 2 points to its respective scale, the lowest ranked item adds 0 points, and the remaining item adds 1 point to its respective scale. When the number of items per block exceeds the number of response options, partial rankings are commonly obtained. For instance, if items are presented in blocks of four, we obtain a partial ranking in the sense that we know what the most and least desirable choices are from the individual’s responses, but we do not know how the remaining items within the block are ordered. In this case, all items ranked in intermediate positions add 1 point to their respective scales.
Given the scoring scheme explained previously, item scores in each block always add up to the same number, and the total test score (sum of all blocks, also the sum of all scale scores) is the same for every respondent. Data with these properties are called ipsative. Ipsative scoring distorts individual profiles (i.e., it is impossible to achieve all high or all low scale scores), construct validity (i.e., covariances between scale scores must sum to zero), criterion-related validity (i.e., validity coefficients must sum to zero), and reliability estimates (Brown & Maydeu-Olivares, 2013).
There has been ubiquitous concern in the personnel selection literature regarding the ipsative properties of forced-choice scores (Dilchert & Ones, 2011; Dilchert, Ones, Viswesvaran, & Deller, 2006; Heggestad et al., 2006; McCloy, Heggestad, & Reeve, 2005; Ones et al., 2007). Specifically, because ipsative scores distort applicants’ normative standings on the measured attributes, their appropriateness for making interindividual comparisons has been contested (e.g., Hicks, 1970; McCloy et al., 2005; Meade, 2004). Although it has been shown that some (scoring and design) approaches can result in partially ipsative data and somewhat mitigate the undesirable score properties (see Hicks, 1970), the resulting scores still cannot convey purely normative trait information because response dependencies are not appropriately accounted for. The latter has limited the ability to evaluate the impact of applicant faking in the forced-choice response format effectively. Fortunately, the undesirable properties of arbitrary scored forced-choice responses can now be overcome with the recently developed IRT-based scoring methods, which take into account the dependencies among responses as well as the response process to items presented in forced-choice blocks (Brown & Maydeu-Olivares, 2011b, 2013; Stark, Chernyshenko, & Drasgow, 2005, 2011; Stark et al., 2014).
IRT Scoring of Forced-Choice Questionnaires: The Thurstonian IRT Model
Ranking data can be coded using dummy variables (i.e., dummies), one for each ordered combination of items. For instance, responses to the three-item block shown earlier can be coded using three dummies: one comparing Item 1 to Item 2, a second comparing Item 1 to Item 3, and a third comparing Item 2 to Item 3. The dummies take the value of 1 if the first item in the comparison better describes the respondent than the second item (and 0 otherwise). When the most/least like me format is used in blocks of more than three items, partial rankings are obtained, and there are missing data in some of these comparisons. Such missing data can be dealt with using multiple imputation (Brown & Maydeu-Olivares, 2011b). Thurstonian modeling (Brown & Maydeu-Olivares, 2011b; Maydeu-Olivares & Böckenholt, 2005; Maydeu-Olivares & Brown, 2010; Thurstone, 1927, 1931) provides the most straightforward and general framework for modeling ranking data (including partial rankings). The Thurstonian factor model is simply an extension of the well-known normal ogive IRT model for binary data (McDonald, 1999) to items presented in blocks. According to this framework, an individual describes Item A as characterizing him or her better than Item B if the psychological value (or utility) of Item A is larger than the psychological value of Item B, in other words, if the difference of psychological values A – B is positive. The psychological values are then related to the latent traits measured by the questionnaire with a standard factor analysis model. The resulting model (a Thurstonian factor model for rankings) is a second-order factor analysis model for binary data (i.e., the dummy variables used to code the rankings). The first-order factor loading is a block diagonal matrix of fixed contrasts; for instance, (1, –1) if items are presented in pairs. The second-order factor loading matrix, interfactor correlations, and second-order uniquenesses are the standard outcome from a factor analysis model. Finally, the first-order uniquenesses are zero as the relationship between the psychological values and the binary outcomes is deterministic. Consider if individuals are asked to rank Items A, B, and C. If they prefer Item A to B and Item A to C, they must prefer B to C (the ranking format does not allow for intransitive responses). The Thurstonian IRT model (Brown & Maydeu-Olivares, 2011b) is a reparameterization of the Thurstonian factor model as a first-order factor model so that respondents’ scores can be straightforwardly obtained using general computer software. Following Brown and Maydeu-Olivares (2012), we use Mplus (Muthén & Muthén, 2015) to estimate this model.
It is easy to infer from the previous description that the Thurstonian factor model (and its reparameterization, the Thurstonian IRT model) provides desirable features for modeling forced-choice data: (a) It provides a plausible psychological model for responding to the items presented in this fashion, (b) accounts for the within-ranking dependencies through the use of contrast matrices, (c) accounts for the between-ranking dependencies by means of a common factor model, (d) assigns probability zero to all patterns of binary variables that correspond to intransitive responses (Maydeu-Olivares, 1999), and (e) can be estimated using software for ordinal factor analysis.
Estimating the Effects of Faking
Inferences about the effects of faking on both Likert and forced-choice scores have been commonly based on either mean differences and/or correlations between honest (i.e., low stakes) and applicant (i.e., high stakes) trait scores. The null hypothesis of no mean differences on a personality attribute (i.e., trait) is
μ
honest
and μ
applicant
denote the two corresponding population means. For repeated measures designs, this null hypothesis can be tested using a paired samples t test. Interpretation of β0 is enhanced if honest and applicant scores are on a standardized metric. Given units of measurement are often arbitrary in different applications, it may be more fruitful to use standard deviations as units of measurement. We standardize honest scores and applicant scores in this study by using
Critically, we note that the aforementioned null hypothesis may be equivalently tested via a simple linear regression model:
where Xhonest and Xapplicant denote the honest and applicant trait scores after the previous standardization. Model defined by Equation 1 suggests that applicants increase their honest scores by a constant, β0, assuming that the measured trait is desirable for the target job. This model represents the baseline model in the subsequent model building that will be described as it simply describes the standardized mean difference between honest and applicant trait scores. The increase of applicant scores at the average honest score is expressed in honest standard deviation units. The model assumes a slope of 1 between honest and applicant scores. The tenability of this assumption can be tested using the following linear regression model:
Note that fitting model defined by Equation 2 is formally equivalent to computing a correlation between honest and applicant trait scores as the correlation is simply the standardized regression slope. Hence, this second model in our moderation approach is equivalent to the conventional correlation method to test faking effects. If the confidence interval for β1 does not include 1, the underlying assumption of the standardized mean difference model (i.e., Equation 1) is violated, and inferences regarding the effect of faking should be based on the correlational model. The correlational model aligns with previous faking literature that suggests that respondents’ honest scores should be a determinant of the magnitude of score inflation, such that score inflation will not be constant across respondents (Ellingson, 2011; McFarland & Ryan, 2000; Ziegler, Maaß, Griffith, & Gammon, 2015). Thus, Equation 2 provides a more appropriate estimation of faking effects as it subsumes Equation 1 as a special case.
Although model defined by Equation 2 takes into account the expectation that respondents with different honest scores inflate their scores to a different extent, it is limited in that it does not consider that respondents with identical honest scores may also do so. Empirical evidence has demonstrated that respondents can be differentially induced to fake given varying manipulation scenarios, independent of honest trait standings (Jansen, König, Kleinmann, & Melchers, 2012; McFarland & Ryan, 2006; Mueller-Hanson et al., 2006). That is, within individuals with a common honest trait standing, there are individual differences in their tendency to fake. For instance, results in Robie et al. (2007) showed that respondents can be classified based on their motivation to fake and defined three faking classes: honest, slight fakers, and extreme fakers. Similarly, Zickar, Gibby, and Robie (2004) revealed two distinct faking patterns via an IRT analysis, slight faking and extreme faking, and Ziegler (2011) demonstrated comparable results using a cognitive interview technique. Finally, König, Merz, and Trauffer (2012) documented that respondents may differ in their tendency to fake as reflected in a range of applicants’ response strategies. To account for the potential individual differences in faking that emerge as reactions to situational demands, the model can be expanded in the following form:
where F denotes faking tendency. We can mean-center F to improve interpretation of the intercept β0. Model defined by Equation 3 states that the applicant score of a respondent with an average honest score and average faking tendency will increase by β0 honest standard deviation units. More generally, the model states that applicants’ scores will depend on both their honest scores and their faking tendency. Model defined by Equation 3 assumes that applicant scores will increase depending on the faking tendency regardless of respondents’ honest scores; existing faking theory, however, suggests that a multiplicative effect of applicants’ faking tendency and their honest trait standings may be more plausible in some circumstances (e.g., McFarland & Ryan, 2006; Tett & Simonet, 2011). We can model the possibility that the correspondence between honest and applicant trait scores vary as a function of faking tendency with the following moderation model:
which can be rewritten in simple intercepts and slopes form to facilitate plotting interaction effects:
where β1 + β3F in Equation 5 is an estimate of the conditional effect of faking on the applicants’ trait scores at a specific value of faking tendency, defined by F. We provide a comparison of the proposed moderation framework to conventional approaches to estimating faking effects in Table 1 for interested readers.
Comparing Methods to Estimate Faking Effects
Application of the moderation framework presented is not limited to evaluating biases in personality assessment but rather extends to self-reported construct measures of varied kinds. We submit that the proposed framework may be of particular interest to researchers in related areas, such as socially desirable responding, impression management, self-enhancement, malingering, and so on.
Effects of Faking on Likert and Forced-Choice Scores
The presented moderation framework additionally facilitates evaluating the possibility of differential effects across response formats. Such investigations may be valuable given previous research has suggested differential effects of applicant faking due to item presentation format (e.g., Christiansen et al., 2005; Jackson et al., 2000; Stark et al., 2011). For example, in contrast to the Likert response format where applicants can relatively simply manipulate the severity of faking by choosing appropriate response categories, severity of faking in the forced-choice format depends on the extent to which applicants’ responses match the response patterns of respondents with high honest trait standings in the normative sample. The forced-choice format does not lend itself to a simple score maximization strategy as scoring parameters are unknown to job applicants (Stark et al., 2011).
Effects of Perceived Trait Desirability on Faking Tendency
The moderation model defined by Equation 4 posits that respondents with equal honest trait scores and equal faking tendency will obtain equal applicant trait scores. However, differences in applicant scores may also arise from the perceived desirability of the given measured traits for a target job position. That is, respondents with similar honest scores and faking tendency might differ in their perceptions of how desirable a measured trait is for a given position. Previous research has documented that the perception of measured traits’ desirability for a job can considerably vary across respondents (Kuncel & Borneman, 2007; Kuncel & Tellegen, 2009) and may impart an influence on the inflation of corresponding observed trait scores (Christiansen et al., 2005; Jansen et al., 2012). Moreover, Jansen and colleagues (2012) argued that the salience of perceived trait relevance activates when the motivation to fake is high. Accordingly, we might expect that the effects of perceived trait desirability on applicant trait scores will be a function of the faking tendency such that perceived trait desirability and faking tendency interact to influence applicant trait scores. We expand Equation 4 to account for this possibility:
where D denotes the perceived desirability of a trait for a particular job target.
Study Hypotheses
Given the proposed moderation framework and discussion of potential differential effects of faking tendency across item response format and/or levels of perceived trait desirability, we consider several themes that emerge for evaluation. First, we contend that faking tendency will moderate the relationship between scores in honest (i.e., low stakes) and applicant (i.e., high stakes) conditions regardless of response format. We further surmise that the forced-choice format may be able to mitigate the influence of faking tendency such that one would observe a higher correspondence between honest and applicant scores in the forced-choice format relative to Likert. By extension, we also consider that selection decisions may be more accurate and discerning with forced-choice data. Finally, we believe that faking tendency may moderate the influence of perceived trait desirability in both formats. We formalize these suppositions for testing in the following hypotheses:
Hypothesis 1: Faking tendency will moderate the relationship between honest trait scores and applicant trait scores obtained from both Likert and forced-choice response formats. The relationship between the two sets of scores will decrease as faking tendency increases.
Hypothesis 1 will be tested via a model building approach and ultimately by examining the significance and the sign of β3 in the model underlying Equation 4.
Hypothesis 2 focuses on the potential utility of the forced-choice item format in mitigating faking effects. Testing the hypothesis makes use of the simple intercepts and slopes model presented in Equation 5 and is comprised of two parts:
Hypothesis 2a: At high faking tendency, the relationship between honest and applicant forced-choice trait scores will be significantly stronger than the relationship between honest and applicant Likert trait scores.
Hypothesis 2b: At high faking tendency, the expected value of applicant trait scores at the average level of honest trait scores will be significantly higher in the Likert format than the forced-choice format.
We will examine Hypothesis 2a by comparing the simple slopes (i.e.,
Predicated on findings from Hypotheses 2a and 2b, we will assess the accuracy of selection decisions across item response formats relying on the two commonly used personality selection systems, selection ratios, and cut scores (Berry & Sackett, 2009). This evaluation is formalized in Hypothesis 3, which is comprised of two parts:
Hypothesis 3a: At low selection ratios, selection decisions based on the applicant forced-choice trait scores will be significantly more accurate than those based on applicant Likert trait scores.
Hypothesis 3b: At high cutoff scores, the increase in proportion of selected respondents based on the applicant forced-choice trait scores will be lower than based on the applicant Likert trait scores.
If the forced-choice format indeed presents difficulty for extreme fakers to obtain high trait scores, then we would expect that Likert trait scores should show a higher score inflation relative to forced-choice trait scores. Following the same argument, we expect that the forced-choice format will be less sensitive to the effect of faking on respondents’ rank ordering such that the quality of selection decisions will differ across formats as well under low selection ratios.
Based on previous literature suggesting that a trait’s perceived desirability for a target job may interact with faking tendency to influence observed trait scores, we also put forth:
Hypothesis 4: Faking tendency will moderate the effect of perceived trait desirability on both Likert and forced-choice applicant trait scores. The effect of perceived desirability will increase as faking tendency increases.
Method
Participants
One hundred and eighty undergraduate psychology students from a large public university in Spain participated in the study. Participants voluntarily completed the online experiment in exchange for a detailed feedback report on their performance (i.e., a personalized personality profile) and a chance to win a $25 gift certificate for the local university bookstore. Seventy-nine percent of the respondents were women, and age in the sample ranged from 18 to 46 (M = 21.5, SD = 5.67). Nine participants did not report their age.
Procedure
The study was administered in two parts via the Qualtrics survey platform. The first part (i.e., Time 1) defined the honest condition; the second part (i.e., Time 2) defined the applicant condition. After providing informed consent, respondents completed a background questionnaire with basic demographic information. A Big Five personality measure was then administered in both forced-choice and Likert response formats, in exchange for a detailed personality assessment report. The sequence of the presentation of the two formats was counterbalanced to control for potential order effects. Respondents were instructed to respond to the measures as honestly as possible and informed that their responses would be kept confidential. The applicant condition (i.e., Time 2) was administered three weeks after the honest condition (i.e., Time 1). Before responding to the Big Five personality measure in Time 2, respondents were led through a guided scenario exercise. We asked respondents to imagine that they were applying for a particular job opening. We specified a particular job target along with a general job description for the position in order to collect job-specific perceived desirability information. We told respondents to imagine that they were applying for a job opening for a human resource assistant (HRA) position and to imagine that the selection decision was made based on the performance on the personality test that followed. We then provided a detailed job description for the position. Following these instructions, participants completed the same Big Five measure as in Time 1 in both Likert and forced-choice test formats. Presentation of the measures was counterbalanced to mitigate possible order effects. We provided no explicit faking instruction to respondents at Time 2. After completing the Big Five personality measure under this simulated applicant scenario, participants completed a self-report faking measure. Participants were explicitly urged to try to respond as honestly and accurately as possible to the self-report faking measure. In the final section of the experiment, respondents were asked to rate the desirability of all 60 personality statements in relation to the specified target job. After completing the questionnaire, participants were debriefed on the purpose of the study and thanked for their participation.
Measures
Personality
We utilized the Forced Choice Five Factor Markers (Brown & Maydeu-Olivares, 2011a) and its Likert version. This 60-item measure contains 12 indicators of each Big Five personality dimension, 8 of which are positively keyed and 4 negatively keyed. In the Likert response format, respondents are asked to rate each item on a 5-point scale, ranging from 1 (very accurate) to 5 (very inaccurate). In the forced-choice response format, items were combined into triplets with each of the items being an indicator of a different Big Five trait. Respondents are asked to indicate one item that describes them the most and one that describes them the least.
Faking tendency
The following instruction was used: “Please answer the following question as honestly and as accurately as possible. To what extent did a desire to get the position of Human Resource Assistant lead you to distort your answers?” The response scale ranged from 0 to 10 (0 = I did not distort my answers. I responded honestly; 10 = I distorted my answers a lot).
Perceived trait desirability
Respondents were asked to rate each of the 60 items on their desirability for the target job position (HRA). Each statement was assessed on a 5-point rating scale ranging from 1 (very undesirable) to 5 (very desirable), and responses were coded from −2 to +2, respectively. We summed responses to the 12 items measuring each trait to create the perceived trait desirability measure. Negatively keyed items were reversed-coded prior to computation. Internal consistency reliability estimates (Cronbach alpha) for the perceived trait desirability scores obtained in this fashion ranged from .77 for Extraversion to .87 for Conscientiousness.
Statistical Analyses
As described previously, we standardized scores from both response formats and assessment conditions to express them in standard deviation units of Likert honest scores. Furthermore, we transformed perceived trait desirability scores for each trait into percent of maximum possible scores (POMP; P. Cohen, Cohen, Aiken, & West, 1999) with the following formula:
where observed = the observed score for an individual, minimum = the minimum possible score on the scale, and maximum = the maximum possible score on the scale. The formula rescales the scores on a 0 to 10 scale. This transformation does not influence the significance tests and was simply made to improve interpretation of the effects of perceived trait desirability. We mean-centered faking tendency and desirability variables in regression models to reduce nonessential multicollinearity. All regression models were estimated by maximum likelihood using MPlus 7.4 (Muthén & Muthén, 2015). We describe in the Appendix how we specified the regression equations to be estimated for both formats simultaneously with our within-subjects design.
Results
Latent Trait Scores Estimation
We fitted the multidimensional version of the normal ogive graded response model (Samejima, 1997) to the Likert item responses from the honest condition for all five traits simultaneously in MPlus 7.4 (Muthén & Muthén, 2015) using unweighted least squares (ULS) estimation from polychoric correlations. The latent factors were allowed to correlate freely. Latent trait Likert scores were estimated with the maximum a posteriori (MAP) method. The Likert model fit was relatively poor, with χ2 = 2,562.86, df = 1,700 (p < .001), root mean square error of approximation (RMSEA) = .053. We fitted forced-choice responses from the honest condition to the five-dimensional Thurstonian IRT model for forced-choice data also in MPlus, following the procedure by Brown and Maydeu-Olivares (2012). Model parameters were estimated also with ULS, from thresholds and tetrachoric correlations obtained after dummy coding the responses as explained earlier. The latent factors were allowed to correlate freely. One item uniqueness per block was fixed for identification purposes. Latent trait forced-choice scores were estimated using the MAP method. The forced-choice model fit was better than that of the Likert model, with χ2 = 1,851.02, df = 1,640 (p < .001), RMSEA = .027. The model parameters of Likert and forced-choice models in the honest condition were used for obtaining Likert and forced-choice latent trait scores in the applicant condition. We assume measurement invariance across experimental conditions to ensure comparability of scores; this assumption is not testable empirically in the design. Measurement invariance could not be assessed as the number of estimated parameters (with 120 items per format) exceeded the number of observations in our sample. Reliability estimates of the MAP scores were computed as described in Brown and Croudace (2015). In our sample, the estimates ranged from .81 to .89 for Likert scores and.71 to .84 for forced-choice scores.
Descriptive Statistics and Correlations
Table 2 contains means, standard deviations, and correlations of the raw scores for all study variables. Cohen’s d estimates for the standardized mean difference between applicant and honest scores for all constructs are also provided. To facilitate ease of presentation and comparisons across traits, Neuroticism trait scores were reverse-coded to represent Emotional Stability. The observed effect sizes aligned with previous within-subject instructed faking studies (see Viswesvaran & Ones, 1999). All applicant trait scores increased more when the Likert format was used; median effect sizes were .87 and .62 for the Likert and forced-choice formats, respectively. Agreeableness forced-choice scores had an unexpected behavior in our sample as the effect size was negative (–.14). In the honest condition, Agreeableness indicators were most frequently ranked first (or last, if negatively keyed), and the frequency with which Agreeableness was ranked first (or last) in the applicant condition decreased, likely due to the interplay of the perceived trait desirabilities judged against the target job.
Descriptive Statistics and Correlations.
Note: N = 180. Cohen’s d was computed as d = (M applicant – M honest) / SD honest. Statistically significant correlations at α = .05 are boldfaced.
Same-trait correlations between conditions (honest and applicant) were significant and substantial in both response formats (see Table 2), and comparable across formats. The median correlation between honest and applicant Likert trait scores was .48 and between honest and applicant forced-choice trait scores was .49. Table 2 shows that correlations between faking tendency and applicant trait scores were generally higher for Likert scores as compared to forced-choice: The median correlation between faking tendency and applicant trait scores was .43 and .23 for the Likert and forced-choice formats, respectively. For Agreeableness forced-choice scores, the relationship between applicant trait scores and faking tendency was statistically significant and negative (r = –.24). As discussed previously, we may conjecture that the negative correlation between Agreeableness forced-choice applicant scores and faking possibly occurred due to the decrease of the high frequency of preference for Agreeableness trait indicators under the applicant manipulation. Finally, in Table 2 we note that the relationship between applicant trait scores and perceived trait desirability is lower for the forced-choice format across all constructs: The median correlation between applicant trait scores and perceived trait desirability was .45 and .33 for the Likert and forced-choice formats, respectively.
Evaluating the Moderation Model (Hypothesis 1)
The data transformation described previously enabled us to simultaneously evaluate the intercept-only model defined by Equation 1 as well as the main effect model defined by Equation 2 in a single model (termed Model 1 in Table 3).
Results of Linear Regression Analyses.
Baseline analysis: Intercept-only model
To obtain a baseline model for the subsequent analyses, we fit the model described in Equation 2 to the data after suitably transforming scores. Given the data transformation, the estimated intercepts for the Likert response format in Model 1 represent the standardized mean differences between honest and applicant responses. The intercepts for the forced-choice response format in Model 1 represent the mean score inflation of this difference, expressed in terms of Likert honest SDs. In both cases, they are equal to the Cohen d statistics reported in Table 2. All the intercepts in Model 1 (i.e., Cohen ds) were statistically significant. Differences between intercepts across response formats ranged from .23 (Openness) to .55 (Agreeableness) and were significant for all traits.
Main-effect model: Honest score as predictor of applicant trait score
Continuing review of Model 1 in Table 3, we evaluated honest trait scores as the sole predictor of corresponding applicant trait scores. We can observe that regression slopes in the model comparing honest and applicant conditions were all significant (median slopes were .54 and .46 for Likert and forced-choice, respectively). There were no significant differences in slopes between formats, implying that the main effects of the honest condition on applicant scores was comparable across Likert and forced-choice.
Modeling the effects of faking tendency: Faking as a covariate
To assess the model represented in Equation 3, we regressed applicant trait scores on honest trait scores and faking tendency. Results of this model are presented as Model 2 in Table 3. We can readily observe that the ΔR 2 was overall substantial, suggesting that Model 2 represents a marked improvement over Model 1 for all measured traits and response formats (median ΔR 2 = .25 for Likert, median ΔR 2 = .10 for forced-choice models).
Modeling the effects of faking tendency: Faking as a moderator
To evaluate the moderation model defined by Equation 4, we regressed applicant trait scores on honest scores, faking tendency, and the interaction term between the two variables. Results of this model are presented as Model 3 in Table 3. We can observe that the interaction term was significant for all measured traits and across response formats, providing support for Hypothesis 1. The increase in R 2 over that of Model 2 could be observed for all measured traits and across formats (median ΔR 2 = .04 for the Likert format, median ΔR 2 = .07 for the forced-choice format).
Probing the Simple Slope Model (Hypothesis 2)
Simple slopes comparison across response format: Hypothesis 2a
Using the simple slopes model in Equation 5, we graphically displayed the interaction effect of Honest Scores × Faking Tendency for both response formats (see Figures 1 and 2 for Likert and forced choice, respectively). These figures overlay simple slopes of conditional effects on top of bivariate scatterplots for each of the five traits at four faking levels: no faking (F = 0), low faking (F = 2), high faking (F = 8), and maximal faking (F = 10). We include the estimated regression line corresponding to the main effect in Model 1 as reference. Heteroscedasticity of the data is readily apparent in both figures, demonstrating the need for the moderation model in the data. Specifically, as honest trait scores decreased, variability of applicant trait scores increased.

Bivariate scatterplot of Likert applicant versus Likert honest trait scores with main effect (in black) and conditional effects at honest, low, high, and maximal faking (from bottom to top, in gray). Dotted line represents the reference line of equal honest and applicant trait scores.

Bivariate scatterplot of forced-choice applicant versus forced-choice honest trait scores with main effect (in black) and conditional effects at honest, low, high, and maximal faking (from bottom to top, in gray). Dotted line represents the reference line of equal honest and applicant trait scores.
Figure 3 displays conditional effect plots of simple slopes at low faking (F = 2) and high faking (F = 8) conditions for each trait separately for both Likert and forced-choice formats together. The figure illustrates that at low faking (F = 2), the relationship between honest and applicant scores was substantial and significant for all five traits across response formats. Simple slopes did not significantly differ between Likert and forced-choice measures, except for Stability (diffFC-L = –.21, p = .008). At high faking (F = 8), the relationship between honest and applicant scores was statistically significant for all traits in the Likert format except for Conscientiousness (bL = –.03, p = .800). Associated effect sizes were low, however. For the forced-choice format, the relationship between honest and applicant scores was only significant for Stability (bFC = .17, p = .025). To formally assess Hypothesis 2a, we tested the difference between simple slopes for each trait across the two response formats at high faking (F = 8). None of the differences between simple slopes was statistically significant at the α = .05 significance level; the highest difference was obtained for Openness (diffFC-L = –.27, p = .089). We also evaluated the difference between simple slopes across response formats at the maximal level of faking (F = 10); simple slopes were negligible for all traits and response formats. None of the differences between simple slopes reached the α = .05 significance level. Accordingly, results did not provide clear support for Hypothesis 2a.

Interaction effects between faking tendency (F) and honest trait scores on applicant trait scores at low (F = 2) and high (F = 8) faking.
Simple intercepts comparison across response format: Hypothesis 2b
Using the simple slopes model in Equation 5, we evaluated differences in the simple intercepts of the moderation model to assess potential mean differences across response formats at specific levels of faking. Such as Figure 3 illustrates, intercepts at high faking (F = 8) appear smaller for the forced-choice models than for the Likert models. At low faking (F = 2), differences between simple intercepts were low (median absolute difference of .20 in Likert honest SDs) but statistically significant for Stability (diffFC-L = –.23, p = .003), Extraversion (diffFC -L = –.20, p = .006), and Agreeableness (diffFC-L = –.32, p = .001). Differences between simple intercepts at high faking (F = 8) were notable (median absolute difference of .85) and significant for all traits: Stability (diffFC-L = –.85, p < .001), Extraversion (diffFC-L = –.94, p < .001), Openness (diffFC-L = –.63, p < .001), Agreeableness (diffFC-L = –1.25, p < .001), and Conscientiousness (diffFC-L = –.74, p < .001). At maximal faking (F = 10), differences between simple intercepts were even larger, ranging from –.82 (Openness) to –1.57 (Agreeableness), and with a median absolute difference of 1 Likert honest SD. These results strongly supported Hypothesis 2b, with large effect sizes (J. Cohen, 1992) observed for differences in simple intercepts across response formats at high faking. In practical terms, respondents with high faking tendency on average scored roughly 1 Likert honest SD lower on the forced-choice format than on the Likert format, demonstrating the utility of forced-choice in helping to mitigate faking effects in self-reported data.
Evaluating the Accuracy of Selection Decisions (Hypothesis 3)
Given the partial support provided for Hypothesis 2, by way of results provided in the simple intercepts comparison, we empirically investigated potential differences in the accuracy of selection decisions made across response formats.
Forced choice and selection accuracy: Hypothesis 3a
To assess Hypothesis 3a, we tested the differences in the proportion of accurate selection choices between the formats for each trait at 10%, 20%, and 50% selection ratios. The selection ratio analysis in Table 4 shows improved selection decision accuracy of the forced-choice format relative to Likert, especially at the lowest (10%) section ratio. None of the differences in the percentage of accurate selection decisions between formats was statistically significant, however. In addition, the quality of hiring decisions appear to deteriorate as the selection ratio decreased. Overall, these findings did not provide support for Hypothesis 3a.
Selection Decisions Based on Cutoff Scores and on Selection Ratios.
Note: N = 180. S = Emotional Stability; E = Extraversion; O = Openness; A = Agreeableness; C = Conscientiousness; Ph = proportion accepted in honest condition; Pa = proportion accepted in applicant condition; Difference = Pa – Ph; PoL = proportion selected in honest also selected in applicant condition for Likert; PoFC = proportion selected by honest also selected in applicant condition for forced-choice; Ps = proportion of accurate selection choices. Statistically significant differences at α = .05 are boldfaced.
Forced choice and selection discernment: Hypothesis 3b
To assess Hypothesis 3b, we computed proportions of respondents that would be selected based on their honest and applicant trait scores in each format given cutoff scores of 0.5 SD, 1 SD, and 1.5 SD above the honest mean score. Table 4 provides the proportion increase of selected respondents based on their applicant trait scores from both response formats. Results indicate that this increase in proportion was markedly lower when applicant forced-choice scores were used. Moreover, as the cutoff score increased, the percentage increase of selected respondents based on their forced-choice applicant scores decreased; this pattern was less pronounced for the Likert format. The median proportion increase at the 0.5 SD cutoff score was .29 for both response formats; .31 and .15 for the Likert and forced-choice formats at the 1 SD cutoff score, respectively; and .18 and .02 for the Likert and forced-choice format at the 1.5 SD cutoff score, respectively. We tested the significance of the proportion increase using McNemar’s test for correlated proportions (McNemar, 1947). The increase in proportion was significant for all traits and cutoffs when Likert scores were used. Using forced-choice scores, the increase in proportion was not significant for Stability, Extraversion, and Agreeableness at the 1.5 SD cutoff; Extraversion and Agreeableness at the 1 SD cutoff; and Agreeableness at the 0.5 SD cutoff. Overall, these results provided substantial support for Hypothesis 3b.
To test to what extent selection based on applicant scores would preserve respondents that would be selected based on their honest scores, we computed the proportions of respondents that would be selected based on their honest scores for each cutoff value and trait and response format (PoL and PoFC in Table 4). As we can observe in Table 4, even though the proportions of selected respondents based on their forced-choice scores were similar between conditions, respondents selected differed substantially. The largest discrepancies were observed for the highest cutoff. Selections based on Likert applicant scores were better at preserving respondents that would be selected based on their honest scores and were significantly better then forced-choice for Stability at the 1.5 SD cutoff; Stability, Openness, and Agreeableness at 1 SD cutoff; and Extraversion and Agreeableness at 0.5 SD cutoff.
Assessing if Faking Tendency Moderates Perceived Trait Desirability (Hypothesis 4)
To test Hypothesis 4, we estimated the model underlying Equation 6 and evaluated the value added by incorporating the interaction of Faking Tendency × Perceived Trait Desirability relative to the basic moderation model defined by Equation 5. Results for the interaction model incorporating perceived trait desirability are presented as Model 4 in Table 3. The increase in R 2 for the model over that of Model 3 was overall substantial, suggesting that Model 4 represents a marked improvement over Model 3 for all measured traits and across response formats (median ΔR 2 = .09 for the Likert, median ΔR 2 = .08 for the forced-choice). We also see in Table 3 that the main effect of desirability was significant for all traits in both response formats. The Desirability × Faking interaction was statistically significant for all traits in the Likert format but not for Stability (DFC = –.01, p = .592) and Conscientiousness (DFC = .02, p = .163) in the forced-choice format. These findings fully support Hypothesis 4 for the Likert response format but only partially support Hypothesis 4 for the forced-choice format.
Figure 4 displays conditional effect plots of perceived trait desirability on applicant trait scores. The plots hold honest trait scores at their means, consider no faking (F = 0) and maximal faking (F = 10) for each trait, and assess results for both Likert and forced-choice formats. It should be noted that the observed perceived desirability scores in these plots do not cover the full potential scale range (point of neutral desirability was at POMP = 5) as all the traits were perceived as desirable for the target job by the vast majority of respondents.

Interaction effects between faking tendency (F) and perceived trait desirability on applicant trait scores at no faking (F = 0) and at maximal (F = 10) faking.
Figure 4 illustrates that the simple slopes for Model 4 at no faking (F = 0) were negligible, except for Stability in both formats (bL = .24, p = .003; bFC = .26, p < .001) and Conscientiousness (bL = .17, p = .021) in the Likert format. None of the differences between simple slopes across formats at no faking was statistically significant at the α = .05 significance level. At maximal faking (F = 10), the simple slopes were substantial and significant for all traits. The only differences between simple slopes across the two formats at maximal faking were for Stability (diffFC-L = –.44, p = .014) and Conscientiousness (diffFC-L = –.36, p = .029; see Table 3).
The simple slopes for the Likert models were consistent across traits, ranging from .63 to .68 (with a median simple slope of .64). Simple slopes for the forced-choice models ranged from .19 to .47 (with a median simple slope of .39). Based on the median simple slope, we can infer that for a respondent with an average honest trait standing, a 10% increase in perceived trait desirability leads to a .64 SD increase in the Likert applicant trait score and to a .39 SD increase in the forced-choice applicant trait score (expressed in Likert honest SDs).
Discussion
The present study examined if and to what extent a moderation framework to estimate faking effects was appropriate for a Big Five personality measure and considered whether forced-choice scores could mitigate effects of applicant faking relative to a Likert format counterpart. As previously indicated, the application of the moderation framework presented is not limited to personality assessment but rather extends to self-reported construct measures of varied kinds.
Effects of Faking Estimation
We showed that the main assumption underlying the conventional standardized mean difference model to estimate applicant faking (i.e., that faking effects are uniform across respondents) was violated, rendering the model misspecified and demonstrating bias in our sample. Previous research (e.g., McFarland & Ryan, 2000; Ziegler et al., 2015) suggests that such an assumption can be violated either because situational demands interact with measured personality traits to elicit faking (e.g., less conscientious respondents may be more inclined to fake) or because of ceiling effects, such that the scale maximum prevents respondents with high trait standings from faking as much as respondents with low trait standings.
We also showed that the conventional correlation model to estimate applicant faking, based on the correspondence between honest and applicant trait scores, fails to take into account the notion that faking effects may interact with respondents’ honest trait standings. The main assumption underlying this correspondence model (i.e., that respondents will be similarly induced to fake by the assessment context) was also violated in our sample. As we have shown, the variation in the faking tendency in our sample created considerable heteroscedasticity in the data, which was not adequately handled by the correspondence model. Building on theoretical approaches arguing for the multiplicative effect of applicants’ true trait standings and their tendency to fake (e.g., McFarland & Ryan, 2000, 2006; Tett & Simonet, 2011), we argued that the relationship between honest and applicant trait scores was moderated by the faking tendency and demonstrated evidence for this position. Furthermore, we argued that the proposed moderation model could be meaningfully extended by incorporating respondents’ perceived desirability of the measured traits for a target job and demonstrated that perceived trait desirability contributed incrementally to the prediction of applicant trait scores.
Overall, the main contribution of the present research regarding the estimation of the effect of faking demonstrates that faking research must take into account the interplay of at least three critical factors in model estimation (i.e., respondents honest trait scores, faking tendency, and perceived desirability of the measured attributes) and needs to consider that the interplay of these factors is likely to be multiplicative. Omission of these considerations may yield research of limited value. Additionally, the current study contributes to the faking literature by offering a regression-based moderation framework that subsumes the two approaches commonly utilized for the estimation of faking effects (i.e., the standardized mean difference and correlational/correspondence models) as special cases. We contend that the correlational/correspondence model for estimating faking effects is only appropriate in the absence of variation in faking tendency. We further assert that the standardized mean difference model is appropriate only when the correspondence between honest and applicant trait scores approaches a perfect relationship. This assertion, and the viability of both the standardized mean difference and correlation models more generally, can be assessed using the model building strategy illustrated in Table 2. We provide syntax and step-by-step instructions describing the model building strategy for testing faking effects in the supplemental materials to this manuscript (available in the online version of the journal).
Effects of Faking on Forced-Choice and Likert Scores
At high faking values, both Likert and forced-choice applicant trait scores accounted for only a trivial portion of the honest trait variance. This finding held across response formats and all measured traits. Interestingly, the majority of instructed faking studies to date, either assessing Likert (e.g., Caldwell-Andrews, Baer, & Berry, 2000; Ellingson, Sackett, & Hough, 1999; McFarland & Ryan, 2006; Mueller-Hanson et al., 2003), forced-choice (e.g., Gordon & Stapleton, 1956; Longstaff & Jurgensen, 1953), or both response formats (e.g., Christiansen et al., 2005; Heggestad et al., 2006), have documented notable associations between honest and applicant trait scores. Findings from our study suggest that the effects of faking reported in previous work may be underestimated due to not accounting for variability in faking tendency. Further, the common notion that obtained applicant scores in instructed faking research represent a maximum degree (or an “upper bound”) to which personality measures can be faked is also called into question (e.g., Dilchert et al., 2006; Ellingson, Smith, & Sackett, 2001; Ones et al., 2007; Ones, Reiss, & Viswesvaran, 1996; Smith & Ellingson, 2002).
Contrary to our expectation, the estimated simple slopes at high levels of faking were negligible for all measured traits and comparable between the two formats. Our hypothesis that forced-choice applicant scores would yield better selection decision quality at low selection ratios was also not supported. These findings suggest that the forced-choice format is not impervious to faking and performs similarly to the Likert format in preserving respondents’ normative trait standings. On the other hand, our findings regarding lower mean inflation of forced-choice scores were supported. Results suggest that the source of the lower mean inflation may lie in the difficulty presented by the forced-choice format in obtaining high trait scores on the test. Findings related to selection decision analysis based on cutoff scores suggest that the latter may not be the only reason for lower mean inflation of forced-choice scores, however. Specifically, comparable proportions of selected respondents based on forced-choice scores were associated with relatively poor selection overlap between assessment conditions, especially at highest cutoffs. Taken together, these results suggest that lower mean inflation does not necessarily represent sufficient evidence to form valid conclusions regarding the effectiveness of a faking mitigation strategy.
We might argue for at least two plausible explanations for our failure to find support for the benefit of the forced-choice format to reduce the effects of applicant faking. First, the forced-choice trait scores in our study had lower reliabilities than those of the corresponding Likert trait scores, given the forced-choice scales provided less information about the measured traits than polytomous rating scales consisting of the same statements (Brown & Maydeu-Olivares, 2011b, 2013, 2017). Consequently, parameter estimates in the regression models for the forced-choice format may have been more severely attenuated than those for the Likert format. In addition, the lower forced-choice reliabilities could explain poorer overlap in selection decisions, especially at high cutoffs. The second explanation requires considering the nature of faking when responding to forced-choice items. Unlike faking in the Likert format, where each indicator/trait can be manipulated individually, we suspect that manipulating forced-choice responses on one trait inevitably entails manipulating responses on other traits in the measure as indicators of one trait are paired with indicators of all other traits being measured. Therefore, the format not only presented a difficulty for fakers to obtain high trait scores but possibly also led respondents with certain personality configurations (e.g., high Stability, low Conscientiousness) into deteriorating their otherwise high honest standings on some traits. In other words, the format possibly penalized some respondents that engaged in response manipulation. Overall, based on our data, we cannot distinguish between these two, not mutually exclusive, explanations. Future research on this critical point seems necessary.
Effects of Perceived Trait Desirability
In the present study, we demonstrated that perceived trait desirability incrementally contributes to the prediction of applicant trait scores in both Likert and forced-choice response formats. Moreover, the role of perceived trait desirability in manipulating responses appeared particularly pronounced under high faking tendency. This multiplicative effect of the perceived trait desirability was supported fully for the Likert response format. As the tendency to fake increases, observed responses become increasingly based on perceived trait desirability (e.g., Shoss & Strube, 2011; Vasilopoulos, Reilly, & Leaman, 2000). We observed partial support for the multiplicative effects of the perceived desirability in forced-choice format. This finding is challenging to explain given research on cognitive processes behind faking the forced-choice format has been limited at best. We may, however, reasonably assume that desirability-driven responding in the forced-choice format was constrained by the perceived desirabilities of all traits in our measure, and we suspect that this interplay of perceived trait desirability was responsible for the absence of stronger and more consistent interaction effects in our study. Although the main focus of the present work was not on the perceived trait desirability of the measured traits, our findings clearly demonstrate the importance of perceived trait desirability in faking across both response formats and call for further exploration of the role of perceived trait desirability in the cognitive process underlying the manipulation of forced-choice test responses.
Limitations and Directions for Future Research
There are several limitations of the present study and several future research directions that warrant attention. To begin, we acknowledge the frequently voiced concerns regarding the lack of realism and relevance of instructed faking studies relying on student respondents (e.g., Ellingson et al., 2001; Ellingson & McFarland, 2011; Griffith et al., 2007). In applied settings with more experienced applicants and where the stakes are high, applicant faking might be subtler and more difficult to detect. The reliance on a stronger, instructed faking setup in the current study was critical in facilitating our ability to investigate the novel faking estimation paradigm and provide preliminary empirical findings. In addition, it enabled us to more meaningfully relate our findings to the existing research that has traditionally relied on experimental designs. Nevertheless, there is a clear need for future research to validate our findings utilizing additional samples and job targets in real applicant settings. It is important to note that although it is possible to retrieve both honest and applicant scores from actual applicants or incumbents (e.g., Donovan et al., 2014; Griffith et al., 2007), within-subject designs are commonly more demanding and more costly to execute in the field. In addition, issues of range restriction and honesty of actual applicants/incumbents in applied settings would be of greater concern than in the laboratory. Related to the latter, a particular difficulty in the field would be the issue of measuring faking effectively as self-report measures in such settings clearly do not represent a feasible measurement strategy. Taking into account serious concerns regarding the validity of social desirability scales as measures of faking (see e.g., Burns & Christiansen, 2011), a way to circumvent this obstacle could be to use alternative proxies, such as overclaiming technique or bogus statements (e.g., Anderson et al., 1984; Paulhus et al., 2003).
Another important limitation of the present study is related to the measurement of our variables. First, setting Likert and forced-choice scores on the same metric, as we have done in our analysis, assumes that measurement invariance holds across formats. Given the number of items involved, assessing measurement invariance could not be performed as it would involve fitting a model with more parameters than respondents in our sample. With this in mind, and taking into account the differences in reliabilities of scores across response formats, future studies are advised to more firmly establish psychometric equivalence of the applied measures to optimize investigation of the forced-choice format as a faking mitigation strategy. Second, because our faking tendency measure was operationalized with a single item, its psychometric properties could not be assessed. Although we believe that this shortcoming did not substantially affect the validity of our conclusions, future studies are advised to utilize multi-item faking measures and/or incorporate multiple faking measures into research designs.
A final limitation of this study is that we did not use a forced-choice instrument that controls for desirability (i.e., with items within blocks matched on a desirability index). Previous research findings on faking forced-choice response format, however, lead us to believe that such control would not deter respondents from attempting to manipulate responses. In addition, previous research evaluating desirability matching methods has not only demonstrated a limited effectiveness of such methods (e.g., Corah et al., 1958; Goldman, 1964; Krug, 1958; Messick, 1967; Wiggins, 1966) but has also showed that the forced-choice response format facilitates even finer desirability discriminations between the statements (Feldman & Corah, 1960) and may possibly even enhance faking. Although we suspect that the benefits of the format could have been revealed on a desirability equated forced-choice instrument, based on the current state of knowledge, it is not clear what mechanism would be responsible for such added benefit. A natural and critical next step for future research would be to assess whether desirability equated forced-choice instruments, in comparison to Likert instruments, yield a higher number of applicants who choose to respond honestly. To our knowledge, no empirical study to date has directly assessed the preventive feature of the forced-choice format.
The central purpose of the present work was to introduce a novel estimation paradigm to quantify faking effects. A compelling line for future research would involve performing a statistical simulation study to determine how the model is affected by different parameters under different scenarios as well as evaluate how selection decisions are affected by different values of parameters in the model given by Equation 6. A variety of factors may be of interest to manipulate in such a study, including (a) correlations among honest scores, faking tendency, and perceived desirability; (b) values of the various intercepts and slopes in the model; (c) residual variance; and (d) effect size estimates of honest scores, faking tendency levels, and perceived desirability scores. The present study provides reference values to establish plausible values for such a simulation.
Another purpose of the study was to examine the effectiveness of the forced-choice format versus the mainstream Likert format in mitigating faking effects. Despite its appealing design features, the forced-choice response format has not gained popularity among selection specialists mainly due to the lack of procedures for estimation of applicants’ normative trait standings. Recent developments in IRT modeling of forced-choice data have overcome this problem, so the comparison was timely. Our results suggest that with IRT scoring, forced-choice tests are not inferior to tests composed of Likert items. Further, when selection is performed using cutoff scores, they may outperform Likert-based tests. Though our results were not clear cut in this regard, this was likely due to imprecision of measurement. Even though our instruments met current standards of reliability (i.e., estimates mostly in the .80 s), standard errors (SEs) of individual scores were on average .5. Smaller SEs are needed if serious research is to be performed on selection decisions. This will require the use of longer tests and necessitate measuring fewer attributes to meet time constraints. The superior face validity of forced-choice over the Likert format in high-stakes selection procedures, coupled with recent developments in IRT modeling of forced-choice scores, produce evidence for the utility of forced-choice formats and argue for their increased use. Moreover, the new IRT models put forth for estimating forced-choice scores may help shed light into the cognitive mechanisms individuals use when choosing among items, particularly in high-stakes situations.
In closing, our research has revealed that individual differences in faking tendencies cannot be ignored. These individual differences are likely to be situational to a large extent in practice. However, the presence of faking tendencies in experimental settings such as the one described here lead us to believe that some of the individual differences may instead be driven by trait-like characteristics. If so, it is of paramount interest to understand and predict them.
Supplemental Material
Supplemental Material, Supplementary_materials_Pavlov_Maydeu_Fairchild - Effects of Applicant Faking on Forced-Choice and Likert Scores
Supplemental Material, Supplementary_materials_Pavlov_Maydeu_Fairchild for Effects of Applicant Faking on Forced-Choice and Likert Scores by Goran Pavlov, Alberto Maydeu-Olivares and Amanda J. Fairchild in Organizational Research Methods
Footnotes
Appendix
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
