Abstract
This study investigated the efficacy of the lz person fit statistic for detecting aberrant responding with unidimensional pairwise preference (UPP) measures, constructed and scored based on the Zinnes–Griggs item response theory (IRT) model, which has been used for a variety of recent noncognitive testing applications. Because UPP measures are used to collect both “self-” and “other” reports, the capability of lz to detect two of the most common and potentially detrimental response sets, namely fake good and random responding, was explored. The effectiveness of lz was studied using empirical and theoretical critical values for classification, along with test length, test information, the type of statement parameters, and the percentage of items answered aberrantly (20%, 50%, 100%). It was found that lz was ineffective in detecting fake good responding, with power approaching zero in the 100% aberrance conditions. However, lz was highly effective in detecting random responding, with power approaching 1.0 in long-test, high information conditions, and there was no diminution in efficacy when using marginal maximum likelihood estimates of statement parameters in place of the true values. Although using empirical critical values for classification provided slightly higher power and more accurate Type I error rates, theoretical critical values, corresponding to a standard normal distribution, provided nearly as good results.
Keywords
In the fields of psychology and education, there are long histories of research on noncognitive constructs, such as personality, vocational interests, self-efficacy, and values. Measures administered in research settings for developmental and diagnostic purposes were shown early on to predict important outcomes and those successes raised intriguing possibilities about the use of noncognitive tests in the workplace. One notable line of research involved the behaviorally anchored rating scales (BARS; Smith & Kendall, 1963), which were originally designed to reduce the leniency, severity, central tendency, and halo errors often associated with Likert-type rating scales.
In 1998, Borman et al. proposed a “next-generation” version of BARS, called Computerized Adaptive Rating Scales (CARS; Borman et al., 2001), which integrated research on observer ratings, forced choice assessment, and modern psychometric theory. Specifically, Borman et al. assessed contextual (i.e., citizenship) performance (Borman & Motowidlo, 1993) using computerized adaptive unidimensional pairwise preference (UPP) measures composed of pairs of statements that represented different levels of employee effectiveness. A rater’s task was to choose the statement in each pair that better characterized the behavior of the ratee. By making repeated pairwise preference judgments across items chosen dynamically via computerized adaptive testing (CAT) principles (Stark & Chernyshenko, 2011; Stark & Drasgow, 1998), measurement error was reduced relative to BARS and Likert-type graphical rating scales (Borman et al., 2001).
Since the Borman et al. (2001) study, the suitability of UPP measures has been explored for other organizational applications. For example, Borman and colleagues implemented adaptive UPP measurement in the Navy Computerized Adaptive Personality Scales assessment (NCAPS; Houston, Borman, Farmer, & Bearden, 2005), which was designed to accommodate large volume testing applications. In addition, Chernyshenko, Stark, and Williams (2009) described how to construct nonadaptive UPP measures of person-organization fit using a relatively small pool of statements in the context of a university research study.
Although evidence suggests that UPP scales mitigate errors occur when raters evaluate other rating targets (Borman et al., 2001), research was needed to examine their resistance to response biases associated with self-report data, such as socially desirable (fake good) and careless or random responding. A simulation study examining the power and Type I error to detect fake good and random responding using the model-based standardized log likelihood statistic, known as
The model chosen to represent normal UPP responding was the Zinnes and Griggs (ZG; 1974) ideal point IRT model. The ZG model assumes that when a rater is presented with a pair of statements describing different levels of, for example, effectiveness, conscientiousness, or autonomy, the rater carefully considers the statements and chooses the one in each pair that better describes the ratee. Formally, if s and t represent the first and second statements in a performance appraisal item, the probability of choosing or preferring statement s to statement t is given by
and
where
In contrast to normal responding, aberrant responding, such as faking good and random responding, presumes a different psychological process. With UPP assessments, fake good responding implies a rater chooses the more positive or socially desirable statement in a UPP item, regardless of whether it accurately depicts the ratee. In work settings, fake good responding can occur when job applicants want to increase their scores to get hired, when raters wants to give positive impressions of well-liked coworkers in 360° appraisals, and when supervisors want to enhance their own reputations for employee development by manipulating the ratings of subordinates under their tutelage. Alternatively, random responding might occur when busy employees are surveyed too frequently without compensation, when respondents do not understand the context or meaning of questionnaire items, or when supervisors have many subordinates to evaluate and are familiar with only a few. Finally, random responding might also occur when respondents answer an Internet-based self-report survey in an anonymous way (Meade & Craig, 2012).
Adopting the lz Person Fit Index for ZG UPP Model
In general, person fit statistics examine either residuals (i.e., differences between observed and expected responses patterns) or the likelihood of response patterns assuming a formal model of item responding (Nering & Meijer, 1998). IRT methods generally use the latter. The likelihood of a response pattern is calculated using item and person parameter estimates for a designated item response model, and aberrant or atypical patterns are signaled by low (or, in the log metric, negative) likelihoods. The advantage of IRT methods is that they readily permit the assessment of overall model-data fit unlike classical test theory methods.
One of the most widely used and researched IRT-based person fit statistics is
where
The approximate variance is
Finally, the approximately standardized person fit statistic is
The standardization step is important because it eliminates the dependence of the resulting person fit statistic on test length and θ, which was a concern with the
The use of
Factors Influencing lz Efficacy
One of the most widely studied issues associated with
To address this limitation, some authors have explored the use of empirical critical values as alternatives to those based on normality assumptions (Stark, Chernyshenko, & Drasgow, 2012; see also Nering, 1997). Essentially, one must simulate large numbers of normal response patterns based on actual exam or scale characteristics, compute
Past studies involving applications of
A second factor affecting the power of person fit statistics is the proportion of items answered aberrantly. Several studies involving dominance IRT models have shown that higher proportions of aberrant responding are associated with higher detection rates (Drasgow et al., 1987; Levine & Rubin, 1979).
Test composition has also been found to influence detection rates. Given the same type and relative proportions of aberrant responding, detection rates are consistently higher with longer tests (Emons, Sijtsma, & Meijer, 2004; Nering & Meijer, 1998), perhaps because trait scores are more accurately estimated and there are more opportunities to observe inconsistencies with model predictions. Second, higher power and lower Type I error are typically observed with tests having more discriminating items (Emons et al., 2004; Meijer, 1997; Meijer et al., 1994) and more variation in item extremity (Reise, 1995). This makes intuitive sense because higher discrimination leads to higher test information and, thus, more accurate trait estimation. And, variations in extremity highlight inconsistencies between predicted and observed responses given one’s estimated trait score.
Finally, some recent studies have examined the effects of parameter estimation error on the power and Type I error of person fit indexes. In accordance with the statistical principle of consistency, large samples are always desirable for item/statement parameter estimation. The more these parameter estimates differ from their true values, the more error there will be in the estimated trait scores and, thus, the lower the power to detect aberrance. Fortunately, person fit research with single-statement measures has shown only small detrimental effects for parameter estimation error on power (Hendrawan, Glas, & Meijer, 2005). However, research is needed to see whether this finding generalizes to ZG-based UPP measures calibrated via marginal maximum likelihood (MML) estimation (Stark & Drasgow, 2002).
Method
Study Design
This research investigated the power and Type I error rates for
Test Characteristics
In preparation for this simulation, four tests were created to satisfy the test information and test length considerations mentioned above. First, in accordance with the recommendation by Stark and Drasgow (2002), a 10-item high information UPP test was assembled by pairing statement parameters that differed by about 2.5 units along different parts of the trait continuum. The result was a test information function that had an amplitude of approximately 5 near
Data Generation and lz Analyses
Power and Type I error rates for
One thousand trait scores (thetas) were obtained by sampling from a standard normal distribution.
“Normal” responses to each item of each test were simulated by computing the probability of preferring statement s to statement t in item i given a simulee’s true trait score (see Equation 1) and comparing the value to a random uniform number.
Three sets of statement parameters for each of the four tests were used to investigate the effects of MML estimation error on
Three sets of
Lower one-tailed empirical critical values for “operational”
New response data reflecting varying degrees of aberrance (0%, 20%, 50%, 100%) were generated to under an operational testing scenario. For the 0% (no aberrance or normal) conditions, 1,000 new trait scores were sampled from a standard normal distribution and used to generate UPP responses to each of the four tests using the TRUE statement parameters. For the 20% and 50% aberrance conditions, item responses from those same data sets were randomly designated for replacement with fake good or random responses. In the 100% condition, all of the responses were replaced with fake good or random responses. Random responses were generated by sampling a random number from a uniform distribution and comparing it to 0.5. If the result exceeded 0.5, the response was scored as 1; otherwise, the response was scored as 0. Fake good responses were simulated by adding 1.5 to a simulee’s trait score when computing
As in Step 4, three sets of
Type I error was computed for each of the critical values by calculating the proportion of response patterns in the 0% conditions that were misclassified as aberrant. Power was computed in the 20%, 50%, and 100% conditions by computing the proportion of response patterns that were correctly identified as aberrant.
Steps 1 through 8 were repeated until 100 replications were performed.
Power and Type I error results were tabulated, and ANOVA was used to test for the statistical significance of main effects and interactions involving up to three variables.
Results
Tables 1 through 3 show the average Type I error and power rates across the 100 replications in each simulation condition. In particular, Table 1 presents detailed results for Type I error under conditions of test length (10 and 20 items), test information (medium and high information), type of statement parameters (TRUE, MML1000, MML500), and type of critical values (empirical, theoretical).
Type I Error Rates for Empirically and Theoretically Driven Critical Values.
Power to Detect Random Responding With Empirically and Theoretically Driven Critical Values.
Power to Detect Faking Good With Empirically and Theoretically Driven Critical Values.
As can be seen, the Type I error rates for the empirical critical values matched perfectly with the respective nominal alpha levels. Specifically, Type I errors of .01, .05, .10, and .20 were found for the nominal alphas of .01, .05, .10, and .20, respectively, regardless of test length, test information, and the type of statement parameters. In contrast, with the exception of the .01 nominal alpha level, the theoretical critical values resulted in consistently lower than expected Type I errors and the negative bias increased as the nominal alpha increased from .05 to .20. Importantly, however, there were no marked differences in Type I error as a function of the type of statement parameters, test information, or test length.
Table 2 presents the power results for random response detection using empirical and theoretical
Table 3 shows the power results for detecting faking good, which was operationalized as a consistent upward shift in trait scores on items that were designated as aberrant. As can be seen in the table, power to detect faking was poor in every case. In what were optimal conditions for detecting random responding (20 items, high information, 100% aberrance), power for detecting faking was only .16 with empirical critical values and a nominal alpha of .20, and the results were even worse with stricter alphas. Neither test length nor test information had a beneficial effect on power, nor did the use of true statement parameters or empirical critical values. The only interesting finding is that power was lowest, as expected, in the respective 100% aberrance conditions due to the inability to distinguish an “across-the-board” faker from a truly high-trait responder.
To buttress the interpretation of the power results in Tables 1 through 3 and address the specific hypotheses that were proposed above, an ANOVA and planned comparisons were conducted. Table 4 shows the ANOVA results for main effects and interactions that accounted for at least 1% of the variance in power. All of the factors manipulated were statistically significant (p < .05), with the largest effect observed for the type of aberrance. Power was markedly higher for detecting random responding than for fake good responding (p < .0001;
Main Effects and Interactions for Independent Variables on Power Rates.
Note. All effects shown were significant at p < .05. Only interaction effects that accounted for at least 1% of the variance in power are included.
Discussion
The primary goal of this study was to investigate the efficacy of the
In short,
Finding that faking good is difficult to detect was not surprising. If a respondent fakes on a large proportion of items, there will be few apparent inconsistencies in the response pattern, making it difficult to distinguish a spuriously high from a truly high-trait score. Similarly, if just a small percentage of items are faked, the likelihood of the response pattern would be very similar to that of a normal responder, which would also reduce hit rates. These results are consistent with the optimal appropriateness measurement findings and conclusions of Zickar and Drasgow (1996), who examined fake good response detection with Likert-type personality scales in an experiment involving coached and ad-lib faking conditions, and thus confirm existing evidence that faking is difficult to detect.
Another interesting and important finding of the study was that using MML statement parameter estimates, based on samples of 500 yielded power and Type I error rates, were nearly identical to those for the true parameter values. This is consistent with the findings of small effects in research involving single-statement IRT models (Hendrawan et al., 2005). It is also good news for practitioners because the true parameters are never known and, for obvious reasons, pretest samples of 500 and smaller are preferred. In the future, it would be interesting to explore whether subject matter expert (SME) ratings of statement location would be as effective for
Finally, although previous
It is also important to note that significant
Limitations and Future Research
This study has some limitations that might be explored in future investigations. First, the data structures were very clean; that is, normal data were generated according to the ZG model and aberrant responses were generated based on the operationalizations of random responding and faking. However, in the real world, some degree of model-data misfit is always present, so it would be interesting to explore the power to detect random responding when model assumptions are violated. It might be also beneficial to compare
Another limitation of this study is that it focused on one index for detecting aberrance. In future work, it would be interesting to compare the performance of ZG-based
Finally, no real data were examined in this study, so the researchers are encouraged to conduct investigations aimed at explaining aberrance using auxiliary information, such as response latencies, conscientiousness scores, and perhaps level of education. In addition, researchers might explore the possibility of using
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
