Most person-fit statistics require long tests to reliably detect aberrant item-score vectors and are not readily applicable to noncognitive measures that consist of multiple short subscales. The authors propose combining subscale person-fit information to detect aberrant item-score vectors on noncognitive multiscale measures. They used a simulation study and three empirical personality and psychopathology test datasets to assess five multiscale person-fit methods based on the
person-fit statistic with respect to (a) identifying aberrant item-score vectors, (b) improving accuracy of research results, and (c) understanding causes of aberrant responding. Simulated data analysis showed that the person-fit methods had good detection rates for substantially misfitting item-score vectors. Real-data person-fit analyses identified 4% to 17% misfitting item-score vectors. Removal of these vectors little improved model fit and test-score validity. The person-fit methods helped to understand causes of aberrant responding after controlling for response style on the explanatory variables. More real-data analyses are needed to demonstrate the usefulness of multiscale person-fit methods for noncognitive multiscale measures.
Lack of motivation, misunderstanding of questions, untraitedness, stylistic responding, or social desirability may cause aberrant response behavior on self-report measures of typical performance (Ferrando, 2012; Tellegen, 1988). Consequently, trait measurement may be confounded and individual decision-making may be adversely affected, for example, in personnel selection (Christiansen, Goffin, Johnston, & Rothstein, 1994) and clinical treatment planning (Piedmont, McCrae, Riemann, & Angleitner, 2000). Also, psychometric properties derived from contaminated data may be invalid (Meijer, 1997; Woods, 2006).
Item response theory (IRT)–based person-fit analysis (PFA; Meijer & Sijtsma, 2001) can be used to detect aberrant item-score vectors. PFA has its roots in cognitive and educational measurement (Levine & Drasgow, 1982), but PFA may also be used in noncognitive or typical performance measurement (e.g., Emons, 2008; Ferrando, 2010, 2012; Reise, 1995; Reise & Flannery, 1996). Inventories for noncognitive measurement often assess a general trait by means of several short, unidimensional subscales containing no more than 15 items. However, person-fit (PF) statistics lack power to detect misfit on scales containing fewer than 20 items (Emons, 2008; Reise, 1995; Reise & Flannery, 1996). Hence, one needs PF statistics that combine the statistical information from the multiple subscales to gain power and allow drawing conclusions about the general trait. This study proposes and investigates multiscale PF statistics.
The core PF statistic used was statistic
(Drasgow, Levine, & McLaughlin, 1987). Statistic
is the standardized log-likelihood of an item-score vector given the estimated, unidimensional IRT model. Five lz-based PF methods that combine person-fit information from multiple short scales into one PF statistic were compared with respect to the degree to which they (a) identified aberrant item-score vectors, (b) improved the accuracy of research results, and (c) enabled understanding the causes of aberrant responding. To address the first issue, a simulation study was used to determine the multiscale PF methods’ Type I error rate and detection rate. To address the other two issues, the multiscale PF methods were used to analyze panel data from the International Personality Item Pool 50-item questionnaire (IPIP-50; Goldberg et al., 2006) clinical data from the and the Brief Symptom Inventory (BSI; Derogatis, 1993). Based on the results, the authors discuss the usefulness of the
-based multiscale PF methods for noncognitive assessment.
Multiscale PFA
Statistic
for Polytomous Items
Noncognitive measurement commonly uses rating scale items; hence, statistic
for polytomous items, denoted by
, was used (Drasgow, Levine, & Williams, 1985). Emons (2008) found that among several polytomous-item PF statistics, statistic
had a higher detection rate. The authors of this study defined
under the graded response model (GRM; Samejima, 1997). Larger negative
values indicate higher degree of misfit. See online Appendix A for the equations of the GRM and the
statistic (available at http://apm.sagepub.com/supplemental).
Multiscale PF Approaches
Five multiscale PF methods based on statistic
were assessed. Methods 2 to 5 were constructed for multiscale measures and assume unidimensionality for each subscale. Methods 4 and 5 solve problems that Methods 1 to 3 have.
Approach 1: The unidimensional approach
If a general, higher order trait drives the subscale traits producing positively correlating subscale scores, one may compute statistic
here denoted
for the complete set of items collected from the subscales (Conrad et al., 2010). This unidimensional approach identifies item-score patterns that are invalid measures of the general higher order trait but is useless if the subscales measure distinct uncorrelated traits (e.g., see the IPIP-50). Statistic
uses many items; hence, it is powerful and may have high detection rate.
Approach 2: Subscale analysis
Statistic
is computed for each subscale separately (e.g., Emons, 2008; Reise & Waller, 1993), and an item-score vector is classified as aberrant if at least one of the
shows evidence of person misfit. The subscale-analysis approach is denoted by
This approach suffers from low power, and requires control of the Type I error rate if used to obtain a conclusion about fit or misfit on the complete multiscale measure.
Approach 3: Multiscale extension
Based on the multitest extension of statistic
for dichotomous item scores (Drasgow, Levine, & McLaughlin, 1991), statistic
is the sum of the
values of S unidimensional subscales indexed s =1, …, S, such that
Statistic
collects PF information from multiple subscales into one measure but allows misfit on one scale to be compensated by fit on another scale. The method best identifies misfit that is consistent across subscales, for example, when a respondent lacks motivation or concentration throughout the whole test, but may lack power for identifying subscale-specific misfit (Conijn, Dolan, & Vorst, 2007; Schmitt, Chan, Sacco, McFarland, & Jennings, 1999).
Approaches 4 and 5: Combining
and
Method
combines subscale
values and statistic
and it is expected that
improves detection rates for persons consistently showing misfit across subscales compared with separate-subscale analyses. In contrast to statistic
statistic
uses subscale-specific information separately; thus, statistic
may also improve detection rates for persons that show misfit on a few but not on all subscales.
For statistic
for all possible subsets out of a total of S subscales, the
(or
) values are computed. For example, for
,
(or
) is computed for seven subsets (
,
,
), (
,
), (
,
), (
,
), and (
), (
), and (
). If at least one of the resulting statistics suggests significant misfit, the item-score vector is classified as misfitting. A variation of
is to only use the
statistic based on all subscales and the
for the single subscales, and denote the method
This means that for
an item-score vector is classified as misfitting if
and lpz for at least one of the subsets (
,
,
), (
); (
), or (
) are significant.
Common Issues for
-Multiscale Methods
Provided the IRT model is consistent with the item-score vector and true
values are available, statistic
is standard normally distributed (Drasgow et al., 1985), but Nering (1995) showed that the use of an estimate of
invalidates standard normality. In this study, a parametric bootstrap procedure (De la Torre & Deng, 2008) was used to compute
and
and the p values. To prevent inflated Type I error rates for statistics
and
the false discovery rate (FDR) was controlled instead of the more traditional and less powerful family-wise error rate (e.g., Bonferroni correction). To this end, the Benjamini and Hochberg (BH) procedure (Benjamini & Hochberg, 1995) was used.
Study 1: Simulation Study
Research Questions
In this project, a simulation study was used to investigate whether the
-multiscale methods are useful for detecting aberrant item-score vectors. More specifically, the following questions were addressed:
Do empirical Type I error rates adhere to the nominal Type I error rates?
What are the detection rates for realistic test length and realistic item properties?
Method
Data generation
The GRM was used to simulate for 10,000 simulees item scores with five ordered values, 0, …, 4, for a multiscale measure with five subscales. Item parameter estimates (online Appendix B, available at http://apm.sagepub.com/supplemental) were based on the empirical IPIP-50 data (Study 2, Method). The latent trait
values followed a standard normal multivariate distribution.
Lack of motivation or concentration, misunderstanding of questions, or low traitedness may cause random errors in the context of PFA known as random responding. Person misfit was therefore simulated using a response probability equal to .2, for each response option. Extreme response style (ERS) or agreement bias may result in more typical aberrant item-score vectors than random error, hence restrict generalization of results. Moreover, Emons (2008) found that statistic
detects random responding twice as well as ERS. Hence, it was decided not to include response styles and to limit attention to random responding.
Simulated data including misfitting item-score vectors were used to estimate the item, and person parameter values used to compute the PF statistics. For parameter estimation, MULTILOG 7 (Thissen, Chen, & Bock, 2003) was used. The
of item-score vectors that contain only 0s or 4s are uninformative of person fit and were excluded from the analyses. To classify item-score vectors as misfitting, significance testing was one-tailed,
For methods
and
an item-score vector was classified as misfitting if at least one of the statistics
and
was significant using the BH procedure to control the FDR at level
Design characteristics
A cross-factorial design with four factors and 60 conditions was defined: (a) percentage of misfitting item-score vectors was 10% or 30%; (b) the latent traits corresponding to the five subscales correlated .4, .6, or .8; (c) number of items per subscale was 6 (30 items in total) or 12 (60 items in total); and (d) methods
and
For these conditions, the data followed a simple factor structure. In each cell, 50 replications were realized.
Real noncognitive data do not always show a simple factor structure. Hence, in a separate design, data consistent with the bifactor model (Reise, Morizot, & Hays, 2007) were generated. Each item loaded on a general trait factor, and items of three subscales loaded on one of three subscale-specific trait factors. Item discrimination parameters for the specific and general factors were in proportion of 5:6 and were chosen such that correlations between subscale scores and coefficient alphas equaled those generated under the simple structure condition and θs correlating .6. Having the same data-correlational structure for the bifactor model and the simple structure model enabled the authors to assess effects the bifactor structure had on PF statistic performance. The factors subscale length and PF method were combined with a fixed misfit of 10% to generate 10 bifactor conditions. In each cell, 50 data replications were simulated.
Random responding was implemented as “global misfit” or “subscale misfit,” and the percentage of random item scores in an item-score vector was varied. Global misfit was realized by randomly selecting 20%, 40%, 60%, or 80% of the items from all subscales. Subscale misfit was realized by first nominating subscales and then, for nominated subscales, randomly selecting 50% or 100% of the item scores, either from one subscale or from two subscales. Each of the eight kinds of misfitting item-score vectors was equally represented in the data.
Results
For the simple structure conditions, Table 1 shows the Type I error rate (rows corresponding to “no misfit”) and the detection rate for methods
and
For
the results for all three θ-correlation levels are reported. Variation in θ-correlation had little effect on performance of other methods; hence, only results for θ-correlation of .6 are discussed.
Note. Means were based on 50 replications; standard errors were <.01; due to missing
on average 1.2% and 0.1% item-score vectors were excluded from the analyses for the 30-item condition and the 60-item condition, respectively.
Research Question 1: Adherence to nominal Type I error
For both the simple structure conditions (Table 1) and the bifactor conditions (not tabulated), empirical Type I error ranged from .01 to .05 in the 10% misfit condition. Type I error ranged from .00 to .01 in the 30% misfit condition. Low Type I error was due to random item scores that caused biased item discrimination (
) estimates; on average,
were 0.13 and 0.32 units too low in the 10% misfit and 30% misfit conditions, respectively. As a result, the
values were too high, and too few item-score vectors were classified as misfitting. Using true item parameters to compute
Type I errors on average were .05, .05, .03, and .04 for
and
respectively. For
using
and
values estimated in a dataset without person misfit resulted in Type I errors between .05 and .08, with higher values for a lower θ-correlation.
Research Question 2: Detection rates
All methods showed detection rates between .73 and 1.00 for the 60-item condition if at least 40% of the item scores were random, and for the 30-item condition if at least 60% of the item scores were random. All methods failed to have good detection rates in the other conditions. Detection rates decreased as percentage of misfitting item-score vectors increased, and increased as number of items and percentage of random item scores increased. Contrary to expectation,
was rather insensitive to size of θ-correlation. Detection rate of PF statistics differed little between simple structure data and the bifactor data. The largest difference was .026 higher in bifactor data (condition:
30 items, θ-correlation .60).
For global misfit, method
had the highest detection rate. For 60 items and 30 items, including 10% misfit and at least 40% random item scores,
had a detection rate between .84 and 1.00. For 30 items including 30% misfit, detection rates of
were only good when at least 60% of the item scores were random. For subscale misfit, method
had the highest detection rates. In the 60-item condition, detection rates (range : .77-.99) of
were good for item-score vectors with 100% random scores on one or two subscales. Across all conditions, method
had the best detection rate, almost as good as
for global misfit and
for subscale misfit.
Conclusion From Study 1
Detection rates of different methods strongly depend on misfit type. Compared with the other methods, methods
and
had higher detection rates for global misfit, and methods
and
had higher detection rates for subscale misfit. As expected, method
had relatively high detection rates for both subscale and global misfit. If one does not have articulated expectations of the manifestation of misfit, method
is a safe choice in terms of power.
Study 2: Real-Data Applications
Research Questions
The authors investigated whether
-multiscale methods can be used to correct bias in research results due to person misfit, and to understand the causes of aberrant responding. Specifically,
Research Question 1: Does removal of misfitting item-score vectors identified by the
-multiscale methods improve the fit of confirmatory factor analysis (CFA) models and provide more convincing evidence of discriminant and convergent validity?
Research Question 2: Does statistic
relate to explanatory variables for aberrant responding?
These questions were addressed using empirical data collected by means of three multiscale measures with short subscales: the IPIP-50 (Goldberg et al., 2006), the BSI (Derogatis, 1993), and the BSI-18 (Derogatis, 2001). As the authors only had access to relevant explanatory variables for the IPIP-50 data, they addressed the second question only for the IPIP-50 data.
Method
Participants
The IPIP-50 data came from the Longitudinal Internet Studies for the Social sciences (LISS) panel and were collected by CentERdata (Tilburg University, the Netherlands) in 2008. The panel completed a survey that included the IPIP-50 and several other personality, mood, and attitude scales. The study sample consisted of 6,791 participants (45.4% male). The BSI data were collected in a sample of 1,270 clinical outpatients (38.6% male) that completed the BSI at intake at four sites of a Dutch public mental health care institution.
Measures
The Dutch IPIP-50 (Hendriks, Hofstee, & De Raad, 1999) consists of five 10-item subscales, each measuring one of the Big-Five personality factors: extraversion, agreeableness, conscientiousness, neuroticism, and intellect. All items have a five-point rating scale. The Dutch BSI (De Beurs, 2004) consists of 53 five-point rating scale items, 49 of which are divided across nine subscales that include 4 to 7 items and measure different symptoms of psychopathology. The BSI short version, the BSI-18 (Derogatis, 2001), consists of three 6-item subscales measuring somatization, depression, and anxiety. In practice, for the BSI and BSI-18, subscale scores and a total score referred to as the global severity index are used.
Statistical analyses
Methods
and
were used for PFA, but method
was not used for IPIP-50 analysis as the inventory measures five distinct traits. The methods assume a fitting GRM; a misfitting GRM may confound person misfit results. Therefore, prior to the PFA, GRM fit to subscale data was assessed. Except for the Agreeableness subscale, for the other subscales, fit results for the one-factor model suggested multidimensionality and local dependence. Except for the BSI-18 Depression subscale, fit for the other subscales of the BSI and the BSI-18 was sufficient.
To decide how to deal with the IPIP-50 model misfit, subscale PFA was used for the complete IPIP and for a shortened IPIP composed of subscales of 7 to 10 items that the GRM fitted well. The comparison of the correlations between subscale
resulting from the two analyses shed light on the usefulness of the PFA results; we assumed that higher correlations indicated more valid
because a confounding of model misfit and person misfit would lower the correlations. On average, the correlations between subscale
were .04 lower (range : –.01 to .10) for the shorter IPIP, suggesting that removing items sacrificed relevant PF information (Woods, 2006). Therefore, items were not excluded from the PFA.
Results for IPIP-50
The percentages of persons the five
-multiscale methods detected ranged from 15.6% (
) to 17.3% (
). Methods
and
primarily detected the same persons; hence, for Research Question 1, the results for only methods
and
are discussed.
Research Question 1
The authors determined whether removal of misfitting item-score vectors improved the fit of the theoretical five-factor model (Hendriks et al., 1999) and affected the correlations between the five subscales. Low correlations may be considered evidence supporting discriminant validity for each of the IPIP-50 subscales (John & Srivastava, 1999).
The nonlinear CFA model for categorical data (Muthén & Muthén, 2007) was fitted to the IPIP-50 data. A value for the root mean square error of approximation (RMSEA) of .08 or less suggests acceptable model fit, but appropriate cut-off values also depend on sample size, model size, and model specifications (Kenny, Kaniskan, & McCoach, 2011). Values for the Tucker–Lewis index (TLI) and the Comparative Fit index (CFI) in excess of .95 suggest not only good model fit (Hu & Bentler, 1999) but also depend on sample size (e.g., Bollen, 1990). Hence, the following procedure was conducted. First, the model-fit indices for the original data were determined. Second, the model-fit indices were determined for the original data in which persons classified as misfitting were replaced by a random sample of persons not classified as misfitting, and the second step was repeated 10 times using different random samples, comparing the mean model-fit indices to the model-fit indices for the original data. Table 2 shows the values of RMSEA, TLI, and CFI for the total sample, and the mean values for the samples excluding misfit using either
or
Note. For the IPIP-50, n = 6,786; for the BSI, n = 1,268; for the BSI-18, n = 1,258. Due to item-score vectors including only 0s or 4s, < 1% persons were excluded from each data set. IPIP-50 = International Personality Item Pool 50-item questionnaire; RMSEA = root mean square error of approximation; TLI = Tucker–Lewis index; CFI = Comparative Fit index; BSI = Brief Symptom Inventory.
Removing misfitting item-score vectors improved model fit only little. Exclusion based on
decreased RMSEA the most (.006), and removal based on
increased TLI (.032) and CFI (.018) the most. In the total sample, the correlations between the IPIP-50 scales ranged from –.27 to .34, and increased after misfitting item-score vectors had been removed. The absolute differences were the largest using
and ranged from .01 to .03. Hence, removal of misfitting item-score vectors weakened evidence of discriminant validity of the IPIP-50 subscales.
Research Question 2: Explaining person misfit
A multiple regression analysis was used to study whether
relates to explanatory variables (online Appendix C provides a description of the scales used as explanatory variables, available at http://apm.sagepub.com/supplemental). Table 3 gives the authors’ expectation of the effects of the explanatory variables.
Note.“+”/“−” indicates a positive/negative effect of the explanatory variable on
. n = 6,250. IPIP-50 = International Personality Item Pool 50-item questionnaire.* p < 0.10, ** p < 0.05, *** p < 0.01 (one tailed).
The multiple regression model explained 6% of the variance. Table 3 shows the correlations between
and the explanatory variables, and the regression coefficients (Model 1). Except for gender, regression coefficients and correlations had the same sign. Gender, education level, need for cognition, and neuroticism had significant effects in the expected direction. Other significant effects ran counter to expectation. Persons scoring higher on survey attitude, survey understanding, survey involvement, agreeableness, conscientiousness, and intellect showed poorer person fit. Effects were small, which is a consistent finding in explanatory PFA (Conijn, Emons, Van Assen, Pedersen, & Sijtsma, 2013).
Unexpected results may be due to a confounding effect of response styles. Response styles that relate to item content or item wording may lead to spuriously high or low scores on explanatory variables but they may also produce low
values. To study the effect of response style on the relationships between person fit and explanatory variables, measures for social desirability bias, agreement bias, and ERS were added to the model. ERS was quantified by the total number of most extreme scores, 0 and 4.
Results suggest that ERS probably confounded the effects survey attitude, survey understanding, survey involvement, agreeableness, conscientiousness, and intellect had on person fit. Table 3 (Model 2) shows the results for a regression model including ERS, which produced a 46% explained-variance increase. The effect of ERS on person fit was negative. After having accounted for ERS, all but one of the explanatory variables that initially had an unexpected effect on
now had an effect in the expected direction. The exception was the effect of survey involvement, which remained negative. Effect of education level changed from significant to not significant, which also ran counter to the authors’ expectations. To conclude, ERS related to social desirability (correlation .92) produced biased explanatory-variable scores and interfered with explanatory PFA.
Results for BSI and BSI-18
For the BSI, the percentage of detected persons ran from 10.8% (
) to 15.4% (
), and for the BSI-18, the percentage ran from 3.8% (
) to 7.3% (
). The authors discuss the results for Research Question 1 for methods
and
As with the IPIP-50, they do not report results for
and
.
Research Question 1
To evaluate the improvement of model fit by excluding misfitting item-score vectors, the theoretical nine-factor and three-factor models were first fitted to the BSI and the BSI-18 data, respectively (Derogatis, 2001). Overlap between subscale traits in both cases rendered the covariance matrix of the latent factors not positive definite. Hence, the models could not be used, but for the BSI, a second-order factor model was used instead of the nine-factor model (Hoe & Brekke, 2009), and for the BSI-18, a one-factor model (Meijer, De Vries, & Van Bruggen, 2011) was used. The authors also studied whether removing misfitting item-score vectors changed the correlations of the BSI and BSI-18 with the symptom distress subscale of the Dutch version of the Outcome Questionnaire-45 (OQ-45; De Jong et al., 2007; Lambert et al., 2001). These correlations provide support that the BSI and BSI-18 have convergent validity (De Jong et al., 2007).
Table 2 shows the effect of removing misfitting item-score vectors on model-fit indices for the BSI and the BSI-18. Results were similar for both inventories. Removing misfitting item-score vectors based on
had the largest effects, for both scales causing RMSEA to decrease by .01. TLI increased by .017, and contrary to the total sample result, TLI suggested good model fit. CFI increased by .076 for the BSI and by .044 for the BSI-18. In the total sample, both the BSI and the BSI-18 correlated .76 with the OQ-45 symptom distress subscale. The correlation did not change after removing misfitting vectors using either
or
Conclusion From Study 2
Model fit improved modestly when misfitting item-score vectors were removed from the data. Exclusion of person misfit hardly affected correlations supporting either discriminant or convergent validity or had an effect contrary to theoretical expectations. Statistic
explained aberrant response behavior in the IPIP-50 data after accounting for ERS related to socially desirable responding. Low power, meaning many misfitting item-score vectors go undetected, may partly explain the small effect excluding person misfit has on model fit and validity measures. To investigate possible low power, the detection rates of the
-multiscale methods were determined, given the properties of the IPIP-50, the BSI, and the BSI-18 data, by conducting simulations similar to those of Study 1. For the IPIP-50, the PF methods had good power for detecting substantial misfit, but for the BSI and the BSI-18, good detection rates were found only for item-score vectors with at least 60% random item scores (BSI) and item-score vectors with 100% aberrant item scores (BSI-18). As item-score vectors including more than 60% aberrant item scores may be rare in real data, low power is expected for the BSI and BSI-18.
Meijer (2003) recommended choosing a more liberal α level so as to increase power in PFA. Using
for the IPIP-50,
was used, and for the BSI and the BSI-18,
was used to remove misfitting item-score vectors. Little additional effect of removal on model fit and validity measures was found relative to
An explanation other than low power is that the type of person misfit the
-based methods detected hardly affected the item-score correlation matrix used as the input for CFA or correlations supporting validity. Statistic
has relatively low power for detecting misfitting item-score vectors due to a systematic response style, for example, agreement bias or ERS (e.g., Emons, 2008). Statistic
has reasonable power for detecting random error, but correcting data for random error may affect correlations only little (Drasgow & Kang, 1984).
General Discussion
The simulation study showed that
-multiscale methods have good power for long tests (≥ 60 items), and for medium-length tests (30 items) provided the data include little person misfit. Combining information from different subscales increases power considerably. For example, Emons (2008) found that for a 12-item subscale with 50% random item scores,
had a detection rate of .50. In contrast, the authors of this study found that for five 12-item scales with 40% random item scores in all subscales, the best-performing multiscale method,
had a detection rate of .96. If only one subscale included 50% random item scores,
had a detection rate of .47. Hence, using
-multiscale methods when misfit is consistent across subscales may produce a substantial gain in power, whereas the use hardly reduces power when misfit is subscale specific. The authors advise using multiscale PF statistics when a test consists of several short subscales, and choosing a method dependent on whether misfit is expected to be a local or a global phenomenon.
The
statistic performed relatively well when unidimensionality was violated, suggesting that PFA is robust against violations of GRM assumptions. This finding contradicts previous suggestions. Woods (2008) separated balanced scales with poor model fit into item subsets containing only positively worded or negatively worded items, and Emons (2008) suggested resorting to less powerful nonparametric PF statistics. However, if PFA is robust against model violations, following their suggestions may weaken PFA results. More research is needed on this topic.
Most PFA research on Type I error rates and detection rates used item parameters obtained in data without person misfit (e.g., Emons, 2008; Reise, 1995), but in noncognitive test data, lack of self-interest or faking good or bad is a realistic cause of aberrant responding, rendering misfitting item-score vectors and biased item parameter estimates unavoidable. Based on the findings of this study, the authors speculate that previous studies overestimated the performance of PFA in real-life settings.
The real-data applications suggest that a multiple regression model for statistic
was useful for exploring the causes of aberrant responding as they were predicted in the literature. An important finding was that one needs to account for response styles when the explanatory variables can be affected by aberrant response behavior. Furthermore, the results suggest that the usefulness of the
-multiscale methods for correcting bias in research results may be limited.
Item-score vectors
classified as misfitting are not necessarily influential with respect to misfit of the normative model (also see Reise & Widaman, 1999) and may have minor effects on criterion-related validity (also see Meijer, 1997; Schmitt, Cortina, & Whitney, 1993; Schmitt et al., 1999). Consistent with outlier analysis, it may therefore be useful in PFA to distinguish misfitting or “outlying” observations and influential observations (Barnett & Lewis, 1994, pp. 9, 317).
Real-data analysis in PF research is a useful addition to simulation studies: The simulation study provided information about PF statistics' performance, and the real-data analyses suggested that the methods may not detect misfit that negatively affects model fit or distorts validity indices. Methods’ detection rates were sufficient, so that future studies may demonstrate the usefulness of the
-multiscale methods for improving individual decision-making. PF methods may also detect substantively interesting person misfit that requires follow-up analysis for individual respondents (Emons, Sijtsma, & Meijer, 2005; Ferrando, 2012). The authors conclude that more real-data studies are needed to demonstrate the usefulness of the
-multiscale methods for noncognitive measurement.