Abstract
This study examined whether cutoffs in fit indices suggested for traditional formats with maximum likelihood estimators can be utilized to assess model fit and to test measurement invariance when a multiple group confirmatory factor analysis was employed for the Thurstonian item response theory (IRT) model. Regarding the performance of the evaluation criteria, detection of measurement non-invariance and Type I error rates were examined. The impact of measurement non-invariance on estimated scores in the Thurstonian IRT model was also examined through accuracy and efficiency in score estimation. The fit indices used for the evaluation of model fit performed well. Among six cutoffs for changes in model fit indices, only ΔCFI > .01 and ΔNCI > .02 detected metric non-invariance when the medium magnitude of non-invariance occurred and none of the cutoffs performed well to detect scalar non-invariance. Based on the generated sampling distributions of fit index differences, this study suggested ΔCFI > .001 and ΔNCI > .004 for scalar non-invariance and ΔCFI > .007 for metric non-invariance. Considering Type I error rate control and detection rates of measurement non-invariance, ΔCFI was recommended for measurement non-invariance tests for forced-choice format data. Challenges in measurement non-invariance tests in the Thurstonian IRT model were discussed along with the direction for future research to enhance the utility of forced-choice formats in test development for cross-cultural and international settings.
Due to innovative advancements in psychometric modeling based on item response theory (IRT) that allows for interpersonal comparisons (e.g., Brown & Maydeu-Olivares, 2011b; Stark et al., 2005), forced-choice formats have been increasingly applied in educational and personnel selection settings (e.g., Anguiano-Carrasco et al., 2015; Dueber et al., 2019; Guenole et al., 2018; Organisation for Economic Co-operation and Development [OECD], 2014; Usami et al., 2016). Along with the popular use of forced-choice formats in assessment construction, validation studies for the use of forced-choice formats demonstrated that estimated scores from the IRT-based approaches represented target traits well and reduced response distortion that may exist in single-stimulus formats by comparing scores between single-stimulus (e.g., Likert-type-scale items) and forced-choice formats (e.g., Anguiano-Carrasco et al., 2015; Guenole et al., 2018; Usami et al., 2016). However, only a few studies (e.g., Bartram, 2013a, 2013b) have examined whether the measurement of constructs through forced-choice formats works equivalently across various subgroups of respondents such as those with different genders, racial/ethnic backgrounds, and from different countries. Except for Bartram’s studies using a forced-choice format version of the Occupational Personality Questionnaire (OPQ32i; SHL Group, 2006), no published studies appearing in academic databases (e.g., PsycINFO and ERIC) have investigated the fairness aspect of validity in the use of scores from a forced-choice assessment across heterogeneous groups of respondents. In other words, the equivalence of psychometric properties in measurement (measurement invariance; Millsap, 2011) for forced-choice formats has not been explored in depth.
Although Bartram (2013a, 2013b) examined the equivalence of construct and scalar equivalence in the OPQ32i across different countries, the employed psychometric approaches were slightly different from commonly used approaches such as multiple group confirmatory factor analyses (CFAs) and multiple indicators multiple causes (MIMIC) models. Different from the approach using model fit indices to which most applied researchers are familiar, Bartram employed correlational analyses for scalar invariance of the forced-choice format assessment and explored whether group-level differences in mean values and standard deviations of trait scores were related to scores from other scales measuring group-level effects. In addition, a multilevel modeling approach was used to examine the proportion of between-country variance in forced-choice format score variance.
Regarding a psychometric approach to test measurement invariance in the Thurstonian IRT model, Brown and Maydeu-Olivares (2018) mentioned that a multiple group CFA can be employed to test equivalence of psychometric properties in measurement including loadings and thresholds across subgroups of respondents. However, with respect to evaluation measures to determine measurement invariance in forced-choice formats, there is a clear lack of research into whether the evaluation criteria established for single-stimulus formats would perform well when the Thurstonian IRT model is fit for responses from forced-choice formats. In addition, considering that the first step of a multiple group CFA is the examination of model fit, a question may arise in terms of selecting evaluation criteria to assess model fit because the commonly used evaluation criteria such as comparative fit index (CFI) no smaller than .95 and root mean square error approximation (RMSEA) no larger than .06 from Hu and Bentler (1999) were established with maximum likelihood estimators (MLEs) and the Thurstonian IRT model uses limited information methods. Although there have been studies demonstrating that model fit evaluation criteria established for multivariate normal data with MLE do not work for the limited information estimation methods (e.g., Nye & Drasgow, 2011), most empirical studies employing forced-choice formats (e.g., Anguiano-Carrasco et al., 2015; Brown & Maydeu-Olivares, 2011b; Guenole et al., 2018; Lee et al., 2018) relied on such rules of thumb for the evaluation of model fit.
The purpose of this simulation study is to investigate whether cutoffs in fit indices suggested for traditional formats with estimation methods such as MLE can be utilized to test measurement invariance when a multiple group CFA was employed for the Thurstonian model. Using the multiple group CFA, the evaluation of measurement invariance in this study was based on a holistic approach through fit indices. Regarding the performance of evaluation criteria in fit indices, detection rates of measurement non-invariance and Type I error rates were examined. In addition, the impact of measurement non-invariance on estimated scores in the Thurstonian IRT model was examined through the accuracy and efficiency in score estimation. Based on the findings, this study aimed to provide information about the selection of evaluation criteria when a multiple group CFA was used for the detection of measurement non-invariance.
This study describes the conceptual framework of the Thurstonian IRT model and evaluation of measurement non-invariance in multiple group CFAs. Two simulation studies were conducted to examine six fit indices and suggest new cutoffs. Finally, challenges in measurement non-invariance tests in the Thurstonian IRT model were discussed.
Thurstonian IRT Model
By applying Thurstonian factor models, Brown and Maydeu-Olivares (2011b) developed the Thurstonian IRT model. Equation 1 1 shows the probability of selecting statement i over statement k within a block of statements:
The probabilistic function of the binary outcome in a pairwise comparison is a two-dimensional normal ogive IRT model where
The Thurstonian IRT model estimates item and trait (person) parameters from forced-choice item responses from a unidimensional or multidimensional scale. As in Equation 1, the relationship between an item and a trait is assumed to be based on a dominance model (linear relationship). According to Thurstone’s law of comparative judgment, a person j prefers item i to k when the utility for item i (
The local dependence occurring in an item block composed of more than two items is accounted for by modeling a covariance structure in the Thurstonian IRT model. Let’s assume that an item block is composed of three items i, k, and q and a respondent ranked the three items as 1, 2, and 3, respectively, with 1 representing “the most important” therefore the highest utility and 3 representing “the least important” the lowest utility. In this example, the ranks can be coded as {i, k} = 1, {i, q} = 1, and {k, q} = 1 through three pairwise comparisons.
2
The choices in pairwise comparisons involving the same item such as the pairs {i, k} and {i, q} are not independent after controlling for respondent’s standings on the traits measured by the items. To account for local dependence, variance shared between the two pairwise comparisons is incorporated into the Thurstonian IRT model based on the mathematical derivation
The Thurstonian IRT model is estimated using limited information methods such as unweighted least squares or diagonally weighted least squares. If Mplus (Muthén & Muthén, 1998–2017) is used, the equivalent estimator is unweighted least squares (ULSMVs; Muthén & Muthén). When item blocks are composed of three or more items, a correction to the degrees of freedom is required due to redundancies among thresholds and tetrachoric correlations (Maydeu-Olivares, 1999). The redundancy in each block is computed as
Detection of Measurement Non-Invariance: Multiple Group CFA
Multiple group CFAs have become the most common method to examine measurement invariance, compared with MIMIC models (Meade & Lautenschlager, 2004). The terminologies used for different types of measurement invariance are as follows: configural invariance testing the same factor structure, metric invariance testing equality in factor loadings, and scalar invariance testing equality of intercepts. Metric invariance subsumes configural invariance, and scalar invariance includes both other types. To conduct subsequent tests, for example, scalar invariance, first configural then metric invariance should be established (Vandenberg, 2002), and these tests are typically performed using chi-square difference tests. A more detailed description of measurement invariance tests was provided in the online appendix.
However, as chi-square difference tests have been criticized due to sensitivity to sample size (Bentler & Bonett, 1980; Brannick, 1995; Meade & Lautenschlager, 2004), the use of alternative fit indices has been suggested because such indices are less sensitive to sample size and perform better to detect measurement non-invariance than chi-square difference tests (Chen, 2007; Cheung & Rensvold, 2002; Meade et al., 2008). In addition, DIFTEST does not currently support the adjustment needed for the Thurstonian IRT model, making it challenging for applied researchers and practitioners to adopt chi-square difference tests. However, the use of changes in fit indices is relatively accessible because these fit indices simply need to be recomputed using corrected degrees of freedom.
Evaluation Criteria for Measurement Invariance: Changes in Fit Indices
Cheung and Rensvold (2002) and Meade et al. (2008) suggested the use of absolute changes (Δ) in alternative fit indices to test measurement invariance. Fit indices used for measurement invariance tests in the literature are CFI (Bentler, 1990), gamma hat (
Chen (2007) also suggested the use of ΔCFI ≤ .010 and ΔRMSEA < .015 for the evidence of measurement invariance when sample sizes are larger than 300, the ratio between groups is equal, and the pattern of measurement non-invariance is nonuniform. When sample sizes are smaller than 300 with an unequal ratio between groups and the pattern of measurement non-invariance is uniform, Chen suggested the cutoff criteria ΔCFI < .005 and ΔRMSEA < .010. In international settings, the Teaching and Learning International Survey (TALIS) operated by OECD (2014) adopted ΔCFI < .02 and ΔRMSEA < .03 as the evaluation criteria for metric invariance and ΔCFI < .01 and ΔRMSEA < .010 for scalar invariance (Rutkowski & Svetina, 2017). In addition, Rutkowski and Svetina stated that the criteria adopted in TALIS were established specifically for cases where the number of groups was large and the sample size in groups widely varies.
Although using alternative fit indices to determine measurement invariance seems more advantageous compared with chi-square difference tests due to less sensitivity to sample size and relative ease in accessibility, there seems to be no evidence that the same cutoffs would work well for forced-choice formats. Also, for the evaluation of model fit, which is necessary to determine configural invariance, it should be noted that adopting commonly used cutoffs which were established with the use MLE (CFI ≥ .95, RMSEA ≤ .06) may not be appropriate for forced-choice formats because limited information estimators are employed. Thus, Study 1 investigated (a) the performance of the established criteria for the evaluation of model fit and (b) the performance of the existing cutoffs for the changes in fit indices to determine measurement non-invariance when a Thurstonian IRT model was employed for forced-choice format response data. As a follow-up, Study 2 was conducted for the recommendation of better cutoffs to improve the detection of measurement non-invariance.
Study 1
Method
Data generation
Data generation was performed in R (R Core Team, 2018) based on 20 blocks of RANK format forced-choice items. Blocks were composed of three items, and each item measured one of the five personality traits. The response data were generated through three pairwise comparisons, such as {item i, item k}, {item i, item q}, and {item k, item q}, based on Equation 1, resulting in 60 pairwise comparisons. Parameter values (factor loadings, thresholds, and error variances) and the correlation coefficients for the five traits used for data generation were from Brown and Maydeu-Olivares (2018; Table 1A in the online supplement). Random error was incorporated into each response by comparing each computed probability from Equation 1 to a unique random value from a uniform distribution [0, 1]. Afterward, comparisons were coded with a binary value of 0 if they were less than the random value or 1 if they were greater (see footnote 2).
Manipulated factors
A total of five factors were manipulated in this study: (a) types of non-invariance, (b) magnitudes of measurement non-invariance, (c) numbers of items manipulated, (d) directions of non-invariance, and (e) characteristics of manipulated items/item pairs. Table 1 denotes the condition names used for the combinations of the manipulated factors.
Measurement Non-Invariance Conditions.
Among the 60-item pairwise comparisons, the five and 10 lowest or highest factor loadings (weakest or strongest loadings in absolute value) were changed by ±0.3 and ±0.6 in the focal group for the manipulation of small and medium magnitude of metric non-invariance. For small and medium magnitudes of scalar non-invariance, the five and 10 lowest or highest thresholds were changed by ±0.25 and ±0.5 in the focal group. The magnitudes for metric and scalar non-invariance coincide with those employed in previous studies (e.g. Lee et al., 2017; Oshima et al., 1997) where differential item functioning (DIF) was investigated based on the multidimensional IRT framework under a similar test length setting; Lee et al. also included 12 items per factor condition. By adapting effect size measures from Meade (2010) to the ipsative data, the small loading manipulation resulted in average expected score standardized differences equivalent to Cohen’s (1988) d = .3, and medium manipulation equivalent to d = .5. For thresholds, small and medium manipulations corresponded to around d = .2 and d = .4, respectively. As explained in Equation 1, because thresholds involve two items (item pair), the manipulation of one item pair’s threshold affected the other item pair. In this simulation study, it was assumed that scalar non-invariance occurred due to one of the two items from an item pair being either more difficult or easier to endorse. For example, scalar non-invariance occurred due to an increase or decrease in the threshold of Item 1 from the Item 1 and Item 2 pair. Thus, the threshold of the pair composed of Items 1 and 3 is also affected by the change occurring in Item 1. As a result, five threshold manipulations resulted in changes of up to 10 thresholds of item pairs. Among the two items in a pair, an item showing a relatively stronger loading (better discrimination) was chosen for the manipulation of scalar non-invariance to intensify the effects of measurement non-invariance.
As each item pair involves two items, the number of pairwise comparisons manipulated for measurement non-invariance corresponds to the case where 10 and 20 single-stimulus items out of 60 behave differentially across subgroups. The manipulated proportions of items for non-invariance (17% and 33%) were enough to cover the proportion (25%) considered in Meade et al. (2008). The sample size of 500 for each subgroup was considered for this study because most of the simulation and empirical studies on the Thurstonian IRT model employed samples sizes close to 500 or larger (Brown & Maydeu-Olivares, 2010, 2012, 2013; Guenole et al., 2018), although Maydeu-Olivares and Brown (2010) stated 200 as the minimum sample size. To examine Type I error rates and detection of measurement non-invariance, changes in fit indices across the three types of measurement invariance models (configural, metric, and scalar invariance) were used as a criterion for the evaluation of measurement non-invariance. Three evaluation cutoffs in absolute changes (ΔCFI > .01,
In Study 1, a total of 16 manipulated measurement non-invariance conditions for two measurement non-invariance types were examined based on the six evaluation cutoffs. A stepwise multiple group CFA approach was employed to test measurement non-invariance for 100 data sets from each condition. In Study 2, the null sampling distributions of fit index differences were generated for the recommendation of cutoff values based on 1,000 data sets where measurement invariance was held. The sampling distributions were employed to determine cutoff values which correspond to critical values for rejecting the null hypothesis of measurement invariance with an α = .05. This procedure was based on that from Chen (2007) and Cheung and Rensvold (2002). Mplus (Muthén & Muthén, 1998–2017) and MplusAutomation (Hallquist & Wiley, 2018) were used for the analyses of the data sets in each condition.
Dependent measures
Type I error rates for the measurement invariance conditions (null conditions), configural invariance for the metric and scalar non-invariance conditions, and metric invariance for the scalar non-invariance conditions were examined based on model fit; CFI < .95 and RMSEA > .06 were considered as poor fit. If the configural invariance models and the metric invariance models where only thresholds were manipulated (PHT, NHT, PLT, and NLT conditions, see Table 1 for condition names) showed poor fit, it was counted as Type I error. Type I error was also examined based on the changes in fit indices between the configural and metric invariance models under PHT, NHT, PLT, and NLT conditions because these conditions were only manipulated for scalar non-invariance and as such should not demonstrate metric non-invariance. For CFI,
Corrected degrees of freedom due to redundancy were used for the computation of the changes in CFI,
Results
Convergence and model fit
As a first step, model fit and nonconvergence rates were examined. CFI ≥ .95 and RMSEA ≤ .06 were used to examine whether the models of configural and metric invariance fit data well; configural invariance was tested for all conditions and metric invariance was tested only for scalar non-invariance conditions (PHT, NHT, PLT, NLT). Most models converged well, aside from the NLL conditions (6%–10% nonconvergence). Table 2A in the online supplement and the online appendix provide additional details about nonconvergence rates.
Type I error rate
For configural invariance, Type I error rates are equal to the rates of models with poor fit across conditions. As mentioned above, as the RMSEA cutoff did not show any non-acceptable fit for both sample size conditions, the Type I error rates for configural and metric invariance were reported based on the CFI cutoff value. Type I error rates were below .05 in all conditions (Table 3A in the online supplement).
When changes in relative fit indices were used, the largest Type I error rates were found with ΔCFI > .002, ranging from .62 to .81. The criterion
Type I Error Rates.
Note. See Table 1 for the condition names. Type I error rates in PHT, NHT, PLT, and NLT conditions were the proportion of replications falsely detected for metric non-invariance. No Type I errors were found in ΔRMSEA conditions. Values below .05 were bold faced. CFI = comparative fit index; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance; Null_MI = metric invariance; Null_SI = scalar invariance.
Detection of measurement non-invariance
Regarding the best performance in the detection of measurement non-invariance, ΔCFI > .01 for metric non-invariance performed best based on the results from Type I error rates and the proportion of correct non-invariance detection (Tables 2 and 3). Also, ΔNCI > .02 performs as a modest criterion for the detection of metric non-invariance with the simultaneous consideration of Type I error rate control. Although ΔCFI > .002 and
Proportions of Correct Non-Invariance Detection.
Note. See Table 1 for the condition names. Proportions in PHL, NHL, PLL, and NLL conditions were related to correct detection of metric non-invariance. Proportions in PHT, NHT, PLT, and NLT conditions were related to correct detection of scalar non-invariance. No detection was found in ΔRMSEA conditions. Values above .8 were bold faced. CFI = comparative fit index; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance.
Overall, ΔCFI > .01 and NCI > .02 were found to perform relatively well only when a larger number of items exhibited relatively larger magnitude of metric non-invariance. Based on the findings with the established cutoffs, any scalar non-invariance seems to be undetectable. Thus, determining new cutoff values for scalar non-invariance was imperative.
Study 2
Method
To suggest cutoff values to detect measurement non-invariance while controlling for Type I error rates at the nominal level, sampling distributions of fit index differences were generated from 1,000 data sets where measurement invariance held. For the fit index differences, only CFI and NCI were considered due to their performance in Study 1. Based on the procedure from Chen (2007) and Cheung and Rensvold (2002), cutoffs were determined using the concept of critical values in the sampling distribution for rejecting the null hypothesis of measurement invariance with an α = .05. That is, the proposed cutoffs correspond to the 95th percentiles of ΔCFI and ΔNCI distributions.
After determining the cutoffs for the detection of metric and scalar non-invariance, it was examined whether the recommended cutoffs control Type I error rates at the nominal level and exhibit greater detection rates of measurement non-invariance than existing cutoffs, especially considering scalar non-invariance.
Results
Recommendation of cutoffs
Based on the sampling distributions, this study proposed the cutoffs of ΔCFI > .007 for metric non-invariance and ΔCFI > .001 for scalar non-invariance. Regarding ΔNCI, the same cutoff from Cheung and Rensvold (2002), ΔNCI > .02 was found for metric non-invariance. For scalar non-invariance, this study recommended ΔNCI > .004. Compared with the existing cutoffs, the results showed that detecting scalar non-invariance in forced-choice formats need smaller cutoffs than metric non-invariance. The findings are aligned with what Rutkowski and Svetina (2017) suggested; the cutoffs for scalar non-invariance were smaller than those for metric non-invariance (e.g., ΔCFI < .02 for metric non-invariance and ΔCFI < .01 for scalar non-invariance).
Type I error rates and detection of measurement non-invariance
As seen in Table 4, the suggested cutoffs control Type I error rates at the nominal level. ΔCFI controls Type I error rates better than ΔNCI. In terms of measurement non-invariance detection, the suggested cutoffs detected both metric and scalar non-invariance better than the existing cutoffs (ΔCFI >.01, ΔNCI > .02) for scalar invariance, especially when medium non-invariance was exhibited for greater numbers of items (Table 5). Also, the new cutoff ΔCFI > .007 demonstrated higher detection rates for metric non-invariance than ΔCFI >.01.
Type I Error Rates for Recommended Cutoffs.
Note. See Table 1 for the condition names. Values below .05 are bold faced. CFI = comparative fit index; MI = metric invariance; SI = scalar invariance; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance; Null_MI = metric invariance; Null_SI = scalar invariance.
Proportions of Correct Non-Invariance Detection for Recommended Cutoffs.
Note. See Table 1 for the condition names. Values above .8 are bold faced. DIF = differential item functioning; CFI = comparative fit index; MI = metric invariance; SI = scalar invariance; NCI = noncentrality index; S = small magnitude of non-invariance; M = medium magnitude of non-invariance; 5i = five items were manipulated for non-invariance; 10i = 10 items were manipulated for non-invariance.
To examine factors affecting Type I error rates and the detection of measurement non-invariance, t-tests and analyses of variance (ANOVAs) were conducted. Overall, ΔCFI performed significantly better than ΔNCI, negative direction of measurement non-invariance led to higher Type I error rates, and the higher magnitude of invariance led to greater detection rates (see the online appendix).
Bias and RMSE
The impact of measurement non-invariance on estimated scores was examined with absolute bias, bias, and RMSE values (Table 4A in the online supplement). The amounts of absolute bias across the measurement non-invariance conditions ranged from 0.34 to 0.46, and bias ranged from −0.025 to 0.043. ANOVAs were conducted to investigate factors affecting these values. Overall, failure to detect metric non-invariance was more detrimental than that of scalar non-invariance, and negative direction of non-invariance was more detrimental to accuracy of estimation than the positive direction (see the online appendix).
Discussion
Regarding the performance of the established criteria for the evaluation of model fit, it was found that RMSEA ≤ .06 did not show any poor model fit across all conditions, and small proportions of models were assessed as fitting poorly with the use of CFI ≥ .95, especially under configural invariance. With respect to the performance of the existing cutoffs for the changes in fit indices to determine measurement non-invariance, ΔCFI > .01 and ΔNCI > .02 performed better for the detection of metric non-invariance than the other three cutoffs. However, as ΔCFI > .01 and ΔNCI > .02 performed poorly for the detection of scalar non-invariance, providing more applicable cutoffs was crucial. This study suggested ΔCFI > .001 and ΔNCI > .004 as the cutoffs for scalar non-invariance and ΔCFI > .007 was also provided for the detection of metric non-invariance. Based on the performance related to Type I error rate control and measurement non-invariance detection, it was concluded that ΔCFI > .007 for metric non-invariance and ΔCFI > .001 for scalar non-invariance were recommended for the cutoffs for measurement non-invariance tests when the Thurstonian IRT model was fit for forced-choice format data.
Regarding the impact of failure in the detection of non-invariance, the average amount of bias may not appear consequential; however, when considering decisions at individual levels (e.g., selection or admission), failure in the detection of non-invariance can potentially jeopardize test fairness, especially in multicultural settings where items related to certain personality traits are more or less favored due to cultural backgrounds than for other items. For example, when ranking statements 4 such as “I waste my time (Item 1),”“I get irritated easily (Item 2),” and “I talk to a lot of people at parties (Item 3),” an individual from a culture where implicit cultural stigma is embedded in terms of showing negative affect may provide ranks “maybe like me,”“least like me,”“most like me” for the three statements. As a result, Item 3 becomes easier and Item 2 becomes more difficult to endorse, potentially causing both items to become less discriminating due to the similar response patterns from a majority of respondents from the same cultural background. Then, the negative non-invariance may lead to a positive bias in estimated trait scores, resulting in higher trait scores when non-invariance was ignored, compared with that scores of an individual from a background where showing irritation does not have any cultural connotation.
This study employed the sample size of 500 per group with an equal ratio. Initially, the sample size of 200 per group was included; however, nonconvergence and poor fit were often detected under the 200 sample size condition, producing 20% of nonconvergence under configural invariance. Related to sample size, the findings from Lin and Brown (2017) showed that measurement non-invariance did not affect the estimation of scores, but attention should be placed on the sample sizes in Lin and Brown: 62,639, and 22,610 participants for the quad and triad formats of a forced-choice assessment composed of 104 item blocks, respectively. However, the findings may not be applicable in most research settings where the number of item blocks and respondents are more likely to be small. For example, the sample size was 420 and the number of item blocks were 20 in Guenole et al. (2018), and Anguiano-Carrasco et al. (2015) had the sample size of 283 with eight blocks composed of three items each. In the illustrated cases, failure in the detection of measurement non-invariance due to small sample sizes can potentially threaten test fairness. Thus, the authors would recommend that future research investigate cutoffs from various sample size conditions.
In addition, the stepwise procedures in a multiple group CFA present methodological challenges in identifying items attributing to measurement non-invariance. Because measurement invariance tests employ a holistic approach that uses changes in model fit indices, the results from the tests do not provide much information besides the output from model modification indices (MODINDICES) offered by Mplus (Muthén & Muthén, 1998–2017). For assessment developers, the analysis output may not have practical usefulness when non-invariance was detected; all they can do based on the MODINDICES output is free constraints imposed on thresholds or loadings that the output flags, compare changes in fit indices, and repeat this process until the values are smaller than the criterion value set for MODINDICES. Compared with various DIF detection methods including those for forced-choice formats in the generalized graded unfolding model (Roberts et al., 2000) which focuses more on item-level information, the multiple group CFA stepwise procedures do not seem practically useful, especially for piloting forced-choice items for assessment development. Therefore, in-depth studies should be called for to investigate cutoffs for the evaluation of measurement non-invariance in forced-choice formats along with methods to identify items attributing non-invariance.
The cutoff values suggested in this study were based on only the RANK forced-choice format. Considering that the RANK format offers more information than other forced-choice formats, the use of MOLE or PICK formats may lead to greater challenges in the detection of measurement non-invariance as the occurrence of measurement non-invariance may be on the pairwise comparisons related to missing responses. This problem would be exacerbated for the PICK format due to more limited information compared with RANK or MOLE.
The recommended cutoffs that can be used for forced-choice format invariance tests will be useful information to reduce potential threats to test fairness. However, it should be noted that the recommendations from this study were based on the assumption that there are no mean differences in trait scores between the reference and focal groups. That is, the context of the current research was based on the conditions where personality traits between a reference and focal group may not necessarily be expected to differ (e.g., gender), but different endorsement occurs due to different interpretations of items or different response styles toward items or item blocks. However, as in the literature, cultural differences may affect five-factor personality scores. For example, extroversion scores were found to be lower in Asian cultures than European and American cultures, whereas agreeableness scores were higher in most Asian and African cultures than countries from Western cultures (Allik & McCrae, 2004; Hofstede & McCrae, 2004). Therefore, the authors recommend that future research include different trait levels among subgroups of respondents along with various measurement non-invariance conditions, different types of forced-choice formats, and different sample sizes for the generalization of the findings, especially for cross-cultural research.
Supplemental Material
supplemental_material – Supplemental material for Fit Indices for Measurement Invariance Tests in the Thurstonian IRT Model
Supplemental material, supplemental_material for Fit Indices for Measurement Invariance Tests in the Thurstonian IRT Model by HyeSun Lee and Weldon Z. Smith in Applied Psychological Measurement
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplementary material is available for this article online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
