Abstract
Growing interest in measurement invariance (MI)/equivalence for nonachievement measures across heterogeneous populations in an international context also demands an examination of current criteria for establishing scale score comparability. Namely, in international large-scale assessments (ILSAs) operationally, establishing MI using multiple-groups analyses typically relies on the criteria proposed by research that was limited in scope to few groups and relatively small sample sizes. Recent studies that examined situations resembling those of ILSAs found mixed results with respect to the evaluation criteria for testing MI. The current article extended this line of research by first illustrating the current practices in applied research using empirical examples, and second, by evaluating the performance of several fit measures in two simulation studies when the data were assumed to be ordered categorical and multidimensional in nature. Our simulation study results suggested that in some cases, typical or newly proposed recommendations were not suitable in large-group, varied sample size, multidimensional contexts. Thus, we call for a new measure that more appropriately accounts for complex data characteristics on assessments such as ILSAs.
In cross-cultural psychology, it is commonly of interest to measure and compare populations on some theoretical construct of interest. As an example from educational psychology, international large-scale assessments (ILSAs), such as the Programme for International Student Assessment (PISA) or the Trends in International Mathematics and Science Study (TIMSS), serve as a fruitful basis from which to derive measures of affective and motivational domains, including self-concept, attitudes, and feelings about learning and school (e.g., Marsh et al., 2015; Ozel, Caglak, & Erdogan, 2013; Segeritz & Pant, 2013). In a similar vein, international surveys such as the Teaching and Learning International Survey (TALIS) seek to measure and compare teachers’ attitudes, perceptions, and experiences related to education. Examples of studying other psychological constructs abound, including social axioms (e.g., Bou Malham & Saucier, 2014), physical self-perception (e.g., Hagger, Biddle, Chow, Stambulova, & Kavussanu, 2003), cognitive emotional regulation (e.g., Megreya, Latzman, Al-Attiyah, & Alrashidi, 2016), and identity processing styles during cultural transition (e.g., Szabo, Ward, & Fletcher, 2016).
Regardless of the context, scores that represent the underlying constructs of interest on international assessments and surveys are often summarized in terms of model-based scale scores (Economic Co-operation and Development [OECD], 2010; Olson, Martin, & Mullis, 2008). An important precursor to making meaningful comparisons across groups on scale scores involves measurement equivalence or measurement invariance (MI). Namely, this criterion states that a construct ought to be understood and measured equivalently across groups (Meredith, 1993). In practice, researchers generally adopt one of two approaches when examining score comparability: multiple-groups confirmatory factor analysis (MG-CFA; Jöreskog, 1971) or differential item functioning (DIF)-type approaches to establishing the equivalence (for an overview of such methods, see Millsap & Everson, 1993; Penfield & Lam, 2000; Potenza & Dorans, 1995). 1
In the current study, we emphasize the use of MG-CFA to study score comparability as it aligns with operational practice in TALIS and PISA. Our purpose is to provide some methodological guidance for cross-cultural psychologists, particularly those who wish to examine comparability of scores across multiple groups or subpopulations. To provide further motivation for our perspective, we note several empirical studies published in the Journal of Cross-Cultural Psychology, principally with respect to establishing measurement or cultural equivalence across several countries (e.g., Bou Malham & Saucier, 2014; Cieciuch, Davidov, Vecchione, Beierlein, & Schwartz, 2014; He, Buchholz, & Klieme, 2017; Olatunji et al., 2009). We use two empirical examples to demonstrate that currently used criteria for determining MI in large numbers of populations might not be well suited in this context. Furthermore, we offer three idiosyncratic features of equivalence studies that might require revisions to current criteria.
Typical examples of MI methods are well established in the literature, in particular when only two groups are compared (e.g., Bollen, 1989; Millsap, 2011). However, in cross-cultural settings, it is often of interest to investigate larger numbers of groups from four (Cieciuch et al., 2014) to 20 or more groups (e.g., Bou Malham & Saucier, 2014; OECD, 2014b; Spini, 2003). The complexity of multiple-group comparisons in this setting is compounded by multiple cultural, language, and geographical differences and combinations of those differences. As such, understanding the performance of typically used methods for investigating invariance beyond two groups is important.
Current operational practice in establishing MI within international assessments assumes that the observed indicators are continuous and normally distributed (OECD, 2014a, 2014b) This is particularly problematic given that most of the measured variables on such surveys are ordinal in nature (e.g., Likert-type items with four or five response options), producing possibly incorrect inferences (Lubke & Muthén, 2004). Studies in psychology journals often make the same assumption, even though observed variables are often ordinal at best (Cieciuch et al., 2014; Lucas et al., 2008; Olatunji et al., 2009). To that end, recent research examined typical MI approaches in settings where groups are large in number and found that assuming continuous distributions for ordinal variables is problematic in these settings (Desa, 2014; Rutkowski & Svetina, 2017).
The abovementioned studies assumed the underlying model to be unidimensional, which is operationally consistent in many studies; however, examples where multidimensional scales are compared across many countries exist (Lucas et al., 2008; OECD, 2014b). Importantly, many psychological constructs are multidimensional in nature (e.g., the Big Five) and should be modeled as such. Thus, we view the current study as an extension of previous (unidimensional) research in the international context, and we examine the degree to which typical fit measures are suitable for MI in international studies when comparability of scores is desirable in a setting with multidimensional constructs and many populations. To that end, we supplement two empirical examples with simulation studies.
The current study adds to MI literature and extends previous work by considering the practical situation of multidimensional constructs that are compared across large numbers of groups with large within-group sample sizes. Although this research shares commonalities with, for example, Rutkowski and Svetina (2017), the increased use of MG-CFA for multidimensional constructs in this setting necessitates a dedicated investigation. Furthermore, in the current study, we examined MI within the context of studies that have smaller sample sizes across larger numbers of groups but with greater multidimensionality as to illustrate different psychological constructs.
Background
In general, the process of assembling evidence of MI involves a number of hierarchical (nested) tests, which we briefly describe for continuous variables. Typically, the first test is a test of same form or configural invariance. In other words, the test of configural invariance assumes that the number of latent variables and the patterns of factor loadings and intercepts remain the same across populations. The second test—of metric invariance—assumes equality of factor loadings across the groups. If results support metric invariance, scale score means cannot be compared across cultures. To compare scale scores between cultures meaningfully, the more restrictive level of invariance—scalar invariance—should be met. This third test further imposes equality on the intercepts or thresholds. 2 Scalar invariance is difficult to achieve in ILSAs as such assessments typically involve large numbers of groups that are often heterogeneous with respect to language, geography, and other cultural aspects.
Although establishing MI in the case of categorical observed variables is well recognized (e.g., Millsap, 2011; Muthén & Asparouhov, 2002; Muthén & Christoffersson, 1981), it has not typically been applied in settings with a large number of groups and large sample sizes. For tests of equality, an adjusted chi-square difference test is used, with degrees of freedom equal to the number of additional constraints. In addition, based on established recommendations in the literature (see “Method” section), changes in the comparative fit index (ΔCFI; Bentler, 1990) and changes in the root mean square error of approximation (ΔRMSEA; Steiger & Lind, 1980) are commonly used to further support testing for MI in slopes and thresholds.
As a means of demonstrating issues that arise when categorical MG-CFA models are applied in a large-group, multidimensional setting, we began with empirical examples. The first example was based on data from TALIS 2013, which surveyed educators across a number of different areas related to teachers and their environment. Specifically, within each participating country, TALIS surveys 200 schools and 20 teachers within each school for each level of education (primary, lower secondary, and upper secondary). Principals of the sampled schools are also surveyed. For illustration in this study, we selected two scales, principal satisfaction and teacher cooperation, as examples of multidimensional scales used in TALIS to serve as the basis for our Study 1 (more details about the scales and data follow).
Study 2 was included as an extension of Study 1 where we expanded on the current design to align with other types of psychological data. We used data collected from the revised Conformity to Masculine Norms Inventory (Parent & Moradi, 2009) as the basis for building a reasonable design in Study 2. The revised version of this questionnaire consists of 46 questions with varied numbers of items associated with each dimension, including winning (six items), emotional control (six items), risk taking (five items), power over women (four items), violence (six items), playboy (four items), self-reliance (five items), primacy of work (four items), and heterosexual self-presentation (six items; Hsu & Iwamoto, 2014). Each item is constructed as a 4-point Likert-type statement where options range from strongly disagree to strongly agree. An example of an item associated with the “winning” dimension states, “In general, I will do anything to win.” This questionnaire has been shown to be stable in terms of its nine-factor structure in the literature (e.g., Hsu & Iwamoto, 2014; Parent & Moradi, 2009).
The remainder of the article is organized as follows. First, we report on the empirical example of MG-CFA for the international assessment data, including the evaluation criteria used to establish MI across principal job satisfaction and teacher cooperation scales and the results associated with these two scales. Next, we describe the method and study design for the two simulation studies. We then report results based on the two simulation studies, separately. Finally, we discuss our findings in light of the existing literature, and we provide recommendations to researchers who engage in establishing MI in cross-cultural contexts.
Empirical Example
The TALIS 2013 principal job satisfaction scale contained seven items, with four items related to Subscale 1 (i.e., satisfaction with current work environment) and three items related to Subscale 2 (i.e., satisfaction with profession). An example of an item related to satisfaction with profession is as follows: “If I could decide again, I would still choose this job/position.” All items on these scales were measured on a 4-point scale ranging from 1 (strongly disagree) to 4 (strongly agree). The teacher cooperation scale contained eight items, with four items associated with Subscale 1 (i.e., exchange and coordination for teaching) and the remaining four items associated with Subscale 2 (i.e., professional collaboration). All items on these scales ask teachers how often they engage in certain activities and were measured on a 6-point scale, with response categories ranging from 1 (never) to 6 (once a week or more). An example of an item related to the teacher cooperation scale is as follows: “Exchange teaching materials with colleagues.” For exact wording on the remaining items, the reader is directed to Tables 10.23 (p. 186) and 10.68 (p. 248) in OECD (2014b).
Evaluation Criteria
To examine the performance of MG-CFA fit measures in this context, we modified operational procedures (OECD, 2014b) to allow for categorical indicators and followed a standard process of hierarchical, nested tests of invariance, as described previously. Overall model fit was considered good if the chi-square test was not statistically significant, and the RMSEAs were not larger than .050. Although we report CFI and Tucker–Lewis index (TLI), these indices are not recommended for evaluating overall model fit in settings with large numbers of groups and large within-group sample sizes (Rutkowski & Svetina, 2017), as neither measure could detect even gross departures from invariance. Next, we tested the relative fit of nested models, first for equal slopes (i.e., we assumed that slopes for items were equal across the studied groups) followed by a further restriction of equal thresholds (i.e., we assumed that slopes and thresholds associated with particular items were equal across the groups). To examine the plausibility of invariant slopes and, subsequently, invariant thresholds, we used the chi-square difference test, ΔCFI, and ΔRMSEA. Given that this study follows previous work by Chen (2007; for large sample sizes), Cheung and Rensvold (2002), and Rutkowski and Svetina (2014, 2017), and is in line with operational OECD procedures, we considered the following criteria as evidence of reasonable MI: (a) for slopes, changes in CFI greater than or equal to −.004 and changes in RMSEA less than or equal to .050, and (b) for slopes and thresholds, changes in CFI greater than or equal to −.010 in magnitude and changes in RMSEA less than or equal to .010. While we use the above-stated criteria found in the literature, we acknowledge that behavior of these indices is unknown in the multidimensional context of the current study as the recommendations to evaluate measurement equivalence have largely been studied within a unidimensional context only.
Empirical Example Results
The results of fitting the categorical MG-CFA models to the 31 countries’ data from TALIS are summarized below (complete tabular results can be located in the supplementary material S_T1). 3 According to the overall fit of the models fit to the teacher cooperation and principal job satisfaction scales, we found mixed evidence for even configural invariance: chi-square tests of fit were statistically significant, and RMSEAs were well above the recommended .05 threshold for both scales (TLI and CFI measures ranged between .902 and .967). In addition, for both scales, the overall model fit deteriorated as indicated by the chi-square and RMSEA measures as further restrictions were placed on the model parameters (TLI and CFI values decreased, suggesting poorer model fit, in particular for the teacher cooperation scale). In sum, we found reasonable evidence that these scales did not meet the preliminary hypothesis of the same factor structure in all countries analyzed. As expected, further restrictions manifested as increased evidence of model-data misfit.
For completeness, we also briefly discuss the incremental fit findings. On both scales, the chi-square difference tests were statistically significant for tests of metric and threshold invariance. We arrived at the same findings based on ΔCFI; however, according to the ΔRMSEA, the findings varied across scale. For the teacher cooperation scale, we found evidence that a model of equal slopes fit no worse, while further restrictions on the thresholds yielded meaningful deterioration in model fit. In contrast, the ΔRMSEA suggested that increased restrictions on the principal satisfaction scale produced no further misfit when either slopes or slopes and thresholds were constrained to equality. From both of these results, we found reasonable evidence that even the least restrictive models did not fit the data for either scale in all countries analyzed. This provided some motivation and guidance for the simulation described in Simulation Study 1.
Simulation Study 1: Method
Our simulation study design choices were motivated largely by what can be observed in practice with typical international assessments such as those described in our empirical example. In what follows, we offer a rationale for the choices made in the simulation study design, including manipulated factors (and respective levels) of the simulation study and the process of the selection/modification of item and person parameters (see Data Generation for Study 1). Finally, we provide the plan of analysis to guide our results of the simulation study.
Manipulated Factors for Study 1
Manipulated factors included number of groups (10 or 20), length of scale (six or eight items), nature of noninvariance (slopes, thresholds, or both), percent of groups affected by noninvariance (40% or 60%), and number of noninvariant items (one or two per factor).
Number of groups
Our choice of 10 and 20 groups was based on the desire to adhere reasonably to operational contexts (OECD, 2010, 2013) with large numbers of groups (e.g., more than two or three), while also keeping the scope of the study manageable. 4
Scale length
The number of items is generally low in nonachievement surveys, with only a few items representing a construct (OECD, 2010b). For example, the multidimensional scales in TALIS 2013 consisted of three to four items per subscale (OECD, 2014b). In the current study, we varied the scale length to be either three or four items per subscale, making the length of the full scale either six or eight items.
Percent of groups affected by noninvariance
We simulated conditions where either 40% or 60% of groups were modeled as being affected by noninvariance, a scenario not studied in previous research. This means that in 10-group conditions, four or six groups were modeled as having some level of noninvariance, whereas in the 20-group conditions, eight or 12 groups were modeled as containing noninvariance in slopes, thresholds, or both. Noninvariance was modeled such that one half of the affected groups had higher values of the parameters than the baseline conditions, whereas the other half had lower values (see more details in the Data Generation for Study 1 below).
Source of noninvariance
Noninvariance was simulated in slopes only, thresholds only, or in both slopes and thresholds. For all items, we assumed that the number of thresholds was four, implying five response options.
Number of noninvariant items
In the six-item conditions, two items per subscale were simulated with cross-group differences in the model parameters. In the eight-item conditions, one or two items were modeled similarly with model parameter differences.
Sample sizes in the simulation study varied from 750 to 6,000 per group and were assigned randomly to each group to avoid confounding sample size with either the latent variable means (variances) or the nature of invariance. These samples sizes represent empirical data structures in the international studies of interest, including TALIS, PISA, and TIMSS, where group sample sizes range from hundreds to thousands.
Data Generation for Study 1
To obtain population parameters to use for our simulation, we selected one scale from our empirical example (i.e., the Principal Job Satisfaction scale, which included two subscales: satisfaction with current work environment and satisfaction with profession), to which we fit two-dimensional ordered-categorical MG-CFA models separately for each of 30 participating countries and subnational entities. 5 We combined the empirical results by calculating the mean and variances of each of the parameter values across countries and subscales to provide values from which to draw for our simulation. This allows us to connect our simulation study to what is observed in empirical settings.
As noted in Table 1, simulation parameters were assumed to be normally distributed, except for the distance between thresholds and latent variable variance parameters, which we assumed to be uniformly distributed. 6 To simulate our model parameters for two-dimensional multiple-group models, we took random draws from each of these parameter distributions for the appropriate number of groups, slopes, loadings, thresholds, residuals, and latent variable distributions. To simulate thresholds, we used the low threshold as the first threshold for an item and then took a random draw from the distance between thresholds, which we added to create the next thresholds. Beyond the simulation parameters in Table 1, latent means were modeled as being correlated at a .303 level, which was slightly higher than correlations reported in the technical report (OECD, 2014b) but was lower than the empirical approximations we obtained by fitting categorical models to the data. To generate items noninvariant in slopes and/or thresholds, one standard deviation was added or subtracted from the baseline values. For example, in invariant (baseline) conditions, the slope for Item 1 was 1.334. To simulate noninvariance in this item’s slope (which is related to Factor 1), we added or subtracted a value of one standard deviation (.177). This means that the noninvariant value for Item 1’s slope equaled 1.511 for groups with higher noninvariance, and 1.157 for groups with lower noninvariance (see Table 1, under Factor 1, SD for slopes). Similarly, when we simulated noninvariance in slopes of items associated with Factor 2, we added/subtracted .675 from the baseline values. Threshold noninvariance was simulated in the same way, using corresponding values of standard deviations for thresholds of items associated with Factors 1 and 2 (.209 and .217, respectively).
Parameters for Simulation.
The fully crossed design in Study 1 yielded 40 conditions, with 14 conditions related to six-item scales (Two group sizes [10 or 20] × Three sources of noninvariance [slopes, thresholds, or both] × Two affected groups percentages [40% or 60%] plus two baseline conditions) and 26 conditions related to eight-item scales (Two group sizes × Three sources of noninvariance × Two affected groups percentages × Two numbers of noninvariant items [one or two per subscale] plus two baseline conditions).
Analysis for Study 1
Data were simulated using Mplus 7.2 (Muthén & Muthén, 1998), assuming a two-dimensional, ordered-categorical model with parameters as described above. We followed Millsap and Yun-Tein (2004) for model identification and to set the scale of the latent variable. In contrast to currently used operational procedures, we then analyzed the data assuming that the observed variables were ordered categorical, which recognized the actual distribution of the observed variables and avoided documented problems that stem from fitting a normal factor model to nonnormal observed variables (Lubke & Muthén, 2004). We used evaluation criteria in the simulation study as described above. Each condition was replicated 500 times, and for all results, we reported the average fit statistics and indices across the 500 replications for each condition and invariance test.
Study 1: Results
Prior to focusing on results for relative model fit in Study 1, we provide a brief summary of unexpected findings for the overall results. Complete overall tabular and graphical results can be found in the online supplemental material (S_T2-S_T6 and S_F1-S_F4).
Across simulated conditions, we encountered several instances of nonconvergent models. Initially, in 15 conditions (all involving eight-item scales with either slopes or slopes-and-thresholds noninvariance), more than 5% of replications failed to converge. The main problem with nonconvergence appeared to be related to the absence of any responses in the last category for at least one simulated group for Variable 5. To correct for this problem, as per operational practices (OECD, 2014b), we collapsed the adjacent categories across all countries in the affected conditions. Upon reanalyzing the data, all of the affected conditions had a convergence rate of 95% or more, and as such, the results were summarized as averages across the admissible replications within each condition.
Overall Fit Results
Six-item conditions
We note only one unexpected finding regarding the chi-square test: In the 20-group condition, the average value of the test statistic was statistically significant for all 20-group conditions with fully invariant data, suggesting a mismatch between the model and data where there is none (see Table S_T2 in the supplementary material, columns 4 and 5 with subheading Same form invariance). There was no apparent relationship between the average chi-square value and the proportion of groups that exhibited noninvariance (e.g., 40% compared with 60% for a given condition).
The RMSEA confirmed good model-data fit for both fully invariant conditions and in all conditions with noninvariant thresholds. Furthermore, the RMSEA was above the .05 cutoff for all conditions involving noninvariant slopes (for complete overall results, consult supplementary material Figures S_F1 through S_F3). Furthermore, the average RMSEA values were all less than .05 in the four conditions in which thresholds were varied, offering some evidence of poor performance for this measure. It is also worth noting that for several conditions correctly identified as not meeting the assumption of equal slopes and thresholds, the average RMSEA values were less than .08. In particular, for both the 10- and 20-group/40% of groups affected conditions with slope noninvariance and slopes-and-thresholds noninvariance, the average values ranged between .064 and .071. CFI and TLI performed quite poorly across all conditions, suggesting an acceptable model fit regardless of the level or nature of the noninvariance.
Taken together, overall results for the six-item conditions provide evidence that the RMSEA, CFI, and TLI are not particularly effective at identifying misfitting models. Furthermore, the chi-square test is overly sensitive in many conditions, finding ill-fit where none actually exists. This is particularly true under an assumption of same form and equal slopes.
Eight-item conditions
We note that chi-square tests of same form were significant in conditions involving noninvariance in slopes, regardless of the number of variant items, percent of affected groups, or the number of groups (see columns 5 and 6 in Table S_T3, under subheading Same form invariance). These results are opposed to what we would expect in this context. Similar performance was noted when further constraints were imposed.
When testing for slope invariance, the RMSEA suggested good model fit across most conditions, regardless of the source or degree of noninvariance (RMSEAs were .002 and .002, respectively; see supplementary files under Figures S_F1 through S_F3). Across conditions, the RMSEAs were larger than .05 in only three conditions; two of the three were in the 20-group/60% of groups affected by noninvariance conditions (noninvariance in slopes of one item and in slopes and thresholds of one item), while the other was where one slope differed in the 10-group/60% of groups condition. CFI and TLI were all high across the studied conditions (ranging from .990 to 1.000), suggesting good model fit regardless of the source or degree of noninvariance.
Testing for slopes-and-thresholds invariance produced RMSEA values that were the lowest among the studied conditions and ranged from .019 to .027. In conditions with slopes-and-thresholds noninvariance, RMSEAs ranged from .038 to .050. Results for CFI and TLI were consistent with the conclusions based on RMSEAs; between the two, the TLI yielded higher values across all noninvariant conditions, suggesting even better model fit than the CFI. We note, however, that across all studied conditions, based on the .05 cutoff for good model fit for RMSEAs and the .95 cutoff for the CFI and TLI, the fit indices suggested that all models fit well, regardless of the level or nature of noninvariance introduced.
Relative Fit Results
Six-item conditions
The results in Table 2 give some insight into the performance of the chi-square difference test in this context (unexpected results in the table were marked as bold and italicized). In particular, for both tests of equal slopes and equal slopes and thresholds, the chi-square difference test is exceptionally good at identifying untenable equality constraints. To that end, all data simulated to have unequal slopes were identified as such by the chi-square difference test, including the fully invariant and threshold-only noninvariant conditions; however, both conditions with 60% of groups having a noninvariant threshold were identified as violating the assumption of equal slopes. This statistic correctly identified incrementally poorer fitting models (e.g., equal slopes models fit to data with noninvariant slopes). In addition, for both relative tests of equal slopes and equal slopes and thresholds, the chi-square difference test statistics were larger in the 20-group conditions. The difference statistics were also consistently larger for conditions where 60% of groups had noninvariant item parameters.
Relative Fit Results for Six-Item Conditions.
Note. Under Level heading, 0 and 1 correspond to number of noninvariant items; N = none; T = thresholds; S = slopes; S&T = slopes-and-thresholds. Nj = number of groups; df for 10 and 20 groups χ2 difference test of slope invariance were 36 and 76, respectively; df for slopes-and-thresholds invariance were 144 and 304, respectively. Results in bold italic are outside of expected ranges of values, suggesting that cutoffs used to evaluate relative model fit did not suggest poor model fit when indeed noninvariance was modeled in either slopes and/or thresholds. Underscored results indicate that although a bad fitting model was overlooked at current criteria, there is little consequence as this would have been identified in the previous step, prompting a conclusion of noninvariant models. (+) indicates standard deviations <.001. (.) indicates standard deviations equal to zero. RMSEA = root mean square error of approximation; CFI = comparative fit index.
Under an assumption of equal slopes and using the criteria ΔRMSEA ≤.05, all well-fitting models were retained, including the equal slopes models fit to the fully invariant and threshold-only noninvariant data. Under these conditions, the ΔRMSEA values were all considerably smaller than .05 and ranged from −.001 to .004. Furthermore, all misspecified models were identified as such, including all conditions simulated to have slope noninvariance. In these conditions, ΔRMSEA exceeded .05, ranging from .065 to .096. Under an assumption of equal slopes and thresholds and using the criteria ΔRMSEA ≤.01, data simulated to have threshold-only noninvariance were identified as such by the ΔRMSEA; all other conditions produced negative values on this measure. 7
Regarding ΔCFI performance, we found that under an assumption of equal slopes, conditions with fully invariant data and threshold-only noninvariant data produced average values of .000, correctly indicating good model-data consistency. Furthermore, all models fit to slope noninvariant data produced ΔCFI greater in magnitude than −.004, leading to correctly rejecting the plausibility of equal slopes in these conditions. In addition, under an assumption of equal slopes and thresholds, the models fit to fully invariant data produced average values of .000, again providing evidence of commensurate models. Importantly, however, equal slopes-and-thresholds models fit to data with simulated threshold noninvariance were greater in value than −.004, providing incorrect evidence that these models fit the data well. In all of the remaining conditions, ΔCFI were outside of the accepted cutoff. Taken together, these findings suggest that the ΔRMSEA and the chi-square difference test are generally reliable at identifying well-fitting models; however, some problems with currently (and most previously) recommended cutoffs for the ΔCFI exist. Furthermore, the chi-square difference test is overly sensitive in some conditions.
Eight-item conditions
Table 3 shows the results for relative model fit across all conditions for the eight-item cases. In both scenarios, the chi-square difference tests were nonsignificant only for the fully invariant conditions. As expected, the chi-square difference tests were the largest in conditions where slope noninvariance was modeled and where larger numbers of groups were considered. Furthermore, differences in chi-square statistics were larger for those conditions where 60% of groups were affected by noninvariance in comparison with their 40% counterparts.
Relative Fit Results for Eight-Item Conditions.
Note. Under Level heading, 0, 1, and 2 correspond to number of noninvariant items; N = none; T = thresholds; S = slopes; S&T = slopes-and-thresholds. Nj = number of groups; df for 10 and 20 groups χ2 difference test of slope invariance were 54 and 114, respectively. df for slope and thresholds invariance were 198 and 418, respectively. Results in bold italic are outside of expected ranges of values, suggesting that cutoffs used to evaluate relative model fit did not suggest poor model fit when indeed noninvariance was modeled in either slopes and/or thresholds. Underscored results indicate that although a bad fitting model was overlooked at current criteria, there is little consequence as this would have been identified in the previous step, prompting a conclusion of noninvariant models. (+) indicates standard deviations <.001. (.) indicates standard deviations equal to zero. RMSEA = root mean square error of approximation; CFI = comparative fit index.
The relative fit indices results, as evaluated by ΔRMSEA and ΔCFI, suggested varying degrees of performance when testing for equality of slopes and equality of slopes and thresholds. Using the above-specified criteria of acceptable model fit, we noted that ΔRMSEA did not perform well across most of the studied conditions in identifying misfitting models (unexpected results in the table were marked as bold and italicized). In only two conditions did the ΔRMSEA correctly suggest model misfit, with reported values of .051 and .052, respectively. In the remaining conditions, ΔRMSEA for those simulated as misfitting passed the cutoff for reasonable model fit, with values ranging from .000 to .047. In other words, for these conditions, results supported (statistical) equality of slopes, when indeed there was noninvariance simulated in either slopes or slopes and thresholds. The performance of ΔCFI yielded somewhat better results for testing invariance of slopes. Across most of the studied conditions, ΔCFI values ranged from −.007 to .000, some of which were larger than the suggested −.004 cutoff. The ΔCFI incorrectly supported slope equality in five conditions, including in conditions that simulated noninvariance for 40% of groups with two noninvariant items per dimension for slopes or slopes-and-thresholds as well as a condition in which 40% of the 10 groups were affected by noninvariance with one noninvariant slope per dimension. For testing slope equality, it seems that the previously suggested cutoff values may be appropriate for the ΔCFI to some extent but not necessarily for the ΔRMSEA.
Somewhat opposite results were found for testing the assumption of slopes-and-thresholds invariance. Focusing on conditions where threshold noninvariance was simulated, all ΔRMSEA values were larger than the .010 cutoff value (ranging from .011 to .014), except for one condition in which 40% of the 20 groups were affected by noninvariance in thresholds for one item per dimension (ΔRMSEA = .009). 8 The ΔCFI, on the contrary, was largely unsuccessful in identifying poor model fit across the same conditions. For testing the assumption of slopes-and-thresholds invariance, previously suggested cutoff values seem appropriate for the ΔRMSEA but not necessarily for the ΔCFI. We elaborate on these findings more in our discussion.
Simulation Study 2: Method
Our design choices for the second study were motivated largely by what can be observed in practice with typical psychological assessments, such as the inventory for masculine norms. Although Study 2 can be seen as an extension of Study 1, it differs in several design features. The main differences in design include a larger number of factors, imbalanced influence across factors for noninvariant items, sample sizes, and overall scale length. Available empirical data on masculine norms (N = 237) did not have sufficient sample size to conduct MI across groups but rather a straightforward single-group nine-factor CFA model; however, as with Study 1, we used empirical analysis to guide our selection of generating parameters in simulation.
Design and Data Generation for Study 2
Two of the three manipulated factors in Study 2 mimicked those found in Study 1, namely, the nature of noninvariance (slopes, thresholds, or both) and percent of groups affected by noninvariance (40% or 60%). An additional manipulated factor, number of factors affected by one noninvariant item, included two levels (all five factors or three factors only) and was included to investigate situations that account for an uneven impact of problematic items.
In Study 2, we simulated five factors and slightly longer scales (five items per factor) to investigate MI. We simulated items that had four categories, and we examined equivalence across 10 groups only. Sample sizes across these groups were smaller than in Study 1, and ranged from 443 to 973, with a mean (SD) of 694 (155).
As mentioned above, we based our selection of simulation parameters on empirical results from a nine-factor ordered-categorical CFA model fit to masculinity norms data. Given that these analyses were based on one group only, our approach of selecting simulation parameters differed slightly from that used in Study 1; however, the goals for both empirical analyses were the same—to provide some sensible starting values for data generation.
In Study 2, we pooled parameter estimates across the nine factors and randomly drew values with replacement for loadings, thresholds, residual variances, and correlations among the latent factors. Specifically, generating parameter values for baseline loadings ranged from .729 to 1.474, threshold values ranged from −1.128 to 1.753, residual variances ranged from .108 to .544, and correlations among dimensions ranged from −.410 to .372. Latent variances were drawn for each group and factor from a uniform distribution (U[0.90, 2.70]) and ranged from .901 to 2.603, while latent means were sampled from a uniform distribution (U[−0.20, 0.20]) and ranged from −.197 to .197. As was the case in Study 1, to generate items with noninvariant slopes and/or thresholds, one standard deviation was added (subtracted) from the baseline values.
For each simulated condition, the population model was assumed to be a five-factor model. The fully crossed design yielded 13 conditions (Three sources of noninvariance [slopes, thresholds, or both] × Two affected group percentages [40% or 60%] × Two balances of noninvariant items [five or three factors] plus one baseline condition). As with Study 1, each condition was replicated 500 times, 9 and results were reported as the average values across the replications within each condition. For all results, we report the average fit statistics and indices across the replications for each condition and invariance test.
Study 2: Results
Overall Fit Results
In the interest of space, we focus on unexpected results, with respect to fit indices, and only note that the chi-square test performed as expected in identifying misspecified models and retaining correctly specified models (complete tabulated results are provided in supplementary material S_T4, while overall fit indices results are graphically included in S_F4).
With respect to the RMSEA results, we noted two interesting patterns. First, when only three factors were affected by noninvariant items, the RMSEAs were lower than when all five factors contained problematic items. Second, RMSEAs were lower in conditions with 40% of affected groups than in conditions with 60% of the groups being affected by noninvariance. Although the RMSEA values were larger in conditions where slopes and slopes-and-thresholds were modeled as noninvariant (which we would expect), suggesting poorer model fit, they were all still below the .05 cutoff value, suggesting (incorrectly) good model fit.
Across all studied conditions testing for same form, slopes, and slopes-and-thresholds invariance, CFI and TLI values were very high, essentially approaching 1 (ranging from .997 to 1.00 for both indices), suggesting that the CFI and the TLI correctly supported the hypothesis of same form, yet they failed to identify the poor-fitting models across all remaining conditions.
Relative Fit Results
The relative fit results for the five-factor conditions are presented in Table 4. Panels (a) and (b) represent results based on the five and three factors affected by noninvariance, respectively. Although values of the chi-square difference test and ΔRMSEA were somewhat higher and other fit indices were slightly lower when only three of the five factors included problematic items, overall conclusions about performance were largely the same. As noted in Panel (a) of Table 4, in testing for relative fit for equality of slopes and in testing for equality of slopes and thresholds, the chi-square difference tests were nonsignificant only for the fully invariant condition. The chi-square difference test values were the largest in conditions where slopes-and-thresholds noninvariance was modeled and where 60% of groups were affected by the noninvariance.
Relative Fit Results for Conditions for Five-Factor Conditions.
Note. Under Level heading, a is a baseline condition; 0 (none) noninvariance; S&T = slopes-and-thresholds; Nj = number of groups; df for 10 groups χ2 difference test of slope and slope-and-thresholds invariance were 90 and 225, respectively. Results in bold italic are outside of expected ranges of values, suggesting that cutoffs used to evaluate relative model fit did not suggest poor model fit when indeed noninvariance was modeled in either slopes and/or thresholds. Underscored results indicate that although a bad fitting model was overlooked at current criteria, there is little consequence as this would have been identified in the previous step, prompting a conclusion of noninvariant models. (+) indicates standard deviations <.001. (.) indicates standard deviations equal to zero. RMSEA = root mean square error of approximation; CFI = comparative fit index.
When testing the assumption of slopes-and-thresholds invariance, somewhat opposite results were found. Focusing on five-factor conditions where threshold noninvariance was simulated, all ΔRMSEA values were between .019 and .023. In comparable conditions for three factors with noninvariant items, these values were slightly lower at .016 and .019, respectively. Conditions with simulated slopes and/or slopes-and-thresholds noninvariance yielded ΔRMSEA values of less than .010 across all conditions except for one. This exception was found in the condition where all five factors contained noninvariant items and 60% of groups were affected in both slopes and thresholds. In this case, the value of ΔRMSEA was right at the cutoff point of .010. Based on the results from the previous step (when testing the assumption of slopes invariance), we assume that those conditions would not support adequate model fit in testing slope invariance, which would then deem testing for slopes-and-thresholds invariance irrelevant. The ΔCFI was also unsuccessful in identifying poor model fit across the same conditions and ranged from −.003 to −.001, except for the baseline (no noninvariance) condition, where it was .000.
Discussion
A critical precursor to comparing means on latent variables across cultures is that the measures are invariant across all compared populations. Evidence for invariance typically relies on hierarchical tests conducted in a multiple-groups CFA context with usual fit measures including overall and chi-square difference tests as well as overall and incremental changes in fit indices. In general, these measures have been validated in settings with few groups (usually two) and smaller sample sizes (e.g., Cheung & Rensvold, 2002; French & Finch, 2006). However, in international comparative surveys such as TALIS and TIMSS, the number of groups is considerably larger than two and the within-groups sample sizes are larger than in supporting studies. In addition, a recent, small body of research has shown that in these settings, typically used criteria are not well suited using traditional cutoffs (Rutkowski & Svetina, 2014). Our studies extended the conversation by considering the performance of several fit measures when data were simulated and modeled as multidimensional, ordered-categorical indicators, an area that has not been explored extensively to our knowledge. We summarize the main conclusions based on the results of the two simulation studies next. However, as we argue below, based on the current as well as previous studies, measures typically used in establishing MI do not work well, warranting development of improved measures.
In Study 1, across all considered measures, findings were very mixed. For example, in both the six- and eight-item conditions, the overall and chi-square difference tests tended to be overly sensitive, particularly in the 20-group conditions. That is, the chi-square-based tests identified poor-fitting models, overall and hierarchically, where they did not exist. Similarly, the RMSEA (both overall and as an incremental or relative index) did not perform reliably across all or most studied conditions. Finally, both the CFI and TLI performed exceptionally poorly overall, identifying no ill-fitting models in any studied condition. The incremental CFI was also found to perform poorly, as it too failed to identify several models that were not commensurate with the data. These findings are largely in contrast to recent research that found reasonable performance of these measures, particularly when some adjustments to typical criteria are made (Rutkowski & Svetina, 2014, 2017). One consistent finding, however, was found across these studies in that CFI and TLI are poor measures of overall fit. In other words, both fit indices did a poor job of identifying misfit in overall model fit. In Study 2, the chi-square statistics performed well in terms of identifying the misfitting models, but the fit indices were too liberal (i.e., accepted models that should have been rejected given current recommendations). In examining relative model fit, results were similar to those in Study 1, suggesting mixed outcomes in terms of the performance of both chi-square difference tests and relative fit indices. Given that the current results are somewhat different from previous findings under similar conditions, we advise very cautious use of any of the studied measures in multidimensional contexts. 10 We recognize that modified recommendations are more appropriate for multidimensional contexts than the current standards, which have been largely constrained to single constructs. However, as with previous research, modified recommendations are limited to the current studied conditions. We further elaborate on the limitations next.
The settings we considered were necessarily limited, and other contexts (e.g., more groups) might produce different results. We also only considered typical operational settings where the data are ordered categorical and where, at least originally, items had an equal number of categories. In addition, association among the factors was constant, and an equal number of items was associated with the respective factors. In addition to addressing these limitations of considered settings, future work may also examine the performance of fit indices when equal sample sizes are present across groups or whether an alternative approach incorporating some sort of “weights” to balance contributions of each group (as is often done in analysis of data from large-scale assessments) would yield better results. Despite these limitations, the current work provides further insight into the performance of measures typically used to evaluate MI. In addition, we offer revised guidelines for typically used measures, including the overall chi-square test, the RMSEA, the CFI, and the TLI. Similarly, we recommend some changes to the ΔRMSEA and the ΔCFI. 11
In addition to our recommendations and limitations, we wish to elaborate further on several important points. First, we recognize that with any recommendation or cutoff value, there is some level of subjectiveness when it comes to model fit evaluation. In addition, adjustment of the fit indices or changes to cutoff scores without some theoretical basis may underscore the importance of MI practice and research. Furthermore, we acknowledge that missing values in some categories presents a real issue in applied research. Currently in practice, when missing observations are found in some categories, typical procedure is to collapse the affected, adjacent categories. It is, however, unclear what the impact of this practice is, and thus, we believe that further research into understanding the impacts of collapsing categories is worthwhile, particularly in the multiple-groups setting. Finally, the current study intended to assemble a body of evidence in favor of or against the measures and cutoff values in operational settings when many groups are evaluated with respect to MI and, in particular, when multidimensional constructs are examined.
Our study helps to illustrate that current recommendations, with (or without) slight cutoff adjustment values, are not suitable in cross-cultural MI research. Specifically, our findings align with previous literature and show that traditional measures do not work well in the context of cross-cultural research where large numbers of groups or cultures are considered and/or where multidimensional constructs are studied. In other words, across the existing literature (e.g., Chen, 2007; Cheung & Rensvold, 2002; Rutkowski & Svetina, 2014, 2017) we found that the farther we get from the original contexts in which measures were developed and tested (relatively few groups, moderately small sample sizes, single constructs), the worse the measures perform. Continuing to adjust currently used fit indices to various designs does not seem to be a feasible approach moving forward when establishing measurement equivalence in cross-cultural research. Thus, we believe that our findings show a clear need for development of new measures that are designed to account for data characteristics found in the complex cross-cultural analyses that are growing in popularity. In the meantime, we recommend that cross-cultural researchers who deal with large numbers of groups and with large sample sizes consider examining cultural equivalence with subsets of data. For example, researchers could choose subsets of countries that are more homogeneous in terms of linguistic, geographical, or cultural factors. As a final point, we emphasize the importance of considering previous work in a given field as well as the theory guiding the development and refinement of any measurement model rather than relying exclusively on rule-of-thumb guidelines.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplementary Material
Supplementary material is available for this article online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
