Abstract
Popular measurement invariance testing procedures for latent constructs evaluated by multiple indicators in distinct populations are revisited and discussed. A frequently used test of factor loading invariance is shown to possess serious limitations that in general preclude it from accomplishing its goal of ascertaining this invariance. A process of mean intercept invariance evaluation is subsequently examined, and it is indicated that within this framework there is no statistical test available for group identity in them. Rather than pursuing these popular and widely used invariance testing procedures, it is recommended that empirical studies on constructs in multiple populations be concerned in general with alternative measurement invariance examination and ensuring the inclusion of their invariance conditions in models aimed at investigating group differences and similarities in latent means, variances, and interrelationships. The discussion is illustrated using data from a cognitive intervention study.
Keywords
Behavioral, social, and educational scientists are often involved in developing and using instruments aimed at measuring latent constructs. Frequently, these instruments consist of multiple components or indicators, and are designed to evaluate the constructs in two or more distinct populations. A question that naturally arises in such circumstances is whether an instrument of concern—such as a psychometric scale, test, inventory, subscale, testlet, self-report, survey, questionnaire and so on—measures similarly the same latent dimension(s) in all populations. This condition has been oftentimes referred to as measurement invariance (MI). Over the past three decades, MI has attracted a great deal of interest among methodological and substantive researchers concerned with studying constructs in multiple populations. As a consequence, a substantial body of literature documenting these activities and resulting methodological advances on studying MI and related issues has been developed (e.g., Meredith, 1993; Millsap, 2005, 2011; Millsap & Olivera-Aguilar, in press).
Even a cursory perusal of part of this literature will suggest that latent variable modeling (LVM) provides an essential tool for evaluating group differences on constructs of research interest (e.g., Muthén, 2002). According to a widely held view, in order to be in a position to study similarities and differences on the same constructs in all populations of interest (at times also referred to as “groups” below), factor loading and mean intercept invariance is necessary, if the same number of factors is tapped into by the same indicators in all groups. The latter condition is frequently called configural invariance (CI; Meredith & Horn, 2001) and will play an important role later in the article.
After initial latent structure examination within each group and ascertaining CI, LVM has been nearly routinely used for evaluating group equality in loadings and intercepts in empirical educational and behavioral research. This LVM method has been available for a number of years and enjoys widespread popularity in these and related disciplines. Characteristic features of the approach are (a) the use of nested multiple-group models for testing factor loading invariance and (b) when the latter constraint is judged plausible, fitting subsequently an overall multiple-group model that assumes both loading and intercept identity across groups (see, e.g., Cheung & Rensvold, 2002).
This widely used modeling procedure, for convenience referred to as the traditional or conventional procedure for studying MI in the remainder of the article, has in general important limitations, however, which to date do not seem to have received sufficient attention. In particular, the procedure does not include a formal test of invariance in the mean intercepts, whereas their group identity is frequently of special substantive relevance due to important features of the construct measurement process being reflected in these location parameters (e.g., Reise, Widaman, & Pugh, 1993). Specifically, in a typical application of this popular procedure, after loading invariance is deemed plausible the group equality constraint of intercepts is added to that of factor loadings, whereas the factor means are freed in all but one of the groups. Tenability of the resulting model is oftentimes interpreted as supportive of mean intercept invariance, yet this need not be considered a test of invariance of the intercepts in their own right because the model contains additional constraints that typically affect its overall fit indexes as well. Moreover, this conventional approach is usually based on unitary factor loading or variance constraints in all groups for model identification. However, not all these restrictions are essential for identification, whereas they have important consequences and may in fact contribute to misspecification of initial and subsequent models used within the conventional approach.
The present article reconsiders the traditional LVM procedure for studying MI and aims to contribute to the critical literature on MI examination using LVM (e.g., Millsap, 2001, 2005, 2011; Millsap & Cham, in press; Woods, 2009; Yoon & Millsap, 2007; see also Steiger, 2002). The remainder of the article is concerned with important limitations of popular applications of this procedure, which in general preclude it from providing strict statistical tests of group invariance in factor loadings as well as in mean intercepts. For these reasons, it is argued that instead of being concerned in general with testing these parameters’ invariance using the traditional procedure, it would be recommendable to use alternative approaches to studying MI and in affirmative cases refocus one’s efforts on ensuring that these restrictions are in fact implemented when examining latent group similarities and differences in means, interindividual differences, and interrelationship indexes using LVM. The discussion is illustrated with data from a cognitive intervention study of older adults.
Background, Notation, and Assumptions
This article assumes that a set of observed (approximately) continuous measures is given, denoted
where
From Equation (1), it follows that the implied covariance matrix for the measures
where
where
referred to as factorial structure invariance (or factorial invariance for short), and
called intercept invariance below, give two necessary conditions for being able to study population differences on the factors
When both series of restrictions (4) and (5) hold across the G populations in question, one can argue that the same units and origins of measurement are used for all groups in the process of evaluating the factors in
Traditional LVM Procedure to Examining Measurement Invariance and Its Limitations
As elaborated in detail in the literature (e.g., Cheung & Rensvold, 2002, and references therein), the conventional LVM procedure for studying MI, and specifically conditions (4) and (5), consists of fitting a series of appropriate versions of the multigroup model (1) to the data from the G populations under consideration. To this end, after initial factor structure examination in each group and ascertaining CI, the G-group model (1) is fitted first without any group parameter equality and subsequently with restriction (4) or its implementable version (see below). For simplicity of reference, we call the former Model 1 (unconstrained model) and the latter Model 2 (constrained model) in the rest of this article. Under normality of
If restriction (4) is found to be plausible (i.e., if the fit of Model 2 is not significantly worse than that of Model 1), as a next step within the traditional LVM procedure for studying MI one usually examines the intercept invariance (5). To this end, in Model 2 this intercept group equality is introduced, and the resulting restricted model fitted to the mean structure of the data (raw data). We refer to this model version with (4) and (5) as Model 3. Frequently, lack of fit of Model 3 is interpreted in empirical research as indicating a violation of (5) and suggestive of lack of MI for the studied set of measures
Although this conventional LVM procedure for MI evaluation has been widely used in the behavioral, social, and educational disciplines over the past two decades or so, its routine applications are associated with important limitations that do not seem to have received sufficient attention in the literature. These limitations are discussed in turn next.
Lack of a Statistical Test for Intercept Invariance
One of the limitations of the above traditional LVM method for studying MI is the lack of a formal test of group identity in mean intercepts. Specifically, when examining intercept invariance with Model 3, the pertinent restriction (5) is introduced while at the same time the means of the factors in
As a result of this lack of nesting between Models 3 and 2, no formal statistical test of the intercept invariance (5) is possible in general by comparing their fit statistics within the traditional procedure. Similarly, although one could compare their information criteria-based fit indexes—such as Akaike information criterion (AIC), Bayesian information criterion (BIC), and related measures—that comparison would still not represent a statistical test of the relevant constraint (5). Alternatively, if instead of comparing the fit statistics of Models 2 and 3, a researcher were to examine only those of Model 3, he or she would still not be conducting in general a formal test of the intercept invariance restriction (5). The reason is that the fit statistics of Model 3 absorb—and thus reflect—not only the deviations from perfect fit that are due to Equation (5) but also those resulting from all other constraints effective in Model 3 (e.g., Bollen, 1989). To summarize the discussion in this subsection, the traditional procedure for studying MI does not include a statistical test for mean intercept invariance.
Lack of a Complete Test of Loading Invariance
Another limitation of conventional LVM applications for testing MI results from the way identification is routinely achieved in Models 1, 2, and 3 in empirical research. Specifically, a factor loading is typically set equal to 1 for the same indicator (reference variable) per factor in all groups, or each factor variance is set at 1 in all groups (or a nonredundant combination of both types of unit setting is introduced). Although these restrictions ensure identification of the underlying CFA model (1) in each of the G groups (considered separately), they have important consequences, which are discussed next. These consequences do not seem to have been clarified to date sufficiently in the literature (cf. Woods, 2009; Yoon & Millsap, 2007).
To highlight these consequences, suppose that in all groups one sets at 1 the loading of the same indicator, say y m , of the jth factor, f j ; that is, in popular LVM notation λ mj,g = 1 is set (g = 1, …, G). Then testing of factor loading invariance, as done traditionally and mentioned above, does not really address the question of whether the loading λ mj itself is the same in all groups (unless of course prior, nonstatistical knowledge is available that λ mj is indeed the same in all G groups; 1 ≤ j ≤ q, 1 ≤ m ≤ s j , 2 ≤ s j ≤ k, with s j being the number of indicators of the factor f j ; see also Woods, 2009). The reason is that the group equality of the loading λ mj is stipulated to begin with, that is, in order for this factorial invariance procedure to commence. This group parameter equality is essentially an a priori assumption that is being imposed in the multigroup model (1), whereas this loading’s equality is in fact not tested within routine applications of the traditional approach to MI testing. This is because by assuming λ mj,g = 1 for all g, one in effect introduces in each group an artificial metric of measuring the jth factor f j , to achieve model identification in each group (j = 1, …, q; see Appendix A for formalization of this argument).
An artificial metric would also be introduced if instead a different (nonzero) value for λ mj were to be selected. This particular metric is instrumentally involved when one tests in these conventional LVM applications whether all remaining factor loadings (other than λ mj ) are identical across groups (cf. Steiger, 2002; Yoon & Millsap, 2007). This issue is not resolved by freeing subsequently λ mj across groups and setting at 1 another loading in all groups, since by so doing one again introduces an artificial (yet different) metric (cf. Woods, 2009; see also Appendix A). Specifically, a test in that metric that the rescaled version of the original loading λ mj is the same across groups, is not really responding to the pertinent part of the actual question of interest. This question is unconditional to begin with—namely. whether the loading λ mj itself (not any rescaled version of it) is identical across groups, that is, irrespective of any artificial metric that is effectively involved traditionally in such a comparison.
Therefore, the fixing at 1 of the same loading for each factor in all groups and subsequently testing if the remaining factor loadings are the same only represents a conditional test of those remaining loadings’ invariance in the metric introduced by this unit fixing, rather than of the invariance of all s j original loadings in model (1) for the factor f j that was to be tested in the first instance (see also Appendix A; j = 1, …, q). Various aspects of this important limitation have received considerable attention in the recent literature on MI (e.g., Cheung & Rensvold, 1999; Little, 2000; Meade & Lautenschlager, 2004; Millsap, 2011; Rensvold & Cheung, 2001; Vandenberg & Lance, 2000; Woods, 2009; Yoon & Millsap, 2007). However, to our knowledge no complete resolution of it has been found for the general case and made widely available.
This lack of a complete test of loading invariance is not resolved by freeing all factor loadings in all groups but fixing instead at 1 all q-factor variances in the G groups. The reason is that this alternative constraint may in fact introduce model misspecification(s). This is because in general the degree of individual differences on the factors in
Alternative Model Identification
An alternative way to achieve model identification using fewer constraints has been proposed, but it still does not provide per se a statistical test of all loadings’ group invariance, that is, of restriction (4) (see Millsap, 2005, 2011; Yoon & Millsap, 2007; cf. Reise et al., 1993). This approach (a) fixes at 1 only the variances in just one of the groups, (b) constrains for equality across the G groups all loadings of all factors considered (i.e., does not fix at 1 any loading in any group), and (c) fits under normality this G-group model to the covariance structure of
Although using no unnecessary constraints to achieve identification, beyond a single group factor variance setting and factorial invariance (i.e., restriction [4]), we find that as an overall means of data description Model 4 similarly has an important limitation. Specifically, as such a means Model 4 does not provide a possibility to test in general for this loading invariance, that is, condition (4). In particular, releasing the factorial invariance constraint in Model 4—to obtain one nested in it, say—renders that relaxed version of Model 4 unidentified, with no test statistic available then for actually testing hypothesis (4) of interest in the first instance (see also Millsap, 2011, for this model.)
Source of Limitations
The source of the discussed limitations of typical applications of the traditional LVM procedure for MI examination lies in the entangling of pertinent parameters of the underlying model defined by Equation (1) (see also the following discussion). As is well known, for each factor its latent variance and one loading are confounded, as are the factor mean and the mean intercept of an indicator (see Equations 2 and 3). For this reason, these two types of parameters are not identifiable (without further restrictions). This is because, as widely appreciated, latent variables are not observed and measured. For this reason, they do not have a “natural” metric underlying them—namely, a “natural” origin and unit of measurement—unlike observed variables that have such metrics.
Model identification can be achieved, as is well known, by fixing of a loading or variance per factor, to an essentially arbitrary constant (other than zero). Although this parameter fixing resolves the identification issue (up to a sign reversal, as treated in this article; e.g., Millsap, 2001) that fixing is not innocuous. Rather, it has certain consequences that need to be taken into account in subsequent analyses and result interpretations. More specifically, the fixing of parameter introduces an artificial metric that, however, depends on (a) the particular constant to which the parameter is fixed (e.g., Steiger, 2002) and (b) the metric underlying the pertinent indicator when a factor loading is fixed (see Appendixes A and C). All group parameter comparisons that ensue after such identification restrictions are introduced are therefore in actual fact conditional on these restrictions and implied equality of involved parameters holding in the population to begin with. In particular, in the multiple-group setting, any identification restriction additionally implies the assumption that the constrained parameters to the same constant (e.g., 1) in different groups are in fact equal to one another across the populations under consideration (e.g., Woods, 2009; Yoon & Millsap, 2007). This equality and assumption need to be accounted for when interpreting results of subsequent constraint testing. The consequences of particular model identification resolutions in part lead to the above discussed limitations of routine applications of the traditional LMV procedure for MI examination.
We stress that model identification only represents a way of accomplishing unique estimation of the model parameters (that remain free under the particular choice of constraints to accomplish identification; cf. Millsap, 2001). That is, model identification per se does not ensure resolution of the above limitations of the traditional LVM approach to studying MI. The reason is that a particular set of parameter constraints (unit fixing and/or related) for model identification only limit the number of model parameters to such an amount that can be uniquely estimated. At the same time, they do this at the expense of changing the meaning and interpretation of some parameters (for a further discussion, see, e.g., Steiger, 2002; see also Appendix C). This is the “price” paid for achieving unique estimation of the model parameters within CFA models, which are otherwise entangled and not identifiable. This consequence of a particular approach to model identification is what underlies the above lack, in general, of a complete test of loading invariance in MI studies with multiple populations.
From a philosophy of science viewpoint, at the bottom of these limitations lies the fact that latent variable models are fundamentally based on unobserved variables (latent variables, dimensions, factors, constructs). The presence of these unobserved variables in the models is tantamount to that of missing information, including, in particular, missing unit of measurement and scale origin for the constructs. A parameter constraint to achieve model identification is a post factum decision, that is, after-the-fact with regard to that missing information (after the latter was “lost”). This decision is in addition made in a largely arbitrary way (unless of course prior knowledge is available that the constraint holds in the population/populations under consideration, to begin with). For this reason, such a parameter constraint cannot compensate for that missing information, in particular, for the missing unit and origin of a pertinent latent variable measurement scale.
A Proposed Refocus of Popular Measurement Invariance Examination Efforts in Empirical Social, Behavioral, and Educational Research
The discussed limitations of the traditional LVM procedure for studying MI, in particular in its routine applications in the behavioral and social disciplines, have important consequences that we find have not received sufficient emphasis in the literature. These limitations imply that in general no complete (i.e., unconditional) test of loading invariance or a test of mean intercept invariance is actually carried out with these LVM applications (unless prior nonstatistical knowledge is available that the loadings of a chosen reference variable per factor are the same in all groups, or that all factor variances are the same in all groups, in which case complete/unconditional factorial invariance testing can be carried out as traditionally done; cf. Woods, 2009). What these LVM applications achieve in the general case, is at best only a partial (i.e., conditional) examination of restrictions (4) and (5), that is, of factorial and intercept invariance, in the sense explicated in the preceding section.
For these reasons, we submit that empirical scientists in the social and behavioral disciplines would be generally better off at present if they used alternative means of MI examination, and especially if they refocused their efforts on MI-related activities involved in multipopulation investigations when using popular LVM procedures. Specifically it is in our view essential that the conditions of cross-group factorial and intercept invariances (4) and (5) be implemented, when plausible, in a multigroup model fitted to data from such studies that aim at factor mean, variance, and interrelationship comparisons across the populations under investigation. In those cases, the following strategy to accomplishing the latter aims has a lot to recommend in the general case (see also Millsap, 2011).
In its first step, one examines CI of the set of factor indicators
Examining Group Differences and Similarities in Fluid Intelligence Constructs
For the purposes of this section, we use data from a cognitive intervention study carried out by Baltes, Dittmann-Kohli, and Kliegl (1986), who were concerned with examining plasticity in fluid intelligence of elderly adults. Fluid intelligence is considered to comprise intellectual subabilities involved in solving abstract problems that one is confronted with for the first time, and does not need for their solution any particular knowledge expected to be acquired through a process of upbringing, socialization, and acculturation (e.g., Horn, 1982). For the present aims, we use data from n = 271 older adults on several intelligence tests. Two of them were specifically developed for the goals of the Baltes et al. (1986) study, the so-called ADEPT Induction and ADEPT Figural Relations tests that were designed to tap into the fluid abilities Inductive Reasoning and Figural Relations, respectively. Three of the remaining tests of concern here were (a) a well-established measure of Inductive Reasoning, namely, Thurstone’s Standard Induction test, (b) a Culture-fair test, and (c) Raven’s Advanced Progressive Matrices test. The last couple of measures used below were indicators of the fluid ability Perceptual Speed. The original study comprised two groups, an experimental and a control group. Subjects in the experimental group (sample size n1 = 177) participated in a cognitive training that targeted the test relevant abilities, Inductive Reasoning and Figural Relations. The remaining elderly adults (sample size n2 = 94) comprised a control group. Further details on the study and measures used can be found in the original publication (Baltes et al., 1986).
In the rest of this section, we will be interested in examining group differences and similarities in means and variances of the three factors evaluated by this set of seven intelligence tests. Specifically, given their nature, the ADEPT Induction and Thurstone’s Standard Induction tests can be hypothesized to be tapping into the fluid intelligence construct Inductive Reasoning, whereas the ADEPT Figural Relations, Raven’s Matrices, and Culture-fair tests can be considered measures of the construct Figural Relations (cf. Baltes et al., 1986). Similarly, the two perceptual speed tests can be treated as measures of the fluid ability Perceptual Speed. Therefore, we postulate in each of the two groups under consideration the CFA model (1) for k = 7 tests, denoted y1 through y7, with q = 3 factors. These constructs are the fluid abilities Inductive Reasoning, Figural Relations, and Perceptual Speed, respectively. Thereby, the first and third of these factors have two indicators each, whereas the second factor has three indicators.
Following the strategy outlined in the preceding section, for this set of seven intelligence measures we commence with fitting the described three-factor model in the control group and then in the experimental group (see Equation 1 with s1 = s3 = 2, s2 = 3, k = 7, and q = 3). Since in either of these groups some deviations from normality were found on observed measures, we use thereby the robust maximum likelihood estimation method (e.g., Muthén & Muthén, 2010). The resulting fit indexes of these two within-group models are as follows. In the control group, this CFA model is associated with a chi-square (χ2) = 15.145, degrees of freedom (df) = 11, p = .176, and root mean square error of approximation (RMSEA) = .063, with a 90% confidence interval [0, .134]. For our purposes in this section, these findings can be interpreted as indicating that the fitted model is a reasonable approximation to the data. Similarly, in the experimental group this CFA model is found to be associated with the fit indexes, χ2 = 21.863, df = 11, p = .026, and RMSEA = .075 [.025, .120]. These results can also be interpreted as indicating a model that is a reasonable approximation to the analyzed data here. We conclude that the hypothesis of CI is plausible in the two groups.
According to the strategy indicated in the preceding section, in the next step we fit the two-group Model 4. As mentioned earlier, both factor loading and intercept invariance conditions (4) and (5) are imposed in this model, whereas no factor loading is fixed at 1 while the variances of the three factors are set equal to 1 only in the control group (and free in the experimental group; see Appendix B for model identification and Appendix D for source code with the popular LVM software Mplus; Muthén & Muthén, 2010; see also Millsap, 2011). Model 4 is similarly associated with fit indexes that can be considered here indicative of its plausibility: χ2 = 53.254, df = 30, p = .006, and RMSEA = .076 [.041, .108] (e.g., Millsap & Cham, in press; Millsap & Olivera-Aguilar, in press). Table 1 presents the parameter estimates, standard errors, and pertinent test statistics obtained with this model.
Parameter Estimates, Standard Errors, and p Values Associated With Model 4 a
Note. ADEPT_I = ADEPT induction test; ADEPT_FR = ADEPT figural relations test; TH_ST_I = Thurstone’s standard induction test; C_FAIR = Culture fair test; RAVEN = Raven’s advanced progressive matrices test; PS_1, PS_2 = 1st and 2nd perceptual tests; SE = standard error.
In Mplus output format (Muthén & Muthén, 2010). See Appendix D for source code needed for model fitting.
As can be seen from Table 1, since in the experimental group none of the factor means are significant, there appear to be no marked group differences in mean performance on the Inductive Reasoning, Figural Relations, and Perceptual Speed fluid abilities. Similarly, the variances section indicates no considerable group differences in the degree of interindividual differences on these three factors of concern. This can be most easily seen when working out the 95% confidence intervals for these variances in the experimental group by subtracting and adding 1.96 times their standard errors. Each of these three resulting intervals covers then the constant 1, which represents the variance of any of the three factors in the control group, and hence one could conclude that there are no considerable individual differences in the studied constructs across groups. By way of summary, these up-to-weak factor mean and variance group difference findings suggest a substantial degree of similarity in the two groups at the assessment occasion used (cf. Baltes et al., 1986), with no significant training effect as far as the studied fluid abilities’ means and extent of individual differences are concerned.
Conclusion
This article revisited a popular LVM procedure for studying MI in empirical behavioral, social, and educational research. Although this approach is currently widely used in these disciplines, it was indicated in the article that at present (a) it does not include in general a complete (unconditional) test of group invariance in the factor loadings and (b) it does not provide a statistical test of mean intercept invariance. These limitations stem from the underlying multiplicative entangling of factor loading and factor variance parameters on one hand, and in the additive confounding of factor mean and mean intercept parameters on the other. Although appropriate parameter restrictions render the underlying CFA models identified, these constraints have important consequences in their own right that are eventually the source of the limitations of this procedure.
Based on these limitations, the article argued that presently it would be recommendable in general for empirical social, behavioral, and educational researchers to examine alternatively MI and in the affirmative case ensure that these invariance conditions are implemented in latent variable models aimed at aiding their research in factor mean, variances, and interrelationship indexes’ differences and similarities in multiple populations under investigation. The traditional LVM approach to MI testing can still be used for testing (a) factorial invariance restriction (4) when such prior (nonstatistical) knowledge is available that is tantamount to group identity of the factor loadings for chosen reference variables, or group identity of factor variances in the studied populations, and (b) group identity in mean intercepts (constraint [5]) when prior knowledge is available that amounts to group identity in the intercepts of a reference indicator per factor (cf. Millsap, 2011; Millsap & Olivera-Aguilar, in press; Vandenberg & Lance, 2000; Woods, 2009; Yoon & Millsap, 2007).
Footnotes
Appendix A
Appendix B
Appendix C
Appendix D
Acknowledgements
We are indebted to R. E. Millsap, B. O. Muthén, L. K. Muthén, and K.-H. Yuan for valuable discussions on measurement invariance and related issues. Thanks are also due to two anonymous reviewers as well as to P. B. Baltes, F. Dittmann-Kohli, and R. Kliegl for permission to use—for illustration purposes—data from their project “Aging and Fluid Intelligence.”
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
