Measurement Invariance for Latent Constructs in Multiple Populations

Abstract

Popular measurement invariance testing procedures for latent constructs evaluated by multiple indicators in distinct populations are revisited and discussed. A frequently used test of factor loading invariance is shown to possess serious limitations that in general preclude it from accomplishing its goal of ascertaining this invariance. A process of mean intercept invariance evaluation is subsequently examined, and it is indicated that within this framework there is no statistical test available for group identity in them. Rather than pursuing these popular and widely used invariance testing procedures, it is recommended that empirical studies on constructs in multiple populations be concerned in general with alternative measurement invariance examination and ensuring the inclusion of their invariance conditions in models aimed at investigating group differences and similarities in latent means, variances, and interrelationships. The discussion is illustrated using data from a cognitive intervention study.

Keywords

factor loading latent variable modeling mean intercept measurement invariance model identification

Behavioral, social, and educational scientists are often involved in developing and using instruments aimed at measuring latent constructs. Frequently, these instruments consist of multiple components or indicators, and are designed to evaluate the constructs in two or more distinct populations. A question that naturally arises in such circumstances is whether an instrument of concern—such as a psychometric scale, test, inventory, subscale, testlet, self-report, survey, questionnaire and so on—measures similarly the same latent dimension(s) in all populations. This condition has been oftentimes referred to as measurement invariance (MI). Over the past three decades, MI has attracted a great deal of interest among methodological and substantive researchers concerned with studying constructs in multiple populations. As a consequence, a substantial body of literature documenting these activities and resulting methodological advances on studying MI and related issues has been developed (e.g., Meredith, 1993; Millsap, 2005, 2011; Millsap & Olivera-Aguilar, in press).

Even a cursory perusal of part of this literature will suggest that latent variable modeling (LVM) provides an essential tool for evaluating group differences on constructs of research interest (e.g., Muthén, 2002). According to a widely held view, in order to be in a position to study similarities and differences on the same constructs in all populations of interest (at times also referred to as “groups” below), factor loading and mean intercept invariance is necessary, if the same number of factors is tapped into by the same indicators in all groups. The latter condition is frequently called configural invariance (CI; Meredith & Horn, 2001) and will play an important role later in the article.

After initial latent structure examination within each group and ascertaining CI, LVM has been nearly routinely used for evaluating group equality in loadings and intercepts in empirical educational and behavioral research. This LVM method has been available for a number of years and enjoys widespread popularity in these and related disciplines. Characteristic features of the approach are (a) the use of nested multiple-group models for testing factor loading invariance and (b) when the latter constraint is judged plausible, fitting subsequently an overall multiple-group model that assumes both loading and intercept identity across groups (see, e.g., Cheung & Rensvold, 2002).

This widely used modeling procedure, for convenience referred to as the traditional or conventional procedure for studying MI in the remainder of the article, has in general important limitations, however, which to date do not seem to have received sufficient attention. In particular, the procedure does not include a formal test of invariance in the mean intercepts, whereas their group identity is frequently of special substantive relevance due to important features of the construct measurement process being reflected in these location parameters (e.g., Reise, Widaman, & Pugh, 1993). Specifically, in a typical application of this popular procedure, after loading invariance is deemed plausible the group equality constraint of intercepts is added to that of factor loadings, whereas the factor means are freed in all but one of the groups. Tenability of the resulting model is oftentimes interpreted as supportive of mean intercept invariance, yet this need not be considered a test of invariance of the intercepts in their own right because the model contains additional constraints that typically affect its overall fit indexes as well. Moreover, this conventional approach is usually based on unitary factor loading or variance constraints in all groups for model identification. However, not all these restrictions are essential for identification, whereas they have important consequences and may in fact contribute to misspecification of initial and subsequent models used within the conventional approach.

The present article reconsiders the traditional LVM procedure for studying MI and aims to contribute to the critical literature on MI examination using LVM (e.g., Millsap, 2001, 2005, 2011; Millsap & Cham, in press; Woods, 2009; Yoon & Millsap, 2007; see also Steiger, 2002). The remainder of the article is concerned with important limitations of popular applications of this procedure, which in general preclude it from providing strict statistical tests of group invariance in factor loadings as well as in mean intercepts. For these reasons, it is argued that instead of being concerned in general with testing these parameters’ invariance using the traditional procedure, it would be recommendable to use alternative approaches to studying MI and in affirmative cases refocus one’s efforts on ensuring that these restrictions are in fact implemented when examining latent group similarities and differences in means, interindividual differences, and interrelationship indexes using LVM. The discussion is illustrated with data from a cognitive intervention study of older adults.

Background, Notation, and Assumptions

This article assumes that a set of observed (approximately) continuous measures is given, denoted $\underline{y}'$ = (y₁, y₂, …, y_k), which may represent subscale scores, individual measures in a test-battery, testlets in a test, or in general items of a social, behavioral, or educational measuring instrument (k > 1; although an extension to the case of categorical items in $\underline{y}$ can be accomplished along similar lines, it is not pursued in this article.) We also assume that $\underline{y}$ fulfills the CI condition with regard to G distinct populations under consideration, from which large samples are made available for study (G > 1; in this article, underlining denotes vector and prime denotes transposition). That is, in each of the G groups the same number of q factors (constructs) are evaluated by the same measures in $\underline{y}$ (1 ≤ q < k). In other words, the following confirmatory factor analysis (CFA) model is postulated to relate constructs to their indicators in the gth population (cf. Millsap, 2011):

{\underline{y}}_{g} = {\underline{α}}_{g} + Λ_{g} f_{g} + {\underline{e}}_{g},

where $\underline{y_{g}}$ is the vector of k observed variables in the gth group, $\underline{f_{g}}$ is the vector of q factors in it, $Λ_{g}$ is the k × q matrix of factor loadings in that population, ${\underline{α}}_{g}$ is the k × 1 vector of mean intercepts, and $\underline{e_{g}}$ is the k × 1 vector of error (residual) terms in the gth group, with zero means and assumed uncorrelated with ${\underline{f}}_{g}$ (g = 1, …, G; see also Meredith & Horn, 2001). To avoid triviality considerations, we also assume that (a) each factor in $\underline{f}$ is measured by at least two indicators, (i.e., each column of $Λ_{g}$ has at least two nonzero elements) and (b) the overall, G-group model (1) is identified through appropriate restrictions (see below; g = 1, …, G).

From Equation (1), it follows that the implied covariance matrix for the measures $\underline{y}$ in each population, which is denoted $\sum ({\underline{θ}}_{g}),$ is structured as follows (cf. Millsap, 2011):

Σ ({\underline{θ}}_{g}) = Λ_{g} Φ_{g} Λ'_{g} + Ψ_{g},

where ${\underline{θ}}_{g}$ designates the vector of model parameters while $Φ_{g}$ and $Ψ_{g}$ are, respectively, the q × q and k × k covariance matrices of the factors and residual terms that need not be diagonal (assuming identification of the overall model; g = 1, …, G). Similarly, the implied mean vector in each population, denoted µ( ${\underline{θ}}_{g}$ ), is structured as

μ ({\underline{θ}}_{g}) = {\underline{α}}_{g} + Λ_{g} {\underline{υ}}_{g},

where ${\underline{υ}}_{g}$ symbolizes the q × 1 vector of factor means (g = 1, …, G; see below for model identification). The intercept parameters ${\underline{α}}_{g}$ in Equations (1) and (3) represent location parameters reflecting the origins of scales of measurement for the pertinent indicators; at the same time, the factor loadings in $Λ_{g}$ are scale parameters reflecting the units underlying these scales (g = 1, …, G). With this in mind, the following two identities of ${\underline{α}}_{g}$ and $Λ_{g}$ across groups:

Λ_{1} = Λ_{2} = \dots = Λ_{G},

referred to as factorial structure invariance (or factorial invariance for short), and

{\underline{α}}_{1} = {\underline{α}}_{2} = \dots = {\underline{α}}_{G},

called intercept invariance below, give two necessary conditions for being able to study population differences on the factors $\underline{f}$ using the measures in $\underline{y}$ . Using the G-group model (1), population comparison of latent means, variances and interrelationship indexes, which is frequently of special substantive interest, need not be strictly justifiable unless Equations (4) and (5) are imposed (cf. Byrne, Shavelson, & Muthén, 1989).

When both series of restrictions (4) and (5) hold across the G populations in question, one can argue that the same units and origins of measurement are used for all groups in the process of evaluating the factors in $\underline{f}$ . If measuring the same factors $\underline{f}$ in all populations then, as discussed in the literature, comparison across groups of the means on any element of $\underline{f}$ could answer the question of whether there are group differences on that latent dimension. In addition, group comparison of the corresponding elements of the factor covariance matrix could answer the query about group similarity in the individual differences on the factors in $\underline{f}$ , as well as in their interrelationship indexes. Therefore, implementing the factorial and intercept invariance conditions (4) and (5) for a set of factors evaluated by multiple indicators in two or more populations can be seen as setting a solid ground for pursuing in a meaningful way answers to these and related questions involving the factors $\underline{f}$ and characteristics of their distributions (e.g., Meredith, 1993; see also Lubke, Dolan, Kelderman, & Mellenbergh, 2003; Millsap, 2011).

Traditional LVM Procedure to Examining Measurement Invariance and Its Limitations

As elaborated in detail in the literature (e.g., Cheung & Rensvold, 2002, and references therein), the conventional LVM procedure for studying MI, and specifically conditions (4) and (5), consists of fitting a series of appropriate versions of the multigroup model (1) to the data from the G populations under consideration. To this end, after initial factor structure examination in each group and ascertaining CI, the G-group model (1) is fitted first without any group parameter equality and subsequently with restriction (4) or its implementable version (see below). For simplicity of reference, we call the former Model 1 (unconstrained model) and the latter Model 2 (constrained model) in the rest of this article. Under normality of $\underline{y}$ , the well-known chi-square difference test (e.g., Bollen, 1989) is then used to test factorial structure invariance.¹ Under lack of normality for $\underline{y}$ , which does not result from piling of cases at an end of an item or instrument scale, highly discrete items or nonnegligible clustering effect, a corrected chi-square difference test is used for this purpose instead after using robust maximum likelihood; thereby, both Model 1 and Model 2 are fitted using the raw data (e.g., Muthén & Muthén, 2010). A nonsignificant (corrected) chi-square difference test for Models 1 and 2 is usually interpreted in the literature as supportive of the hypothesis of factorial invariance, that is, plausibility of restriction (4). Conversely, a significant (corrected) chi-square difference test is considered indicative of the lack of factorial invariance (4). The latter finding typically leads to a declaration that the set of measures in $\underline{y}$ do not exhibit MI. This result may be followed by analyses for locating MI violation(s); we note that such further analyses will not be of concern in this article (for details, see Byrne et al., 1989; Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000; see also Woods, 2009; Yoon & Millsap, 2007).

If restriction (4) is found to be plausible (i.e., if the fit of Model 2 is not significantly worse than that of Model 1), as a next step within the traditional LVM procedure for studying MI one usually examines the intercept invariance (5). To this end, in Model 2 this intercept group equality is introduced, and the resulting restricted model fitted to the mean structure of the data (raw data). We refer to this model version with (4) and (5) as Model 3. Frequently, lack of fit of Model 3 is interpreted in empirical research as indicating a violation of (5) and suggestive of lack of MI for the studied set of measures $\underline{y}$ . Subsequent analyses for locating the violation of this assumption can also be undertaken then (cf. Byrne et al., 1989; see above). Conversely, tenability of Model 3 is often interpreted in the empirical research literature as supportive of mean intercept invariance for the observed variables $\underline{y}$ . Various comparisons of factor means, variances, covariances, and correlations are then frequently carried out across the multiple populations involved.

Although this conventional LVM procedure for MI evaluation has been widely used in the behavioral, social, and educational disciplines over the past two decades or so, its routine applications are associated with important limitations that do not seem to have received sufficient attention in the literature. These limitations are discussed in turn next.

Lack of a Statistical Test for Intercept Invariance

One of the limitations of the above traditional LVM method for studying MI is the lack of a formal test of group identity in mean intercepts. Specifically, when examining intercept invariance with Model 3, the pertinent restriction (5) is introduced while at the same time the means of the factors in $\underline{f}$ are freed in all but one of the G groups (e.g., Cheung & Rensvold, 1999, 2002; Vandenberg & Lance, 2000).² By virtue of introducing these q(G − 1) parameters, however, Model 3 is not nested in Model 2 because in the latter only the series of constraints (4) are imposed (while all factor means are fixed at zero; see also Note 2). At the same time, however, in Model 3 one (a) restricts additional parameters relative to Model 2 (viz., those involved in Equation 5), and simultaneously (b) frees other parameters from Model 2 that are fixed at zero in the latter, namely, the q(G − 1) factor means. (This argument holds regardless of whether or not restriction (4) is imposed in these two models; that is, Model 3 is not nested in Model 2 irrespective of whether one includes restriction (4) in them.)

As a result of this lack of nesting between Models 3 and 2, no formal statistical test of the intercept invariance (5) is possible in general by comparing their fit statistics within the traditional procedure. Similarly, although one could compare their information criteria-based fit indexes—such as Akaike information criterion (AIC), Bayesian information criterion (BIC), and related measures—that comparison would still not represent a statistical test of the relevant constraint (5). Alternatively, if instead of comparing the fit statistics of Models 2 and 3, a researcher were to examine only those of Model 3, he or she would still not be conducting in general a formal test of the intercept invariance restriction (5). The reason is that the fit statistics of Model 3 absorb—and thus reflect—not only the deviations from perfect fit that are due to Equation (5) but also those resulting from all other constraints effective in Model 3 (e.g., Bollen, 1989). To summarize the discussion in this subsection, the traditional procedure for studying MI does not include a statistical test for mean intercept invariance.

Lack of a Complete Test of Loading Invariance

Another limitation of conventional LVM applications for testing MI results from the way identification is routinely achieved in Models 1, 2, and 3 in empirical research. Specifically, a factor loading is typically set equal to 1 for the same indicator (reference variable) per factor in all groups, or each factor variance is set at 1 in all groups (or a nonredundant combination of both types of unit setting is introduced). Although these restrictions ensure identification of the underlying CFA model (1) in each of the G groups (considered separately), they have important consequences, which are discussed next. These consequences do not seem to have been clarified to date sufficiently in the literature (cf. Woods, 2009; Yoon & Millsap, 2007).

To highlight these consequences, suppose that in all groups one sets at 1 the loading of the same indicator, say y_m, of the jth factor, f_j; that is, in popular LVM notation λ_mj,g = 1 is set (g = 1, …, G). Then testing of factor loading invariance, as done traditionally and mentioned above, does not really address the question of whether the loading λ_mj itself is the same in all groups (unless of course prior, nonstatistical knowledge is available that λ_mj is indeed the same in all G groups; 1 ≤ j ≤ q, 1 ≤ m ≤ s_j, 2 ≤ s_j ≤ k, with s_j being the number of indicators of the factor f_j; see also Woods, 2009). The reason is that the group equality of the loading λ_mj is stipulated to begin with, that is, in order for this factorial invariance procedure to commence. This group parameter equality is essentially an a priori assumption that is being imposed in the multigroup model (1), whereas this loading’s equality is in fact not tested within routine applications of the traditional approach to MI testing. This is because by assuming λ_mj,g = 1 for all g, one in effect introduces in each group an artificial metric of measuring the jth factor f_j, to achieve model identification in each group (j = 1, …, q; see Appendix A for formalization of this argument).

An artificial metric would also be introduced if instead a different (nonzero) value for λ_mj were to be selected. This particular metric is instrumentally involved when one tests in these conventional LVM applications whether all remaining factor loadings (other than λ_mj) are identical across groups (cf. Steiger, 2002; Yoon & Millsap, 2007). This issue is not resolved by freeing subsequently λ_mj across groups and setting at 1 another loading in all groups, since by so doing one again introduces an artificial (yet different) metric (cf. Woods, 2009; see also Appendix A). Specifically, a test in that metric that the rescaled version of the original loading λ_mj is the same across groups, is not really responding to the pertinent part of the actual question of interest. This question is unconditional to begin with—namely. whether the loading λ_mj itself (not any rescaled version of it) is identical across groups, that is, irrespective of any artificial metric that is effectively involved traditionally in such a comparison.

Therefore, the fixing at 1 of the same loading for each factor in all groups and subsequently testing if the remaining factor loadings are the same only represents a conditional test of those remaining loadings’ invariance in the metric introduced by this unit fixing, rather than of the invariance of all s_j original loadings in model (1) for the factor f_j that was to be tested in the first instance (see also Appendix A; j = 1, …, q). Various aspects of this important limitation have received considerable attention in the recent literature on MI (e.g., Cheung & Rensvold, 1999; Little, 2000; Meade & Lautenschlager, 2004; Millsap, 2011; Rensvold & Cheung, 2001; Vandenberg & Lance, 2000; Woods, 2009; Yoon & Millsap, 2007). However, to our knowledge no complete resolution of it has been found for the general case and made widely available.

This lack of a complete test of loading invariance is not resolved by freeing all factor loadings in all groups but fixing instead at 1 all q-factor variances in the G groups. The reason is that this alternative constraint may in fact introduce model misspecification(s). This is because in general the degree of individual differences on the factors in $\underline{f}$ (latent variances) need not be the same across the groups, yet they are forced to be so as a consequence of this unit variance constraint. To summarize the discussion in this subsection, the traditional procedure for studying MI does not achieve complete (unconditional) testing of factorial invariance, that is, constraint (4).

Alternative Model Identification

An alternative way to achieve model identification using fewer constraints has been proposed, but it still does not provide per se a statistical test of all loadings’ group invariance, that is, of restriction (4) (see Millsap, 2005, 2011; Yoon & Millsap, 2007; cf. Reise et al., 1993). This approach (a) fixes at 1 only the variances in just one of the groups, (b) constrains for equality across the G groups all loadings of all factors considered (i.e., does not fix at 1 any loading in any group), and (c) fits under normality this G-group model to the covariance structure of $\underline{y}$ , or to the mean structure of $\underline{y}$ (with zero factor means and free mean intercepts in all groups; see Note 1). Appendix B shows that the resulting restricted model—referred to as Model 4 in the remainder—is identified. (The present authors are not aware of a previous, formal demonstration of identification of Model 4, but they do not consider the developments in Appendix B a contribution of this article.) We will also use Model 4 as a basis of a suggested strategy and for illustrative purposes in a later section.

Although using no unnecessary constraints to achieve identification, beyond a single group factor variance setting and factorial invariance (i.e., restriction [4]), we find that as an overall means of data description Model 4 similarly has an important limitation. Specifically, as such a means Model 4 does not provide a possibility to test in general for this loading invariance, that is, condition (4). In particular, releasing the factorial invariance constraint in Model 4—to obtain one nested in it, say—renders that relaxed version of Model 4 unidentified, with no test statistic available then for actually testing hypothesis (4) of interest in the first instance (see also Millsap, 2011, for this model.)

Source of Limitations

The source of the discussed limitations of typical applications of the traditional LVM procedure for MI examination lies in the entangling of pertinent parameters of the underlying model defined by Equation (1) (see also the following discussion). As is well known, for each factor its latent variance and one loading are confounded, as are the factor mean and the mean intercept of an indicator (see Equations 2 and 3). For this reason, these two types of parameters are not identifiable (without further restrictions). This is because, as widely appreciated, latent variables are not observed and measured. For this reason, they do not have a “natural” metric underlying them—namely, a “natural” origin and unit of measurement—unlike observed variables that have such metrics.

Model identification can be achieved, as is well known, by fixing of a loading or variance per factor, to an essentially arbitrary constant (other than zero). Although this parameter fixing resolves the identification issue (up to a sign reversal, as treated in this article; e.g., Millsap, 2001) that fixing is not innocuous. Rather, it has certain consequences that need to be taken into account in subsequent analyses and result interpretations. More specifically, the fixing of parameter introduces an artificial metric that, however, depends on (a) the particular constant to which the parameter is fixed (e.g., Steiger, 2002) and (b) the metric underlying the pertinent indicator when a factor loading is fixed (see Appendixes A and C). All group parameter comparisons that ensue after such identification restrictions are introduced are therefore in actual fact conditional on these restrictions and implied equality of involved parameters holding in the population to begin with. In particular, in the multiple-group setting, any identification restriction additionally implies the assumption that the constrained parameters to the same constant (e.g., 1) in different groups are in fact equal to one another across the populations under consideration (e.g., Woods, 2009; Yoon & Millsap, 2007). This equality and assumption need to be accounted for when interpreting results of subsequent constraint testing. The consequences of particular model identification resolutions in part lead to the above discussed limitations of routine applications of the traditional LMV procedure for MI examination.

We stress that model identification only represents a way of accomplishing unique estimation of the model parameters (that remain free under the particular choice of constraints to accomplish identification; cf. Millsap, 2001). That is, model identification per se does not ensure resolution of the above limitations of the traditional LVM approach to studying MI. The reason is that a particular set of parameter constraints (unit fixing and/or related) for model identification only limit the number of model parameters to such an amount that can be uniquely estimated. At the same time, they do this at the expense of changing the meaning and interpretation of some parameters (for a further discussion, see, e.g., Steiger, 2002; see also Appendix C). This is the “price” paid for achieving unique estimation of the model parameters within CFA models, which are otherwise entangled and not identifiable. This consequence of a particular approach to model identification is what underlies the above lack, in general, of a complete test of loading invariance in MI studies with multiple populations.

From a philosophy of science viewpoint, at the bottom of these limitations lies the fact that latent variable models are fundamentally based on unobserved variables (latent variables, dimensions, factors, constructs). The presence of these unobserved variables in the models is tantamount to that of missing information, including, in particular, missing unit of measurement and scale origin for the constructs. A parameter constraint to achieve model identification is a post factum decision, that is, after-the-fact with regard to that missing information (after the latter was “lost”). This decision is in addition made in a largely arbitrary way (unless of course prior knowledge is available that the constraint holds in the population/populations under consideration, to begin with). For this reason, such a parameter constraint cannot compensate for that missing information, in particular, for the missing unit and origin of a pertinent latent variable measurement scale.

A Proposed Refocus of Popular Measurement Invariance Examination Efforts in Empirical Social, Behavioral, and Educational Research

The discussed limitations of the traditional LVM procedure for studying MI, in particular in its routine applications in the behavioral and social disciplines, have important consequences that we find have not received sufficient emphasis in the literature. These limitations imply that in general no complete (i.e., unconditional) test of loading invariance or a test of mean intercept invariance is actually carried out with these LVM applications (unless prior nonstatistical knowledge is available that the loadings of a chosen reference variable per factor are the same in all groups, or that all factor variances are the same in all groups, in which case complete/unconditional factorial invariance testing can be carried out as traditionally done; cf. Woods, 2009). What these LVM applications achieve in the general case, is at best only a partial (i.e., conditional) examination of restrictions (4) and (5), that is, of factorial and intercept invariance, in the sense explicated in the preceding section.

For these reasons, we submit that empirical scientists in the social and behavioral disciplines would be generally better off at present if they used alternative means of MI examination, and especially if they refocused their efforts on MI-related activities involved in multipopulation investigations when using popular LVM procedures. Specifically it is in our view essential that the conditions of cross-group factorial and intercept invariances (4) and (5) be implemented, when plausible, in a multigroup model fitted to data from such studies that aim at factor mean, variance, and interrelationship comparisons across the populations under investigation. In those cases, the following strategy to accomplishing the latter aims has a lot to recommend in the general case (see also Millsap, 2011).

In its first step, one examines CI of the set of factor indicators $\underline{y}$ within each population of interest in a given study. If CI is found to hold (i.e., to be plausible) in all populations, one fits then Model 4 with the added constraint of intercept invariance (5), zero factor means in one group, and otherwise free factor means. (If CI is found not to be plausible, we submit that a researcher may well need to revisit the set of factor indicators and possibly modify it in order to enhance the likelihood of CI holding, keeping in mind potential changes in the study’s external validity and generalizability of ensuing findings; e.g., Yoon & Millsap, 2007.) Assuming this model is associated with tenable fit indexes, one examines with it group differences and similarities in factor means, variances, and correlations, depending on the specific research questions pursued in the study, as discussed in detail in the literature (e.g., Cheung & Rensvold, 2002). We illustrate this strategy next using data from a cognitive intervention study of older adults.

Examining Group Differences and Similarities in Fluid Intelligence Constructs

For the purposes of this section, we use data from a cognitive intervention study carried out by Baltes, Dittmann-Kohli, and Kliegl (1986), who were concerned with examining plasticity in fluid intelligence of elderly adults. Fluid intelligence is considered to comprise intellectual subabilities involved in solving abstract problems that one is confronted with for the first time, and does not need for their solution any particular knowledge expected to be acquired through a process of upbringing, socialization, and acculturation (e.g., Horn, 1982). For the present aims, we use data from n = 271 older adults on several intelligence tests. Two of them were specifically developed for the goals of the Baltes et al. (1986) study, the so-called ADEPT Induction and ADEPT Figural Relations tests that were designed to tap into the fluid abilities Inductive Reasoning and Figural Relations, respectively. Three of the remaining tests of concern here were (a) a well-established measure of Inductive Reasoning, namely, Thurstone’s Standard Induction test, (b) a Culture-fair test, and (c) Raven’s Advanced Progressive Matrices test. The last couple of measures used below were indicators of the fluid ability Perceptual Speed. The original study comprised two groups, an experimental and a control group. Subjects in the experimental group (sample size n₁ = 177) participated in a cognitive training that targeted the test relevant abilities, Inductive Reasoning and Figural Relations. The remaining elderly adults (sample size n₂ = 94) comprised a control group. Further details on the study and measures used can be found in the original publication (Baltes et al., 1986).

In the rest of this section, we will be interested in examining group differences and similarities in means and variances of the three factors evaluated by this set of seven intelligence tests. Specifically, given their nature, the ADEPT Induction and Thurstone’s Standard Induction tests can be hypothesized to be tapping into the fluid intelligence construct Inductive Reasoning, whereas the ADEPT Figural Relations, Raven’s Matrices, and Culture-fair tests can be considered measures of the construct Figural Relations (cf. Baltes et al., 1986). Similarly, the two perceptual speed tests can be treated as measures of the fluid ability Perceptual Speed. Therefore, we postulate in each of the two groups under consideration the CFA model (1) for k = 7 tests, denoted y₁ through y₇, with q = 3 factors. These constructs are the fluid abilities Inductive Reasoning, Figural Relations, and Perceptual Speed, respectively. Thereby, the first and third of these factors have two indicators each, whereas the second factor has three indicators.

Following the strategy outlined in the preceding section, for this set of seven intelligence measures we commence with fitting the described three-factor model in the control group and then in the experimental group (see Equation 1 with s₁ = s₃ = 2, s₂ = 3, k = 7, and q = 3). Since in either of these groups some deviations from normality were found on observed measures, we use thereby the robust maximum likelihood estimation method (e.g., Muthén & Muthén, 2010). The resulting fit indexes of these two within-group models are as follows. In the control group, this CFA model is associated with a chi-square (χ²) = 15.145, degrees of freedom (df) = 11, p = .176, and root mean square error of approximation (RMSEA) = .063, with a 90% confidence interval [0, .134]. For our purposes in this section, these findings can be interpreted as indicating that the fitted model is a reasonable approximation to the data. Similarly, in the experimental group this CFA model is found to be associated with the fit indexes, χ² = 21.863, df = 11, p = .026, and RMSEA = .075 [.025, .120]. These results can also be interpreted as indicating a model that is a reasonable approximation to the analyzed data here. We conclude that the hypothesis of CI is plausible in the two groups.

According to the strategy indicated in the preceding section, in the next step we fit the two-group Model 4. As mentioned earlier, both factor loading and intercept invariance conditions (4) and (5) are imposed in this model, whereas no factor loading is fixed at 1 while the variances of the three factors are set equal to 1 only in the control group (and free in the experimental group; see Appendix B for model identification and Appendix D for source code with the popular LVM software Mplus; Muthén & Muthén, 2010; see also Millsap, 2011). Model 4 is similarly associated with fit indexes that can be considered here indicative of its plausibility: χ² = 53.254, df = 30, p = .006, and RMSEA = .076 [.041, .108] (e.g., Millsap & Cham, in press; Millsap & Olivera-Aguilar, in press). Table 1 presents the parameter estimates, standard errors, and pertinent test statistics obtained with this model.

Table 1.

Parameter Estimates, Standard Errors, and p Values Associated With Model 4^a

Parameter	Estimate	SE	t-Value	p Value
CONTROL Group
IR BY
ADEPT_I	12.266	0.661	18.553	0.000
TH_ST_I	13.954	0.806	17.321	0.000
FR BY
ADEPT_FR	5.751	0.551	10.443	0.000
C_FAIR	8.345	0.799	10.446	0.000
RAVEN	2.990	0.276	10.824	0.000
PS BY
PS_1	3.737	0.521	7.175	0.000
PS_2	5.111	0.705	7.246	0.000
FR WITH
IR	0.849	0.041	20.681	0.000
PS WITH
IR	0.573	0.096	5.988	0.000
FR	0.737	0.076	9.650	0.000
Means
IR	0.000
FR	0.000
PS	0.000
Intercepts
ADEPT_I	31.137	1.305	23.869	0.000
ADEPT_FR	28.254	0.630	44.824	0.000
TH_ST_I	33.899	1.481	22.882	0.000
C_FAIR	40.582	0.938	43.250	0.000
RAVEN	7.905	0.344	22.976	0.000
PS_1	20.005	0.497	40.253	0.000
PS_2	32.698	0.642	50.946	0.000
Variances
IR	1.000
FR	1.000
PS	1.000
Residual Variances
ADEPT_I	21.655	5.148	4.207	0.000
ADEPT_FR	6.146	1.435	4.284	0.000
TH_ST_I	9.537	5.646	1.689	0.091
C_FAIR	18.665	3.666	5.091	0.000
RAVEN	5.361	0.774	6.929	0.000
PS_1	15.307	3.379	4.530	0.000
PS_2	16.443	4.623	3.557	0.000
EXPERIMENTAL Group
IR BY
ADEPT_I	12.266	0.661	18.553	0.000
TH_ST_I	13.954	0.806	17.321	0.000
FR BY
ADEPT_FR	5.751	0.551	10.443	0.000
C_FAIR	8.345	0.799	10.446	0.000
RAVEN	2.990	0.276	10.824	0.000
PS BY
PS_1	3.737	0.521	7.175	0.000
PS_2	5.111	0.705	7.246	0.000
FR WITH
IR	0.859	0.141	6.101	0.000
PS WITH
IR	0.631	0.131	4.801	0.000
FR	0.683	0.160	4.282	0.000
Means
IR	0.236	0.129	1.837	0.066
FR	0.168	0.130	1.292	0.196
PS	0.071	0.150	0.473	0.636
Intercepts
ADEPT_I	31.137	1.305	23.869	0.000
ADEPT_FR	28.254	0.630	44.824	0.000
TH_ST_I	33.899	1.481	22.882	0.000
C_FAIR	40.582	0.938	43.250	0.000
RAVEN	7.905	0.344	22.976	0.000
PS_1	20.005	0.497	40.253	0.000
PS_2	32.698	0.642	50.946	0.000
Variances
IR	0.912	0.132	6.924	0.000
FR	0.983	0.211	4.651	0.000
PS	0.984	0.280	3.513	0.000
Residual Variances
ADEPT_I	23.597	3.647	6.470	0.000
ADEPT_FR	10.703	1.360	7.871	0.000
TH_ST_I	8.583	3.638	2.359	0.018
C_FAIR	7.083	1.903	3.722	0.000
RAVEN	4.541	0.597	7.604	0.000
PS_1	8.798	1.965	4.478	0.000
PS_2	22.713	3.716	6.112	0.000

Note. ADEPT_I = ADEPT induction test; ADEPT_FR = ADEPT figural relations test; TH_ST_I = Thurstone’s standard induction test; C_FAIR = Culture fair test; RAVEN = Raven’s advanced progressive matrices test; PS_1, PS_2 = 1st and 2nd perceptual tests; SE = standard error.

In Mplus output format (Muthén & Muthén, 2010). See Appendix D for source code needed for model fitting.

As can be seen from Table 1, since in the experimental group none of the factor means are significant, there appear to be no marked group differences in mean performance on the Inductive Reasoning, Figural Relations, and Perceptual Speed fluid abilities. Similarly, the variances section indicates no considerable group differences in the degree of interindividual differences on these three factors of concern. This can be most easily seen when working out the 95% confidence intervals for these variances in the experimental group by subtracting and adding 1.96 times their standard errors. Each of these three resulting intervals covers then the constant 1, which represents the variance of any of the three factors in the control group, and hence one could conclude that there are no considerable individual differences in the studied constructs across groups. By way of summary, these up-to-weak factor mean and variance group difference findings suggest a substantial degree of similarity in the two groups at the assessment occasion used (cf. Baltes et al., 1986), with no significant training effect as far as the studied fluid abilities’ means and extent of individual differences are concerned.

Conclusion

This article revisited a popular LVM procedure for studying MI in empirical behavioral, social, and educational research. Although this approach is currently widely used in these disciplines, it was indicated in the article that at present (a) it does not include in general a complete (unconditional) test of group invariance in the factor loadings and (b) it does not provide a statistical test of mean intercept invariance. These limitations stem from the underlying multiplicative entangling of factor loading and factor variance parameters on one hand, and in the additive confounding of factor mean and mean intercept parameters on the other. Although appropriate parameter restrictions render the underlying CFA models identified, these constraints have important consequences in their own right that are eventually the source of the limitations of this procedure.

Based on these limitations, the article argued that presently it would be recommendable in general for empirical social, behavioral, and educational researchers to examine alternatively MI and in the affirmative case ensure that these invariance conditions are implemented in latent variable models aimed at aiding their research in factor mean, variances, and interrelationship indexes’ differences and similarities in multiple populations under investigation. The traditional LVM approach to MI testing can still be used for testing (a) factorial invariance restriction (4) when such prior (nonstatistical) knowledge is available that is tantamount to group identity of the factor loadings for chosen reference variables, or group identity of factor variances in the studied populations, and (b) group identity in mean intercepts (constraint [5]) when prior knowledge is available that amounts to group identity in the intercepts of a reference indicator per factor (cf. Millsap, 2011; Millsap & Olivera-Aguilar, in press; Vandenberg & Lance, 2000; Woods, 2009; Yoon & Millsap, 2007).

Footnotes

Appendix A

Appendix B

Appendix C

Appendix D

Acknowledgements

We are indebted to R. E. Millsap, B. O. Muthén, L. K. Muthén, and K.-H. Yuan for valuable discussions on measurement invariance and related issues. Thanks are also due to two anonymous reviewers as well as to P. B. Baltes, F. Dittmann-Kohli, and R. Kliegl for permission to use—for illustration purposes—data from their project “Aging and Fluid Intelligence.”

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Baltes

P. B.

Dittmann-Kohli

Kliegl

(1986). Reserve capacity of the elderly in aging-sensitive tasks of fluid intelligence: Replication and extension. Psychology and Aging, 1, 172-177.

Bollen

K. A.

(1989). Structural equations with latent variables. New York, NY: Wiley.

Byrne

B. M.

Shavelson

R. J.

Muthén

(1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456-466.

Cheung

G. W.

Rensvold

R. B.

(1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25, 1-27.

Cheung

G. W.

Rensvold

R. B.

(2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233-255.

Horn

J. L.

(1982). The aging of human abilities. In Wolman

B. B.

(Ed.), Handbook of developmental psychology (pp. 847-870). New York, NY: McGraw-Hill.

Little

T. D.

(2000). On the compatibility of constructs in cross-cultural research: A critique of Cheung and Rensvold. Journal of Cross-Cultural Psychology, 31, 213-219.

Lubke

G. H.

Dolan

C. V.

Kelderman

Mellenbergh

G. J.

(2003). Weak measurement invariance with respect to unmeasured variables: An implication of strict factorial invariance. British Journal of Mathematical and Statistical Psychology, 56, 231-248.

Meade

A. W.

Lautenschlager

G. J.

(2004). A Monte-Carlo study of confirmatory factor analytic tests of measurement equivalence/invariance. Structural Equation Modeling, 11, 60-72.

10.

Meredith

(1993). Measurement invariance, factor analysis, and factorial invariance. Psychometrika, 58, 525-543.

11.

Meredith

Horn

J. L.

(2001). The role of factorial invariance in modeling growth and change. In Collins

L. M.

Sayer

A. G.

(Eds.), New methods for the analysis of change (pp. 201-240). Washington, DC: American Psychological Association.

12.

Millsap

R. E.

(2001). When trivial constraints are not trivial: The choice of uniqueness constraints in confirmatory factor analysis models. Structural Equation Modeling, 8, 1-17.

13.

Millsap

R. E.

(2005). Four unresolved problems in studies of factorial invariance. In Maydeu-Olivares

McArdle

J. J.

(Eds.), Contemporary psychometrics (pp. 153-171). New York, NY: Routledge.

14.

Millsap

R. E.

(2011). Statistical approaches to measurement invariance. New York, NY: Routledge.

15.

Millsap

R. E.

Cham

(in press). Investigating factorial invariance. In Laursen

Little

T. D.

Card

N. A.

(Eds.), Handbook of developmental research methods. New York, NY: Guilford Press.

16.

Millsap

R. E.

Olivera-Aguilar

(in press). Investigating measurement invariance using confirmatory factor analysis. In Hoyle

Kaplan

Marcoulides

G. A.

West

(Eds.), Handbook of structural equation modeling. New York, NY: Guilford Press.

17.

Muthén

B. O.

(2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81-117.

18.

Muthén

L. K.

Muthén

B. O.

(2010). Mplus user’s guide. Los Angeles, CA: Muthén & Muthén.

19.

Reise

S. P.

Widaman

K. F.

Pugh

R. H.

(1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566.

20.

Rensvold

G. B.

Cheung

G. W.

(2001). Testing for metric invariance using structural equation models: Solving the standardization problem. In Schriesheim

C. A.

Neider

L. L.

(Eds.), Research in Management: Vol. 1. Equivalence in management (pp. 21-50). Greenwich, CT: Information Age.

21.

Steenkamp

J. E. M.

Baumgartner

(1998). Assessing measurement invariance in cross national consumer research. Journal of Consumer Research, 25, 78-90.

22.

Steiger

J. H.

(2002). When constraints interact: A caution about reference variables, identification constraints, and scale dependencies in structural equation models. Psychological Methods, 7, 210-227.

23.

Vandenberg

R. J.

Lance

C. E.

(2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations. Organizational Research Methods, 3, 4-70.

24.

Woods

C. M.

(2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33, 42-57.

25.

Yoon

Millsap

R. E.

(2007). Detecting violations of factorial invariance using data-based specification searches: A Monte Carlo study. Structural Equation Modeling, 14, 435-463.