Multiple Group Analysis in Multilevel Data Across Within-Level Groups: A Comparison of Multilevel Factor Mixture Modeling and Multilevel Multiple-Indicators Multiple-Causes Modeling

Abstract

The purpose of this two-part study is to evaluate methods for multiple group analysis when the comparison group is at the within level with multilevel data, using a multilevel factor mixture model (ML FMM) and a multilevel multiple-indicators multiple-causes (ML MIMIC) model. The performance of these methods was evaluated integrally by a series of procedures testing weak and strong invariance models and the latent group mean differences testing after holding for factorial invariance. Two Monte Carlo simulation studies were conducted under the following conditions: number of clusters, cluster size, and the design type in groups. A multilevel one-factor confirmatory factor analysis (CFA) model as a research model in Study 1 was investigated to compare the results under different conditions with those of previous studies. A multilevel two-factor CFA model as a research model in Study 2 was evaluated by fitting alternative models that can be applied when the model is complicated. The results indicated that the two methods were reasonable in multilevel multiple groups analysis across within-level groups. However, pros and cons were found between the two methods. In the multilevel one-factor CFA model, ML MIMIC model was slightly better when the sample size is small. In the multilevel complex model, two alternative models of ML FMM were recommended because the weak invariance testing of ML MIMIC was considerably time-consuming. Finally, it was shown that information criteria, which are criteria for determining whether factorial invariance is established, need to be applied differently according to the sample size conditions. Guidelines for this situation are provided.

Keywords

multilevel multiple group analysis factorial invariance latent means comparison multilevel factor mixture modeling multilevel mimic modeling

Introduction

Multiple group analysis is used to test whether the parameters of interest are statistically different across comparison groups. In the social and behavioral sciences, a number of studies address the methodology of multiple group analysis (e.g., Cheung & Rensvold, 1998; Kim et al., 2017; Morin et al., 2016; Vandenberg & Lance, 2000). Recently, with the popularity of multilevel data (e.g., when individuals are nested within organizations, when students are nested within classes or schools, etc.), several issues regarding multiple group analysis in multilevel structural equation modeling (MSEM) are worthy of attention. A group is distinguished from clusters and refers to the memberships to be compared.

When multiple group analysis is studied with multilevel data, comparing groups can be either at within level (e.g., male students vs. female students) or at between level (e.g., public schools vs. private schools). If the grouping variable is at the between level, all the individuals in each cluster belong only to a single group, whereas if the grouping variable is at the within level, each cluster may have several groups (Asparouhov & Muthén, 2012b; Kim & Cao, 2015; Kim et al., 2015; Ryu, 2014). For example, when comparing gender differences using data of students nested within schools, there are males and females in coeducational schools. Between-level grouping analysis is relatively easier than within-level grouping analysis because it can be solved in a standard way as in single-level multiple group analysis. The standard procedure of a multiple group approach is that a set of data is first split by groups and the same model is specified in each group (i.e., configural invariance). Then the models are simultaneously estimated with the equality constraints (i.e., factor loading and intercept invariance). It should be noted that an important assumption of the multiple group analysis is an independent assumption where individuals are randomly extracted from the population. Between-level grouping variable analysis can be regarded as satisfying this assumption. However, within-level grouping analysis raises a complex problem because within-level groups are generally not independent. As these gender groups share a common variable at the between level (e.g., school environment, teacher’s teaching quality), the data cannot support the independence assumption and also cannot be completely separated. Hence, a key issue with using multiple group analysis across within-level groups is a question of how to consider the dependency at the between level. Despite the increasing number of studies on multilevel data, the methodology of multilevel multiple group analysis has not been sufficiently studied, and group comparison studies have not been actively conducted in the social sciences such as education and psychology.

To address this problem, a few recent studies proposed some approaches and investigated the performance of each method: (1) multilevel factor mixture model (ML FMM) as a multiple group approach (Asparouhov & Muthén, 2012b) and (2) multilevel multiple-indicators multiple-causes (ML MIMIC) model as a single group approach (Kim et al., 2015; Woods & Grimm, 2011).

This study builds on recent findings that ML FMM and ML MIMIC are appropriate for within-level factorial invariance testing and latent group mean comparison (Kim & Cao, 2015; Kim et al., 2015). By expanding on the contributions of previous studies, this study differs in the following aspects. First, given that factor invariance testing is a prerequisite for multiple group analysis, integrated guidelines for factor invariance testing procedures and latent group mean difference testing should be suggested. Previous studies carry limitations in that they are evaluating the performance of methods for two types of testing (i.e., factorial invariance testing, latent group means difference testing) in separate studies. Although researchers may be interested only in the establishment of factorial invariance, generally it is accompanied by factorial invariance testing and the comparison of latent means. Moreover, it is insufficient to conduct only the latent mean comparison, without testing factorial invariance. Second, since it is a methodology aimed at comparing two groups, it is necessary to pay particular attention to whether performance is sensitive to unbalanced sample sizes between the two groups. However, previous studies investigating the performance of both methods have considered various sample sizes of a number of clusters and cluster size as a condition but have not considered whether the two groups were balanced. To fill this gap, this study compared the performances of two methods by focusing on sample size conditions including design type (balanced or unbalanced) as well as the number of clusters and cluster size. Last, for complex models with two or more factors or with three or more comparison groups, the performance of the two methods needs to be considered. The complex model in this study refers to cases where the number of factors increases (i.e., multilevel multiple-factor model) or the number of groups increases (i.e., comparisons of more than two groups). The basic models of ML FMM and ML MIMIC, which were applied in a multilevel one-factor confirmatory factor analysis (CFA) model, are not applicable to complex models. The reason is technically related to the estimation method. To estimate group-specific random effects at the between level, the estimation of ML FMM uses numerical integration and the dimensions are required for as many G groups per factor (i.e., number of dimensions = number of factors × number of groups; Asparouhov & Muthén, 2012b). Hence, when conducting multiple group analysis using ML FMM, a multilevel one-factor CFA model and a multilevel two-factor CFA model require two and four dimensions, respectively. However, analysis of a complex model is likely to lead to estimation problems. In fact, it is impossible to estimate two more dimensions in current software programs, including Mplus. Besides ML FMM, ML MIMIC also has difficulty in estimation in the case of a complex model. One of the disadvantages of the MIMIC approach is that it is not easy to implement weak invariance testing because it also requires numerical integration due to latent interaction terms (i.e., interaction between a factor and grouping variable). Hence when the model has more than one factor, it needs as many latent interaction terms as the number of factors in weak invariance testing, which requires a considerable amount of time. Therefore, to analyze a complex model, alternative methods are needed. To seek a solution, this study addresses modeling issues for a complex model in multiple group analysis and reviewed alternative models proposed by Asparouhov and Muthén (2012b). Alternative models in this study are more restricted models in which group specific random effects are set proportionally or equalized between two groups (see “Theoretical Background” section for more details). By evaluating the performances of the alternative models, this study offers suggestions in terms of to what extent the complex model can be generalized.

The purpose of this study is to provide practical guidelines for which method is appropriate, ML FMM and ML MIMIC, when multilevel multiple group analysis is conducted with a within-level grouping variable. To achieve this objective, this study explicated the factorial invariance testing procedures of two methods (ML FMM and ML MIMIC) and elucidated a model specification about a complex model (i.e., multiple-factor model or more than two groups analysis). Two simulation studies were conducted to investigate the efficacy and performances of the two methods: a multilevel one-factor CFA model in Study 1 and a multilevel two-factor CFA model as a complex model in Study 2. Study 1 confirmed that the same result was obtained from the previous study under different sample size conditions including unbalanced design types. In Study 2, alternative methods were suggested, and the performance of those was examined. In addition, this study also tried to generalize alternative models to the more complex models (e.g., three-factor model or three groups comparison). Last, this study provides a practical guideline for applied researchers.

The current study will address the following research questions:

Research Question 1: When multiple group analysis is conducted with a multilevel one-factor CFA model across within-level groups, how are the performances of ML FMIM and ML MIMIC according to data conditions (i.e., number of clusters, cluster size, and group balance)? (Study 1)

Research Question 2: When multiple group analysis is conducted with a multilevel two-factor CFA model across within-level groups, how can each method be applied? How are the performances of alternative models according to data conditions (i.e., number of clusters, cluster size, and group balance)? (Study 2)

Theoretical Background

Multiple Group Analysis

The establishment of factorial invariance allows for the interpretation of a latent construct in the same manner between groups; thus, factorial invariance testing is a prerequisite to compare observed scores or latent factor means across groups (Mellenbergh, 1989; Meredith, 1993; Meredith & Millsap, 1992; Millsap, 1997; Reise et al., 1993). Factorial invariance testing generally requires four sequential procedures: configural, weak, strong, and strict invariance. Configural invariance refers to having equivalent patterns (i.e., zero and nonzero patterns) of factor structure across groups (i.e., the same number of factors and indicators across groups). Weak invariance (also referred to as metric or factor loading invariance) requires equal factor loading between groups in addition to configural invariance. Strong invariance (also referred to as scalar or intercept invariance) holds invariant measurement of intercepts in addition to weak invariance between groups. Strict invariance indicates invariant measurement of residual variance in addition to equal factor loading and intercepts. According to the literature on factorial invariance, it is known that a strong invariance model should be established for a statistically reasonable group mean comparison (Jak et al., 2014; Widaman & Reise, 1997). That is, strict invariance testing is not considered essential when testing latent mean differences between groups. The establishments of weak and strong invariance are particularly important in factorial invariance testing. Accordingly, the establishment of configural invariance was assumed in this study, and only the performances of weak and strong invariance testing were assessed. Also, this study did not examine the performance of strict invariance testing.

Multilevel Factor Mixture Model for Known Classes

Factor mixture modeling (FMM) is a combined model of traditional CFA and latent class analysis that incorporates latent continuous variables (i.e., factors) and latent categorical variables (i.e., latent classes; Asparouhov & Muthén, 2008; Lubke & Muthén, 2005). In general, FMM is used as an analytical method to identify heterogeneous populations in the data and to derive latent classes. In a multiple group analysis framework, FMM also can be employed for factorial invariance testing and latent mean difference testing across groups in which observed groups are treated as known classes.

In ML FMM across within-level groups, the group effects are estimated as random effects at the between level. ML FMM for within-level groups is specified as (Asparouhov & Muthén, 2012b; Kim & Cao, 2015)

[y_{ij} | C_{ij} = g] = [τ_{g .} + Λ_{B .} η_{Bj .} + ε_{Bj .}] + [Λ_{wg} η_{wijg} + ε_{wijg}],

(1a)

ε_{Wijg} ~ N (0, Θ_{ε Wg}),

(1b)

ε_{Bg .} ~ N (0, Θ_{ε B .}) .

(1c)

where a subscript $g$ refers to the within-level group membership for individuals $i$ in the cluster $j$ , $y_{ij}$ is a vector of observed scores for individual $i$ in the cluster $j$ , $Λ_{wg}$ is the factor loading matrix at the within level for group $g$ , $η_{wijg}$ and $ε_{wijg}$ are the vector of factor scores and the vector of residuals at the within level for individual $i$ in the cluster $j$ and group $g$ , respectively. Also, $τ_{g}$ is a vector of intercepts at the between level for group $g$ , $Λ_{B}$ is the loading matrix at the between level, $η_{Bj}$ and $ε_{Bj}$ are the vector of factor scores and the vector of residuals at the between level for cluster $j$ , respectively. It should be noted that the intercept ( $τ_{g}$ ) of observed scores ( $y_{ij}$ ) exists at between level only because an observed score is composed of a group mean and its deviation from the group mean (Heck & Thomas, 2009). Also, the dot indicates random effects across clusters and within-level groups in the model. The dependency between within-level groups (e.g., males and females in the same school) is presented as $ρ$ in the multiple group analysis approach.

COV (e_{1}, e_{2}) = ρ .

(2)

In ML FMM, group-specific random effects are specified by the latent variable (i.e., e₁ for males and e₂ for females), and these random effects are correlated, representing the dependency. For each group, e₁ and e₂ are specified to have zero residual variance and perfectly related to the corresponding between-level factors. The variance of e₁, e₂, and the correlation (or covariance) between e₁ and e₂ (called intraclass group correlation coefficient [IGCC] in Kim and Cao’s, 2015, article) are estimated freely in the model.

Importantly, ML FMM in multiple group analysis uses numerical integration for model estimation. Hence, the estimation problem may occur when the model is complex as in the case of a CFA model with more than one factor (i.e., multidimensional constructs) or more than two groups (e.g., comparison of school levels: elementary, middle, and high school) due to an increased dimension in numerical integration. Accordingly, Asparouhov and Muthén (2012b) suggested six alternative models since a parsimonious model that is more restricted but easier to estimate may be appropriate depending on the data. Among these models, there are two alternative models that are suitable for the testing of factorial invariance and latent mean difference. Other models can easily be applied to the estimation of the latent mean differences, but it is almost impossible to test for factorial invariance (for more details, see Asparouhov & Muthén, 2012b).

The first alternative model (referred to as the ML FMM A1 model hereinafter in this study) is a slightly restricted model and nested in the ML FMM described above. While general ML FMM in multiple group analysis has separate group-specific random effects between groups, ML FMM A1 has different group-specific random effects but assumes to have a perfect correlation between these effects (i.e., the correlation = 1), indicating that the group-specific random effects have a proportional relationship. The variance of random effects is fixed to 1 and different group-specific variances across groups are represented by the freely estimated factor loadings for the random effect. Since it is expected to have highly correlated cluster effects in many educational and psychological applications, this assumption may be reasonable (Asparouhov & Muthén, 2012b; Ryu, 2014; Ryu & Mehta, 2017).

The second alternative model (referred to as the ML FMM A2 model hereinafter in this study) is a more restricted model than ML FMM A1 and also nested in the ML FMM A1. ML FMM A2 is assumed to have a common set of between levels across groups. That is, ML FMM A2 has an equal variance of cluster-specific effects (i.e., the variance is equally constrained) in addition to the correlation of 1 between group-specific effects; thus, the group effect is estimated as a fixed effect. (i.e., exactly the same between-level model). In this model, there is no need for a random effects variable for the between model.

Factorial Invariance Testing Procedure

In a multilevel multiple group approach, factorial invariance should be established not only across groups but also across the within and between levels (Jak, 2018; Jak & Jorgensen, 2017; Ryu & Mehta, 2017). In Ryu and Mehta’s (2017) article, this was called metric invariance across levels. Although multilevel modeling has flexibility in that the model can be specified differently at each level, it requires the same factor structure across the levels in multiple group analysis (Kim et al., 2012; Ryu & Mehta, 2017). With metric invariance across levels, the total variance of a factor can be divided into within and between levels (Zyphur et al., 2008). In addition, latent factors can represent the same construct conceptually at both levels (e.g., self-efficacy of students at the within level represent average self-efficacy of a school at the between level), which enables them to be labeled in the same way. With this outcome, a latent group mean occurring within level can be estimated at a between level (Asparouhov & Muthén, 2012b).

With metric invariance across levels, Equation (1a) is simplified to

[y_{ij} | C_{ij} = g] = τ_{g .} + Λ_{g} (η_{Bj .} + η_{wijg}) + ε_{Bj .} + ε_{wijg} .

(3)

Now the procedure of factorial invariance for ML FMM is similar to multiple group analysis of single-level data. The holding for weak invariance was assessed by model comparison testing between a factor loading equally constrained model across the groups (i.e., weak invariance model, $Λ_{1}$ = $Λ_{2}$ in Equation 2) and the unconstrained model (i.e., configural invariance model). In same way, holding for strong invariance was assessed by model comparison testing between intercepts of the observed variables equally constrained model across the groups (i.e., intercept invariance model, $τ_{1} = τ_{2}$ in Equation 2) and the unconstrained model (i.e., weak invariance model). As in MSEM, the mean parameters (i.e., latent factor means, measurement intercepts) is zero at the within level, and these are specified at the between level. After the establishment of factorial invariance, latent mean difference across groups is represented by the latent means of a focal group ( $α_{W (g 2)}$ ) by constraining the latent means of a reference group ( $α_{W (g 1)}$ ) as 0 at the between level.

Strengths and Weaknesses

ML FMM is considered the most general and flexible model in that the dependency between group effects can freely be estimated depending on the modeling. The dependency across groups in the same cluster can be represented by separate sets of group-specific random effects at the between level with the freely estimated as their correlation. Or it can be estimated as a constrained model to reduce the number of parameters. There can be two models in which the correlation is assumed to be 1. One is the group-specific random effects¹ having proportional effects across groups (referred to as ML FMM A1 in this study); the other is the same cluster effects across groups (referred to as ML FMM A2 in this study). Although it is not often seen, the correlation between groups can be assumed to be zero.

Obviously, high power and properly controlled Type I errors are expected in the group-specific random effects model with the freely estimated as their correlation. However, estimation of this model involves a burden on large sample size. In small sample size data, there will not be enough power to establish statistical significance for the group-specific effects. In addition, when the group-specific random effects and IGCC were incorrectly constrained differently from the data, a high Type I error rate and low statistical power resulted (Kim & Cao, 2015).

On the other hand, multiple group analysis in the ML FMM is not possible when the factor structure across levels is different because the mean parameters (i.e., latent factor means, intercepts) are specified at the between level under the assumption that the latent factors have equal meanings at different levels (i.e., metric invariance across levels).

Multilevel Multiple-Indicators Multiple-Causes Model

MIMIC modeling is generally used to test factorial invariance and group mean differences in latent variables (Jöreskog & Goldberger, 1975; Muthén, 1989). Unlike the multiple group analysis approach, MIMIC employs a dummy-coded grouping variable as a covariate to test factorial invariance. The MIMIC model can be extended to an ML MIMIC modeling in hierarchically structured data. When there are groups at the within level, a grouping covariate ( $X_{Wij}$ ) is specified at the within level. An ML MIMIC model with within-level groups is expressed as

y_{ij} = [τ_{Bj} + Λ_{B} η_{Bj} + ε_{Bj}] + [Λ_{W} η_{Wij} + ε_{Wij}],

(4a)

η_{Wij} = Γ_{W} X_{Wij} + ξ_{Wij},

(4b)

ξ_{Wij} ~ N (0, ψ_{W}),

(4c)

where $y_{ij}$ is a vector of observed scores for individual $i$ in cluster $j$ , $Λ_{W}$ and $Λ_{B}$ are a matrix of loadings that relate within- and between-level factor scores to the observed variables, respectively. $η_{Wij}$ and $η_{Bj}$ are vectors of within- and between-level factor scores, respectively. A regression coefficient ( $Γ_{W}$ ) of a grouping covariate ( $X$ ) on the corresponding latent factor ( $η$ ) represents the group effects indicating the difference in latent means at the within level.

Factorial Invariance Testing Procedures

In the MIMIC modeling, sequential invariance testing (i.e., configural, weak, and strong invariance testing) is examined with the significance of the corresponding regression coefficient (Mellenbergh, 1989). Equations (5a) to (5c) specify the configural invariance model, which may differ in intercepts and factor loading between groups.

y_{ij} = [τ_{Bj} + Λ_{B} η_{Bj} + ε_{Bj}] + [Λ_{W} η_{Wij} + β_{y} X_{Wij} + ω_{η y} η_{Wij} X_{Wij} + ε_{Wij}],

(5a)

ε_{Wij} ~ N (0, Θ_{ε W}),

(5b)

ε_{Bj} ~ N (0, Θ_{ε B .}) .

(5c)

Specifically, the grouping variable ( $X_{Wij}$ ) predicts each observed variable, which allows the groups to have different intercepts for these variables. Thus, if the regression coefficient ( $β_{y}$ ) is significant, it means that the intercept of the corresponding observed variable is noninvariant between groups, and strong invariance (also called uniform bias in the MIMIC approach) is not established. To test factor loading invariance in ML MIMIC, there needs to be an interaction term between the latent variable ( $η_{Wij}$ ) and the grouping variable ( $X_{Wij}$ ). For example, suppose that there is a latent variable of motivation and the grouping variable is gender. The interaction term ( $η_{Wij} X_{Wij}$ ) is produced by the multiplication of gender $(X_{Wij}$ ) variables and factors of motivation ( $η_{Wij}$ ). The observed variables are regressed on the interaction term, which allows for group differences in loadings. In the same manner as for intercept invariance testing, if the regression coefficient ( $ω_{η y}$ ) is significant, this means that the factor loading of corresponding observed variables is noninvariant between groups, and weak invariance (also called nonuniform bias in the MIMIC approach) is not established. When factorial invariance is established, the model equation for testing the latent mean difference is simplified as Equation (4a).

Strengths and Weaknesses

ML MIMIC has a smaller number of parameters to be estimated than ML FMM because it uses the whole sample to estimate the group difference instead of dividing the data by groups, which ensures adequate power for relatively small samples (Woods, 2009). Another benefit to multiple group analysis using this model is that the model can easily incorporate additional variables (e.g., socioeconomic status, disposition, motivation) that can potentially explain group differences. Despite its advantages, ML MIMIC has a few downsides. Although the factor variance and residual variance of observed variables may be different across groups, MIMIC modeling assumes homogeneity regardless of the number of groups (i.e., stringent assumption). In addition, when multiple group analysis is conducted with within-level groups, group-specific effects cannot be estimated as a random effect, which does not allow for the estimation of the correlation (or covariation) between within-level groups. Importantly, weak invariance testing entails a technical and time-consuming burden. As described above, weak invariance testing in MIMIC involves an interaction term with factors. Unlike interaction terms between observed variables, interaction terms with factors require numerical integration for analysis. As the number of factors increases, the dimension of numerical integration increases, making it almost impossible to estimate. Therefore, the weak invariance test in the two-factor model in Study 2 was conducted by fitting the model of each interaction term with a factor instead of creating two interaction terms with factors at one time.

Previous Simulation Studies

To evaluate the performance of methods of multilevel multiple group analysis when the grouping variable is at a within level, there have been several recently conducted simulation studies. A summary of prior simulation studies is presented in Table 1.² Kim et al. (2012) utilized the design-based multiple group multilevel CFA (i.e., single-level approach) to account for the dependency at the between level across the within-level groups. This method takes into account the hierarchical data structure by adjusting standard errors of the parameter estimates using the Type = COMPLEX routine in Mplus. Although this method showed desirable performance in terms of an admissible solution rate, statistical power, and Type I error, it raises questions about group-specific effects across within-level groups not considered in the data generation. That is, an ideal and special situation was assumed.

Table 1.

Summary of Previous Simulation Studies.

Study	Method	Model	Location of difference (magnitude)	Model selection criteria	Condition
Kim et al. (2012)	Design-based multiple group multilevel CFA (single-level approach) vs. multiple group ordinary CFA	Multilevel one-factor CFA (eight indicators)	Factor loading (Δ0.5 large size)	SB-LRT	• CN: 30, 50, 80, 160 • CS: 10, 20 • ICC: .09, .20, .33
Asparouhov & Muthén (2012b)	The most general ML FMM and five alternative models were presented	Multilevel one variable	None	LRT	None
Kim et al. (2015)	ML FMM, ML MIMIC	Multilevel one-factor CFA (eight indicators)	Factor loading or intercept (Δ0.25, Δ0.5, small and large sizes, respectively)	SB-LRT, AIC, BIC, aBIC	• CN: 60, 100, 160 • CS: 10, 20 • ICC: .09, .20, .33
Kim & Cao (2015)	ML FMM, ML MIMIC	Multilevel one-factor CFA (six indicators)	Factor mean (Δ0.3 medium size)	None	• CN: 20, 25, 30, 50, 100 • CS: 6, 10, 20 • Heterogeneity: homogeneous factor and residual variances, heterogeneous factor variances, heterogeneous residual variances • IGCC: 0, .3, .6
Ryu (2015)	Multiple group MSEM with Level 1 group membership	Multilevel one-factor CFA (four indicators)	Intercept (Δ0.3) or factor mean (Δ0.4)	LRT	• Sample size: balanced, three unbalanced levels

Note. CFA = confirmatory factor analysis; ML FMM = multilevel factor mixture model; ML MIMIC = multilevel multiple-indicators multiple-causes; MSEM = multilevel structural equation modeling; CN = number of clusters; CS = cluster size; ICC = intraclass correlation; IGCC = intraclass group correlation coefficient; LRT = likelihood ratio test; SB-LRT= Satorra–Bentler scaled likelihood ratio test; AIC = Akaike information criterion; BIC = Bayesian information criterion; aBIC = sample-size adjusted BIC.

Asparouhov and Muthén (2012b) presented the Mplus web note for multiple group multilevel analysis. They proposed the ML FMM, which was the most flexible model to estimate group-specific random effects (referred to as model H1 in their article) and suggested five alternative models that were gradually constrained from model H1. Also, a small-scale simulation was conducted to present how the log-likelihood values for the six different models differ from the true model (i.e., model H1). Since they focused on estimating the mean difference between groups, there were models that were not practically able to perform factorial invariance testing among these proposed alternative models.

To examine the performance of factorial invariance, Kim et al. (2015) investigated the adequacy of the ML FMM and ML MIMIC. Additionally, the performance of model selection criteria (Satorra–Bentler scaled likelihood ratio test [SB-LRT], Akaike information criterion [AIC], Bayesian information criterion [BIC], and sample-size adjusted BIC [aBIC]) for determining factorial invariance for ML FMM or in detecting item noninvariance for ML MIMIC was examined. Their study reported that the two approaches were optimal although ML FMM was less optimal when the sample size was as small as 600. However, it should be noted that restricted ML FMM that ignored the group-specific random effects at the between level was investigated.

In addition, Kim and Cao (2015) compared the performance of the two methods (ML FMM and ML MIMIC) for the group mean differences of the latent variable with factorial invariance assumed. The population model was the most general ML FMM, but the MIMIC showed comparable performance to the ML FMM by calculating the average value although the factor and residual variance between the two groups were set differently (i.e., heterogeneity). Additionally, they examined the performances of several different models of ML FMM (i.e., IGCC freely estimated, fixed at zero, fixed at one, and without group-specific random effects) and ML MIMIC with a focus on the specification of the IGCC in another study. In their study, the performance of the model that freely estimates the IGCC was the best, and the model with the IGCC fixed at one had a high power, but Type I errors were substantially inflated regardless of the IGCC conditions. On the contrary, the model without group-specific random effects properly controlled Type I errors, but the power was unacceptably very low. Rather, the model with IGCC fixed at zero indicated a slightly better performance in terms of both evaluation criteria.

Ryu (2014, 2015) suggested multiple group MSEM with Level 1 group membership (referred to as MG1-MSEM in her article) as a multiple group approach. To consider the dependence within the clusters properly, MG1-MSEM first decomposes the variance–covariance matrix into each level components before separating the data by groups. Then, a single-level CFA is set up with an identical between-level model for each group irrespective of their group membership by assuming that the correlation between within-level groups is perfect. MG1-MSEM is available only in Muthén’s maximum likelihood estimation (MUML; Muthén, 1994) and should be manually set up as many 2 *k groups single-level CFA models, where k equals the number of groups. Although MG1-MSEM can be easily applied to more complicated models such as a path model, it is less flexible than ML FMM in that MG1-MSEM only allows a common set of between-level random effects (i.e., correlation = 1) for groups (Ryu & Mehta, 2017).

Method

Simulation Design

To evaluate the performance of ML FMM and ML MIMIC for multiple group analysis in a multilevel CFA model across within-level groups, two Monte Carlo simulations were conducted. As a research model, a multilevel CFA model with one factor (Figure 1) and two factors (Figure 2) were examined in the first and second study, respectively. The model of Study 1 is a multilevel one-factor CFA model similar to previous studies.

Figure 1.

A multilevel confirmatory factor analysis population model of Study 1.

Figure 2.

A multilevel confirmatory factor analysis population model of Study 2.

The simulation design is represented in Table 2. The generation model was ML FMM, which is most general and flexible because the group-specific random effects and their correlation are estimated. Four models were generated: two factorial noninvariance models depending on the location of noninvariance (i.e., factor loading or intercept) and two factorial invariance models depending on whether there is a mean difference in the latent variable. The power for the establishment of factorial invariance was calculated by fitting the constrained model (i.e., an unspecified model) and unconstrained model (true model) using the data of the factorial noninvariance model. The power for estimating the latent group mean parameter was calculated using the factorial invariance model with latent group means differences. In addition, a factorial invariance model with zero mean difference was simulated to assess Type I errors.

Table 2.

Simulation Design.

Generation model (ML FMM)		Fitted model (ML FMM and ML MIMIC in Study 1; ML FMM A1, A2, and ML MIMIC in Study 2)	Analysis of results
Factorial noninvariance model	Factor loading noninvariance model	• Configural invariance model (true model) • Weak invariance model (misspecified model)	Power for establishment of factorial invariance (or detecting noninvariance model)
	Intercept noninvariance model	• Weak invariance model (true model)• Strong invariance model (misspecified model)
Factorial invariance model	Latent group mean differences model	• Strong invariance model (true model)	Power, parameter bias, SE, MSE, and 95% coverage
	Zero latent group mean differences	• Weak invariance model (misspecified model)• Strong invariance model (true model)	Type I error for establishment of factorial invariance and Type I error for the parameter estimation

Note. ML FMM = multilevel factor mixture model; ML MIMIC = multilevel multiple-indicators multiple-causes; ML FMM A1 = first alternative model; ML FMM A2 = second alternative model; SE = standard error; MSE = mean squared error.

Simulation Design Factors

Simulation design factors were considered in the two studies: number of clusters (CN), cluster size (CS) and design types in groups (DT). Three different numbers of clusters were simulated (i.e., 30, 50, and 100 per within-level group). Prior simulation studies on multilevel modeling indicated that CN was an important factor in the estimation of the model (Hox & Maas, 2001; Maas & Hox, 2005). In addition, studies on factorial invariance testing with multilevel data reported that CN was positively associated with the power of detecting the noninvariance items and slightly related to the inflation of Type I errors (Kim et al., 2015). The difference of the performance of ML FMM and ML MIMIC was expected to be evident when the CN was small because the two methods have a different number of free parameters that are related to model complexity as reviewed before. In addition, although 100 was considered a minimum number of clusters to obtain stable results (Hox et al., 2010), as in practice, the number of clusters was often much smaller than 100 except for large-scale assessment projects. The cluster sizes of 15 and 30 were chosen based on previous multilevel simulation studies (Hox, 1998; Kim & Cao, 2015; Kim et al., 2012; Kim et al., 2015). The combination of the CN and CS created total sample sizes that ranged from 450 to 3,000. CS was split for each group depending on the design type in the groups because grouping variables occurred at the within level. The groups were designed as balanced or two unbalanced types. The ratio between the two groups was set to 1:1 in a balanced type but 1:2 or 1:3 in the unbalanced types. According to previous studies, sample size in a group that is as small as 20% of the total sample was considered to be at the extreme but statistically estimable sample size (Henson et al., 2007; Tofighi & Enders, 2008). For example, when the cluster size was 15, it was divided into five in the focal group and 10 in the reference group on average in a 1:2 unbalanced type. This means that not divided by an integer of the same size for all replication, on average, two groups were generated at a ratio of 1:2 across 500 replications. In the same way, for these conditions of cluster sizes of 15 or 30 and unbalanced groups of 1:2 or 1:3, on average, 500 replications were generated that met these conditions. The larger sample size served as the reference group.

Data Generation and Analysis

Data Generation

Two-level data with within-level groups (e.g., males and females) were generated. For both within and between levels (i.e., identical factor structure), a single factor with eight indicators and two factors with four indicators were constructed in Study 1 and Study 2, respectively. Two of the eight indicators ( $Y_{w 3}$ and $Y_{w 7}$ , where $Y_{wj}$ means the jth variable at the within level) were not invariant across within-level groups, which indicates about 25% contamination. It is known that a contamination rate of less than 50% is acceptable because most of the indicators are likely to be invariant (Kim et al., 2012; Kim & Willson, 2014). The population parameters for data generation were determined based on previous studies about multilevel measurement invariance (Kim & Cao, 2015; Kim et al., 2015; Ryu, 2015).

In the noninvariance model, two indicators (i.e., y₃ and y₇) were set differently. For factor loading noninvariance, all the factor loadings of the first group (i.e., reference group) were .80 except for reference item $λ_{W 1 (1)}$ , whereas for the second group (i.e., focal group), the factor loadings were generated with identical factor loadings but $λ_{W 3 (2)}$ and $λ_{W 7 (2)}$ were .60. A .2 difference in factor loading is considered medium for noninvariance (Kim & Willson, 2014; Kim & Yoon, 2011; Stark et al., 2006). For intercept noninvariance, intercepts at the between level were set differently. Again, the intercept of within-level indicators was estimated at the between level in the multilevel data. Specifically, between-level intercepts of all indicators were zero in the reference group, whereas two intercepts ( $τ_{W 3 (2)}$ and $τ_{W 7 (2)}$ ) of the focal group were set to 0.3. The 0.3 difference in intercepts is considered medium for noninvariance in multilevel data (French & Finch, 2008; Kim et al., 2012). In addition, the latent mean was set at a difference of 0.3; this size was considered medium in the previous study (Finch & French, 2011; Maas & Hox, 2005).

To create population heterogeneity, the factor variances across group membership were set differently (e.g., when the group is a balanced design, 1.0 for Group 1 and 1.5 for Group 2 at the within level and 0.1 for Group 1 and 0.2 for Group 2 at the between level). ICC was set to around 0.1. This size is considered small in previous simulation studies (e.g., Hox & Maas, 2001; Jak & Oort, 2015; Kim et al., 2012; Kim et al., 2015; Maas & Hox, 2005) but most commonly seen in educational and psychological studies. The IGCC was set to 0.8. In more general cases, group-specific random effects are expected to be highly correlated in education and psychological contexts (Asparouhov & Muthén, 2012b). According to Ryu and Mehta’s (2017) study, the IGCC of a math self-efficacy latent variable for males and females in PISA 2003 data was .988, which was close to 1. With such high correlation in the educational and psychological context, .8 was considered an appropriate value.

Eighteen conditions (3 CN × 2 CS × 3 DT) per four models (i.e., two noninvariance and two invariance models) that resulted in 72 models were involved in the data generation. For each condition, 500 replications are generated. Models were estimated with maximum likelihood with robust standard errors (MLR), which is known to be robust for nonnormality and nonindependence observations. Data generation and all subsequent analysis were conducted with Mplus 8.3 (Muthén & Muthén, 2015).

Fitted Model

Each replication was analyzed both using ML FMM and ML MIMIC in Study 1 and three times using ML FMM A1, ML FMM A2, and ML MIMIC in Study 2. Two alternative models (i.e., ML FMM A1, ML FMM A2), which were more restricted models, were evaluated due to the convergence problem caused by the increased dimension of numerical integration. In addition, ML MIMIC for a two-factor CFA model entails a different model specification with ML MIMIC for a one-factor CFA model. Importantly, in the testing of the weak invariance model, the interaction term between the factor and the grouping variable should be generated, which means that the number of interaction terms increases by the number of factors. The interaction term between the factor and the observation variable is also required to be estimated using numerical integration, which takes a considerable amount of time to converge. Therefore, the weak invariance testing of ML MIMIC in Study 2 detects noninvariance indicators separately for each factor.

Evaluation Criteria

First, to evaluate the adequacy of the two methods (ML FMM and ML MIMIC), this study investigated the convergence rates. Convergence rates refer to the replications in which the model estimation is computed successfully. On the other hand, nonconvergence means that the solution is not mathematically produced due to a negative variance or a singular matrix.

Second, power and Type I errors were assessed in factorial invariance testing using model selection criteria. Power is defined as the proportion of replications in which the level of noninvariance is correctly detected. That is, power is the proportion at which the configural model (i.e., model with unconstrained factor loading) is selected when the factor loading is noninvariant across the groups. At the same time, power is the proportion at which the weak model (i.e., model with unconstrained intercepts) is selected when the intercept is noninvariant across the groups. In contrast to power, a Type I error is computed as the proportion of replications in which an invariance model is falsely rejected at $α = . 05$ when factorial invariance holds in the true population. To assess power and Type I errors, a model comparison was conducted using model selection criteria. The criteria included a likelihood ratio test (LRT) and three different information criteria (ICs). When MLR was used for model estimation, it is suggested that the Satorra–Bentler scaled likelihood ratio test (SB-LRT) is more appropriate than the existing LRT (Satorra & Bentler, 1994, 2001). SB-LRT³ uses a correction factor for chi-squares to calculate chi-square differences between the two nested models (i.e., constrained and relaxed model). Under the null hypothesis of no statistical difference between the models, when the null hypothesis is rejected, the relaxed model is selected, whereas the constrained model is favored otherwise. In addition to the adjusted chi-squares difference testing, AIC (Akaike, 1987), BIC (Schwarz, 1978) and aBIC (Sclove, 1987) were examined as measures of model comparison. The smaller the statistics of the ICs, the better the model is considered. These ICs are generally used for nonnested model comparison (e.g., mixture model; Nylund et al., 2007).

In latent mean difference testing, power, Type I errors, bias, mean squared error (MSE), and 95% coverage were assessed. Power here refers to the proportion of replication in which the latent mean difference is detected when the difference truly exists in the population. On the other hand, the Type I error here is computed as the proportion of replications in which zero mean difference between groups was falsely rejected at $α = . 05$ . The accuracy and precision of the estimated latent group mean difference were evaluated through parameter bias, standard error (SE), MSE, and 95% coverage. Parameter bias was computed as the deviation of the parameter estimate ( $\hat{θ_{r}}$ ) from the population parameter ( $θ$ ) on average across 500 replications. The absolute value of parameter bias should not exceed 0.1, which is normally used (Muthén & Muthén, 2002). SEs of these parameter estimates were also collected from each condition and averaged across replication. MSE is defined as the sum of the square of the bias of parameter estimates and variance. The smaller the values, the more accurate and reliable the estimation is. The coverage rate of a 95% confidence interval means the proportion of replications for which the 95% confidence interval includes the true parameter value. Coverage indicates how well the parameters and SEs are estimated. A range of .91 to .98 is considered a satisfactory standard of coverage (Muthén & Muthén, 2002).

Results: Study 1

Convergence Rates

For ML FMM, all replications converged perfectly under many conditions and the lowest rate was 92.8%. However, the convergence rates for ML MIMIC were relatively low from 74.4% to 98.6% and especially for the two unbalanced types. Overall, there were no notable differences in the convergence rate depending on the levels of the factorial invariance testing model (i.e., weak and strong invariance model testing) and across conditions (number of clusters, cluster size, and design types). Even under the worst condition, assumed as a small sample (CN = 30, CS = 15, 1:3 unbalanced group size), both methods showed a high convergence rate (92.8% for ML FMM and 79.8% for ML MIMIC in the weak invariance testing model), indicating high usability in the applied study.

Type I Errors and Power of Factorial Invariance Testing Using Model Selection Criteria

Table 3 present the results of Type I errors and power of factorial invariance testing for ML FMM and ML MIMIC with a multilevel one-factor CFA model using model selection criteria (i.e., SB-LRT, AIC, BIC, and aBIC). Also, the performances by model selection criteria are presented in Figure 3 for specific conditions.

Table 3.

Type I error and Power of Factorial Invariance Testing in Study 1.

											Power
			Type I error								Factor loading noninvariance (weak)								Intercept noninvariance (strong)
Condition			SB		AIC		BIC		aBIC		SB		AIC		BIC		aBIC		SB		AIC		BIC		aBIC
CN	CS	DT	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2	M1	M2
30	15	B	.51	.49	.05	.05	.00	.00	.01	.01	.99	.92	.88	.87	.05	.07	.61	.62	1.00	.99	1.00	1.00	.63	.64	.99	.99
30	15	Unb1	.52	.47	.06	.17	.00	.00	.00	.04	.97	.90	.74	.83	.01	.12	.42	.67	1.00	.99	1.00	1.00	.34	.70	.98	1.00
30	15	Unb2	.53	.49	.05	.24	.00	.00	.01	.05	.94	.87	.64	.79	.00	.11	.31	.64	1.00	.99	.98	1.00	.16	.61	.89	.99
30	30	B	.52	.46	.05	.06	.00	.00	.00	.00	1.00	.95	1.00	.93	.29	.33	.88	.84	1.00	.98	1.00	1.00	1.00	1.00	1.00	1.00
30	30	Unb1	.53	.46	.05	.17	.00	.00	.00	.01	1.00	.95	.97	.92	.09	.42	.73	.85	1.00	1.00	1.00	1.00	.93	.99	1.00	1.00
30	30	Unb2	.52	.45	.04	.21	.00	.00	.00	.02	1.00	.96	.92	.90	.03	.36	.49	.80	1.00	.99	1.00	1.00	.77	.97	1.00	1.00
50	15	B	.50	.46	.07	.07	.00	.00	.00	.00	1.00	.99	.98	.98	.19	.26	.84	.88	1.00	.99	1.00	1.00	.96	.96	1.00	1.00
50	15	Unb1	.52	.53	.06	.16	.00	.00	.00	.01	.99	.98	.95	.98	.07	.34	.61	.88	1.00	.99	1.00	1.00	.81	.97	1.00	1.00
50	15	Unb2	.55	.52	.05	.25	.00	.00	.00	.02	.99	.98	.85	.98	.02	.33	.44	.84	1.00	.99	1.00	1.00	.54	.92	.98	1.00
50	30	B	.48	.48	.07	.07	.00	.00	.00	.00	1.00	1.00	1.00	1.00	.75	.81	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
50	30	Unb1	.51	.52	.05	.15	.00	.00	.00	.01	1.00	.98	1.00	1.00	.42	.83	.93	.99	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00
50	30	Unb2	.48	.48	.04	.20	.00	.00	.00	.01	1.00	.97	.99	1.00	.17	.76	.80	.98	1.00	.99	1.00	1.00	.99	1.00	1.00	1.00
100	15	B	.47	.46	.06	.07	.00	.00	.00	.00	1.00	1.00	1.00	1.00	.76	.82	1.00	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00
100	15	Unb1	.51	.53	.06	.16	.00	.00	.00	.01	1.00	1.00	1.00	1.00	.44	.83	.93	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
100	15	Unb2	.51	.51	.07	.24	.00	.00	.00	.01	1.00	.99	1.00	1.00	.22	.75	.79	.99	1.00	.99	1.00	1.00	.99	1.00	1.00	1.00
100	30	B	.53	.50	.05	.06	.00	.00	.00	.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00
100	30	Unb1	.53	.53	.06	.18	.00	.00	.00	.01	1.00	.99	1.00	1.00	.98	1.00	1.00	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00
100	30	Unb2	.48	.48	.05	.25	.00	.00	.00	.01	1.00	1.00	1.00	1.00	.81	.99	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00

Note. M1 = multilevel factor mixture model with one factor; M2 = multilevel multiple-indicators multiple-causes with one factor; CN = number of clusters; CS = cluster size; DT = design type; B = balanced design; Unb1 = 1:2 unbalanced design; Unb2 = 1:3 unbalanced design; SB = Satorra–Bentler scaled likelihood ratio test; AIC = Akaike information criterion; BIC = Bayesian information criterion; aBIC = sample-size adjusted BIC.

Figure 3.

Type I error and power of factorial invariance testing of AIC, BIC, and aBIC in Study 1.

Overall, three characteristics were observed across all conditions. First, the performances of the weak invariance testing were poorer than those of the intercept invariance testing across all model selection criteria. This indicates that the performance of both methods depends largely on the power of weak invariance testing estimating parameters more freely. Second, the powers of ML FMM were more sensitive to unbalanced sample size conditions than ML MIMIC except for large sample size conditions (i.e., CN = 100, CS = 30). This is more evident in Figure 3, indicating that it is appropriate to consider the use of ML MIMIC rather than ML FMM if there are sample differences between the two groups. Third, Type I errors showed consistent results regardless of the conditions, whereas power was positively related to CN and CS. Also, as expected, even with the same sample size, the power was higher in the balanced design condition than the unbalanced design type.

Specifically, power and Type I errors varied considerably depending on the model selection criteria. SB-LRT had high power (ranging from .87 to 1.00) in weak and strong invariance testing whereas the Type I error was unacceptably high (ranging from .45 to .53) in both methods. These results indicate that SB-LRT is not appropriate as a model selection criterion for multiple group analysis.

Regarding the AIC, both methods had satisfactory powers in the weak and strong invariance model testing except for the worst condition (i.e., CN = 30, CS = 15 and unbalanced design type). However, ML FMM had reasonably controlled Type I errors (from .04 to .07), whereas the Type I errors of the two unbalanced conditions in ML MIMIC notably increased (from .15 to .25).

BIC was found to be more affected by sample size than the other ICs. In the weak invariance model testing, satisfactory power was seen in the case of the large sample size with at least CN = 100 and CS = 30 for ML FMM and CN = 50 and CS = 30 for ML MIMIC. Also, the power of BIC was considerably low in the 1:3 unbalanced design type in both methods. The power of strong invariance testing was less affected by sample size than weak testing but required at least CN = 50 and CS = 15 for sample size for reasonable power in both methods. On the other hand, BIC had overcontrolled Type I errors to zero in all conditions for both methods.

aBIC was more robust to sample size than BIC but exhibited a fairly similar pattern. In strong invariance testing, both methods secured excellent power in all conditions (.89-1.0), but weak testing required at least CN = 50 and CS = 30 in FMM, and CN = 30 and CS = 30 in MIMIC. Like BICs, there was a tendency to lose power under unbalanced design conditions. Also, aBIC overcontrolled Type I errors less than .05 in most conditions for both methods.

Compared with the results of previous studies (Kim & Cao, 2015) that simulated the performance of multilevel factorial invariance testing across within-level groups for both methods, the results of this study are mostly supported. Specifically, the results that SB-LRT was not recommended as an excessively inflated Type I error and that BIC and aBIC were recommended for large sample sizes were replicated. However, there was also a contradictory result with this study. AIC in both methods was recommended because of its high power and properly controlled Type I errors except for a certain condition (i.e., unbalanced types using ML MIMIC) in this study, whereas AIC in ML FMM yielded an unacceptably inflated Type I error in the prior study. In their study, Type I errors of AIC were positively related to cluster size, but no such relationship was observed in this study. Therefore, further investigation in this regard is needed.

Power, Type I Errors, Bias, Mean Squared Error, and 95% Coverage of Latent Mean Difference Testing

Table 4 summarizes Type I errors, power, relative bias, MSE, and 95% coverage of parameter estimation in Study 1. The performances of ML FMM and ML MIMIC were comparable in all evaluation criteria. Type I errors of ML FMM and MLMIMIC were properly controlled for regardless of small sample size and unbalanced design types. However, as expected, the power of both methods was very low for small sample size such as less than the combination of CN = 50 and CS = 15 (ranging from .51 to .72 for ML FMM and .48 to .71 for ML MIMIC), and it was increased to a reasonable level as the sample size increased. Also, the relative bias and 95% coverage rates of both methods were within the standard value of .1 and range of .91 to .97, respectively in all conditions. SE and MSE were not large in both methods and coverage rates of 95% CI for all conditions were within the appropriate range in both methods.

Table 4.

Type I Error, Power, Relative Bias, Standard Error, Mean Squared Error, and 95% Coverage Rate of Latent Mean in Study 1.

Model			ML FMM						ML MIMIC
CN	CS	DT	Error	Power	RB	SE	MSE	CV	Error	Power	RB	SE	MSE	CV
30	15	B	.06	.51	.04	.15	.02	.95	.06	.49	.04	.15	.02	.94
30	15	Unb1	.07	.50	.02	.15	.02	.94	.06	.49	.01	.15	.02	.94
30	15	Unb2	.07	.46	.03	.15	.03	.94	.07	.48	.02	.16	.03	.92
30	30	B	.05	.69	.02	.12	.01	.95	.05	.70	.02	.12	.01	.95
30	30	Unb1	.06	.69	.02	.12	.01	.94	.07	.67	.02	.13	.02	.94
30	30	Unb2	.06	.65	.02	.13	.02	.94	.06	.65	.03	.13	.02	.94
50	15	B	.06	.72	.04	.11	.01	.94	.06	.71	.04	.12	.01	.94
50	15	Unb1	.05	.71	.02	.11	.01	.95	.05	.71	.02	.12	.01	.95
50	15	Unb2	.05	.69	.02	.12	.01	.95	.04	.68	.02	.12	.01	.96
50	30	B	.06	.91	.01	.09	.01	.95	.06	.90	.01	.10	.01	.95
50	30	Unb1	.06	.88	.01	.10	.01	.94	.06	.88	.01	.10	.01	.94
50	30	Unb2	.07	.86	.02	.10	.01	.93	.07	.85	.02	.10	.01	.93
100	15	B	.07	.95	.01	.08	.01	.93	.07	.94	.01	.08	.01	.93
100	15	Unb1	.05	.95	.00	.08	.01	.95	.05	.94	.00	.09	.01	.95
100	15	Unb2	.05	.94	.00	.08	.01	.95	.05	.93	.00	.09	.01	.95
100	30	B	.04	1.00	.00	.07	.00	.97	.03	.99	.01	.07	.00	.96
100	30	Unb1	.04	.99	.01	.07	.00	.95	.05	.99	.01	.07	.00	.96
100	30	Unb2	.05	.99	.00	.07	.00	.94	.05	.99	.00	.08	.00	.95

Note. ML FMM = multilevel factor mixture model; ML MIMIC = multilevel multiple-indicators multiple-causes; CN = number of clusters; CS = cluster size; DT = design type; B = balanced design; Unb1 = 1:2 unbalanced design; Unb2 = 1:3 unbalanced design; Error = Type I error; RB = relative bias; SE = standard error; MSE = mean squared error; CV = coverage rate of 95% confidence interval.

Results: Study 2

Convergence Rates

Similar patterns of convergence rates with Study 1 emerged. The convergence rates of two alternative ML FMM models were not different from each other and reached 1.0 or close to 1.0. ML MIMIC had a rather lower convergence rate (from 75.6% to 91.2%) than the ML FMM models. As with the results of Study 1, there were no significant differences across the factorial invariance testing models and simulation conditions. These results indicate that both methods can be easily applied in a modified form to complex models.

Power and Type I Errors of Factorial Invariance Testing Using Model Selection Criteria

Table 5 presents the results of Type I errors and power of factorial invariance testing for ML FMM A1, ML FMM A2, and ML MIMIC with multilevel two-factor CFA model using model selection criteria (i.e., SB-LRT, AIC, BIC, and aBIC), respectively. The results of Study 2 were considerably similar to those of Study 1. The performances of ML FMM A1 and ML FMM A2 resulted in almost the same results, and the differences with ML MIMIC were negligible.

Table 5.

Type I Error and Power of Factorial Invariance Testing in Study 2.

															Power
			Type I Error												Factor loading noninvariance (weak)												Intercept noninvariance (strong)
Condition			SB			AIC			BIC			aBIC			SB			AIC			BIC			aBIC			SB			AIC			BIC			aBIC
CN	CS	DT	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3	A1	A2	M3
30	15	B	.56	.56	.55	.07	.07	.07	.00	.00	.00	.01	.01	.01	.98	.99	.88	.89	.91	.79	.05	.05	.15	.66	.68	.61	1.00	1.00	.99	1.00	1.00	1.00	.78	.78	.80	1.00	1.00	1.00
30	15	Unb1	.53	.53	.50	.08	.08	.18	.00	.00	.00	.00	.00	.05	.98	.97	.86	.76	.77	.78	.03	.03	.24	.50	.49	.65	1.00	1.00	1.00	1.00	1.00	1.00	.56	.56	.83	.98	.98	1.00
30	15	Unb2	.54	.54	.50	.05	.05	.24	.00	.00	.00	.00	.00	.06	.94	.95	.85	.67	.66	.76	.01	.01	.23	.33	.31	.61	1.00	1.00	.99	.99	.99	1.00	.32	.32	.79	.94	.95	.99
30	30	B	.51	.52	.50	.06	.06	.06	.00	.00	.00	.00	.00	.00	1.00	1.00	.96	1.00	1.00	.91	.45	.47	.46	.96	.96	.84	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
30	30	Unb1	.50	.53	.50	.06	.06	.15	.00	.00	.00	.00	.00	.02	1.00	1.00	.94	.99	.99	.91	.18	.17	.49	.83	.82	.82	1.00	1.00	1.00	1.00	1.00	1.00	.99	.98	1.00	1.00	1.00	1.00
30	30	Unb2	.56	.56	.53	.09	.09	.26	.00	.00	.00	.00	.00	.03	.99	1.00	.93	.94	.94	.86	.07	.07	.44	.62	.60	.75	1.00	1.00	1.00	1.00	1.00	1.00	.89	.89	1.00	1.00	1.00	1.00
50	15	B	.55	.54	.53	.06	.06	.07	.00	.00	.00	.00	.00	.00	1.00	1.00	.96	.98	.98	.95	.26	.25	.39	.88	.88	.82	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
50	15	Unb1	.55	.55	.53	.06	.06	.17	.00	.00	.00	.00	.00	.03	1.00	.99	.95	.95	.95	.95	.10	.10	.43	.67	.67	.80	1.00	1.00	.99	1.00	1.00	1.00	.93	.93	1.00	1.00	1.00	1.00
50	15	Unb2	.52	.52	.53	.07	.07	.25	.00	.00	.00	.00	.00	.04	1.00	.99	.93	.87	.87	.94	.03	.02	.41	.48	.47	.77	1.00	1.00	.98	1.00	1.00	1.00	.75	.75	.98	.99	.99	1.00
50	30	B	.49	.49	.52	.07	.07	.07	.00	.00	.00	.00	.00	.00	1.00	1.00	.97	1.00	1.00	1.00	.89	.89	.81	1.00	1.00	.98	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
50	30	Unb1	.52	.53	.52	.06	.06	.16	.00	.00	.00	.00	.00	.00	1.00	1.00	.96	1.00	1.00	.99	.59	.57	.81	.99	.98	.96	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
50	30	Unb2	.53	.53	.52	.06	.06	.24	.00	.00	.00	.00	.00	.01	1.00	1.00	.96	1.00	1.00	.97	.32	.28	.73	.89	.86	.93	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
100	15	B	.54	.54	.50	.07	.07	.08	.00	.00	.00	.00	.00	.00	1.00	1.00	.98	1.00	1.00	1.00	.88	.88	.82	.99	.99	.98	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
100	15	Unb1	.50	.50	.48	.06	.06	.15	.00	.00	.00	.00	.00	.01	1.00	1.00	.97	1.00	1.00	.99	.58	.57	.83	.97	.97	.96	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
100	15	Unb2	.50	.50	.49	.07	.07	.27	.00	.00	.00	.00	.00	.02	1.00	1.00	.98	1.00	1.00	1.00	.25	.22	.77	.89	.87	.97	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
100	30	B	.54	.54	.51	.07	.07	.08	.00	.00	.00	.00	.00	.01	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
100	30	Unb1	.53	.53	.53	.06	.06	.16	.00	.00	.00	.00	.00	.00	1.00	1.00	.99	1.00	1.00	1.00	.99	.99	.97	1.00	1.00	.99	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
100	30	Unb2	.53	.53	.52	.08	.08	.27	.00	.00	.00	.00	.00	.01	1.00	1.00	.99	1.00	1.00	1.00	.89	.88	.95	1.00	1.00	.98	1.00	1.00	.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00

Note. ML FMM = multilevel factor mixture model; ML MIMIC = multilevel multiple-indicators multiple-causes; A1 = ML FMM alternative Model 1 with two factors; A2 = ML FMM alternative Model 2 with two factors; M3 = ML MIMIC with two factors; CN = number of clusters; CS = cluster size; DT = design type; B = balanced design; Unb1 = 1:2 unbalanced design; Unb2 = 1:3 unbalanced design; SB = Satorra–Bentler scaled likelihood ratio test; AIC = Akaike information criterion; BIC = Bayesian information criterion; aBIC = sample-size adjusted BIC.

Generally, as in Study 1, the power of weak invariance model testing was poorer than that of the intercept invariance model testing. Alternative models of ML FMM tended to be more sensitive to uneven sample size conditions than ML MIMIC. In addition, Type I errors and power produced concordant results irrespective of sample size conditions.

Specifically, Type I errors of SB-LRT were unacceptably large regardless of the condition in all methods. Even though the power is excellent, it is not appropriate to consider the criteria for factor invariance testing. On the other hand, those of AIC, BIC, and aBIC were properly controlled for in all conditions at a nominal level except for AIC in the unbalanced condition in ML MIMIC. Accordingly, it can be considered a model selection criterion depending on sample size conditions in which power above a standard value is secured.

Except for a small sample size (CN = 30, CS = 15), the power of AIC was acceptable, whereas BIC and aBIC required a fairly large sample to have power above the standard of .8. BIC required a sample size of 1,500 (CN = 50, CS = 30) or higher for both methods, and 3,000 (CN = 100, CS = 30) if the sample size is unbalanced. aBIC was less demanding than BIC, and both methods required a sample size of 900 (CN = 30, CS = 30) or higher, and 1,500 (CN = 500, CS = 30) when the sample size between the comparison groups is unbalanced.

Power, Type I Errors, Bias, Mean Squared Error, and 95% Coverage of Latent Mean Difference Testing

Table 6 summarizes Type I errors, power, relative bias, MSE, and 95% coverage of two latent mean parameter estimations. As in Study 1, the performance of two ML FMM alternative models and ML MIMIC in all evaluation criteria were comparable. Type I error rates were properly controlled and relative bias was below .05 in all conditions for both methods. Similar to Study 1, the power of both methods was low for a small sample size (i.e., less than CN = 50 and CS = 30), but as the sample size increased, the power also improved. But the only difference from Study 1 is that when the unbalanced degree of the sample size of the two comparison groups was not as large as 1:3, the condition of the combination of CN = 50 and CS = 15 also had a power of .8 or more.

Table 6.

Type I Error, Power, Relative Bias, Standard Error, Mean Squared Error, and 95% Coverage Rate of First Latent Mean in Study 2.

Model			ML FMM A1						ML FMM A2						ML MIMIC
CN	CS	DT	Error	Power	RB	SE	MSE	CV	Error	Power	RB	SE	MSE	CV	Error	Power	RB	SE	MSE	CV
30	15	B	.06	.63	.02	.14	.02	.94	.06	.64	.02	.13	.02	.94	.07	.64	.03	.13	.02	.93
30	15	Unb1	.06	.60	.02	.14	.02	.94	.06	.59	.01	.14	.02	.94	.06	.60	.02	.14	.02	.94
30	15	Unb2	.06	.55	.00	.15	.02	.94	.06	.55	.00	.14	.02	.95	.06	.54	.00	.15	.02	.94
30	30	B	.06	.77	.01	.11	.01	.94	.05	.77	.00	.11	.01	.95	.05	.76	.00	.11	.01	.94
30	30	Unb1	.05	.76	.01	.11	.01	.95	.05	.76	.01	.11	.01	.96	.05	.75	.02	.11	.01	.95
30	30	Unb2	.05	.72	.01	.12	.01	.95	.05	.72	.01	.12	.01	.95	.05	.71	.01	.12	.01	.95
50	15	B	.05	.84	.01	.11	.01	.95	.05	.84	.01	.10	.01	.95	.05	.83	.01	.10	.01	.95
50	15	Unb1	.04	.81	.02	.11	.01	.96	.05	.80	.01	.11	.01	.95	.05	.81	.02	.11	.01	.95
50	15	Unb2	.06	.76	.01	.11	.01	.94	.06	.75	.01	.11	.01	.94	.06	.73	.01	.12	.01	.95
50	30	B	.05	.94	.01	.08	.01	.95	.06	.94	.01	.08	.01	.95	.05	.93	.02	.09	.01	.95
50	30	Unb1	.05	.92	.01	.09	.01	.95	.04	.92	.01	.09	.01	.96	.04	.91	.03	.09	.01	.96
50	30	Unb2	.03	.91	.02	.09	.01	.97	.04	.90	.02	.09	.01	.96	.03	.90	.03	.09	.01	.97
100	15	B	.05	.98	.00	.07	.01	.95	.04	.98	.00	.07	.01	.96	.05	.98	.01	.07	.01	.96
100	15	Unb1	.06	.97	.00	.08	.01	.94	.06	.97	.00	.08	.01	.94	.05	.97	.00	.08	.01	.95
100	15	Unb2	.06	.96	.00	.08	.01	.95	.06	.96	.00	.08	.01	.95	.05	.96	.00	.08	.01	.95
100	30	B	.06	1.00	.00	.06	.00	.94	.06	1.00	.00	.06	.00	.94	.06	1.00	.00	.06	.00	.93
100	30	Unb1	.06	1.00	.00	.06	.00	.94	.06	1.00	.00	.06	.00	.94	.06	1.00	.00	.06	.00	.93
100	30	Unb2	.07	1.00	.01	.06	.00	.93	.07	1.00	.01	.06	.00	.94	.06	1.00	.01	.06	.00	.94

Discussion and Conclusion

Factorial invariance testing in multilevel data across within-level groups can be challenging due to the complexity of the model in relation to considering dependencies at the between level. This study addressed modeling and analysis issues in multilevel multiple group analysis when the grouping variable exists at the within level. Two methods have been considered to be suitable: ML FMM and ML MIMIC modeling. To investigate the performances of the proposed models regarding a set of factor invariance testing (i.e., weak and strong invariance testing) and subsequent latent group mean difference testing, two Monte Carlo studies were conducted. In Study 1, the performances of ML FMM and ML MIMIC were examined for multilevel one-factor CFA model. In Study 2, alternative models of ML FMM that are more restricted models (i.e., ML FMM A1 and A2) were suggested and their performance was compared with ML MIMIC for a multilevel two-factor CFA model. It is because when a research model is more complex (i.e., more than one factors or more than two groups), it is technically impossible to estimate all random group effects as in ML FMM.

The performance of the multiple group analysis on these methods was evaluated on two aspects. First, the performance of the factorial invariance testing was evaluated by setting noninvariance to either factor loadings or intercepts (i.e., weak and strong invariance model testing, respectively) for the two indicators. Second, to evaluate the performance of the latent group mean differences testing, a factorial invariance model in which there is a difference in the mean of the latent variable across within-level groups was simulated. In addition, factorial invariance model with a zero latent mean difference was also generated to assess the Type I error rates.

The major findings of the current study are summarized below to provide guidelines for applied researchers.

First, as expected based on previous studies, ML MIMIC had higher power than ML FMM for the small sample size condition. ML FMM needed more sample sizes than the ML MIMIC in order to obtain more than .8 power based on three ICs (i.e., AIC, BIC, and aBIC) in Study 1. Specifically, AIC using ML MIMIC had sufficient power regardless of the degree of unbalanced design under the small sample sizes condition of CN = 30 and CS = 15, whereas ML FMM’s power was secured only under balanced conditions. In BIC and aBIC, ML FMM required total sample sizes of 3,000 (CN = 100, CS = 30) and 1,500 (CN = 50, CS=30), respectively, whereas ML MIMIC was found to have power exceeding the standard value even under the relatively small sample size conditions of 1,500 (CN = 50, CS = 30) and 900 (CN = 30, CS = 30), respectively. These results are in line with previous studies (Kim & Cao, 2015; Woods, 2009) as the advantage of ML MIMIC for small sample sizes in multiple group analysis across within-level groups using multilevel data.

Second, as a result of examining the performance of SB-LRT, AIC, BIC, and aBIC as a criterion for detecting noninvariance, the performances of strong invariance testing were remarkably better than those of weak invariance testing across two studies. This result is not surprising in that the strong invariance testing is a more parsimonious model that constrains intercepts as well as factor loadings, implying that it is necessary to keep in mind that the conditions of weak invariance testing to reach the proper power in multiple group analysis are important. For both tests (i.e., weak and strong invariance testing), SB-LRT is not recommended due to unacceptably large Type I error rates. AIC showed the best performance among the three ICs criteria under small sample size condition. However, when the sample is an unbalanced type, AIC in the ML MIMIC results was found to be inappropriate for use as a model selection criterion due to unacceptably large Type I errors. BIC and aBIC generally functioned well under sufficiently large sample sizes conditions. To have adequate power in the multilevel one-factor model (Study 1), BIC required total sample of 3,000 (CN = 100, CS = 30) for ML FMM and 1,500 (CN = 50, CS = 30) for ML MIMIC. Also, aBIC required a smaller sample size than this, which was 1,500 (CN = 50, CS = 30) for ML FM and 900 (CN = 30, CS = 30) for ML MIMIC. Similar patterns were found in Study 2, and a proper power was reached even under smaller sample size conditions because of the more restrictive model than in Study 1. That is, when the comparison group size was balanced, in both methods, an appropriate power was reached when the total sample sizes were 1,500 (CN = 50, CS = 30) in BIC and 900 (CN = 30, CS = 30) in aBIC.

Third, to perform a series of factorial invariance testing and latent mean difference testing sequentially, the number of clusters is recommended to be at least 50 and ideally more than 100 in multilevel multiple group analysis across within-level groups. When the number of clusters is less than 50, the power of parameter estimation for the latent mean has not reached the standard value, and among the ICs used as a model selection criterion in factor invariance testing, only AIC can be used in a limited manner depending on the other conditions. This result was consistent with studies of sample sizes in MSEM (Hox et al., 2010; Hox & Maas, 2001; Maas & Hox, 2005). Besides the number of clusters, cluster size also contributes to increasing the total sample size, so it should be considered in the analysis, but it was confirmed that the number of clusters played a more important role in multilevel analysis. Comparing the two conditions of CN = 30, CS = 30 and CN = 50, CS = 15 in this study, the former condition is larger as the total sample sizes is 900 and 750, but the performance of the latter condition with a larger number of clusters was better across many conditions. In addition, in a comparison of the two conditions of CN = 50, CS = 30 and CN = 100, CS = 15 with the same total sample size, it was confirmed that the latter condition with a larger number of clusters has better performance than the former condition with a larger cluster size.

Fourth, as expected, if the sample size between groups is uneven, more samples should be obtained by considering the degree of unbalance. Under the minimum sample size conditions of CN = 30, CS = 15, the AIC of the two methods (i.e., ML FMM and ML MIMIC) secured power only in a balanced design but had low power under unbalanced conditions. However, if the sample sizes were large enough (CN = 100, CS = 30 in all three ICs), the uneven conditions were not an issue. It is not possible to conclude that the effects of the unbalanced design types are considerably critical given that the magnitude of the unbalanced set by the conditions in this study is not small at 2:1 and 3:1, but it is clear that researchers should consider it.

To sum up, there is a difference between the two methods in that ML MIMIC has the advantage of testing factorial invariance with small sample sizes. Whereas, the estimation is more difficult than ML FMM in that weak invariance testing takes longer to estimate by including interaction terms with latent variables, and in the case of multifactor models, it is necessary to test one by one for each factor. Nevertheless, the differences were not remarkable; therefore, both methods can be suitably used for multilevel multiple group analysis across within-level groups when researchers understand the performance patterns of the two methods and utilize the pros and cons.

Differentiation from previous studies and the implications of this study are as follows. First, this study compared two methods of ML FMM and ML MIMIC regarding sequential procedures (i.e., weak and strong invariance testing and latent means difference testing after holding for factorial invariance) in multiple group analysis and the performance in each procedure was evaluated integrally. Second, the current study confirmed by simulation that the design type in groups (i.e., balanced or unbalanced sample between the groups) is an important factor for the results of multiple group analysis. Third, given the fact that it was difficult to apply the standard analysis method (used in Study 1) to a complex model (e.g., multiple-factor or three or more groups comparison model), alternative models of ML FMM were presented and the performances were assessed. The major findings of Study 2 are expected to be generalizable to a more complex model. For example, ML FMM A1 or A2 can be applicable to a multilevel three-factor model or many groups analysis model.

The limitation of this study and future studies should be noted. First, in order to reflect the reality in the population model, heterogeneity (i.e., heterogeneous factor variance and residual variance) was assumed across within-level groups, and different factor loadings across within and between levels were set despite the cross-level factor loading constraint imposed on the analysis model. However, since they were not considered conditions (i.e., degree of heterogeneous variance and residual variance, degree of factor loading differences), overgeneralizations should be avoided. For example, when there are considerable differences in factor loading between levels, much deliberation is necessary for the application of these findings of the current study. Second, this study examined a model with an equal factor structure across levels (i.e., the same number of factors at both levels). When the model has unequal factor structures across levels, it is not possible to perform a factorial invariance testing in a multiple group approach that splits data by groups and imposes necessary constraints on each group. Not only is cross-level factor loading invariance required, but also intercept and latent group means are estimated at between level in multilevel multiple group analysis. In this case, a MIMIC approach that includes the group as a covariate can be more appropriate. In the MIMIC approach, all factorial invariance testing procedures are performed at the within level, in which the grouping variable exists, and latent group mean testing is also conducted at the corresponding level. Therefore, future studies will need to examine the performance of methods using ML MIMIC to models with unequal factor structures across levels.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Sookyoung Son

Notes

References

Akaike

(1987). Factor analysis and AIC. Psychometrika, 52, 317-322. https://doi.org/10.1007/BF02294359

Asparouhov

Muthén

(2008). Multilevel mixture models. In Hancock

G. R.

Samuelsen

K. M.

(Eds.), Advances in latent variable mixture models (pp. 27-51). Information Age.

Asparouhov

Muthén

(2012a, April 17). Computing the strictly positive Satorra–Bentler chi-square test in Mplus (Mplus Web Notes No. 12). http://statmodel.com/examples/webnotes/SB5.pdf

Asparouhov

Muthén

(2012b, November 15). Multiple group multilevel analysis (Mplus Web Notes no. 16). http://statmodel.com/examples/webnotes/webnote16.pdf

Cheung

G. W.

Rensvold

R. B.

(1998). Cross-cultural comparisons using non-invariant measurement items. Applied Behavioral Science Review, 6(1), 93-110. https://doi.org/10.1016/S1068-8595(99)80006-3

Finch

W. H.

French

B. F.

(2011). Estimation of MIMIC model parameters with multilevel data. Structural Equation Modeling: A Multidisciplinary Journal, 18(2), 229-252. https://doi.org/10.1080/10705511.2011.557338

French

B. F.

Finch

W. H.

(2008). Multigroup confirmatory factor analysis: Locating the invariant referent sets. Structural Equation Modeling, 15(1), 96-113. https://doi.org/10.1080/10705510701758349

Heck

R. H.

Thomas

S. L.

(2009). An introduction to multilevel modeling techniques (2nd ed.). Routledge.

Henson

J. M.

Reise

S. P.

Kim

K. H.

(2007). Detecting mixtures from structural model differences using latent variable mixture modeling: A comparison of relative model fit statistics. Structural Equation Modeling, 14(2), 202-226. https://doi.org/10.1080/10705510709336744

10.

Hox

J. J.

(1998). Multilevel modeling: When and why. In Balderjahn

Mathar

Schader

(Eds.), Classification, data analysis, and data highways (pp. 147-154). Springer Verlag. https://doi.org/10.1007/978-3-642-72087-1_17

11.

Hox

J. J.

Maas

C. J.

(2001). The accuracy of multilevel structural equation modeling with pseudobalanced groups and small samples. Structural Equation Modeling, 8(2), 157-174. https://doi.org/10.1207/S15328007SEM0802_1

12.

Hox

J. J.

Maas

C. J.

Brinkhuis

M. J.

(2010). The effect of estimation method and sample size in multilevel structural equation modeling. Statistica Neerlandica, 64(2), 157-170. https://doi.org/10.1111/j.1467-9574.2009.00445.x

13.

Jak

(2018). Cross-level invariance in multilevel factor models. Structural Equation Modeling: A Multidisciplinary Journal, 26(4), 607-622. https://doi.org/10.1080/10705511.2018.1534205

14.

Jak

Jorgensen

T. D.

(2017). Relating measurement invariance, cross-level invariance, and multilevel reliability. Frontiers in Psychology, 8, Article 1640. https://doi.org/10.3389/fpsyg.2017.01640

15.

Jak

Oort

F. J.

(2015). On the power of the test for cluster bias. British Journal of Mathematical and Statistical Psychology, 68(3), 434-455. https://doi.org/10.1111/bmsp.12053

16.

Jak

Oort

F. J.

Dolan

C. V.

(2014). Measurement bias in multilevel data. Structural Equation Modeling: A Multidisciplinary Journal, 21(1), 31-39. https://doi.org/10.1080/10705511.2014.856694

17.

Jöreskog

K. G.

Goldberger

A. S.

(1975). Estimation of a model with multiple indicators and multiple causes of a single latent variable. Journal of the American Statistical Association, 70(351), 631-639. https://doi.org/10.2307/2285946

18.

Kim

E. S.

Cao

(2015). Testing group mean differences of latent variables in multilevel data using multiple-group multilevel CFA and multilevel MIMIC modeling. Multivariate Behavioral Research, 50(4), 436-456. https://doi.org/10.1080/00273171.2015.1021447

19.

Kim

E. S.

Cao

Wang

Nguyen

D. T.

(2017). Measurement invariance testing with many groups: A comparison of five approaches. Structural Equation Modeling: A Multidisciplinary Journal, 24(4), 524-544. https://doi.org/10.1080/10705511.2017.1304822

20.

Kim

E. S.

Kwok

Yoon

(2012). Testing factorial invariance in multilevel data: A Monte Carlo study. Structural Equation Modeling: A Multidisciplinary Journal, 19(2), 250-267. https://doi.org/10.1080/10705511.2012.659623

21.

Kim

E. S.

Willson

V. L.

(2014). Measurement Invariance across groups in latent growth modeling. Structural Equation Modeling: A Multidisciplinary Journal, 21(3), 408-424. https://doi.org/10.1080/10705511.2014.915374

22.

Kim

E. S.

Yoon

(2011). Testing measurement invariance: a comparison of multiple-group categorical CFA and IRT. Structural Equation Modeling: A Multidisciplinary Journal, 18(2), 212-228. https://doi.org/10.1080/10705511.2011.557337

23.

Kim

E. S.

Yoon

Lee

(2012). Testing measurement invariance using MIMIC. Educational and Psychological Measurement, 72(3), 469-492. https://doi.org/10.1177/0013164411427395

24.

Kim

E. S.

Yoon

Wen

Luo

Kwok

O. M.

(2015). Within level group factorial invariance with multilevel data: Multilevel factor mixture and multilevel mimic models. Structural Equation Modeling: A Multidisciplinary Journal, 22(4), 603-616. https://doi.org/10.1080/10705511.2014.938217

25.

Lubke

G. H.

Muthén

(2005). Investigating population heterogeneity with factor mixture models. Psychological Methods, 10(1), 21-39. https://doi.org/10.1037/1082-989X.10.1.21

26.

Maas

C. J.

Hox

J. J.

(2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 86-92. https://doi.org/10.1027/1614-2241.1.3.86

27.

Mellenbergh

G. J.

(1989). Item bias and item response theory. International Journal of Educational Statistics, 13, 127-143. https://doi.org/10.1016/0883-0355(89)90002-5

28.

Meredith

(1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525-543. https://doi.org/10.1007/BF02294825

29.

Meredith

Millsap

R. E.

(1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57, 289-311. https://doi.org/10.1007/BF02294510

30.

Millsap

R. E.

(1997). Invariance in measurement and prediction: Their relationship in the single-factor case. Psychological Methods, 2(3), 248-260. https://doi.org/10.1037/1082-989X.2.3.248

31.

Morin

A. J.

Meyer

J. P.

Creusier

Biétry

(2016). Multiple-group analysis of similarity in latent profile solutions. Organizational Research Methods, 19(2), 231-254. https://doi.org/10.1177/1094428115621148

32.

Muthén

B. O.

(1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54(4), 557-585. https://doi.org/10.1007/BF02296397

33.

Muthén

B. O.

(1994). Multilevel covariance structure analysis. Sociological Methods & Research, 22, 376-398. https://doi.org/10.1177/0049124194022003006

34.

Muthén

B. O.

Muthén

L. K.

(2015). Mplus 7.4 [Computer software]. Muthén & Muthén.

35.

Muthén

L. K.

Muthén

B. O.

(2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9(4), 599-620. https://doi.org/10.1177/0049124194022003006

36.

Nylund

K. L.

Asparouhov

Muthén

B. O.

(2007). Deciding on the number of classes in latent class analysis and growth mixture modeling: A Monte Carlo simulation study. Structural Equation Modeling: A Multidisciplinary Journal, 14(4), 535-569. https://doi.org/10.1080/10705510701575396

37.

Reise

S. P.

Widaman

K. F.

Pugh

R. H.

(1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114(3), 552-566. https://doi.org/10.1037/0033-2909.114.3.552

38.

Ryu

(2014). Factorial invariance in multilevel confirmatory factor analysis. British Journal of Mathematical and Statistical Psychology, 67(1), 172-194. https://doi.org/10.1111/bmsp.12014

39.

Ryu

(2015). Multiple group analysis in multilevel structural equation model across level 1 groups. Multivariate behavioral research, 50(3), 300-315. https://doi.org/10.1080/00273171.2014.1003769

40.

Ryu

Mehta

(2017). Multilevel factorial invariance in n-level structural equation modeling (nSEM). Structural Equation Modeling: A Multidisciplinary Journal, 24(6), 936-959. https://doi.org/10.1080/10705511.2017.1324311

41.

Satorra

Bentler

P. M.

(1994). Corrections to test statistics and standard errors in covariance structure analysis. In von Eye

Clogg

C. C.

(Eds.), Latent variables analysis: Applications for developmental research (pp. 399-419). Sage.

42.

Satorra

Bentler

P. M.

(2001). A scaled difference chi-square test statistic for moment structure analysis. Psychometrika, 66, 507-514. https://doi.org/10.1007/BF02296192

43.

Satorra

Bentler

P. M.

(2010). Ensuring positiveness of the scaled difference chi-square test statistic. Psychometrika, 75, 243-248. https://doi.org/10.1007/s11336-009-9135-y

44.

Schwarz

(1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464.https://doi.org/10.1214/aos/1176344136

45.

Sclove

S. L.

(1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52, 333-343. https://doi.org/10.1007/BF02294360

46.

Stark

Chernyshenko

O. S.

Drasgow

(2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292-1306. https://doi.org/10.1037/0021-9010.91.6.1292

47.

Tofighi

Enders

C. K.

(2008). Identifying the correct number of classes in a growth mixture model. In Hancock

G. R.

Samuelsen

K. M.

(Eds.), Advances in latent variable mixture models (pp. 317-341). Information Age.

48.

Vandenberg

R. J.

Lance

C. E.

(2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-70. https://doi.org/10.1177/109442810031002

49.

Widaman

K. F.

Reise

(1997). Exploring the measurement invariance of psychological instruments: applications in the substance use domain. In Bryant

K. J.

Windle

West

S. G.

(Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281-324). American Psychological Association. https://doi.org/10.1037/10222-009

50.

Woods

C. M.

(2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44(1), 1-27. https://doi.org/10.1080/00273170802620121

51.

Woods

C. M.

Grimm

K. J.

(2011). Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Applied Psychological Measurement, 35(5), 339-361. https://doi.org/10.1177/0146621611405984

52.

Zyphur

M. J.

Kaplan

S. A.

Christian

M. S.

(2008). Assumptions of cross-level measurement and structural invariance in the analysis of multilevel data: Problems and solutions. Group Dynamics: Theory, Research, and Practice, 12(2), 127-140. https://doi.org/10.1037/1089-2699.12.2.127