Abstract
Growth mixture modeling has gained much attention in applied and methodological social science research recently, but the selection of the number of latent classes for such models remains a challenging issue, especially when the assumption of proper model specification is violated. The current simulation study compared the performance of a linear growth mixture model (GMM) for determining the correct number of latent classes against a completely unrestricted multivariate normal mixture model. Results revealed that model convergence is a serious problem that has been underestimated by previous GMM studies. Based on two ways of dealing with model nonconvergence, the performance of the two types of mixture models and a number of model fit indices in class identification are examined and discussed. This article provides suggestions to practitioners who want to use GMM for their research.
Growth mixture models (GMMs), which facilitate the exploration and identification of unobserved group-based growth curves in longitudinal data, have been used to investigate developmental processes within and beyond the social and behavioral sciences, including alcohol use in college (e.g., Greenbaum, Del Boca, Darkes, Wang, & Goldman, 2005), depression patterns (e.g., Stoolmiller, Kim, & Capaldi, 2005), reading skills from kindergarten to fifth grade (e.g., Douglas & Liu, 2009), medication effects (e.g., Muthén, Brown, Hunter, Cook, & Leuchter, 2011), and criminal behavior (Kreuter & Muthén, 2008a). However, as Bauer (2007) and Bauer and Curran (2003, 2004) have pointed out, the enumeration of latent classes remains a challenging issue for a variety of reasons. First, the likelihood surface may be more complex than for single-class growth or structural models, leading to local maxima during optimization and in turn over- or underextraction of growth classes; for this reason the use of multiple random starting values across a wide range of parameter space has been recommended (Hipp & Bauer, 2006). Second, a lack of within-class normality could lead to an overextraction of classes and might require the use of second-order GMM (Grimm & Ram, 2009), nonparametric GMM (Kreuter & Muthén, 2008b; Muthén & Asparouhov, 2008), or skew-normal mixture models (Azzalini, 1985, 2005; Chang, 2005). Third, having data that originated from a complex sampling scheme may lead to overextraction, which may or may not be ameliorated through design-based or model-based strategies (Chen, Kwok, Luo, & Willson, 2010; Hamilton, 2009). Fourth, having data that are missing not at random could lead to underextraction of classes, for which pattern mixture models and/or probability weighting methods may be necessary (Bauer, 2007).
Perhaps even more fundamental than all of the above, however, is the assumption that one has the correct within-class model in the first place. As Bauer and Curran (2004) observed, when the within-class structural model was overly restricted relative to the true model additional latent classes might emerge in an attempt to capture unaccommodated variance–covariance of the repeated measures. They illustrated this point by fitting a more restricted latent class growth model to a data generated from a single-class latent curve model. To address this class overextraction problem, a two-step modeling process has been suggested (Bauer & Curran, 2004; Yung, 1997), the viability of which remains to be assessed. In the first step, unrestricted (saturated) multivariate normal mixture models (UMMs) with different numbers of latent classes would be estimated and compared according to select fit indices. The second step, then, assuming the number of latent classes is correctly identified in the first step, would be to fit the hypothesized within-class GMMs to assess which, if any, adequately capture the within-class mean and covariance structures underlying the data. To further explicate the rationale of this two-step procedure, the key differences between the UMM and GMM are noted below.
For the UMM, the probability density function is
in which
Note further that for GMMs, the model-implied mean vector for the kth class is structured as
and two linearly related growth factors intercept and slope define the factor mean vector
Unfortunately, as Bauer and Curran (2004) also noted, the fewer restrictions one places on a mixture model the more taxing the number of parameters becomes on model fitting. Consider, for example, a linear GMM with four time points, heterogeneous residuals variances, and no constraints across classes. For GMMs with one, two, and three classes, the numbers of parameters requiring estimation would be 9, 19, and 29, respectively; by comparison, UMMs would correspondingly estimate 14, 29, and 44 parameters. And with more time points, disparities get even more dramatic: for example, for a seven-time point model with one, two, and three classes, the numbers of parameters for GMMs versus UMMs would be 12 versus 35, 25 versus 71, and 38 versus 107, respectively. As a consequence, we expect that the opposite of overextraction of latent trajectory classes can occur as well. More specifically, fitting a model with an overly flexible within-class specification relative to the true model can result in the adoption of too few classes. The researcher’s challenge, then, is to try to find a practical balance between the potential costs of misspecification (e.g., spurious latent classes, parameter bias) and the costs of overparameterization (e.g., convergence problems and underextraction of latent trajectories).
When making comparisons among models differing in structure and/or number of classes, many sources of evidence for model selection exist, although no universal standard has emerged. Only a few simulation studies have examined the relative efficiency of approaches for class enumeration in GMMs. Work by Tofighi and Enders (2008), which examined a limited number of GMM conditions, recommended the sample-size-adjusted Bayesian information criterion (SABIC; Sclove, 1987) and Lo–Mendell–Rubin (LMR; Lo, Mendell, & Rubin, 2001) test for determining the number of classes for GMMs over BIC, AIC, CAIC, and Multivariate Skewness and Kurtosis Tests. Nylund, Asparouhov, and Muthén (2007), on the other hand, found that the bootstrap likelihood ratio test (BLRT; McLachlan, 1987) outperformed chi-square and LMR likelihood ratio tests, and consistent AIC (CAIC; Bozdogan, 1987) perform better than Akaike information criterion (AIC; Akaike, 1987), Bayesian information criterion (BIC; Schwarz, 1978) and SABIC in GMM. Henson, Reise, and Kim (2007) also recommended using the SABIC in a structural equation mixture model measuring the relation between two latent factors, but they found that no indices performed well when sample sizes were below 500. Tolvanen (2007) investigated the functionality of GMMs with a limited total sample size, with results suggesting that the BIC was most useful when N < 500 while the SABIC performed best when N≥ 500 among AIC, BIC, SABIC, LMR, and BLRT. And most recently, Peugh and Fan (2012) examined both homogeneous (only one growth trajectory) and heterogeneous (three different growth trajectories) growth patterns underlying longitudinal data with two sample size conditions. They used a broad range of indices and found entropy-based indices outperformed other information criteria (ICs) and likelihood ratio tests in identifying the homogeneous growth curve; however, when identifying the three heterogeneous growth patterns, none of the indices performed well. The only exception is that when the three growth curves differed in both growth intercepts and slopes, and sample size was large as 3,000, Hannan and Quinn’s information criterion (HQ; Hannan & Quinn, 1979), sample adjusted HQ, and BLRT achieved an acceptable accuracy rate of 90%. Given these somewhat varied findings, in part due to different models and conditions investigated, the current study will include a wide array of diagnostics, including ICs, likelihood ratio tests, and classification statistics.
The goal of the current study then, is to compare the performance of linear LGM to that of an UMM, in terms of their ability to identify the correct number of latent classes under various experimental conditions. Manipulated modeling conditions, model selection diagnostics, and outcome measures used in current simulation study are detailed below.
Method
Described below are the three conditions that were manipulated to simulate data, as well as the details of the models fitted to each data set, the diagnostics for model selection, and the outcome measures used to gauge the performance of the diagnostics with the linear GMM and UMM specifications.
Data Generation
All sample data were simulated using SAS IML from variations on a two-class GMM with four equally spaced time points, as depicted in Figure 1. Because preliminary analyses led us to suspect that convergence rates for overextracted GMMs were overestimated in Mplus, and because Nylund et al. (2007) is one of a few simulation studies that reported convergence rates for GMM, the parameter values for our data generation model were chosen to subsume those of the Nylund et al. (2007) for purposes of replicating their findings, as shown in Table 1. As Tolvanen (2007) suggested, the Mahalanobis distance (MD) should be larger than 2 to possibly identify the two true latent classes assuming the sample size is relatively large. The class separation in the current study was created along both the intercept and slope factors, equaling 3.2 MD units in the latent growth factor space (3.07 squared units in the measured variable space). We did not consider the same intercept or the same slope factors in our population model as Peugh and Fan (2012) did, for two reasons. First, it is rare to find a longitudinal measure starting from the same level. Second, if students have the same slope but different intercept, identifying heterogeneity may not be so important because one growth curve with large variance in intercept may fit data equally well as three growth curves with small variance in intercepts. In this population model, 45% to 66% of the observed variable variance is explained by linear growth. For the simulation study, three conditions were manipulated and crossed completely in a 4 × 2 × 2 design, as described below. As 100 replications have been used in previous GMM simulation studies (e.g., Nylund et al., 2007), for each of the 16 cells, 100 replications were attempted in the current GMM study, thus yielding 1,600 data sets generated in total.

Population GMM used for data generation.
Population Growth Mixture Model Parameters.
Model Propriety
To examine if the UMM specifications outperform GMM in terms of accuracy of class enumeration, both properly and improperly specified population GMMs were used to generate sample data. To create improperly specified models, a quadratic term was added into the first latent class in the population linear GMM, as illustrated in Figure 1 with dashed lines. The choice of parameters for this quadratic component was done to create something deliberately small in magnitude (see Table 1), so that this subtle nonlinearity would not be detected by visual inspection of a spaghetti plot of sample data. As such, it is highly possible that this growth pattern would incorrectly be assumed to be linear during estimation. Moreover, an UMM is still technically correct because it does not assume a linear growth function, whereas the linear GMM is not correct in this case.
Sample Size
Total sample size (i.e., across classes) was varied to be N = 400, 700, 1,000, and 2,000, following a careful review of substantive GMM applications by Tofighi and Enders (2008), which deemed these sizes as covering “a range of values between approximately the 25th and 75th percentiles of the sample size distribution” (p. 325) in the published GMM studies.
Mixing Proportions
In addition to the 75% versus 25% mixing proportion used in Nylund et al. (2007), we add a balanced 50% versus 50% condition.
Model Estimation
Two types of mixture models (i.e., linear GMM and UMM), each with one, two, and three latent classes, were used to analyze the 1,600 two-class data sets in Mplus Version 6 (Muthén & Muthén, 2008). Estimation was carried out by using maximum likelihood via an expectation maximization algorithm, employing the default convergence criterion of .001 for the complete-data log likelihood derivative (note that changing the convergence criterion to a more liberal .01 did not facilitate convergence in any of the nonconvergent cases examined in this study). For each of the three types of mixture models, an underextracted (1-class) model, a proper (2-class) model, and an overextracted (3-class) model were evaluated. All parameters were allowed to be class-specific, that is, no cross-class model constraints were imposed for any model. Finally, multiple sets of random starting values were implemented in Mplus to avoid any irregularities on the likelihood surface and to differentiate local maxima from the global optimum for estimation of mixture models (e.g., McLachlan & Peel, 2000; Muthén & Muthén, 2001). Specifically, the Mplus setting used was “STARTS=100 10; STITERATIONS=20;” indicating that 100 sets of starting values were generated in the initial stage and 10 sets were used in final stage optimizations, with 20 iterations allowed in the initial stage. The population parameter values were not used as starting values because we want to explore the full likelihood space without a strong assumption about the true values, which is a more realistic situation for researchers.
Model Selection
Three classes of indices were used for model selection. First, ICs used in this study included AIC, CAIC, sample-size adjusted CAIC (SACAIC; see Tofighi & Enders, 2008), BIC (Schwarz, 1978), and sample-size adjusted BIC (SABIC; Sclove, 1987), Draper’s BIC (DBIC; Draper, 1995), HQ, and Hurvich and Tsai’s AIC (HT-AIC; Hurvich & Tsai, 1989), which have been investigated for determining the number of latent classes in latent class analysis models (e.g., Yang & Yang, 2007). All of these indices are computed based on deviance, the number of estimated model parameters p, and the sample size N, and their differences lie in that how the two adjustment (penalty) factors p and N are combined (Peugh & Fan, 2012). According to their computational formulas, and our specific simulated conditions, we would expect that in UMMs the AIC and HT-AIC should have similarly poor performance, while the BIC and CAIC may perform worse than SABIC, SACAIC, DBIC, and HQ because they put more penalty weight on the number of parameters.
Second, three likelihood ratio tests were investigated, using α = .05 for all: the LMR test (Lo et al., 2001), the Vuong–Lo–Mendell–Rubin test (VLMR; Lo et al., 2001), and the bootstrap likelihood ratio test (BLRT; McLachlan, 1987). Compared to ICs, likelihood ratio tests are often more demanding because the statistics require following certain asymptotic distributions, or empirically constructing the necessary distributions, in order to obtain an accurate probabilistic statement (e.g., p value) to assist in model selection.
Third, classification-based statistics were used as well, which inform model selection on the basis of the accuracy of classifying cases. Although such statistics cannot be used as absolute fit indices, given that some mixture models necessarily have overlapping components, they can be used to compare the relative performance of models. Four classification-based statistics were investigated in this study: entropy (Ramaswamy, DeSarbo, Reibstein, & Robinson, 1993), the normalized entropy criterion (NEC; Celeux & Soromenho, 1996), the integrated classification likelihood (ICL-BIC; McLachlan & Peel, 2000), and the classification likelihood information criterion (CLC; McLachlan & Peel, 2000).
Outcome Measures and Ways of Summarizing Results
The primary interest of the current study is the percentage of accurate class enumeration, so this information was summarized in different ways to address the research questions. First, this measure was averaged over all 16 conditions to generally assess the relative efficiency of types of mixture models and various fit indices. Second, this measure was analyzed in terms of the three manipulated factors to examine their main and interaction effects on class enumeration.
Because model convergence can present a challenge for finite mixture models, especially when the number of latent classes is unknown (e.g., Chen, 1995), two issues related to convergence should be stated explicitly in order to summarize results clearly and appropriately.
First, only replications with proper solutions were counted as converged results. Any replication with an offending estimate, such as a Heywood case (i.e., negative error variance), was considered nonconverged even though model fit indices and estimated parameters might be reported in the Mplus standard output and Mplus compact output files. This definition is consistent with Nylund et al.’s (2007) view of an admissible solution but differs from that of Tolvanen (2007). In the latter study, negative error variance was considered a normal variation of sampling because its occurrence in correctly specified two-class GMMs was found to correspond to smaller sample sizes and MDs; the author thus chose to include these replications in the aggregation of results. In the current study, however, model misspecification is the dominant factor affecting nonconvergence given sufficiently large sample size and MD (N > 1,000, MD > 3). Hence, replications with offending estimates were not included; as will be seen, rates of convergence to proper solutions can differ markedly from those including inadmissible estimates, with the latter yielding an often severely distorted view of models’ success.
Second, having determined what constitutes a nonconvergent case, studies of mixture models must make a decision regarding how to handle such cases, including those with negative variance. Nonconvergent replications are sometimes ignored or not addressed by some researchers (e.g., Peugh & Fan, 2012), or they are taken as evidence supporting a model with fewer classes (e.g., Nylund et al., 2007; Tofighi & Enders, 2008); thus, for example, a nonconverged 3-class model could be viewed as evidence supporting the selection of a 2-class model (assuming that the 1-class model would not be selected over the 2-class model). As Pastor and Gagné (2013) recently argued, however, failure to converge properly should not be used to choose among candidate models because it can occur even when the correct model is fit to the data (as will be seen within some 2-class models in this study). In addition, in principle, the failure to converge can occur because too few classes are being forced on the data rather than too many. In the current study, results were reported in two ways. The first (and somewhat involved) approach is to apply 4- and 5-class models to the data with nonconverged 3-class models results; if and only if none of the 3-, 4-, and 5-class models reach convergence, the nonconverged replication is included as evidence supporting a 2-class model (if the 1-class model is not superior). Of course, this more involved approach is not without limitations—that models with 6 or more classes could, in theory, still fit, and that some of the 3- or higher-class models might converge had a different or more efficient algorithm been available. Thus, the second way results were reported is to discard all nonconvergent replications as they are not valid results. This separation of results might, in fact, reveal more complete information than previous GMM research since nonconvergence is such a prevalent issue.
Results
To start, as shown in Table 2, 1 all 1-class (under-extracted) models converged properly, while 2-class (properly specified) models were nearly perfect as well (convergence rate exceeding 97% across all cells). Only a few replications under the smallest sample size condition of N = 400 for 2-class mixture models failed to reach convergence. When sample size was increased to N = 700 or above, all replications for the 2-class model converged properly. As expected for the 3-class (overextracted) models, however, nonconvergence occurred much more often, given that they are misspecified models (e.g., Nylund et al., 2007). Between the two types of overspecified 3-class mixture models, the more restricted linear GMM had average convergence rates of 4% (based on only properly converged replications, the values outside the parentheses) while the convergence rate of the UMM reached 70%. Clearly, overspecification of latent classes has a more serious effect on GMM, evidenced by the dramatically decreasing convergence rates from almost 100% to 5% below. Comparing the values outside with those inside parentheses, the rates of properly converged replications in linear GMM are significantly lower than those including improper solutions (specifically, Heywood cases). Surprisingly, negative variances were not found in 3-class UMMs even though they are also overspecified mixture models. In other words, overspecified UMMs either reached convergence properly or failed to converge, but improper solutions did not occur. This outcome indicates that researchers may be able to avoid inadmissible solutions using UMMs in Mplus, but still need to be cautious when interpreting the program’s GMM output because improper parameter estimates may occur if the number of latent classes in the GMM is overspecified. This is a bit concerning as one would expect the program to be able to consistently find a maximum value for an objective function and generate a reasonable set of estimates within the parameter space; whether this indicates a problem with the program itself is unknown.
Number of Convergent Replications for Two Types of Mixture Models With 1-, 2-, and 3-Class Models Across 16 Conditions.
Note. N is the total sample size; π is the mixing proportion for two latent classes; numbers outside parentheses represent the number of properly converged replications (i.e., excluding inadmissible solutions) whereas numbers within parentheses also count in inadmissible solutions. We selected 4 out of the total 16 conditions (1, 5, 9, and 13) and double checked the results with 1,000 replications. And similar convergence results were found for 3-GMM.
As mentioned previously, convergence problems are typically more frequently expected with more flexible/overparameterized models, such as the UMM in this case. The estimated results from the 2-class models may be consistent with this general rule; nonconvergence occurred in two manipulated conditions of the more flexible 2-class UMM but in only one condition in the 2-class GMM. However, our results of 3-class models do run counter to this rule; the overspecified 3-class GMM has a higher convergence rate than a 3-class UMM once inadmissible solutions are excluded. If this is reflective of a software problem, the true convergence rates might be somewhere between the overly optimistic results including inadmissible solutions and the overly pessimistic results excluding them.
Comparing General Performance of GMM and UMM
Model selection results for GMMs and UMMs across all 16 cells are summarized in Table 3 using the two ways described above: one includes properly converged replications only (outside parentheses) while the other counts improperly converged replications as evidence supporting 2-class models (within parentheses) when the 4- and 5-class models failed as well. As the performance of the LMR and VLMR are almost identical, only LMR is presented in the table and considered in the following analysis. For each index, the model (UMM or linear GMM) having the higher percentage of selecting the correct 2-class specification is highlighted in bold. In addition, all accuracy rates for 2-class models at or above 95% are underscored. As seen in the table, the higher percentages of correct class enumeration tend to be associated with the linear GMM. Considering the percentages within parentheses, which are based on both the properly and improperly converged replications, linear GMM appears to perform best in terms of all examined model fit indices with a satisfactory rate of accuracy (i.e., above 95%). However, if improperly convergent replications are excluded, the percentages outside parentheses can be much lower (e.g., AIC, SABIC, HT-AIC, Entropy, LMR LRT [1-class vs. 2-class], BLRT [1-class vs. 2-class]). Clearly, nonconvergence and how to deal with it play a vital role in determining the number of classes in linear GMM. Since linear GMM has very high nonconvergence rate, different methods of treating nonconvergent cases might lead to remarkably different results in this model. Therefore, the optimistic model selection results of linear GMM in previous studies might be partly caused by the large proportion of improper convergence arising in situations where more classes are fit in the GMM than truly exist.
Percentage of 1-, 2-, and 3-Class Models Selected by Each Index Across All 16 Conditions.
Note. The higher percentage between UMM and GMM is in bold and other percentages over 95 are underlined. Percentages outside parentheses are calculated only from properly converged replications while percentages within parentheses are from all the replications, including both properly converged cases and nonconverged ones. LMR1v2 and BLRT1v2 results come from converged replications only since almost all the 1- and 2-class models converged properly.
That said, as an unrestricted mixture model assumes no specific within-class relations among variables, the UMM is not always superior on average, although the SACAIC, DBIC, LMR LRT (1-class vs. 2-class), and BLRT (1-class vs. 2-class) function well in UMM (i.e., accuracy rates over 95%) as they do in linear GMM. Inconsistent with the hypothesis that some previous studies proposed (e.g., Bauer & Curran, 2004), this result indicates that a completely unrestricted model might not work well in class determination according to some indices, possibly due to its overparameterization (i.e., too many parameters need to be estimated from the data). As we expected, underextraction of latent classes did occur in less restricted UMM as most indices selected more 1-class model in UMM than they did in GMM.
Comparing the Performance of Model Fit Indices
AIC, HT-AIC, and all four classification-based statistics exhibited very limited utility as they tend to overestimate the number of classes for all the mixture models across all the cell conditions examined. They seem to achieve high rates of accuracy only in linear GMM when nonconvergent replications are included as evidence supporting 2-class models. But this outcome might solely result from the high rates of nonconvergence because the accuracy rates are usually lower than 20% when excluding nonconvergent cases. Thus, these indices are not suggested for class determination, and this finding is consistent with previous studies (e.g., Henson et al., 2007). For this reason, only entropy is retained in Table 3 and in the following results as a representative classification measure.
LMR and BLRT are almost perfectly accurate when testing a 2-class model versus a 1-class model across all the mixture models under different conditions, but less so when testing the 2-class model against the 3-class model. For example, BLRT2v3 has an unacceptable Type I error rate of .56 in the UMM.
CAIC and BIC have very similar patterns. Both tend to underestimate the number of latent classes across three types of mixture models and both perform better in the more restricted linear GMM. Generally speaking, BIC has a higher rate of accuracy than CAIC across types of models. Given the fact that CAIC and BIC have the largest penalty terms for model complexity (i.e., the number of parameters) among all the indices, it makes sense that both of them more often select the simpler 1- or 2-class models over 3-class ones. This result is consistent with other studies (e.g., Hurvich & Tsai, 1989; Nylund et al., 2007). When nonconvergent replications are excluded, the rates of accuracy are slightly lower than those including nonconvergent ones within parentheses in Table 3. BIC shows very stable and good performance in the linear GMM, no matter how we deal with nonconvergence; however, it performs worse in the UMM, which is understandable given that BIC tends to favor a simpler model.
SACAIC and DBIC tend to slightly overestimate the number of latent classes in the UMM and linear GMM on average. DBIC performs best across both model types and both ways of dealing with nonconvergent replications, yielding a satisfactory rate of accuracy. SABIC and HQ exhibit very similar patterns in Table 3. Both work better in the UMM if nonconvergent replications are discarded and both work better in the linear GMM if nonconvergent cases are included as evidence for choosing 2-class models. Still, they tend to overestimate the number of latent classes in the linear GMM and UMM.
Factors Influencing Class Enumeration
As shown in Table 4, factorial ANOVA was conducted to examine three factors’ influence on the performance of model fit indices for class enumeration. A statistically significant effect (p < .05) of the factors on model fit indices is marked with an asterisk, while a practically significant effect (partial eta squared > 0.1) is bolded. As observed in Table 4, sample size is the most critical factor because most indices are affected by sample size, but not so much by mixture proportion, model specification, and their interaction effects. The main effects of the three factors are not statistically significant in AIC, SACAIC, HQ, HT_AIC, Entropy, LMR2v3, and BLRT2v3 under the various conditions examined in this study. Only DBIC, BLRT1v2, and LMR1v2 exhibit a statistically and practically significant main effect of sample size and the interaction effect of sample size and model specification.
ANOVA Results With Partial Eta Squared for the Designed Factors’ Effects on Model Fit Indices in Selecting the True Model Across Types of Models and Conditions.
Note. The partial eta squared is bold if it is larger than 0.1, and signified with an asterisk if it is significant at the .05 level. Values outside parentheses are calculated only from properly converged replications while those within parentheses are from all the replications, including both properly converged cases and nonconverged ones.
In sum, following from the analyses in the previous section, DBIC appears to be recommended in class determination due to its relatively stable and good performance across different conditions manipulated in this study. More details about specific design factors’ influence are provided as below.
Sample Size
Percentages of correct selection under the conditions of four sample sizes are summarized in Tables 5 and 6. The 3-class UMM has the lowest convergence rate of 42% at the smallest sample size of N = 400 and goes up to around 70% to 90% at or above a sample size of N = 700. Convergence rates of the 3-class linear GMM are consistently low, about 4% to 5% across sample size conditions. Two conclusions could be drawn from this observation. First, similar to the previous finding in Table 3, convergence rates decrease as a model becomes more parameterized (i.e., less restricted). Second, although model misspecification might be the dominant factor accounting for the high nonconvergence rates, sufficient sample size (e.g., 1,000 for the current GMM with four repeated measures) might lessen the difficulty of convergence more or less, although it is no longer helpful when the sample size reaches 1,000 in our model. This is particularly true in the UMM since it requires a relatively larger sample to estimate many more parameters than does the linear GMM. In the UMM, the convergence rate has increased considerably from 42% to 78% when sample size goes up from N = 400 to N = 700. The convergence rates of the linear GMM remain consistent across all sample size conditions.
Percentage of 1-, 2-, and 3-Class Models Selected by Each Index Across Four Conditions With Sample Size of 400 and 700 Separately.
Note. The higher percentage between UMM and GMM is in bold and other percentages over 95 are underlined. Percentages outside parentheses are calculated only from properly converged replications while percentages within parentheses are from all the replications. LMR1v2 and BLRT1v2 results come from converged replications only since almost all the 1- and 2-class models converged properly. The same rules are applied to Tables 6, 7, and 8.
Percentage of 1-, 2-, and 3-Class Models Selected by Each Index Across Four Conditions With Sample Size of 1,000 and 2,000 Separately.
Inspecting Table 5 and 6, it is not surprising to find that increasing sample size does help many fit indices to more accurately identify the number of latent classes for mixture models, such as CAIC, SACAIC, BIC, SABIC, and DBIC. However, as for AIC, HT-AIC, HQ, and Entropy, large sample size does not seem to be helpful in class identification. LMR1v2 and BLRT1v2 could achieve perfect selection rates as sample size goes up to N = 700 or above no matter which type of mixture is used. But sample size does not show a positive effect in these two likelihood ratio tests when testing 2-class versus 3-class models.
The UMM, being completely unrestricted, performs comparably to or even better than the linear GMM in class identification in terms of most useful fit indices as sample size increases. Even though overparameterization prevents the UMM from functioning well under limited sample sizes, the UMM can outperform the linear GMM once the sample size is sufficiently large (e.g., N > 1,000). When N = 400, all the model-fit indices except LMR2v3 perform better in a linear GMM, which is observed irrespective of how the results are summarized.
As evidenced by Figure 2, CAIC, BIC, SABIC, and LMR1v2 exhibit a statistically and practically significant sample size effect. This graph indicates that as sample size increases, all mixture models perform better in class determination. When sample size reaches N = 2,000, the performance of the both types of models become comparable, supported by the solid horizontal line. Also as seen in Figure 2A and B, irrespective of whether nonconvergent replications are counted, the UMM is much more sensitive to sample size than the linear GMM as evidenced by the varying rates of accuracy based on different sample size; meanwhile, the linear GMM has comparable performance as long as sample size is N≥ 700. When sample size is only N = 400, with the exception of the SABIC, the other three indices tend to perform better in the less complex linear GMM.

(A) Indices with significant interaction effects (sample size × model type) based on results excluding nonconvergent replications. (B) Indices with significant interaction effects (sample size × model type) based on results including nonconvergent replications.
Mixing Proportions
The percentage summary in Table 7 for the results of equal and unequal class proportions show virtually identical patterns, indicating neither condition to be overwhelmingly better than the other, no matter how the nonconvergent replications were handled. The ANOVA results from Table 4 also indicates that varying this factor from 50/50 to 75/25 does not make any appreciable difference for most fit indices. In both mixture proportion conditions, SACAIC and DBIC in the UMM, BIC, and DBIC in the linear GMM, and two likelihood ratio tests for testing 1- versus 2-class models perform well with satisfactory accuracy rates. Again, the percentage values within the parentheses, which incorporate the nonconvergent replications, reflect more or less inflated accuracy rates of model fit indices as compared to the values outside parentheses.
Percentage of 1-, 2-, and 3-Class Models Selected by Each Index Across Eight Conditions With Equal and Unequal Group Sizes Separately.
Within-Class Model Specification
The percentage results for the properly and improperly specified within-class models show similar patterns in Table 8 across methods of dealing with nonconvergence. Although some indices in properly specified model have slightly higher rates of accuracy than those in improperly specified model, this difference is not significant as tested by ANOVA model. This is probably due to the very subtle nonlinear component introduced to the majority class in the population model. Interestingly, there are some indices (e.g., CAIC, SACAIC, BIC, SABIC, HQ, Entropy, and BLRT2v3) that have better performance in the GMM with improperly specified within-class structure. In both model specification conditions, again, SACAIC and DBIC in the both properly and improperly specified UMM, BIC and DBIC in the both properly and improperly specified GMM, and two likelihood ratio tests for testing 1- against 2-class models perform well with satisfactory rates of accuracy.
Percentage of 1-, 2-, and 3-Class Models Selected by Each Index Across Eight Conditions With Proper and Improper Model Specification Separately.
Only BLRT2v3 exhibits a significant interaction of model specification and mixture model type, as Figure 3 shows. Comparing Figure 3A and B, we could observe that BLRT2v3 performs slightly better under conditions with proper model specification in the UMM using either way of dealing with nonconvergent replications. But this is not the case in the linear GMM. If only convergent replications are counted, as Figure 3A shows, BLRT2v3 performs better in the linear GMM under conditions with improperly specified models.

BLRT2v3 with significant interaction effects (model specification × model type) based on results excluding (A; on the left) or including (B; on the right) nonconvergent replications.
Discussion and Recommendations
The primary purpose of current study was to evaluate the performance of unrestricted multivariate normal mixture models (UMMs) against the linear GMMs in correctly identifying the number of latent classes in growth mixture modeling. To start, however, our results indicated that overextraction of the latent classes with the correct within-class specification led to high rates of model nonconvergence. Although this problem has been pointed out previously (e.g., Nylund et al., 2007), how to define and deal with nonconvergence is not clear and has not been consistently addressed in previous GMM studies. Therefore, before discussing our results as relate directly to our focal research concern, it is necessary to clarify the nonconvergence issue to better understand our results.
Recall that Tolvanen (2007) advocated that negative variances should not be interpreted as indicating a misspecified model; as their occurrence was associated with small sample size and small class separation, he viewed such replications as mere sampling variation that should be counted as proper in assessing GMM results. However, Tolvanen’s study only examined the convergence rate for a true 2-class GMM, not a misspecified 3-class model, and he used population values as starting values in estimation which is often unrealistic in practice. Nylund et al. (2007) and Tofighi and Enders (2008), who used multiple sets of starting values in model estimation, considered nonconvergence to be caused by overspecification of the number of latent classes in 3-class models and thus used failed replications as evidence supporting 2-class models.
In the current study, our convergence results are in conflict with those of the Nylund et al. (2007) study; their convergence rate for the linear GMM was reported to be 95% as a minimum across different study conditions. Assuming different starting-value settings have only a minor effect on convergence rates (as long as there are sufficient sets of starting values used in both studies to explore the likelihood surface, which seems likely), two other possible reasons might account for this apparent discrepancy. First, in this study all parameters were freely estimated across classes whereas in the Nylund et al. work error variances were constrained to be equal across different latent classes (i.e., residuals for each time point are held constant across the first, second, and third latent classes). However, even after we reran our models with their constraints, our replicated results showed the average rates were slightly above 20% under the condition of N = 1,000, still far less than 95%. Clearly, the constraint is not sufficient to explain the large convergence discrepancy across studies. Second, if the replications with improper solutions such as negative variance are treated as valid, the convergence rate goes up to 90% in the same N = 1,000 condition. This would explain why Nylund et al. got high convergence rates for misspecified 3-class models. It also suggests that Mplus simulation users might be cautious when summarizing Mplus compact output because the abbreviated results do not alert the researcher to the presence of negative variance in a given replication under certain model conditions. As was seen earlier, rates of convergence to proper solutions can differ markedly from those that include offending estimates, the latter offering an often severely inflated view of the models’ success. This distinction represents an important clarification of previous GMM studies and has practical implication for future GMM research whenever Mplus is used.
Therefore, in the current study, only proper solutions were considered as converged results, which means that both failed replications and replications with negative variance were classified as nonconvergent results. Considering all the possible reasons for nonconvergence, and to provide a full picture of performance of types of mixture models and model fit indices, nonconverged replications were treated in two different ways in this study: they were either viewed as evidence for supporting models with fewer classes or discarded completely. We believe the latter to be a better strategy for methodological researchers studying GMMs.
Based on our results, we tried to find evidence to support an alternative modeling strategy of assessing the number of latent classes for the linear GMM. Unfortunately, although a completely unrestricted mixture model sounds theoretically compelling, the UMM did not outperform the linear GMM under the conditions investigated herein. This is reminiscent of the bias–variance trade-off (or “bias–variance dilemma”; Geman, Bienenstock, & Doursat, 1992; Rice, Lumley, & Szpiro, 2008). In practice, whenever an incorrect restriction is imposed, fewer parameters are required and some degree of bias is induced. As long as researchers can find a balance point so that this restriction is close to the truth, the bias induced will be small while the reduction in variance will be substantial. In reality, the choice between restricted and unrestricted model estimation depends on the researcher’s degree of confidence in those restrictions. How to decide this trade-off is an empirical question, highly related to sample size (A’Hearn & Komlos, 2003). More flexible models, like the UMM, can lower the chance of bias occurring caused by model misspecification, but they require a much larger sample to detect the heterogeneity underlying the data.
A practical suggestion arising from this study is that practitioners can feel comfortable to use UMM to determine the number of latent classes if they do have a large sample at hand. But for researchers who have only limited sample data, they ought to think about which part of the within-class model structure is uncertain and thus could be loosened, either based on existing theory or prior experience. Releasing this restriction, the chance of bias caused by model misspecification is reduced and the number of latent classes could be identified more accurately. For example, if the nature of the growth curve is not certain, meaning it could be linear or nonlinear, the UMM is superior to linear GMM in determining class determination if the sample size is sufficiently large.
Once researchers decide on the mixture model used for class identification, model fit indices are suggested for class enumeration. As stated earlier, only a few GMM simulation studies compared the performance of the indices in this regard. As Tolvanen (2007) summarized, one can identify two heterogeneous linear growth curves, conditioning on an MD of at least 2 between latent classes and having a sufficiently large sample size. And among these studies, only Peugh and Fan (2012) included all the IC statistics, entropy-penalty indices, and likelihood ratio tests. Other studies only included a few statistical indices. Unfortunately, Peugh and Fan used conditions all below or equal to 2 SD to separate latent classes. The fact that all the indices functioned badly in class enumeration is highly possibly due to the relatively small class separation. In this sense, our work can also shed some light on the performance of statistical indices in class enumeration when there is an appreciable difference between the latent classes. BIC and DBIC are suggested in the linear GMM setting; DBIC is recommended for class identification across types of mixture models and various conditions examined here. DBIC’s performance is almost comparable with BIC in the linear GMM and performs clearly better than the BIC in the UMM. When testing whether population heterogeneity exists, regardless of whether the researcher chooses the UMM or linear GMM as the starting mixture model, both the LMR and BLRT could be used with great confidence because both have enough power to detect heterogeneity. And because they are so powerful, both likelihood ratio tests tend to commit Type I error by selecting models with more spurious latent classes. This finding is consistent with results of Nylund et al. (2007) and Peugh and Fan (2012).
Three design factors were investigated in terms of their effect across types of models and on model fit indices’ performance in class identification. Undoubtedly, sample size is the most influential factor. The UMM seems more sensitive to the increase of sample size than the linear GMM. For example, N = 700 is large enough for the simplest linear GMM but not for the LPM. When sample size increases, most useful indices tend to perform better, such as CAIC, SACAIC, BIC, SABIC, DBIC, LMR, and BLRT, although their requirements for sample size are somewhat different.
Within-class model specification did not seem to be a significant factor in current study. This result is different from previous studies (e.g., Bauer & Curran, 2004), probably due to the very subtle nonlinear component introduced into the misspecified model condition. More extreme nonlinear components or other types of misspecified within-class model could be further investigated in future work.
The mixing proportion, in our study, does not have a significant effect on class identification. This is also different from Tofighi and Enders’ (2008) results, in which varying mixing percentage caused different accuracy rates of class enumeration. More specifically, their model with an extremely small proportion of 7% exhibited an unacceptable rate of incorrect class identification. Two reasons might explain this difference. First, the unbalanced mixing proportions in current work are not extremely small; the smaller proportion reaches 25% of the total and so the smallest group size 100, which might be large enough to be identified as a separate group, thus leading to an accurate result of class identification. Second, Tofighi and Enders’ results are based on two different sets of mixing proportions, while holding other factors constant, whereas the results in the current study come from a full-factorial design with the marginal effect as well as its interaction effect of the mixing proportion examined. According to Tueller and Lubke (2010), BIC and SABIC perform worse in selecting the true model in small sample size conditions. And their competing models have different within-class model structures, but not the number of latent classes as in our case. We would expect that the difference between the balanced and unbalanced design might be clear if the minority class is extremely small. More research is required to know what the subtle cutting-point of mixing percentage is to make a difference in the accuracy of class enumeration. Considering this result in conjunction with Tofighi and Enders (2008) work, this cutting point might be at some point between 7% and 25%, under the conditions that we have examined.
In sum, then, the current study tried to elucidate a promising idea proposed by Bauer and Curran (2004) to solve a practical problem in GMM. Although the rationale underlying this idea looks tenable, how to implement it in practice is not universal, which means practitioners cannot just apply an UMM to their data to avoid the possible bias brought by within-class model misspecification unless their sample size is adequately large. But they could utilize this idea based on their prior information or knowledge. That is, researchers could decide on theoretical or practical grounds either to start from the UMM and fix some components of this model, or to start with the linear GMM and loosen some restrictions. More research into the refinement of these suggestions is necessary and would be expected to help the increasing numbers of researchers employing GMMs.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
