Abstract
Many issues of interest to counseling psychologists involve questions regarding how individuals change over time. Although counseling psychologists often examine average levels of change, statistical methods can also identify patterns of change over time by empirically grouping together individuals with similar patterns of change (e.g., group-based trajectory modeling and latent growth mixture modeling). The purpose of this article is to provide an overview of these methods for counseling psychologists. We discuss the conceptual frameworks and assumptions of average-level and person-centered techniques such as group-based trajectory modeling and latent growth mixture modeling. We provide a nontechnical guide for conducting these analyses using data from a study of psychotherapy outcomes in a sample of mental health center clients (N = 1,050). We discuss caveats associated with these methods, including the potential for overinterpreting nongeneralizable results. Last, we suggest best practices for reporting and interpreting results.
Keywords
The assessment of the efficacy of psychotherapy is most often based on analyses of the mean level of change for the entire sample. And, on average, there is strong evidence that psychotherapy is effective (Westen, Novotny, & Thompson-Brenner, 2004). However, although most people may improve over the course of therapy, some people benefit more than others, and some may even be harmed (Lilienfeld, 2007). In addition, some individuals may benefit early on and others may take longer to make therapeutic gains (Heppner, Kivlighan, & Wampold, 2008). Identifying patterns of responses to psychotherapy that vary from the expected course is helpful for understanding the psychological processes underlying successful and unsuccessful interventions (Bauer, 2011). However, given the typical focus on average responses, different patterns of responses across individuals often are not studied. Examining these patterns can help answer a classic question: “What treatment, by whom, is most effective for this individual with that specific problem, and under which set of circumstances?” (Paul, 1967, p. 111).
We believe that psychology could benefit from additional research into treatment efficacy, whether treatments work for most people on average, and the extent to which individuals differ in their responses to these treatments. Thus, this research area will be used to illustrate how person-centered statistical methods, specifically, growth curve modeling (GCM), group-based trajectory modeling (GBTM; Nagin, 2005), and latent growth mixture modeling (LGMM; B. Muthén, 2004), can provide a complementary perspective through which to view these change processes. Both GBTM and LGMM are data-driven processes that create groups of individuals who have similar patterns of change over time. For example, these analyses may identity subgroups of individuals who respond very well to treatment and subgroups who do not respond well. Further analyses can then explore predictors of membership in these subgroups (e.g., characteristics of individuals associated with responding well to treatment). (The technical differences between these methods will be described later.) When used appropriately, these analyses can provide a useful method for grouping individuals to facilitate understanding of individual differences in change processes.
Information about patterns of symptom course and treatment response can help counseling psychologists tailor treatments and intervene more effectively. For example, in one study (Elliott, Biddle, Hawthorne, Forbes, & Creamer, 2005), multiple patterns of treatment responses to a 12-week intervention for posttraumatic stress disorder (PTSD) in combat veterans were identified using LGMM. Responses to the intervention were assessed at 3, 9, and 21 months posttreatment. The treatment was cognitive behavioral in orientation and delivered in a group format. Three classes of treatment responses best fit the data in this study. The largest class of individuals (n = 1,380; 62%) had initially high PTSD symptom levels and showed a sharp improvement in symptoms 3 months posttreatment with continuous, although less dramatic, improvement over time. A second group (n = 752; 34%) had moderate PTSD symptoms initially that showed small and steady improvement over time. A third small class (n = 87; 4%) had relatively lower levels of PTSD symptoms that worsened during treatment and then returned to baseline 21 months posttreatment. Elliott et al. (2005) suggested that the CBT-based group therapy for PTSD they examined may be less effective for veterans with less severe PTSD. They speculated that this may be due to an iatrogenic effect of being in a group with more impaired veterans. In addition, because the symptom trajectories of groups that did improve were relatively flat after the initial posttreatment follow-up assessment, Elliott et al. suggested that an additional “booster session” may increase the long-term effectiveness of the intervention. If we can identify individuals who may be at greater risk of responding minimally or poorly to treatment, we can be mindful of treatment progress for those individuals or assign them to a different treatment. Likewise, if we can identify who responds very well to an intervention, we can do targeted follow-ups and learn more about what made the intervention effective for those individuals.
By statistically identifying patterns of responses to treatment, GBTM and LGMM analyses can provide information to answer the question, “What works or does not work for whom?” A handful of studies have used GBTM or LGMM to examine psychotherapy outcome trajectories (e.g., Cuijpers, van Lier, van Straten, & Donker, 2005; Elliott et al., 2005; Hayes, Hope, & Heimberg, 2008; Lutz et al., 2014; Szapocznik et al., 2004). These studies have found different trajectories of change throughout the course of therapy. However, the interpretation of this growing body of psychotherapy outcome literature is hampered by the lack of standardized reporting procedures for data analysis plans or results. Because there are no consistent analytic or reporting practices, results cannot easily be compared across studies. This lack of consistency is a theme that we will return to throughout the article, particularly when comparing and contrasting findings from GBTM and LGMM.
GBTM and LGMM analytic methods have been fruitfully used in counseling psychology and other domains of psychological research as well. In counseling psychology, GBTM has been used to assess patterns of adjustment over time among international students (Hirai, Frazier, & Syed, 2015). Many articles using GBTM and LGMM methods have been published in developmental psychology, including studies identifying developmental trajectories of conduct disorder and antisocial behavior (e.g., Leadbeater & Hoglund, 2009; Nagin & Tremblay, 1999), predictors of early education performance (e.g., Spilt, Hughes, Wu, & Kwok, 2012), and personality development among youth and adolescents (e.g., De Haan, Deković, den Akker, Stoltz, & Prinzie, 2013). Vocational psychologists have used LGMM to study trajectories of career development (Hirschi, 2011), employment (Huang, Evans, Hara, Weiss, & Hser, 2011), and career success (Zwaan, ter Bogt, & Raaijmakers, 2010).
Nonetheless, the use of GBTM and LGMM to draw strong theoretical conclusions is not without controversy. In the field of trauma psychology, GBTM and LGMM have fundamentally changed our conception of psychological responses to trauma from a primarily one trajectory recovery model (i.e., initial distress that decreases linearly over time) to a more nuanced depiction of four prototypical trajectories, with a resilient trajectory (i.e., low or no posttrauma distress that quickly abates) being the most common (Bonanno, 2004; Bonanno et al., 2012). Indeed, the predominance of resilient trajectories has been replicated many times in diverse samples. However, a recent debate in the trauma field regarding the interpretation (and potential reification) of LGMM results has questioned whether these four prototypical responses to trauma are universal or are the result of statistical artifacts and researchers’ model specifications (see Galatzer-Levy & Bonanno, 2016; Infurna & Luthar, 2016a, 2016b). Similar criticism against GBTM and LGMM—in particular, that researchers are likely to find patterns that confirm their expectations—have been made regarding alcohol use research (Sher, Jackson, & Steinley, 2011). For example, these studies have been criticized for consistently finding “low use,” “increasing use,” “decreasing use,” and “high use” groups regardless of the sample or time frame of the longitudinal study. The effect of model specification on GBTM and LGMM results and the importance of replicability are highlighted throughout this article.
Both GBTM and LGMM can be useful additions to counseling researchers’ toolboxes, with some reservations. Results from these models should not take the place of analyses that examine average levels of effectiveness but can be helpful by providing additional information about the effectiveness of counseling interventions. Indeed, as we describe, determining the average sample effect is an early step in GBTM and LGMM procedures. GBTM or LGMM can also be combined with qualitative data in mixed-methods research designs (Creswell & Plano-Clark, 2010). For instance, average-level analyses could be used to identify average levels of treatment effectiveness in a sample, person-centered methods could tease apart patterns in the data, and qualitative data (e.g., about reactions to a treatment) could be analyzed to better understand lived differences between groups of people and provide ecological validity for disaggregating the sample. Analysis of qualitative data from treatment responders could identify what components of the intervention were most effective, and data from treatment nonresponders could be instructive in understanding what kinds of interventions these individuals might instead need.
The purpose of this article is to provide counseling psychology researchers and practitioners an introduction to these “person-centered” methods for identifying groups of individuals with different patterns of change. This introduction is meant to stimulate critical consumption of research using these methods, to suggest standard reporting practices for these methods, and to provide consumers and reviewers of research with a basic understanding of how to interpret and judge the quality of studies that rely on them. We believe this introduction is particularly timely because these methods are proliferating in closely related fields and the interpretation of the results can be challenging given the complexity of the models. In this article, we first discuss an “average-level” modeling technique (e.g., analysis of variance [ANOVA]) and then discuss “person-centered” methods for examining heterogeneity in growth trajectories (GCM, GBTM, and LGMM), along with their assumptions and limitations. We then provide a nontechnical guide for conducting GBTM and LGMM, including issues related to study design, analysis, and interpretation, using data from a study of psychotherapy outcomes in a sample of community mental health center clients (N = 1,050).
Conceptual Frameworks: Average-Level Versus Person-Centered Approaches
Researchers can ask two fundamental questions using both average-level and person-centered methods with a longitudinal data set: How do individuals change over time? What variables predict change over time?
Average-Level Techniques
Average-level techniques are the oldest and most commonly used methods for assessing change in counseling psychology research. These methods are used when the focus is on group-level effects in a sample (e.g., whether an intervention group changed more than a control group in a treatment outcome study). In these methods, the emphasis is on the “fixed effects,” that is, the mean of the sample. These models treat clients’ change in the focal outcome as a fixed effect and examine the average trajectory of change, but do not examine variability in trajectories among clients. Although very useful, an analysis of average levels of change also can be misleading. Individuals’ patterns of change are not captured well in methods that focus on aggregate or group-level change because the between-subject variation is treated as error. If, for example, half of the group improved and half deteriorated, the average would suggest no change over time within the group rather than two distinct and opposite patterns of change. Person-centered approaches, discussed next, can be used to complement average-level approaches and provide additional information about individual differences.
Person-Centered Approaches
In contrast to average-level techniques that focus on the fixed effects, person-centered techniques such as GCM, GBTM, and LGMM analyze the similarities and differences between individuals rather than treatment groups. Person-centered techniques emphasize “random effects,” that is, the variability around the sample mean. These techniques incorporate individual variability into the analyses rather than treating it as error. Person-centered techniques are appropriate when questions involve identification of different developmental courses or trajectories of response to treatment. Once a researcher has decided to model client change using a person-level approach (e.g., treating change in the outcome as a random effect), a choice is made about whether to treat the trajectories of change as falling along a continuum (as in GCM) or to treat trajectories of change as being grouped into response categories (as in GBTM and LGMM).
We next describe GBTM and LGMM in more detail, and provide a step-by-step guide for conducting GBTM and LGMM. Because the article is meant to be a lay person’s guide, the mathematical explanations are kept to a minimum. Instead, we focus on choosing and setting up, testing, and interpreting the results of the analyses (for more technical information, see B. Muthén, 2004; Nagin, 2005; Nagin & Tremblay, 2001; Wang & Wang, 2012).
Overview of GCM, GBTM, and LGMM
Prior to discussing GBTM and LGMM, we discuss GCM because it is a first step in conducting GBTM and LGMM. GCM is also known as latent growth curve modeling or latent trajectory modeling, and is closely related to multilevel (hierarchical) linear modeling. We assume that readers have some familiarity with GCM and discuss GCM as a foundation for learning about GBTM and LGMM. Readers who wish to review GCM, to ensure that they can accurately understand GBTM and LGMM, are encouraged to consult Duncan, Duncan, and Strycker (2009). Readers can also consult Roseborough, McLeod, and Bradshaw (2012) for more detailed discussion and application of longitudinal multilevel modeling (a close cousin of GCM) to examine psychotherapy outcomes, using the same dataset of psychotherapy outcomes that we use in the current study.
In GCM, the individual growth curves (patterns of change over time) in a sample are estimated based on a set of growth parameters (i.e., intercept and slopes). These growth parameters are specified for each individual in the dataset and then summarized into one set of growth parameters (Weinfurt, 2000). Thus, the modeling captures both the average growth parameters for all individuals in the model and information about individual-level growth. Significant variability around the growth parameters (intercepts or slopes) in GCM suggests an opportunity to examine heterogeneity in these parameters. In addition, predictors can be added to GCM models to predict the variance around the growth factors. In other words, the degree of variability from the group trajectory is identified and predictors of that variability can be included in the model (Duncan et al., 2009; Weinfurt, 2000). However, GCM estimates only one set of growth parameters for the sample, and many people may not follow that average-level trajectory (Laurenceau, Hayes, & Feldman, 2007). Although GCM estimates individual-level trajectories and their variance from the average-level trajectory, it does not classify individual trajectories into empirically derived subgroups for further analyses. Because many questions of interest to counseling psychologists involve typical and atypical patterns of change over time, methods such as GBTM and LGMM were developed to empirically identify groups of individuals with similar response patterns.
The difference between GCM on one hand, and GBTM and LGMM on the other, is the person-centered latent categorical grouping variable in the latter analyses that classifies each individual in the sample into one of k-number of classes, using information about response patterns over time. The conceptual origins of LGMM are based in GCM: Each class in LGMM is, in essence, a unique growth curve (B. Muthén, 2004; Nagin & Odgers, 2010). (GBTM is not an extension of GCM because the variance of the growth factors is fixed to zero, as will be explained in greater detail.) As illustrated in Figure 1, in GBTM and LGMM, models are estimated with latent growth factors (i.e., intercepts and slopes), for which observed variables are the indicators. For example, in the treatment outcome example described later, measures of functioning at five time points (e.g., initial interview [T1] and four follow-up assessment time points [T2-T5]) are the indicators of the latent growth factors. The “class” variable is a latent categorical variable that is estimated by probabilistically grouping individuals who exhibit similar starting points (intercepts) and patterns of change (slopes) together. GBTM and LGMM allow researchers to estimate growth parameters for individual growth curves and examine subgroups of similar response patterns in the sample.

Group-based trajectory model and latent growth mixture model (LGMM) estimating growth parameters (i.e., intercept, slope) with an outcome variable (i.e., OQ-45 total scores) measured at five time points
As in GCM, growth factors are the building blocks of GBTM and LGMM and define the structure of the model. In a simple one-class linear model (i.e., GCM), there are two growth factors: the average beginning point for the individual trajectories (intercept) and the average rate of change over time (slope). Like GCM, in one-class GBTM and LGMM models there is a single set of growth terms that describe the entire sample. In GBTM and LGMM models that have more than one class (k number of classes), there will be k sets of growth terms in the model. In other words, a three-class linear model will contain a set of growth terms for each class, resulting in six growth terms (intercept and slope for each of the three classes). These growth terms can vary across classes, although it is possible that the classes will differ on one growth term (e.g., intercept) and not the other (e.g., slope). Thus, some classes may start at different places but display similar patterns of change over time or vice versa.
It is also possible to specify GBTM and LGMM models with nonlinear growth terms, such as quadratic (e.g., U-shaped) growth models. Nonlinear models are tested by including higher-order polynomial growth factors. When modeling quadratic growth, the model would include three growth factors: intercept, linear slope term, and quadratic slope term. A three-class quadratic model will contain nine growth terms (intercept, linear slope, and quadratic slope for each class). The intercept still represents the average initial value in the class. The linear slope, however, changes meaning when there are higher-order growth terms (e.g., quadratic) in the model. In such models, the linear slope corresponds to the instantaneous rate of change at the point of intercept (e.g., increasing, decreasing, or no change from the intercept). It conveys the direction of the change and the initial rate of change through its sign and magnitude, respectively. In contrast, the quadratic slope term serves as the indicator of change in the rate of change over time. Higher absolute magnitudes of the quadratic term indicate greater curvature in the change pattern. Higher-order growth models beyond quadratic are also technically possible provided there is sufficient information in the data set for model identification. In a longitudinal context, the absolute minimum number of time points for a growth model with j growth terms is j+1 (e.g., for quadratic models, which contain three growth terms, the minimum number of required time points is four). That said, model estimation is optimized when the minimum threshold is exceeded.
Although both GBTM and LGMM are longitudinal analytic methods that classify people into groups based on their starting points and patterns of change over time, there are important philosophical and statistical differences between them. One difference concerns assumptions about the ways in which the “classes” differ from one another. GBTM assumes that “classes” are subgroups of individuals who are taxonomically similar in their responses on the outcome variable (Nagin & Odgers, 2010). Thus, GBTM does not assume that these groups represent distinct subpopulations of individuals within the sample. Using treatment outcome research as an example, in GBTM, individuals who respond well to treatment would be conceptualized as more similar to each other than individuals who do not respond well to treatment, although these would not be considered two distinct subpopulations of individuals. Rather, the grouping provides a pragmatic way to carve up the distribution for further study. The conceptual foundation for LGMM, on the other hand, does assume that the groups represent distinct subpopulations with different patterns of responding (B. Muthén, 2004). This assumption may limit the appropriateness of LGMM because few theories in psychology posit distinctly different “subpopulations” of individuals (Nagin, 2005). This philosophical distinction between GBTM and LGMM often is not recognized in the literature.
The statistical difference in the ways in which individual variability (i.e., between-class and within-class variability) is conceptualized and modeled in GBTM and LGMM corresponds to the philosophical difference between GBTM and LGMM. Between-class variability refers to the differences among individuals in different classes; for example, between-class variability refers to differences in slopes and/or intercepts between groups of responders to treatment. Between-class variability is often the focus of research using GBTM and LGMM. Within-class variability refers to the differences among individuals in the same class; for example, in our treatment outcome study within-class variability describes differences between people in the same group of responders. GBTM models the mean level of responding within each class, but does not model individuals’ deviation from the class mean to which they were assigned. In statistical terms, GBTM fixes within-class variability around each group’s starting value (intercept) and trajectory (slope) to zero; individuals within each group are assumed to start at the same value (zero variance around the intercept) and exhibit the same general pattern of response over time (zero variance around the slope). Thus, the focus of the analysis is on between-class differences in intercepts and slopes (i.e., growth factors). In contrast, LGMM estimates individual variability within classes, providing information about how closely individuals within a class resemble the mean. Each class is a subpopulation with its own distribution. In statistical terms, LGMM estimates within-class variance around the growth factors, such that individuals within classes are assumed to vary in their starting points and patterns of change over time. This distinction leads to important differences in interpretation of results. Using GBTM, researchers can discuss differences between classes, but not differences within classes. Using LGMM, researchers can discuss differences between and within classes.
Statistical Analyses
Now that we have described GBTM and LGMM methods in general terms, we will demonstrate how the results from GCM, GBTM, and LGMM differ from one another when used to analyze the same data. We will demonstrate how to run GBTM and LGMM models and how to interpret the results. This extended example will illustrate the trajectories that emerge with increasingly complex models and the different conclusions that can be drawn from them. We will discuss how to decide whether to use GBTM or LGMM and the limitations of these methods.
The goal of this example is to help counseling psychologists critically consume research using these techniques and to provide guidelines for conducting GBTM and LGMM analyses for those with appropriate questions and data. Before attempting to run these models, however, researchers must consult primary texts (e.g., B. Muthén, 2004; Nagin & Odgers, 2010; Wang & Wang, 2012) to become familiar with the technical aspects of these models and the statistical software packages they are using, including the default settings within those programs. Jung and Wickrama (2008) created a helpful “troubleshooting” guide to running GBTM and LGMM models and discuss some common model specification errors and how to handle them.
Current Study
The example analyses are illustrated using data from a longitudinal study examining the effectiveness of psychodynamically informed psychotherapy in an outpatient community mental health clinic (N = 1,050; Roseborough et al., 2012). In this naturalistic study, clients at the community mental health clinic completed the Outcome Questionnaire–45.2 (OQ-45.2; Lambert et al., 2004) at 3-month intervals over the course of 4 years. We modeled outcomes in the first year only because significant attrition after T5 (due to completing therapy and leaving the clinic) suggested that the sample characteristics dramatically changed after this point (Roseborough et al., 2012). The OQ-45.2 is a 45-item questionnaire that assesses symptom distress, interpersonal relations, and social role functioning and provides a global functioning score. Items are scored on a 5-point scale with responses ranging from 0 (never) to 4 (almost always). Lower scores indicate less distress and higher scores indicate more severe distress; clinical caseness is met by a score of 63 or higher (Lambert et al., 2004). In addition to examining differences in trajectories, use of psychiatric medication was included as a predictor (i.e., covariate) in the GBTM and LGMM models. To assess replicability, the sample was randomly divided into two subsamples (ns = 539, 511). This step is important because a common criticism of GBTM and LGMM studies is that results are sample- and researcher-dependent and have questionable generalizability (Infurna & Luthar, 2016a, 2016b). Results from both subsamples are compared and discussed.
Examining Heterogeneity in Patterns of Change
Before setting up, running, and interpreting group-based models (e.g., GBTM, LGMM), researchers must first consider how they are conceptualizing individual differences in their sample and model. Individual differences within a given sample (e.g., in response to treatment) should be theoretically expected or established by prior empirical studies (von Eye & Bergman, 2003). The presence of variability among individuals also must be verified using GCM. It must be noted that the presence of variability can be due to sources other than underlying heterogeneity, such as measurement artifacts (see Bauer, 2007, for more detail). Thus, interpretation of the results must be grounded in prior theory and research to prevent nonparsimonious classification into subgroups.
Next, we discuss how to select the best fitting model using fit statistics and conceptual considerations. Then, we demonstrate how to run and interpret GBTM and LGMM, including the use of predictors. Last, we discuss if and when GBTM or LGMM are appropriate analytic approaches and the limitations of these models.
The best fitting models should explain the data in clear, simple, and useful ways. Both statistical and conceptual issues need to be considered when choosing the best fitting model. Statistical indices are discussed first. The Bayesian information criterion, bootstrapped likelihood ratio test, and entropy are three commonly reported fit statistics. They do not provide absolute benchmarks for model “accuracy” or “exactness” but compare relative fit among researcher-specified models.
The Bayesian information criterion (BIC; Kass & Raftery, 1993) is the fit index most often reported in GBTM and LGMM studies. The BIC compares the log likelihood values between nonnested models (Nylund, Aspharouhov, & Muthén, 2007). The sample-size-adjusted BIC (ssBIC) is adjusted to account for sample size and simulation studies have found it to be the most accurate criterion fit index for small samples (e.g., N < 500; Hensen, Reise, & Kim, 2007). Because our subsample sizes are greater than 500, we report the BIC. A lower BIC (or ssBIC) indicates better fit. A difference of 10 or more BIC points between a k-1 and k-class model suggests that the additional class meaningfully improves model fit (Raftery, 1995). In other words, if a four-class model has a BIC at least 10 points lower than a three-class model, the four-class model would fit better according to this criterion.
Similarly, the bootstrapped likelihood ratio test (BLRT) evaluates whether a researcher-specified number of classes fits significantly better than a model with one fewer number of classes and provides a p value that denotes whether one model fits significantly better than another (Nylund et al., 2007). The BLRT has been recommended as a better fit statistic than the BIC (Jung & Wickrama, 2008; Nylund et al., 2007). However, because the BLRT takes longer than the BIC to run, it may be useful to begin initial model testing using the BIC fit statistic and then calculate the BLRT once you are choosing between a final set of models (Jung & Wickrama, 2008). The BLRT is considered the best index for determining whether an additional class is statistically meaningful (Nylund et al., 2007).
Entropy indexes the classification accuracy of the classes, essentially describing the likelihood that an individual was classified into the correct class. Each individual has some nonzero posterior probability of being classified into each class of a given model. The probability that each individual was classified into the “most likely” or “best fitting” class is indexed in a matrix titled “average latent class probabilities for most likely latent class membership by latent class” (see Appendices A and B for examples of Mplus output of GBTM and LGMM models and explanation of output, available online at tcp.sagepub.com/supplemental). Values close to 1 on the diagonal of the matrix suggest that individuals were classified into their “most likely class”; high values on the off-diagonal of the matrix suggest that individuals were classified not into their “most likely class” but into another one. Entropy summarizes this matrix. Entropy values range from 0 to 1, with higher values indicating more accurate classification of individuals into classes. Entropy values greater than .70 generally indicate acceptable classification (Wang & Wang, 2012).
Finally, an earlier guide to running GBTM and LGMM models suggested that at least 1% of the total sample should comprise each class (Jung & Wickrama, 2008); however, this rule depends on the sample size. For example, in a large sample, 1% may describe hundreds of people and thus be a meaningful class. In a smaller sample, 1% of the sample may be too small to define a class. Generally, small samples will have less power to detect the significance of slopes within trajectory classes (L. Muthén & Muthén, 2002).
Because these fit indices are calculated using different formulas, it is not unusual for them to disagree with each other. For example, a four-class model with a lower BIC than a three-class model may suggest that the four-class model fits better but a nonsignificant BLRT comparing the three- and four-class models would suggest that the fourth class does not improve model fit over the three-class model. Statistical experts have suggested that the BLRT should be considered more reliable than the BIC (Nylund et al., 2007). Thus, in this example, the three-class model would be preferred over the four-class model despite the lower BIC for the four-class model. To give another example, a four-class model may have a lower BIC than a three-class model and a significant BLRT value comparing the three- and four-class models (both suggesting that the four-class model is preferable), but the four-class model may have lower entropy. This suggests that even though a four-class model fits better than the three-class model, individuals are not well classified into their most likely class. In this case, the Average Latent Class Probabilities for Most Likely Latent Class Membership by Latent Class matrix should be examined in Appendices A and B. One class may be lowering the entropy (i.e., the model fits less well for that class). Conflicting fit indices can arise when running either GBTM or LGMM. If an LGMM is being conducted, it may be useful to examine whether one class has more variability than others. When there is disagreement among fit indices, it is important to look at all the statistical indicators, previous literature, and meaningfulness of differences between models.
Parsimony, prior theory, and the ability to predict class membership should also be considered in selecting final models (Bauer & Curran, 2003b; B. Muthén, 2003, 2004). First, good scientific models are parsimonious: Models that require the fewest number of parameters generally are best (Kuhn, 1977). In the context of examining heterogeneity using growth models, the principle of parsimony suggests that, when deciding between two equally good fitting models, the model with fewer classes and parameters (e.g., linear models over quadratic) should be chosen. Second, the choice of GBTM models ideally should be done in the context of prior theory, but if that does not exist, the explanatory value of the models should be used to guide model selection decisions. Because both GBTM and LGMM are data-driven procedures, researchers employing these methods must take particular caution against reifying and overinterpreting results (for thorough discussions of this issue, see Bauer & Curran, 2003a, 2003b; Cudeck & Henly, 2003; B. Muthén, 2003; Rindskopf, 2003). As with all analyses, and particularly with GBTM and LGMM, the replicability of results needs to be discussed explicitly. Finally, the meaningfulness of GBTM and LGMM models also depends on whether we can predict who is in each class (see predictors section) or whether class membership predicts important outcomes (Bauer & Curran, 2003b; B. Muthén, 2004; Nagin, 2005). Although there is no foolproof way to distinguish noninformative from substantially meaningful classes, our confidence in the model selection process can be strengthened to the extent that (a) models are parsimonious; (b) response classes replicate across samples from the same or similar populations, as suggested by Nagin (2005); (c) response classes are predicted by variables in theoretically meaningful ways (i.e., theory-derived predictions about what characteristics predispose patients to a given pattern of response are supported); and (d) response classes predict distal outcomes consistent with theory.
In the context of GBTM and LGMM, power analyses have a number of different aspects: the power to identify subgroups in a sample, the power to identify significant effects (e.g., significant slopes), and the power to identify significant predictors of those subgroups. Although power analyses are typically the first step of a data analysis plan, there is little discussion of power analyses in the GBTM and LGMM literature. Among the GBTM and LGMM articles reviewed in this article, more studies than not made no mention of any type of power analysis. When power analyses were discussed, it was most often as a study limitation insofar that smaller sample sizes likely resulted in lower power to detect effects (e.g., Hirai et al., 2015).
There do not appear to be well-established guidelines for determining appropriate sample sizes or procedures for assessing power. A number of simulation studies have found that LGMM analyses are likely underpowered for all but the largest data sets (e.g., N > 500), particularly due to increasing model complexity when covariates are included in the models (Hensen et al., 2007; Tofighi & Enders, 2006). In addition, GBTM and LGMM analyses may be underpowered because these methods divide a sample into subgroups, and then use the resulting nominal grouping variable to predict continuous variables (Bauer & Curran, 2003a). Skewness and kurtosis of the underlying distributions of the data may contribute to low power, and B. Muthén (2003) described procedures for testing these assumptions. However, the observed variables, predictors, and latent variables (e.g., classes) are not assumed to be normally distributed in the mixture modeling framework. The data can be skewed because nonnormal distributions may be due to multiple latent distributions of individuals being mixed together (B. Muthén, 2004). All things being equal, power will be increased when studies include greater numbers of data points (e.g., more power in a seven-wave than a three-wave study). Greater conceptual development and guidance around the meaning and testing of power in the GBTM and LGMM field is needed.
Running GBTM and LGMMs
We now report how we ran GBTM and describe and interpret our results. We will then repeat these steps using LGMM. GBTM was conducted in three steps: first, assessing for significant variability in intercepts or slopes (e.g., linear or quadratic) using a GCM to justify conducting GBTM (or LGMM); second, assessing the fit of linear and quadratic GBTM models with two through six classes; and, third, testing whether covariates significantly predicted class membership. In the current study, we tested unconditional (i.e., without the covariate) and conditional (i.e., with the covariate) linear and quadratic GBTM models and reported the fit statistics for all models in both the first and second halves of the sample in Table 1. In our example, we describe the results of the conditional GBTM model. As can be seen in Table 1, although the BIC values differed between the unconditional and conditional models (because of the additional parameters in the conditional models), the entropy, BLRT, and size of the smallest class are very similar between the unconditional and conditional models. When we present on LGMM models, we will report only the models that include covariates as predictors of growth parameters and class membership. As will be discussed in greater detail when describing the role of covariates in LGMM analyses, models with covariates are usually considered the most appropriate models to interpret (B. Muthén, 2004). GBTM models are initially modeled without covariates because all variability in growth parameters is fixed to zero and thus covariates do not affect growth parameter results. LGMM models are modeled with covariates because covariates predict growth parameters and thus modify parameter results; LGMM models without appropriate covariates predicting growth parameters may be misspecified (B. Muthén, 2004).
Model Fit Indices of Two- Through Six-Class Unconditional and Conditional GBTM Models of Functional Impairment From First and Second Halves of the Sample
Note. Best fitting conditional models are in bold. GBTM = group-based trajectory modeling; Uncon. = unconditional models (i.e., without covariate); Con. = conditional models (i.e., with covariate); BIC = Bayesian information criterion; BLRT = bootstrapped likelihood ratio test p value; n = size of smallest class.
p < .001.
Analyses were run using Mplus, a latent variable software package (see p. 1 of Appendix A for GBTM annotated syntax and p. 1 of Appendix B for LGMM annotated syntax; L. Muthén & Muthén, 1998-2015). Appendix A contains the syntax used to obtain the best fitting three-class GBTM described in the example, using the first half of the sample. Appendix B contains LGMM syntax used to obtain the two-class model, which was not the best fitting model; this syntax was included as an example of how to run an LGMM that has more than one class. These models may also be run in other programs (e.g., PROC TRAJ [SAS Institute, 2004], “lcmm” package for R [Proust-Lima, Philipps, Diakite, & Liquet, 2016]). Mplus relies on full information maximum likelihood with robust standard errors to estimate model parameters with missing data, assuming the data are missing completely at random (MCAR) or missing at random (MAR; L. Muthén & Muthén, 1998-2015). Three patterns of missing data are known (Allison, 2001). First, missing values that do not depend on the observed or unobserved (i.e., missing) variables are called MCAR. Second, missing values that depend on observed variables, including predictors, but not unobserved variables, are called MAR. Third, missing values can depend on unobserved variables, which is called missing not at random (for more information on missing data patterns, see Schlomer, Bauman, & Card, 2010). Readers are encouraged to consult the Mplus User’s Guide (L. Muthén & Muthén, 1998-2015) for more information. Missing data analyses were conducted in SPSS (IBM Corp, 2012) using Little’s MCAR test. A significant MCAR test would mean that the data were not MCAR. Little’s MCAR test was significant, χ2(9) = 22.628, p < .05. This finding is not surprising, given that improvement in functioning is associated with treatment dropout (i.e., completion) and thus attrition. In addition, in longitudinal data, missingness at an earlier time point is associated with missingness at a later time point.
Growth curve model
The first step in conducting GBTM (or LGMM) is running a one-class model in which the intercept, linear slope, and quadratic slope (if included) variances are not constrained to zero. The one-class model is equivalent to running a GCM and illustrates the mean level of responding in the sample. The one-class model tests whether there is significant variance around the growth parameters (intercept, linear slope, quadratic slope) to justify disaggregating the sample.
We ran linear and quadratic growth curve models to assess mean change in functioning over time and to assess whether there was enough variability in responses to treatment to warrant conducting the GBTM and LGMM analyses. Analyses were conducted in the first random half of the data and then rerun in the second random half of the data.
In the first half of the sample, the linear GCM suggested that, on average, individuals reported clinically significant functional impairment at the start of therapy (M = 73.39) that declined significantly over time (linear slope B = −2.54, SE = 0.34), with scores of M = 68.30 at 6 months and M = 63.21 at 1 year. The effect size of the change was calculated by subtracting the functional impairment mean score at the 12-month assessment from the initial interview mean score and dividing by the standard deviation of the subsample at the initial interview (SD = 24.54). A small to moderate effect was found (d = −0.41). Significant variance was found around the intercept and linear slope (ps < .05), suggesting that many individuals were reporting initial symptoms and patterns of change that diverged from the “average” response. This provides theoretical and statistical justification for further examining heterogeneity in patterns of response to the intervention. The quadratic GCM was not chosen because it fit the data relatively worse than the linear model; the quadratic GCM BIC value (BIC = 16550.05) was one point larger than the linear GCM (BIC = 16549.77), which suggested the additional parameters did not add meaningful explanatory value. We reran the GCM including psychiatric medication as a covariate predicting growth parameters, which is essentially a one-class LGMM. The results did not change (e.g., linear model [BIC = 17048.60] had a smaller BIC than the quadratic model [BIC = 17055.02]).
In the second half of the sample, the results were somewhat different. The quadratic BIC (BIC = 15474.72) was 15.94 points smaller than the BIC for the linear model (BIC = 15490.66), indicating better fit for the quadratic model. In addition, significant variance was found around the quadratic growth parameter in addition to the intercept and linear growth parameters. When psychiatric medication was included as a covariate predicting growth parameters, the quadratic GCM’s BIC (BIC = 15932.79) was 10 points smaller than the linear GCM (BIC = 15942.99). Thus, in the second half of the sample, the quadratic GCM appeared to fit better than the linear GCM. In the second half of the sample, the estimated means at the initial interview (M = 74.07), 6-month assessment (M = 66.11), and 12-month assessment (M = 65.43) were very similar to those in the first half of the sample, and the effect size for change in symptoms again was small to moderate (d = −0.37; calculated with SD = 23.47).
Two- through six-class GBTMs
Because there was significant variability in the intercept and slope growth parameters, we used GBTM and LGMM to assess whether groups of individuals followed similar patterns of change over time. Specifically, two- through six-class linear and quadratic group-based trajectory models were run, testing to see what number of classes best fit the data. In both GBTM and LGMM, the number of classes to explore is decided by the researchers and patterns of trajectories are determined statistically from the data. Because these are data-driven processes, results must be interpreted in the context of previous literature. We tested up to six classes to encounter and thus illustrate some estimation problems that researchers may experience. No formal guidelines exist about the number of classes that should be tested. Some researchers run increasingly complex models until they reach estimation errors. Often, the reasons for testing a certain number of classes are not discussed in articles that employ GBTM and LGMM, although they should be. In addition, we could not identify clear guidelines about the steps to use in identifying the best fitting model: that is, whether researchers ought to first identify the best unconditional model (i.e., without covariates) and then build up more complex models by adding covariates, or whether covariates should be included in the process of identifying the best fitting model. The process used to identify the best fitting model should be reported transparently in GBTM articles. Results from GBTM analyses using the first half of the sample are discussed first, followed by analyses run using the second half of the sample. Differences in results between the two halves of the sample will then be highlighted. Although we report the fit statistics for all the linear and quadratic models to illustrate the similarities and differences between these models’ results, typically only one set of fit statistics is reported, either the linear or quadratic models, depending on the best fitting model.
In the first half of the sample, the three-class linear model was identified as the best model (see pp. 4-5 and 10 of Appendix A for annotated output with fit statistics for the three-class conditional linear model). We first identified the three-class unconditional GBTM as the best fitting model, and then reran the model including psychiatric medication as a covariate predicting class membership. The class-specific growth parameters (e.g., intercepts, slopes) were identical between the unconditional and conditional models; thus, we reported only the conditional models in the current example. Looking at Table 1, the four-class model had the lowest BIC value and entropy greater than .7. However, the four-class model was not chosen because the additional classes tended to disaggregate the “moderate distress therapy responders” class into a “moderately high distress therapy responders” class and a “moderate distress therapy responders” class with very similar intercepts and slopes. Thus, we decided that the fourth class did not identify a unique pattern in the data and the more parsimonious three-class model was retained. The quadratic GBTM models were not chosen because the BIC values were less than 10 points smaller than the linear model BIC values, and thus did not appear to significantly improve model fit.
The three trajectories in the best fitting model in the first half of the sample are illustrated in Figure 2. We calculated the effect size for change in functional impairment for each class by subtracting the 12-month mean functional impairment score in the class from the initial interview mean functional impairment score for the class, and then dividing the difference by the standard deviation of the subsample at the initial interview. These three classes suggested that 38% of the sample (i.e., “low distress therapy responders,” n = 207; Class 3) reported initial functional impairment within the range for a general community sample (Mauish, 2004), and experienced relief over the course of therapy with a moderate effect size (d = −0.47). An additional 44% of the sample (i.e., “moderate distress therapy responders,” n = 234; Class 2) reported a level of impairment consistent with the norm for community mental health clinics and improved with a moderate effect over the course of therapy (d = −0.56). Finally, 18% of individuals (“high distress therapy nonresponders,” n = 98; Class 1) reported functional impairment at the inpatient unit norm at the beginning of treatment and did not recover after 12 months in therapy (d = −0.10).

Best fitting three-class conditional linear group-based trajectory model of total OQ-45 scores from initial interview to 12-month assessment in the first half of the sample
In the second half of the sample, the three-class linear model also was chosen as the best fitting model (see Table 1). As in the first half of the sample, the class-specific growth parameter estimates were identical between the unconditional and conditional models. Similar to results in the first half of the sample, the four-class GBTM model had a smaller BIC but the fourth class appeared to disaggregate the “moderate distress therapy responders” class into two nonunique classes. In addition, the entropy of the four-class model was lower, suggesting that individuals were not as well classified into the disaggregated classes. The best fitting model in the second half of the sample is illustrated in Figure 3. Approximately half of the sample (“moderate distress therapy responders,” n = 282, 55%; Class 2) reported initial functional impairment in the general range of outpatient clinic norms, and demonstrated change with a moderate effect over the course of therapy (d = −0.42). Approximately equal portions of the subsample were classified into “high distress therapy responders” (n = 113, 22%; Class 1) and “low distress therapy responders” (n = 116, 23%; Class 3). Participants in both Class 1 (d = −0.54) and Class 3 (d = −0.33) demonstrated change with small to moderate effect sizes. As with the first half of the sample, the quadratic GBTMs were not chosen because quadratic model BICs were all less than 10 points smaller than the linear GBTM BICs and thus the additional growth parameter did not appear to improve model fit.

Best fitting three-class conditional linear group-based trajectory model of total OQ-45 scores from initial interview to 12-month assessment in the second half of the sample
It is interesting to note that a case could be made for choosing the three-class quadratic models in both subsamples. Although the quadratic models did not improve model fit (vis-à-vis BIC) and were less parsimonious (because of the additional parameters), the “high distress therapy nonresponders” class was replicated in both the first and second halves of the sample. One could argue that the importance of finding a group of people who did not respond to therapy would justify the choice of the three-class quadratic model over the more parsimonious three-class linear model. These findings highlight the role of researcher judgment and transparency in the model selection process.
The GBTM results painted a more granular picture of intervention effectiveness than the GCM results. However, there were some important differences in the GBTM results between the two subsamples. Relatively fewer participants were classified into the “low distress therapy responders” class and some were classified into a “high distress therapy nonresponders” class in the first half of the sample. In the first half of the sample, GBTM suggested that treatment was effective for a majority of clients, but a substantial minority of highly distressed clients did not improve over the course of therapy. In the second half of the sample, the same patterns of therapy responders were identified (e.g., high, moderate, and low distress). However, unlike the first half of the sample in which the high distress group did not report significant change, in the second half of the sample, all three groups demonstrated small to moderate levels of change. Looking just at results from the first half of the sample, a useful next step could be conducting follow-up analyses of the apparent therapy nonresponders to explore predictors of membership in that group (i.e., for whom therapy was not effective). However, this next step would not have been indicated if only the results from the second half of the sample were examined, which calls into question the robustness of the findings.
Predicting class membership using covariates in GBTM
After identifying the three-class linear model as the best fit, predictors of those classes were assessed. In GBTM, covariates (i.e., predictors) predict the likelihood of an individual’s membership in a given class compared to another class that is chosen as the reference. Again, unlike in LGMM (which will be discussed in the next section), in GBTM predictors can predict only class membership and not the intercepts or slopes because the variance of these growth factors is fixed to zero in GBTM. In other words, within classes, there is no variance to predict.
In the current study, covariate analyses were conducted in Mplus using multinomial logistic regression. Technically, the multinomial regression computes the likelihood of each individual’s classification into each class using the posterior probabilities and does not force individuals into one specific group (Jung & Wickrama, 2008). Follow-up analyses can also be done by saving the most likely class memberships and exporting them to another statistical package such as SPSS (IBM Corp, 2012). Results may differ across these two methods (e.g., analyses in which class membership is based on posterior probabilities vs. likely class membership). Most likely class membership should not be used in the context of low entropy (<.70) because the class assignments are not precise. In addition, in SPSS, those with missing data on the predictor would be excluded (due to listwise deletion), whereas Mplus has the capacity to estimate missing data on the covariate.
In these analyses, the “low distress therapy responders” class (Class 3 in both the first and second halves of the sample) was chosen as the reference because it is conceptually relevant to understanding what distinguishes better functioning therapy clients from more distressed therapy clients. Thus, these analyses can help us understand individuals who varied from a more resilient trajectory.
In this example, we used only one predictor for illustrative purposes although more could be included. Multinomial logistic regression tested whether use of psychiatric medication (binary variable, “yes or “no”) at T1 (initial interview) predicted class membership (see p. 8 of Appendix A annotated output for example). In the first half of the sample, psychiatric medication significantly predicted membership in the “moderate distress therapy responders” class (Class 2; B = 1.20, p < .001) and the “high distress therapy nonresponders” class (Class 1; B = 1.42, p < .001) versus the “low distress therapy responders” class, such that use of psychiatric medication increased the odds of being in the “moderate distress therapy responders” class by three times and the “high distress therapy nonresponders” by four times relative to the “low distress therapy responders” class. Both the “moderate distress therapy responders” and “high distress therapy nonresponders” groups started out with more functional impairment. This provides additional information that the “low distress therapy responders” class was doing better across multiple indices of functioning. These results were largely also found in the second half of the sample: Psychiatric medication significantly predicted membership in both the “moderate distress therapy responders” (Class 2; B = 1.20, p < .001) and the “high distress-therapy responders” (Class 1; B = 2.12, p < .001) classes.
Running LGMM
LGMM requires researchers to make considerably more decisions about model specification regarding growth factors and covariates because LGMM estimates differences between individuals within classes. This is done by freeing the variances of the intercept and slope growth factors. Researchers must decide which variance parameters to free for each group. Using Mplus to run LGMM, the default is to free the within-class variances for intercepts and slopes from zero but to set the variances to be equal across classes (L. Muthén & Muthén, 1998-2015). In other words, individuals within a class can vary from each other but the amount of variability is the same across classes. Researchers can further loosen these restrictions by allowing the variances of the growth parameters to vary across classes. Prior research or theory about the ways that individuals are believed to vary from one another are helpful guides for this decision-making process (B. Muthén, 2004), although this information is rarely available. As a caution, because researchers have free rein over which growth factors are fixed and/or freed, model specification can quickly become complex, complicated, and confusing to interpret.
There are also computational challenges associated with LGMM because of model complexity. LGMM estimates many more parameters than GBTM because, as just mentioned, the variances of the growth parameters (e.g., intercept, slopes) can be freed. In much of the published research using LGMM, it is difficult to identify which growth parameters were fixed for estimation purposes or freed to vary. When the model specification is not clearly stated, LGMM results are difficult to interpret. For instance, without this information, it is difficult to know whether individuals within each class could vary in their starting points or change over time, or in which classes variability was allowed. These decisions should be reported clearly in the data analysis section of articles using LGMM to facilitate the interpretation and replication of results. Although there are no published guidelines for deciding which growth parameters to fix and/or free, a commonly reported practice (e.g., Lutz et al., 2014) is to fix the highest order growth factors (e.g., quadratic slope in a quadratic model) first and then rerun the model. If the model continues to encounter estimation errors, lower order growth factors can be fixed, one at a time. Remember, if the variances of all growth factors are fixed at zero, the analysis is a GBTM, not a LGMM.
Using covariates in LGMM
In LGMM, covariates can be used to predict class membership (as in GBTM) as well as intercepts and slopes (B. Muthén, 2004). Unlike in GBTM, covariates in LGMM can be specified to predict growth factors because the variances of these parameters can be estimated. Theory or previous research about how a predictor is expected to impact the intercept (starting point) or slope (change over time) should guide these decisions. According to B. Muthén (2003, 2004), in LGMM, theoretically important predictors should be modeled to predict class membership and all growth factors unless there is a well-justified explanation for their exclusion. For example, in a four-class LGMM model with one covariate, the researcher should specify that the covariate predicts the intercept and slope growth parameters and class membership, or provide justification for why a different combination of factors were predicted. If covariates are excluded from predicting growth factors, the model may be misspecified and uninterpretable (B. Muthén, 2003). As a reminder, predictors can predict only growth factors with variances that are estimated (i.e., not set to zero).
In the current example, psychiatric medication was entered as a predictor of the intercept, linear slope, and quadratic slope in the LGMM analyses; in other words, use of psychiatric medication was specified to model the starting level of functional impairment (e.g., intercept), the initial rate of change (e.g., linear slope), and the change in the rate of change (e.g., quadratic slope) among individuals within each class. Interpretations of LGMM models that include predictors must correspond to how the predictors were specified in the model. For instance, if covariates were specified only to predict intercepts, conclusions about predictors must be limited to the average starting value for each class and not to change over time.
Two- through six-class LGMM
We tested two- through six-class linear and quadratic LGMMs of functional impairment from the initial interview (T1) to the 12-month therapy session (T5) assessment points. The within-class variances of the intercept, linear slope, and quadratic slope growth factors were not fixed to zero and thus were estimated. The variances of growth factors were constrained to be equal across the classes, which is the Mplus default. Some researchers may need to fix growth factors to limit the number of parameters estimated. This is usually a pragmatic choice to facilitate model testing because estimation errors are encountered if models are too complex. (If running LGMM in Mplus, estimation errors are clearly indicated as a “warning,” which appears above the “Model Fit Information.” See Appendix B of LGMM output on p. 4 for an example of a “Model Fit Information” section. As a reminder, Appendix B illustrates the output of the two-class LGMM, which was not described in the current example. It is provided as an example of LGMM output for a solution with more than one class.) If estimation errors are encountered, model results should not be interpreted.
Estimation problems may arise when a latent residual variance matrix is not positive definite, which means that the weighted variances sum to a negative value. This violates the assumption that all real-world values have a nonnegative variance; these noncomputable matrices have many causes (see Wothke, 1993, for a full discussion). Another estimation problem involves finding local maxima, which suggests that the model solution may not be replicated in the same data set. Local maxima are an artifact of maximum likelihood estimation procedures and suggest that the best log likelihood was not reached and therefore the best fitting parameters were not found (Myung, 2003). This error may be addressed in two ways: (a) by increasing the number of random sets of starting values in model estimation or (b) rerunning the model using the best log likelihood data (see Jung & Wickrama, 2008, for specific instructions for Mplus syntax). If the estimates are replicated the solution may, in fact, be stable.
Psychiatric medication was included as a predictor of class membership and initial functional impairment (e.g., intercept), the initial rate of change (e.g., linear slope), and change in the rate of change over time (e.g., quadratic slope) in all classes. The variability in growth parameters was constrained to be equal across classes, which is the Mplus default setting. The regressions of the covariate (e.g., medication) on intercept and slopes within classes were also constrained to be equal, per Mplus default setting. These restrictions can be relaxed as well. The two- through four-class models ran successfully with no errors (see Table 1); the five- and six-class models encountered estimation errors indicating that the models were not identified.
In the first half of the sample, the one-class linear LGMM (which is the same as the GCM model with the covariate in the model) emerged as the best fitting model (see Table 2 for LGMM results). Remember, in LGMM models covariates are specified as part of the growth model to correctly estimate growth parameters. The one-class linear LGMM that had psychiatric medication predicting the intercept and linear slope had the lowest BIC value (BIC = 17048.60). The regression of psychiatric medication on the intercept was significant (B = 11.72, p < .001); however, the regression of psychiatric medication on the slope was not (B = 0.79, p = .38). This indicated that psychiatric medication was associated with more functional disability at T1, but was not associated with change over the course of treatment.
Model Fit Indices of Two- Through Six-Class LGMM Models of Functional Impairment From First and Second Halves of the Sample
Note. LGMM results are for conditional models (i.e., with covariates). LGMM = latent growth mixture modeling; BIC = Bayesian information criterion; BLRT = bootstrapped likelihood ratio test p value; n = size of smallest class.
p < .001.
In the second half of the sample, the one-class quadratic LGMM (which was the same as the conditional quadratic GCM discussed in the section on GCM) was chosen as the best fitting model (BIC = 15932.79). This model was chosen after examining statistical fit indices and interpreting the results in the context of theory and common sense. The statistical fit indices indicated that the two-class quadratic model had a lower BIC by more than 10 points and higher entropy value (see Table 2). However, only six participants (1%) were classified into the second class. In addition, Class 2 did not seem face valid but rather appeared to be a statistical artifact: Class 2 had a parabolic shape that started below the community norm, spiked sharply to above psychiatric inpatient norms, and then dropped to the intercept-level of distress at the fifth time point (see Figure 4). The results of the regression of psychiatric medication on growth parameters (i.e., intercept, linear slope, quadratic slope) in the second half of the sample were consistent with the first half: The regression of psychiatric medication on the intercept was significant (B = 13.00, p < .001); however, the regressions of psychiatric medication on the linear slope (B = −0.26, p = .92) and quadratic slope (B = −0.21, p = .71) were not.

Two-class conditional quadratic latent growth mixture model in the second half of the sample
Comparing GBTM and LGMM Results
A comparison of the GBTM and LGMM models’ results illustrates different ways of grouping patterns of responding to treatment. In the GBTM models, a majority of individuals were grouped into classes that responded to therapy and differed based on their initial level of functional impairment. These two classes (i.e., “low distress therapy responders” and “moderate distress therapy responders”) benefited from therapy about the same amount as the whole group on average (based on the GCM results). Psychiatric medication was predictive of being in a more distressed class, which was consistent with previous research using this sample (Roseborough et al., 2012). In the first half of the sample, the GBTM model identified a small group of individuals who did not respond to therapy and who were not identified using the GCM approach; however, in the second half of the sample, the analogous group (i.e., high distress) did respond to therapy.
The LGMM models, which assume that distinct and unique subpopulations underlie a sample, did not fit the data well. Although the entropies for the two or more class LGMMs were very high, and the BIC for the two-class LGMM was smaller than the BIC for the one-class LGMM, visual inspection of the two-class LGMM suggested that this was not likely an accurate depiction of patterns in the sample (see Figure 4 for an example). The GCM, which is akin to a one-class LGMM, fit the data better than LGMMs with more than one class. This suggests that the current sample was not well characterized as unique and different subpopulations, but rather as one population with significant individual variation around the average response.
The differences between GBTM and LGMM are both conceptual and statistical. In terms of conceptual differences, it was not surprising that GBTM and LGMM models found different results because they make fundamentally different claims about reality. GBTM is a method for classifying a sample into subgroups without assuming that unique subpopulations with different distributions underlie the sample. However, LGMM assumes that there are unique and different subpopulations underlying a sample that can be identified using this method. The LGMM results from both the first and second halves of the sample suggested that, even though significant variability was found around the intercept and slope of the GCM, our sample was best characterized as being composed of one population and not multiple unique subpopulations. The results from the GBTM analyses suggested that the sample could be classified into three groups that differed in terms of initial distress but had similar patterns of change over time. Additional replication studies are needed to determine whether a group of high distress therapy nonresponders are present in other samples.
When considering the statistical assumptions of GBTM and LGMM, the differences in the GBTM and LGMM models were also not surprising. First, these models specify growth factor variances in very different ways. As a reminder, GBTM fixes growth factor variances to zero within classes, and LGMM can model the variance around growth factors both within and between classes. Because of this difference, we would not expect to find identical GBTM and LGMM results unless there was actually no within-class variance in the growth factors in the sample to model, in which case the GBTM and LGMM models would be functionally identical.
Choosing between GBTM and LGMM
GBTM and LGMM have their strengths and limitations relative to each other and to other modeling techniques. So when should researchers choose GBTM or LGMM? One consideration is the research question being asked. GBTM is useful for research questions concerning taxometric or diagnostic issues, rather than questions concerning distinct subsamples. For example, GBTM could be used to identify groups of people who reported clinically significant change during treatment. Individuals who respond to treatment are not conceptualized as a different population of individuals from those who do not. Instead, GBTM is used to pragmatically subcategorize a continuum of responses for closer examination. If prior theory and research suggest that distinct latent subpopulations exist, LGMM may be a better conceptual choice (although these situations are likely rare). For example, LGMM has been used to identify groups of students with different school success trajectories (B. Muthén, 2004). In this context, the group of high school dropouts was considered a different subpopulation from the group of high school graduates.
There also are practical considerations. Some researchers may intend to use LGMM to capitalize on its greater complexity, but need to use GBTM for estimation purposes (e.g., Lutz et al., 2014). Because GBTM fixes within-class variances to zero, these models estimate fewer parameters and so can be computed faster and with fewer errors. In addition, GBTM results may be easier to interpret than LGMM results because the models are less complex and more transparent. Thus, GBTM is often the more practical choice. An advantage of LGMM is that, because within-class variation is estimated, researchers have another metric by which to determine goodness of model fit. Larger variances for some classes suggest that a given model fits better for some subsamples than for others.
Limitations of GBTM and LGMM
There are times when neither GBTM nor LGMM is appropriate, and we discuss several of them. First, if research questions do not concern differences between patterns of people in the data set, GBTM and LGMM would not be appropriate. In this case, using GBTM or LGMM to disaggregate the sample would result in a misspecified model. Of course, correct model specification is important for all modeling techniques (Tomarken & Waller, 2005), but idiosyncratic data patterns and groups are particularly vulnerable to reification in GBTM and LGMM because these procedures are data-driven (Bauer & Curran, 2003b). Second, even if the research question concerns differences between patterns of people, GBTM or LGMM may not be useful or may not run when there is insufficient variability in a sample (Feldman, Masyn, & Conger, 2009). Third, because a continuous distribution is being categorized, information about individuals near the “border” of categories will be lost (Bauer, 2011). In addition, because the distribution is grouped into classes, some of the classes may be small, thus reducing power to detect differences between classes and the significance of the slope coefficients. As with other choices in research design and analysis, consideration and justification of choice of modeling techniques should be explicitly discussed. Finally, there are also practical issues in running GBTM and LGMM that may limit their use: Sample sizes need to be larger to run more complex models, and study designs must include more than pre- and posttreatment assessment time points. Including greater numbers of waves of data increases study power and precision of results. Because these models are data-driven, replication of results is needed to assess the generalizability of patterns of change, which also requires large or multiple samples.
Discussion
In our example, we have demonstrated how GBTM and LGMM can provide information about groups of individuals who may follow different patterns of change over time and also how GBTM and LGMM can provide confusing or seemingly contradictory information. In our example, GCM identified that, on average, individuals who participated in 12 months of psychotherapy improved over time, with a small to moderate effect size. In the first half of the sample, GBTM analyses identified a subgroup of participants who did not experience improvement in functioning over the course of 12 months of therapy; however, in the second half of the sample all groups appeared to improve over the course of therapy. Using GBTM, we found that psychiatric medication was associated with greater functional impairment at the beginning of therapy. LGMM analyses found that the current sample was best characterized as a single population with significant individual variation in responses to psychotherapy and was not well characterized as multiple distinct subpopulations.
There were significant differences between the first and second halves of the samples across all modeling techniques that highlight the need for replication. A linear GCM best fit the data in the first half of the sample, whereas a quadratic GCM best fit the data in the second half of the sample. A three-class linear GBTM was the best fitting model in both the first and second halves of the data; however, the “high distress therapy nonresponders” class in the first half of the sample was a “high distress therapy responders” class in the second half of the sample. When running LGMM models, the one-class (i.e., conditional GCM) model was the best fitting model, although as noted previously, a linear model was found for the first half and a quadratic model for the second half of the data. Our study and a recent study by Infurna and Luthar (2016a, 2016b) demonstrated the need for replication of GBTM and LGMM results; however, likely due to sample size requirements and the challenges of data collection, very few GBTM or LGMM studies include replications.
We evaluated our results against the criteria outlined previously to distinguish informative from noninformative models. In terms of parsimony (Criterion 1), in the GBTM analyses we chose the more parsimonious three-class linear models over the three-class quadratic models. However, as noted, an argument could be made for selecting the quadratic models because they identified an important therapy nonresponders class. In the LGMM analyses, we chose the more parsimonious one-class models over the two-class models because the two-class models were not informative. The GBTM and LGMM results broadly replicated across the two random halves of the sample, but some differences were found, as described in the previous paragraph (Criterion 2). However, the GBTM and LGMM results did not replicate each other, nor would we necessarily expect them to. We found that response classes were generally predicted by a theoretically relevant variable (e.g., use of psychiatric medication) in the expected direction (Criterion 3). Finally, given that posttreatment follow-up data were not gathered from participants, we were not able to test whether response classes predicted theoretically predicted distal outcomes (Criterion 4).
Our illustration of GBTM and LGMM using a study of responses to therapy had several limitations. First, because data about which therapist saw each client were not available, therapist effects were not modeled. In mental health clinics, therapists often see more than one client, and thus the unique effect of each therapist can be explored. Data about therapists were not collected here due to therapists’ concerns about potential negative evaluation of their work; this illustrates a real-world challenge when conducting therapy outcome research in a naturalistic setting. However, if therapist effects are not properly accounted for in models, results could be inexact (Bauer, 2007). Thus, it is usually necessary to use therapist effects (and multilevel modeling) when examining response classes in psychotherapy outcome data. Adelson and Owen (2012) provided a helpful discussion of handling therapist effects in multilevel modeling (e.g., GCM). When using GBTM and LGMM, clients can be nested within therapists by extending the GBTM and LGMM models to include a multilevel framework (see B. Muthén, 2004, for an example using educational data). Second, we included only one predictor in our models for illustrative purposes, although more predictors often will be included. Finally, although the meaningfulness of models should be examined against distal predictors or outcomes (such as psychiatric hospitalization in our study of psychotherapy clients), we did not have the data to demonstrate these analyses (see Sterba, Prinstein, & Cox, 2007, for an example).
The correct use of appropriate modeling techniques in counseling psychology research is necessary so that individuals’ experiences are accurately understood and effective treatments are developed. Person-centered GBTM and LGMM techniques can provide a useful approach for analyzing individual differences in patterns of change over time. For instance, as illustrated here, treatment outcome studies that examine individuals’ responses to therapy could use GBTM and LGMM to ask how people respond to treatments, who responds in particular ways, and what predicts how people respond. If typical and atypical patterns of responses to treatment are identified and replicated across samples, treatments could then be tailored to individuals based on their predicted response to treatment, thereby improving the effectiveness of counseling.
For a meaningful body of literature using GBTM and LGMM techniques to develop, consistent reporting standards need to be adopted across studies. We recommend that researchers using GBTM and LGMM methods report (a) the goodness of fit indices for unconditional and conditional GCMs as a first step (i.e., models with and without covariates); (b) exactly which growth parameters were freed and fixed in their models; (c) when running GBTMs, how the best fitting model was derived (e.g., via examining unconditional or conditional models); (d) when running LGMM, whether growth parameters were constrained to be equal or freed to vary across classes and whether within-class variance was fixed or freed to vary; and (e) the BIC, BLRT, and entropy fit statistics. In addition, we strongly concur with Nagin and Odgers’s (2010) recommendation to make study programming syntax or script publically available to facilitate replication studies. We believe that adopting consistent reporting standards will facilitate the comparison of psychotherapy process and outcome research across samples and settings.
This review of the theory, application, and interpretation of GBTM and LGMM is meant to benefit both researchers and practitioners. The article describes a methodology for analyzing change over time that enable counseling psychology researchers to critically evaluate studies using these methods and to add these methods to their toolboxes. For practitioners, the substantive point of interest is that individuals who present with symptom trajectories or exhibit courses of treatment different from the clinical “average” should not be considered atypical or unexpected. Clinicians could join with researchers to develop studies that incorporate an examination of individual differences in patterns of change during treatment with follow-up interviews to explore those differences. Across many psychology disciplines, GBTM and LGMM methods are becoming more common in the peer-reviewed mental health literature. Both researchers and practitioners need to be familiar with these methods to critically produce and consume relevant research.
Footnotes
Authors’ Note
Sheila Frankfurt is now at the VISN 17 Center of Excellence for Research on Returning War Veterans, and Kyoung Rae Jung is now at Salisbury University, MD. The views expressed in this article are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the U.S. government.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Department of Veterans Affairs Office of Academic Affiliations Advanced Fellowship Program in Mental Illness Research and Treatment, the Central Texas Veterans Health Care System, and the VISN 17 Center of Excellence for Research on Returning War Veterans.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
