Abstract
This article presents a latent class model for multilevel data to identify latent subgroups and estimate heterogeneous treatment effects. Unlike sequential approaches that partition data first and then estimate average treatment effects (ATEs) within classes, we employ a Bayesian procedure to jointly estimate mixing probability, selection, and outcome models so that misclassification does not obstruct estimation of treatment effects. Simulation demonstrates that the proposed method finds the correct number of latent classes, estimates class-specific treatment effects well, and provides proper posterior standard deviations and credible intervals of ATEs. We apply this method to Trends in International Mathematics and Science Study data to investigate the effects of private science lessons on achievement scores and then find two latent classes, one with zero ATE and the other with positive ATE.
Keywords
Introduction
The Discrete Conceptualization of Treatment Effect Heterogeneity
The variation of treatment effects is an important consideration for planning and evaluating a treatment. There exists an increasing interest in studying and examining treatment effect heterogeneity (Imai & Strauss, 2011; Künzel et al., 2019; Nie & Wager, 2021; Pearl, 2017; Wendling et al., 2018; Xie et al., 2012). Treatment can be effective or ineffective depending on factors, such as psychological or health status. For example, a certain intervention may be helpful for individuals at moderate risk but unsuccessful or even harmful for others at severe risk. Although researchers wish to control for such variables, it is often impossible or impractical to collect all possible baseline or pretreatment covariates that are to be jointly associated with the latent profiles. Machine learning-based treatment effect estimators are increasingly prevalent for causal inference (Athey & Imbens, 2016; Hill, 2011; Imai & Ratkovic, 2013; Wager & Athey, 2018), but these methods do not account for the differences due to unmeasured or unobservable variables. Moreover, estimating fine-grained treatment effects moderated by all possible combinations of the observed covariates is generally unfeasible due to the curse of dimensionality. It can also suffer from the lack of statistical power to detect distinct subgroup-specific effects and may overfit the data and have low predictive accuracy for new samples (Loh & Kim, 2020).
Identifying latent subgroups and estimating different subgroup treatment effects help us investigate the important “what works for whom” question. For conceptual and practical reasons, researchers in the social, behavioral, and health sciences have investigated how the causal effect of a treatment on an outcome differs among individuals in qualitatively different latent subgroups (Kim et al., 2019; Shahn & Madigan, 2017; Spilt et al., 2013; Willke et al., 2012). Latent subgroups are often referred to as “latent classes” but are also called latent components or segmentations. Individuals’ probabilities of belonging to each class have been estimated using discrete latent variable approaches, such as finite mixture models, latent class (regression) models, latent profile analysis, and model-based clustering approaches (Clogg, 1995; McLachlan et al., 2019; Vermunt & Magidson, 2002).
Latent Class Models in Causal Inference
In the context of causal inference, heterogeneous treatment effects can be consequences of different joint covariate distributions across latent classes, different selection processes to choose or be assigned to a treatment, different outcome mechanisms, or any combinations of these. As an example of latent classes in causal inference, suppose a state offered a mathematics summer camp for high school students, and the effectiveness of the program was evaluated by the gain scores of the camp participants’ pre- and postassessment scores. Assume that the state-sponsored camp was offered to a limited number of students from each high school in the state, and some schools selected their high achieving students to enhance their growth, while other schools provided the opportunity to their low performing students to remediate their deficiencies in mathematics. In this scenario, the selection process is positively related to student ability for some students (Class 1) but negatively related to others (Class 2). The selection process can be known (manifest classes) or unknown (latent classes) to the organizer. If the summer camp covered advanced topics, the effect of the treatment (attending the summer camp) would be positive for Class 1 but close to zero for Class 2. On the other hand, if the camp reviewed basic content, the treatment effect will be close to zero for Class 1 but positive for Class 2.
In this example, it is likely that the joint covariate distributions are different between the two classes, but some important variables may not be collected (e.g., family socioeconomic status, parent education) or are unobservable (students’ motivation and self-efficacy). Furthermore, the propensity of being treated is positively correlated with the pretest for students in Class 1 but negatively correlated with the same variable (pretest) for students in Class 2. For the valid evaluation of the treatment effects, we need a statistical method that handles unobserved covariate differences and distinct selection processes, such as latent class models.
Latent class selection models can detect different associations between predictors and outcome by estimating a separate set of regression coefficients for each class, as shown in Kim and Steiner (2015), Kim et al. (2015), and Suk et al. (2021). In the real data analysis section of this article, we also found that substantially different selection and outcome models between two classes, including some significant coefficients in one class, have opposite signs in the other class.
The continuous versus discrete conceptualization of heterogeneity is important in choosing proper methodology for estimating treatment effects. To demonstrate the strength of latent class models in identifying different latent subgroups, Suk et al. (2021) constructed a hypothetical two-class latent structure, where students belong to one of the two latent classes where the average treatment effect (ATE) is two in the first class and zero in the second class. Suk and colleagues fit latent class models first and then used causal forests (Wager & Athey, 2018) within latent classes to estimate individual treatment effects. They also used causal forests without latent class models. The results (Suk et al., 2021, p. 325, figure 1) show that the individual treatment effects estimated by causal forests alone formed a smooth asymmetrical distribution with one mode, and causal forests masked the treatment effects of the two latent classes. In contrast, fitting latent class models first revealed a mixture of two distributions and their respective treatment effects.
As a discrete latent variable method, the distribution for latent classes is not predetermined but approximated by the estimated number of latent classes and class sizes based on a particular data set. Therefore, the nodes and weights of a latent class distribution do not depend on a parametric probability density distribution, in contrast to latent trait models and other continuous latent variable models that assume a parametric (often normal) distribution. Due to its flexibility in terms of its nodes and weights, latent class models can accommodate both continuous and discrete distributions, asymmetric distributions, highly skewed distributions, and multimodal distributions. Latent class modeling is often considered as a semiparametric approach (Heinen, 1996; Lindsay et al., 1991). We extend the standard latent class model for single-level data to clustered data in this article and refer to our method as latent class multilevel models.
Treatment Effects in Clustered Data
To formalize the treatment effects of interest, we use the Neyman–Rubin causal model (Neyman & Iwaszkiewicz, 1935; Rubin, 1974) with its potential outcome notation (Holland, 1986; Rosenbaum & Rubin, 1983; Rubin, 1974, 1978) and extension to multilevel settings (Hong & Raudenbush, 2006). Assume that there are N individuals nested within M clusters. Let
under the stable unit treatment value assumption (SUTVA; Rubin, 1986), where the potential outcomes of each individual are not affected by others’ treatment assignments, and there is only a single version of treatment. For more details on SUTVA for multilevel settings, see Hong and Raudenbush (2006, 2013).
As the two potential outcomes
and the probability of each individual being assigned to a particular treatment given a set of observed covariates is strictly between 0 and 1
where
Subgroup-Specific Treatment Effects
This article considers ATEs of unobserved subgroups or latent classes, where each subgroup may have different ATEs from other subgroups. Let
When CATE is defined as conditional ATE on observed covariates
There is a major difference between the conditional ATEs estimated by machine learning methods and class-specific treatment effects estimated by latent class modeling. Whereas popular machine learning methods, such as causal forests (Wager & Athey, 2018), Bayesian additive regression trees (Zeldow et al., 2019) and targeted maximum likelihood estimation (Luque-Fernandez et al., 2018) can be used to describe the variability of treatment effects based on observed characteristics of the target population, latent class models can be implemented to better understand the underlying processes of selection or outcome mechanisms due to unobserved heterogeneity that may not be explained by observed variables (Kim & Steiner, 2015). Another important difference between machine learning and latent class methods for heterogeneous treatment effects is that heterogeneity is conceptualized as gradual variation by the former, while latent classes represent heterogeneity as a finite number of distinctive and often qualitatively different subpopulations. The quantitative versus qualitative nature of machine learning versus latent class approaches was illustrated in Suk et al. (2021) with hypothetical two-class data, where the true mixture distribution was approximated as a skewed unimodal distribution for the individual conditional ATE estimates using causal forests, indicating that the machine learning method did not identify the two distinct classes.
Comparisons With Prior Work
A mixture modeling method by Kim and Steiner (2015) provides an efficacious tool for investigating unobserved heterogeneity in causal effects with observational clustered data and has shown promising results in both simulations studies and applications with real data (Kim et al., 2017; Suk et al., 2021). However, there is one important caveat regarding these findings: When mixture selection models are fit to identify latent classes, classification may not be perfect and potential misclassification can be carried over to the subsequent outcome models.
To mitigate the effects of potential misclassification, this study presents a simultaneous procedure for mixing probability models, selection models, and outcome models, such that all the information in the data and all the three models can be used jointly for optimal classification of individuals and model estimation. Perhaps most importantly, misclassification is not carried over from one stage of analysis to another. We argue that this is a significant improvement in the use of latent classes, both theoretically and practically. As another benefit of the simultaneous procedure, our method does not regard latent class membership estimated from the selection model as fixed in the outcome model and provides proper estimates of variation for class-specific treatment effects by accounting for classification uncertainty. We achieved these advances by implementing a Bayesian estimation procedure where the likelihood function is defined as a product of three parts representing a mixing probability model, a selection model, and an outcome model.
Another important development distinguishing our method from other studies using latent classes for selection or outcome models is that, whereas previous work conducted classification at the cluster level (i.e., assigning clusters into one of the latent classes), the current study classifies units at the lowest level in the data. We emphasize that classification at the lowest level in multilevel data allows more flexible forms of heterogeneity and provides richer information about unobserved heterogeneity. As an example, consider a common multilevel data structure where students are nested within schools, and it is of interest to evaluate the effects of an optional online program on student learning outcomes in a distance learning environment. One can anticipate that the effects of a program vary substantially across students depending on students’ characteristics and their environments, such as technical equipment, parent and teacher support, quality of the Internet, and distractions at home, as well as students’ attributes and preference toward the particular program and distance learning itself. It would be beneficial for students, parents, teachers, educational directors, and policymakers to understand to whom and under what circumstances a program is successful. Classification at the individual level can identify a group of students who receive meaningful benefits from a treatment as well as help understand their selection processes to the treatment.
From a methodological point of view, classification at the lowest level allows the investigation of heterogeneity with weaker restrictions than classification at the cluster level. Continuing to use the distance learning program example, if classification is closely related to home environment and students’ characteristics, students within the same schools will often be divided into different latent classes if the schools consist of diverse students. If classification is largely determined by regions, district support, or school resources, class separation will occur mostly at the school level, and results will be similar to the cluster-level classification. Importantly, classification at the lowest level allows investigation of these two scenarios and scenarios in-between (e.g., some schools belong to one class, while other schools consist of students in multiple classes) without assuming particular patterns of classification at any level, in contrast to those methods that do not allow multiple class memberships within clusters. Therefore, we believe that the current research substantially contributes to the literature in the field by accounting for classification uncertainty directly in the estimation of treatment effects and also classifying units at the lowest level in the data.
Bayesian Joint Procedure for Classification and Estimation
Model Specification
Suppose there are M clusters and nj
individuals within cluster j, and each individual comes from one of K latent classes labeled as
Let
That is, the probability density of each individual is a mixture of K components. Each component can be decomposed into three parts: the outcome model for
Selection model
This component models how the propensity score, the probability that the individual is assigned to the treatment group, is related to covariates given the latent class. Assuming
where
are the random effects of cluster j for all K latent classes. For each cluster, by stacking random effects across all latent classes together and specifying a multivariate normal prior, we allow random effects of different latent classes to be correlated. This is reasonable because the correlation captures cluster-level effects across latent classes. For example, we may expect that individuals within the same cluster all tend to have relatively high or low probabilities to be treated regardless of their latent classes, which implies that all the K random intercepts (corresponding to the K latent classes respectively) should be positively correlated.
Outcome model
This component models how the outcome is related to covariates when the treatment indicator and the latent class are given, so it can just be a commonly used regression model for multilevel data. Assuming
where
are the random effects of cluster j for all K latent classes. Here, we again allow random effects from one latent class to be correlated with those from another within each cluster. The structure of this component is very similar to that of the selection model.
Mixing probability
This component models the probability that the individual belongs to each latent class given only its covariates. That is, the covariates of individuals from different latent classes may follow different distributions, so that the covariates alone give us information about latent class membership. Assuming
where
where
are the random effects of cluster j for all K latent classes. Equation 4 allows different clusters to have different decision boundaries. A relevant practical issue about this extension is that unlike models (1) and (2), the outcome of (3) and (4) (i.e.,
Putting all the components above together, the log-likelihood function of parameter
With parameter estimates
As in many model-based classification methods, our approach estimates which class each individual belongs to probabilistically, so we classify some individuals into a particular class more confidently than others. However, unlike most methods where the class membership is determined by one model (e.g., multinomial logit or cluster analysis), the proposed method fits the selection model, outcome model, and mixing probability jointly and simultaneously, and the three components collectively contribute to the estimation of class membership and possibly heterogeneous treatment effects.
Bayesian Estimation
Our goal is to estimate the latent class model with log-likelihood function (5). A Bayesian estimation procedure is applied here, given the complex structure of the model. First, we standardize all covariates in
These priors are mostly weakly informative because we hardly see very large standardized coefficients in practice. For specific research questions and data sets, we may want to use even weaker priors, although they may slow down the estimation procedure and make estimates less stable. We also recommend using more informative priors according to the specific research question. After the estimation procedure is done, the posterior mean of each parameter is used as the point estimate, the posterior standard deviation of each parameter measures the accuracy of the estimation, and specific quantiles of posterior samples can be used as bounds of credible intervals.
Akaike information criterion and Bayesian information criterion are widely used by frequentists for model comparison, but they are computed under the maximum likelihood framework (Vrieze, 2012). For Bayesian estimation, the leave-one-out cross-validation information criterion (LOOIC) can be used to select the best model (Vehtari et al., 2017). In this article, we fit the model with different Ks, compute their LOOIC, and test the differences between their expected predictive accuracy to decide the number of latent classes. More specifically, we consider two ways for selecting the number of latent classes K for the Bayesian method. The “minimum” way is to simply ignore the uncertainty in estimating LOOIC and choose the one with the smallest LOOIC. The other approach is a “sequential” process relying on testing the difference between expected log pointwise predictive densities (ELPDs) using
There are some additional computational issues. First, in this study, we use Stan, which performs Hamiltonian Monte Carlo for Bayesian estimation (Gelman et al., 2015). We run four independent chains simultaneously. Second, the model is not identified because of label switching, that is, class labels can be arbitrarily permuted, and as a result, the four chains typically do not mix well. A common way to avoid this is to add a constraint to distinguish between different latent classes, such as
In short, the model we propose uses three pieces of information to do classification: different distributions of covariates, different selection models, and different outcome models. The model here is very general because individuals from different latent classes are allowed to have not only different ATEs but also different coefficients, both fixed and random. Correspondingly, each cluster can have different random effects for different latent classes. Note that the formulation can be simplified or complicated depending on our assumptions. For example, we may assume that individuals from different latent classes actually follow the same outcome model except that they have different ATEs or, as described above, the mixing probability model also has a hierarchical structure. The corresponding estimation procedures are similar.
Doubly Robust Estimation for Model Checking
For each latent class k,
be the two potential outcomes for class k, and let
Then, the weighted average
is the doubly robust estimator of the ATE of latent class k, where
Still, we propose using the posterior distribution of
Other Estimation Approaches
We have already shown how to estimate the model (5) using Bayesian methods. Another popular way for estimating mixture models is the expectation–maximization (EM) algorithm. It is an iterative procedure, where during each iteration, we estimate the models based on the current class membership and then reassign individuals to new classes with the highest likelihood. More specifically, this procedure is as follows: Randomly assign each individual to one of the K latent classes. Estimate the mixing probability model based on the current class membership. Within each latent class, estimate its selection model and outcome model. For each individual, compute their log-likelihood function for the K latent classes respectively. This log-likelihood function is the sum of the log-likelihood from the mixing probability model, the selection model, and the outcome model. Assign each individual to the latent class with the highest log-likelihood. If the total number of iterations exceeds our preset limit or the class membership stays unchanged, then stop and choose the iteration with the highest total log-likelihood as the output. Otherwise, go back to step 2.
The EM algorithm essentially estimates the same models as the Bayesian procedure except that we do not explicitly let random effects of different latent classes be correlated because models of each latent class are estimated separately. As will be shown in the simulation later, the EM algorithm is much less stable than the Bayesian method when the sample size is small due to the presence of singularities under the maximum likelihood framework (Bishop, 2006, pp. 433–434), while priors of the Bayesian method help restrict parameters from being too off. Another disadvantage of the EM algorithm is that obtaining standard errors and confidence intervals for parameters is not straightforward. Bootstrap is applicable, but it can be very computationally intensive.
For comparison, we also consider a sequential estimation procedure where we simply apply the k-means algorithm to covariates for classifying individuals and then estimate parameters based on the class membership. The procedure is similar to the EM algorithm described above except that in step 3, the log-likelihood function is replaced by the negative of the Euclidean distance between the individual and the mean of each latent class. This approach is fast and easy, but its classification only takes mixing probability into account while all information within the outcome model and the selection model is discarded. Therefore, we expect that it has good performance only when the distributions of covariates of different latent classes are so different that k-means alone separates latent classes well. Moreover, this sequential approach does not provide standard errors or confidence intervals for parameters directly, so bootstrap is needed.
Simulation Study
There are three important questions regarding our proposed method: How well does this method classify individuals into their latent classes? How accurate are the ATE estimates? Can this approach find the correct number of latent classes? These questions will be answered through simulation studies. We consider three different scenarios here, where for each case, there are N individuals from M clusters, and random effects consist of random intercepts only.
Data Generating Process and Estimation
The data generating process for the simulation study is as follows. First, each individual is assigned to one of the three classes according to
where
Second, generate Level 2 covariates by
and generate random effects by
Third, assign individuals to clusters by the multinomial logit model
and then, the covariate for individual n who is the ith individual from cluster
Since the probability that each individual belongs to each cluster depends on the interaction between individual-level and cluster-level covariates, correlation is created between these two kinds of variables. Fourth, generate treatment indicator
where
where
We include all the five covariates (intercept, x
1, x
2, z
1, and z
2) as fixed effects in all the three models—mixing probability, selection model, and outcome model—and random intercepts are included in the latter two models. Since the covariance matrices of
Three different numbers of latent classes are considered: When
Summary of Cluster Sizes of Simulation Study
Note. K = number of latent classes; N = number of Level 1 units; M = number of Level 2 units.
Each data set is estimated using all the three methods: the Bayesian method, the EM algorithm, and the sequential procedure based on k-means. For the Bayesian estimation, we run
Results
Table 2 shows the summary statistics (means and standard deviations) of point estimates
Summary of Average Treatment Effect Point Estimates (
Note. K = number of latent classes; N = number of Level 1 units; M = number of Level 2 units;
The Bayesian method gives good results according to Table 2: The mean of
Table 3 shows the mean correct classification rates across replications. Again, the Bayesian method shows the best performance by giving the highest mean correct classification rates. The EM algorithm is almost as good as the Bayesian method when
Mean Correct Classification Rates Across Replications of Simulation Study
Note. K = number of latent classes; N = number of Level 1 units; M = number of Level 2 units.
Table 4 shows the results of the two ways for selecting the number of latent classes K for the Bayesian method. K is never underestimated:
Frequency Table of Estimated Number of Latent Classes Across Replications of Simulation Study
Note. N = number of Level 1 units; M = number of Level 2 units; K = true number of latent classes; k = estimated number of latent classes.
As additional model checking, Table 5 shows whether the point estimate
Percentages That Misfit Is Detected Across Replications of Simulation Study
Note. N = number of Level 1 units; M = number of Level 2 units; K = true number of latent classes; k = estimated number of latent classes.
The simulation study has shown that our proposed method is a good way for classifying observations and estimating ATEs: The correct classification rate is high, the ATE estimates are accurate and have the correct coverage, and in almost all cases, the correct number of latent classes is found. The performance of this method generally improves when the sample size increases.
Application With Trends in International Mathematics and Science Study (TIMSS) Data
Data and Variables
We use the 2015 TIMSS data of eighth graders in Korea to demonstrate our proposed approach. Briefly speaking, TIMSS is a large-scale educational assessment that investigates the progress of students’ performance in mathematics and science. Since 1995, TIMSS has been collected by the International Association for the Evaluation of Educational Achievement, and it has been conducted for Grades 4 and 8 every 4 years in more than 40 countries. The data are collected by a two-stage stratified cluster sampling, where schools are first selected based on school demographic variables and then at least one classroom is randomly selected within each school (Martin et al., 2016).
Our treatment variable is whether a student received a private science lesson or not. The outcome of interest is the first plausible value of the science achievement. For simplicity, we do not incorporate multiple plausible values of student science achievements or sampling weights in our data analysis. A set of covariates include six student-level covariates (student’s gender, father’s highest education level, the number of books at home, student's home study supports, student’s confidence in science, and student’s perceived value of science) and six school-level covariates (school’s gender type, the percentage of economically disadvantaged students, school location, science instruction affected by resource shortage, school’s emphasis on academic success, and school discipline problems). Table 6 summarizes all the covariates used in the analysis. Our analysis suggests nonlinearity of the variable on student’s confidence in science, so its quadratic term is included in the models as well. Note that some of the covariates, such as student’s confidence in science and perceived value of science, may be post-treatment confounders. If it is the case, then both controlling them and omitting them will lead to bias in the outcome regression model, so more complex treatment is needed (Moerkerke et al., 2015). We do not dive deeper into the rationality of variable selection shown in Table 6. In practice, researchers should rely on subject matter theory to determine which covariates to use and the time that each variable is measured is useful for judging whether each covariate is a pretreatment or a post-treatment confounder. After removing observations with missing values, our final sample consists of
All Covariates Used in the TIMSS Analysis
Note.
Procedure
As a demonstration of how to use our proposed method in practice, our data analysis procedure is as follows: For each different postulated number of latent classes, fit the proposed model using the Bayesian method described above. We suggest running several independent chains, discarding the first part of each chain as warmup iterations, and keeping only the chain with the smallest LOOIC. Then, check whether the chain chosen converges. One popular criterion is to make sure that the maximum Use the sequential approach to decide the number of latent classes: if For the best K, check whether all posterior means Summarize the results: For each parameter, the posterior mean is the point estimate, the
It is worth noting that we report
In this TIMSS analysis, both selection and outcome models include random intercepts just like the simulation study, and mixing probability models both with and without random intercepts are considered. We fit the model using the Bayesian procedure with
Results
Table 7 shows the model comparison results that help decide the number of latent classes and whether random intercepts are needed in the mixing probability model. Consider random effects in the mixing probability model first. The mixing probability model vanishes when
Model Comparison (Number of Latent Classes and Whether Mixing Probability Includes Random Intercepts [RIs]) for TIMSS Analysis
Note. K = number of latent classes; LOOIC = leave-one-out cross-validation information criterion; ELPD = expected log pointwise predictive density; TIMSS = Trends in International Mathematics and Science Study.
Tables 8 and 9 show the summary of ATE estimates and class memberships, respectively. Although we have already decided that
Summary of Posterior Distributions of Average Treatment Effects and Doubly Robust Estimates of TIMSS Analysis
Note. Note that
Number of Students in Each Latent Class of TIMSS Analysis
Note. K = number of latent classes; TIMSS = Trends in International Mathematics and Science Study.
As discussed in the simulation study, doubly robust estimates
Table 10 shows some descriptive statistics of the two latent classes. Both latent classes have approximately half of the students. Class 1 has almost zero ATE, suggesting that on average, the treatment has little effect on these students’ science scores; the ATE estimate of Class 2 is around 12 and is significant, although this effect is still not large practically, given that the standard deviations of the science score in the whole sample and the subsample of Latent Class 2 are 76.15 and 64.58, respectively. Class 1 has a higher percentage of treated students than Class 2, which looks largely ineffective. It would be better if more students in Class 2 were treated while fewer students in Class 1 were treated. In addition, Class 1 has much higher mean science scores than Class 2, and the private science lesson is far from able to fill this gap. This might be related to the ceiling effect: It is more difficult for students already good at science to further increase their scores, so they are more likely to be in Class 1, which has almost zero ATE.
Descriptive Statistics of the Two Latent Classes of TIMSS Analysis
Note. TIMSS = Trends in International Mathematics and Science Study; nk
= number of individuals in latent class k;
Table 11 shows the posterior means and standard deviations of fixed effects parameters for the two latent class solution. The two classes have very different selection models and outcome models, so it is reasonable to divide students into different latent classes with different model parameters. Generally, student-level variables are important in all the models: These variables are related to class membership and affect both the probability of taking private science lessons and the science score. School-level variables affect students’ science scores, but there is little evidence that they are also related to the class membership or whether students take private science lessons.
Posterior Means and Standard Deviations of Model Parameters for
Note. Boldface values indicate that the
The mixing probability model suggests that male students, those having more books at home, and those with more confidence or higher perceived value in science are more likely to be in Class 1. This agrees with the mean outcomes shown in Table 10: On average, these students tend to be good at science even without private lessons, so more science lessons are not very helpful. They may consider spending time in other ways. The descriptive statistics in Table 10 and the estimates of the mixing probability model in Table 11 together help us understand how students in the sample are classified. When new students who are not in the sample want to decide whether to take private science lessons, however, they only have access to their current covariates, so only the mixing probability model is applicable. By plugging in their covariates to the model, they can estimate the probability that they belong to each class, which helps make the decision. In addition, if future outcomes can be roughly predicted, then the descriptive statistics in Table 10 help estimate which latent class they might be in.
It is tricky to interpret the selection models because some covariates might be post-treatment confounders, so that it is hard to tell the direction of causality between the treatment indicator and the covariates. A carefully designed empirical study should take subject matter theory into account and pay more attention to the relationship among variables. Whether students in Class 1 take private science lessons seems only related to their perceived value of science but uncorrelated with any of the other covariates. For students in Class 2, more books at home, more confidence in science, and higher perceived value of science are all associated with higher probability of being treated. Students from schools whose science instruction suffers from resource shortage tend to take private science lessons, which makes much sense: Private lessons serve as a supplement to insufficient school teaching.
The outcome models of the two latent classes show some things in common. Students with more books at home or from schools in urban areas tend to have higher science scores. Students from schools with more economically disadvantaged students or schools with both boys and girls have lower science scores on average, although these covariates are significant for Class 1 only. For students in Class 1, their confidence in science and fathers’ education levels are positively associated with science scores; those having their own room and internet connection have lower science scores on average, which suggests that these conditions might be a distraction. For students in Class 2, their perceived values in science are positively associated with science scores; interestingly, the quadratic function of confidence in the outcome model suggests that more confidence leads to higher science scores for very few unconfident students but leads to lower scores for others.
In brief, we start from the descriptive statistics and the mixing probability model to relate covariates, treatment indicators, and outcomes to latent class membership and then interpret the selection model and the outcome model of each class. In this TIMSS analysis, we find that some variables have quite different coefficient estimates (different significance, signs, or magnitudes) in the selection or the outcome model across the two latent classes. Researchers should pay special attention to such heterogeneity.
Finally, the point estimates of Level 1 error term standard deviation and covariance matrices of random effects are
The outcome model and the selection model show similar patterns on random effects: The two latent classes’ random intercepts are moderately positively correlated, and Class 2 has greater variation than Class 1. Class 2 also has greater variation in the error term. We conclude that students in Class 2 are more heterogeneous than those in Class 1.
Discussion
In this article, we propose a latent class model for estimating heterogenous treatment effects under the multilevel setting. The model relies on three pieces of information to classify Level 1 units into latent classes: different distributions of covariates, different selection models, and different outcome models across latent classes. Instead of proceeding with each piece of information sequentially, where errors from one model will be directly carried over to the next model and gradually accumulate, we adopt a Bayesian joint procedure that takes all the information into account simultaneously to estimate this model. The Bayesian approach also gives additional information like credible intervals for the parameters and LOOIC for model comparison. The simulation study results suggest that the Bayesian procedure gives better estimates than the EM algorithm, which is based on the same model, and LOOIC provides sufficient information for determining the correct number of latent classes. For model checking, we also compute the doubly robust ATE estimates to examine possible functional form misspecification and underspecification of the number of latent classes. Finally, we apply the procedure to a real-world data set to demonstrate how to use our proposed method and interpret the results. The empirical example clearly shows the necessity of using our latent class model instead of simply ignoring the heterogeneity among individuals.
A main advantage of our approach is that a large amount of information about the estimation, including the posterior standard deviations of parameters and the posterior probability that each individual belongs to each latent class, is provided by the estimation procedure automatically. With the three pieces of the estimated model, mixing probabilities, selection processes, and outcomes, researchers can easily interpret the results and relate them to existing subject matter theory. The posterior probability of each individual shows the uncertainty in classification and can be helpful for deciding the optimal treatment for each individual through benefit–cost analysis.
It is also worth noting that our proposed method is very flexible. In this article, we try to model the joint distribution of parameters conditional on the covariates and the cluster membership of Level 1 units. We only try to model the selection processes and the outcomes, while how the covariates are generated and how Level 1 units are assigned to Level 2 clusters are taken as given. It is straightforward to extend the model, so that these two processes can be incorporated if one has good models for them. Although we only consider subgroup-specific ATEs in this study, it is also possible to estimate treatment effects conditional on observed covariates by simply including interaction terms between the treatment indicator and covariates in the outcome models. The idea of subgroup-specific models for conditional treatment effects can give researchers even more detailed information on heterogeneous treatment effects.
There are some possible future directions that may help improve our current method. First, we found that the Bayesian procedure has a hard time converging in a few cases when we specify a too large number of latent classes. Nonconvergence itself might be an indicator of overfit, but it needs to be verified. Second, how to better use the doubly robust estimator to detect possible model misspecification is still not clear. In this article, we use the
Supplemental Material
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986221115446 - Estimating Heterogeneous Treatment Effects Within Latent Class Multilevel Models: A Bayesian Approach
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986221115446 for Estimating Heterogeneous Treatment Effects Within Latent Class Multilevel Models: A Bayesian Approach by Weicong Lyu, Jee-Seon Kim and Youmi Suk in Journal of Educational and Behavioral Statistics
Footnotes
Authors’ Note
This research was performed using the compute resources and assistance of the UW-Madison Center For High Throughput Computing (CHTC) in the Department of Computer Sciences. The CHTC is supported by UW-Madison, the Advanced Computing Initiative, the Wisconsin Alumni Research Foundation, the Wisconsin Institutes for Discovery, and the National Science Foundation, and is an active member of the OSG Consortium, which is supported by the National Science Foundation and the U.S. Department of Energy's Office of Science.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research and/or authorship of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
