Abstract
This article reviews and comments on three major expansions of propensity score methods in recent decades. First, how to use generalized propensity scores to tackle multi-categorical or continuous treatment variables is shown in procedures of propensity score regression adjustment and propensity score weighting. Second, the counterfactual framework of causal inference in the analysis of mediation mechanisms is reviewed and the decomposition of the causal relationship between variables into causal direct effects and causal indirect effects is illustrated. Third, the heterogeneous treatment effect across the distribution of propensity score values is discussed in the framework of the stratification-multilevel model. For each methodological breakthrough, this article comments on potential issues which deserve serious attention in the practical application of these methods.
Over the past decades, empirical research on causal relationships using propensity score methods has been increasing in the social sciences (Angrist and Pischke, 2009, 2014; Caliendo and Kopeinig, 2008; Harding, 2003; Imai and van Dyk, 2004; Imbens and Rubin, 2015; Imbens and Wooldridge, 2009; Morgan, 2001; Morgan and Harding, 2006; Morgan and Winship, 2007; Normand et al., 2001; Sekhon, 2011; Smith, 1997; Sobel, 1995, 1996, 2000; Winship and Morgan, 1999; Winship and Sobel, 2004). One of the great promises of propensity scores is that through procedures like matching or weighting, scholars are enabled to construct a dataset in which treated and untreated cases, for specific propensity score values, have similar odds of receiving a treatment. As a result, an observations dataset is transformed to a quasi-experimental dataset, and scholars can then conduct causal inference in the counterfactual framework (see Rubin, 1997).
In the past years, methodological developments have remarkably expanded the scope of research issues that can be addressed by propensity score methods. In this article, we review and comment on how to use propensity score methods in the research scenarios of multi-categorical treatment, causal mediation, and causal effect heterogeneity.
First, multi-categorical treatment is fairly common in social scientific research, but it cannot be appropriately handled by the conventional propensity score methods, which mainly focus on a binary treatment. Therefore, how to expand from a binary treatment to a multi-categorical or even continuous treatment is of great practical significance. The generalized propensity score method proposed by Imbens (2000) fills this gap, but has yet to make it into the hands of practical social scientific researchers. Second, a causal chain usually involves mediating mechanisms between the treatment and the outcome, so one recent development of propensity score methods is to introduce the counterfactual framework into the mediation analysis (Imai et al., 2010, 2011). A causal mediation test differs from a routine mediation test by introducing the sequential unconfoundedness assumption, which will be reviewed in detail in this article. Third, most existing studies using propensity score methods aim to estimating the average treatment effect (ATE) or the average treatment effect for the treated (ATT). However, the average effect conceals heterogeneous responses of individuals to the treatment. In light of this limitation, recent studies have applied propensity score methods to investigate causal effect heterogeneity (e.g., Brand and Xie, 2010; Crump et al., 2008; Heckman et al., 2006). We review an easy-to-use method in the framework of a multilevel model.
This article proceeds with a brief introduction of the counterfactual framework in causal inference. Then, we review and comment on recent methodological developments in the social sciences about applying propensity score methods in research scenarios of (1) multi-categorical treatment, (2) causal mediation, and (3) causal effect heterogeneity. Concluding remarks and suggestions for empirical researchers are presented in the end.
Background: Propensity score methods in the counterfactual framework
The most widely referenced statistical framework for making causal inference is the counterfactual theory, which is the theoretical foundation for propensity score methods (Angrist and Pischke, 2009, 2014; Imbens and Rubin, 2015; Morgan and Winship, 2007; Rubin, 1997; Rubin and Thomas, 1996). In this framework, ‘fact’ refers to what a researcher is able to observe when a case receives a certain treatment level (e.g., being treated), while ‘counter fact’, or counterfactual, is the counterpart outcome for the same case when this case receives another treatment level (e.g., in the control group). For instance, suppose we want to know whether a new teaching strategy can promote students’ test scores. A group of students are randomly assigned into either the new teaching strategy group (treatment) or the old teaching strategy group (control). For those in the group of new teaching strategy, the post-treatment test scores are the observed fact. Counterfactual, in this case, refers to their test scores if they had been assigned into the control group. Similarly, for those students who are assigned into the routine teaching strategy group, their test scores are the observed fact while the test scores if they had been assigned into the treatment group is the counterfactual. Causality is then defined to be the difference between the observed fact and the unobserved counterfactual, which can be expressed as follows:
In this equation, T represents the causal effect, π is the percentage of treated individuals and 1– π is the percentage of individuals in the control group. Y1 and Y0 are the outcome values (e.g., the test score), respectively, for the treated and controlled individuals, and W is the assignment mechanism where 1 = treatment group and 0 = control group. From equation (1), we can see that E (Y1 | W = 1) and E (Y0 | W = 0) are observed facts while E (Y1 | W = 0) and E (Y0 | W=1) are counterfactuals. Thus, the causal effect T is the weighted sum of (1) the difference between the observed fact and the counterfactual for individuals in the treatment group, that is, E (Y1 | W = 1) – E (Y0 | W = 1); and (2) the difference between the observed fact and the counterfactual for individuals in the control group, that is, E (Y1 | W = 0) – E (Y0 | W = 0). The weights are the percentages of individuals in the treatment and control groups, that is, π and 1– π. 1
One dilemma in equation (1) is that we cannot directly observe counterfactuals (Holland, 1986), i.e., we cannot know how the test scores would have been if the same group of students who were treated had been placed into the control group, and vice versa. What we can observe are only E (Y1 | W = 1) and E (Y0 | W = 0). In order to make causal inference, we have to set some assumptions, and the most important one is the unconfoundedness condition, 2 which can be expressed as:
The unconfoundedness assumption refers to the independence between outcome values Y1 and Y0 and the assignment mechanism W. In particular, individuals, no matter how they can possibly be assigned, will always have outcome value Y1 as long as they are assigned to the treatment group. Similarly, the value of individuals’ outcome is always Y0 as long as they are in the control group even if they could possibly be assigned to the treatment group. The unconfoundedness assumption can be satisfied in a random experiment where individuals are randomly assigned into either the treatment or the control group. That is because the assignment mechanism W (1 or 0) in a random experiment is a random variable, which is de facto independent from Y1 and Y0. In observational studies, however, the unconfoundedness assumption is often violated. For example, the likelihood of attending college has much to do with a person’s family background, such as the educational level of parents. At the same time, family background may influence a college graduate’s probability of getting a well-paid position. In this case, whether to attend college (W) would be correlated with the labor market outcome (Y) via the confounding variable of family background. Thus, in order to satisfy the unconfoundedness assumption in observational research, scholars have to control for confounding variables like family background in this example.
3
Denoting all possible confounding variables with a matrix
What should be pointed out is that the dimension of
Inputting (4) into (1), we have
Now, as we can see, equation (5) can be estimated using only observed facts.
The plausibility of the unconfoundedness assumption, although not being directly testable, can still be assessed through certain statistical analyses. As suggested by Imbens and Rubin (2015: Chs 12 and 20), two methods are available in the prior literature. One approach is to estimate the treatment effect on an unaffected outcome, such as the lagged outcome (as proposed by Heckman and Hotz, 1989), and the other approach is to use multiple control variables (as proposed by Rosenbaum, 1997). In both approaches, if the known-zero treatment effect is found to be not zero, we may call into question the plausibility of the unconfoundedness assumption.
To recapitulate, the rationale of propensity score methods is to create a quasi-experimental environment by controlling for the propensity score P, which is a proxy for a series of confounding covariates
Handling multi-categorical treatments with generalized propensity scores
In traditional propensity score methods, the treatment variable is usually binary (the assignment mechanism W = 1 or 0), but if the treatment variable is multi-categorical or continuous, how can we estimate the average treatment effect? 4 Suppose the treatment variable involves m categories and m > 2, following equation (5), the average treatment effect between treatment level j and treatment level k should take the following form:
Although this equation is a straightforward extension of formula (5), its estimation is more complicated relative to the binary treatment situation. Hirano and Imbens (2001) proposed two methods of estimating Tjk. One is propensity score regression adjustment and the other is propensity score weighting, both of which are based on generalized propensity scores (Hirano and Imbens, 2001; Imbens, 2000; Imai and van Dyk, 2004). 5
Generalized propensity scores refer to the predicted probabilities of receiving different levels of treatment when the treatment variable involves multiple categories (Imbens, 2000). Based on whether the multi-categorical treatment levels exhibit ordering, we can use either multinomial logistic regression or ordered logistic regression to predict generalized propensity scores. For example, suppose we want to examine the causal effect of educational attainment (e.g., coded as 1 = elementary school; 2 = junior high school; 3 = senior high school; and 4 = college) on earnings, we can use ordered logistic regression to predict the respective odds of attending elementary school, junior high school, senior high school, and college, given a series of covariates. In this way, each individual would have four predicted probabilities representing the respective likelihood of receiving one of the four treatment levels of education. Using P(t), where t = 1, 2, 3, and 4, to denote these predicted probabilities, P(t) is the generalized propensity score. With P(t), we then estimate Tjk using either propensity score regression adjustment or propensity score weighting (Hirano and Imbens, 2001).
Propensity score regression adjustment
Propensity score regression adjustment uses the estimated generalized propensity score as a predictor of outcome Y and fits the following model:
This is an ordinary regression model where α (t), β (t) and ε (t) are the intercept, regression coefficient, and random error, respectively, at treatment level t. Based on this model, we are enabled to obtain estimated coefficients
For instance, the predicted wages for the four educational levels are respectively:
Then, the average treatment effect between categories j and k is:
Propensity score weighting
Propensity score weighting takes a different approach relative to propensity score regression adjustment. In particular, propensity score weighting can be treated as an application of the Horvitz–Thompson estimator (Horvitz and Thompson, 1952). The Horvitz–Thompson estimator has been widely used in unequal probability sampling. In order to make unbiased inference, every individual is weighted based on their inclusion probability into the sample (denoted with πi). For instance, the population mean based on the Horvitz–Thompson estimator is μHorvitz–Thompson =
Similarly, individuals have different odds of receiving a specific treatment level t. Thus, we can perform a weighting-like operation in the Horvitz–Thompson sense. Generalized propensity scores, referring to the odds of receiving different levels of treatment, resemble the inclusion probabilities in the Horvitz–Thompson estimator. Thus, the estimate of the population mean of Y for treatment level t would be μy(t) = E [Yi(t)] =
Weighting has been widely used in the social sciences. In addition to tackling estimation bias, propensity score weighting can be statistically efficient. For instance, Hirano et al. (2003) proposed a nonparametric series logit estimate of the propensity score, and proved that weighting observations by the inverse of this propensity score could reach the semiparametric efficiency bound developed by Hahn (1998). 6 However, weighting on the true propensity score and weighting on the estimated propensity score are different, and it has been verified by a series of studies that efficiency can be gained from using the estimated rather than the true propensity scores as long as the estimation function of propensity score is sufficiently flexible (Rosenbaum, 1987; Rubin and Thomas, 1996). This is also the case for propensity score matching, as noted in Abadie and Imbens (2006) and Frolich (2007).
Comments on practical issues
Although propensity score regression adjustment and propensity score weighting take different approaches, studies have shown that results based on these two methods are fairly consistent (Feng et al., 2011). However, there are some practical issues that are noteworthy.
First of all, the estimation of generalized propensity scores for the multi-categorical treatment is more complicated relative to the binary case. In particular, if we are interested in a binary treatment variable (e.g., attending college or not), the transition between treatment levels in causal inference is only from 0 (less-educated) to 1 (college-educated). In the case of generalized propensity scores, however, such transition is multiple. In our example of four educational levels, the relevant transitions are from 1 (elementary school) to 2 (junior high school), 1 to 3 (senior high school), and 1 to 4 (college), assuming the reference level is elementary school. Since the same set of confounding covariates
Second, because propensity score regression adjustment is a parametric model, it is subject to the restrictions of model forms (Flores-Lagunes et al., 2010). In the discussion above, we only specify a simple linear model Yi(t) = α(t) + β(t)*P(t) + ε(t) in the interest of illustration, but the relationship between generalized propensity scores P(t) and the outcome Yi(t) can be more complicated. For example, there may be a nonlinear relationship that calls for second- or higher-order terms of P(t). Hirano and Imbens (2001) realized this restriction and argued that the second-order term of P(t) and the interaction term between P(t) and treatment level t should always be added to model (7), such that:
Although this model takes into account more complicated terms, we are still not sure whether this model is the best one to estimate Yi(t). This concern leads us to the classic problem of model uncertainty, a theme that has been widely examined by statisticians (Hoeting et al., 1999; Raftery, 1995; Zigler and Dominici, 2014), sociologists (An, 2010; Young, 2009), economists (Cohen-Cole et al., 2009; Durlauf et al., 2012; Sala-i-Martin, 1997), political scientists (Ho et al., 2007), and psychologists (Kaplan and Chen, 2012, 2014). 7 Generally speaking, there are two types of model uncertainty in equation (11). One is about the variability of P(t). Clearly, P(t) is a predicted value that is subject to the distributional variation in the coefficients of the multinomial model. The other type of model uncertainty is about the model form, i.e., the order of the polynomial terms. These two types of model uncertainty suggest that model (11) is more appropriately to be treated as only one of many potentially possible model forms (i.e., a model space) fitted on one of many possible estimates of P(t). To solve the first model uncertainty problem, a practical researcher should consider a Bayesian approach by incorporating the prior distributions of multinomial model coefficients, as illustrated by An (2010) and Kaplan and Chen (2012). The second type of model uncertainty can be tackled by model averaging techniques, where the model space is explicitly derived and the coefficients are averaged over all possible models, weighted by the posterior model probabilities (e.g., Raftery, 1995; Zigler and Dominici, 2014).
Third, as shown by Hirano and Imbens (2001), generalized propensity scores can be used to analyze continuous treatment variables. For a continuous treatment variable, we cannot calculate discrete predicted probabilities. Instead, we estimate the probability density for a range of treatment values. Suppose the probability density function is f, the generalized propensity score for continuous treatment variable would be fT|X(t|x). Usually, we assume a normal distribution density so that Ti |Xi ~ N(f(x), σ2). Then the generalized propensity score P(t) =
Finally, like propensity scores for the binary treatment, generalized propensity scores require the satisfaction of the unconfoundedness assumption, which can be expressed as E[Y | W = i, P(t)] =E[Y | W = j, P(t)], similar to that for the binary treatment situation.
Performing causal mediation in observational studies
In most previous works, propensity score methods are used to examine the direct causal effect of a treatment variable on an outcome. In this light, the mediating mechanisms are understudied. Investigating only the direct causal relationship between treatment and outcome oversimplifies a causal chain and ignores mediating variables that may be substantively of interest. In this context, there have been some noteworthy attempts in the past decades to introduce the counterfactual framework into mediating tests. 8 For instance, many psychological studies applied principal stratification (Frangakis and Rubin, 2002) into mediation analysis (e.g., Hill et al., 2003; Jo and Stuart, 2009). However, one important condition of this approach is that the treatment should be randomly assigned (Jo and Stuart, 2009: 2863), which can be rarely satisfied in sociological studies. 9 Here, we review another related but distinct framework of the counterfactual mediation test proposed by political scientist Imai and his colleagues (Imai et al., 2010, 2011). According to their research, several assumptions need to be met in order to make causal inference in an observational mediation test.
Sequential unconfoundedness assumption
Imai and colleagues specified the version of unconfoundedness assumption for performing a causal mediation test (Imai et al., 2010, 2011). In particular, they argue that a causal chain from W to Y via mediating variable M can be established if the assumption of sequential unconfoundedness is satisfied, which is
In assumptions (12) and (13), W is the assignment mechanism which can be valued 1 or 0;
10
Performing the causal mediation test
When the sequential unconfoundedness assumption is satisfied, we can perform the causal mediation test. According to Imai et al. (2011), the overall treatment effect can be decomposed into two components. The first component is the direct treatment effect. For individual i, this is defined as
This treatment effect refers to change in outcome Y when W shifts from 0 to 1, regardless of change in the mediation variable Mi (W). For instance, when W is 1, the direct treatment effect is then Yi[1,Mi(1)] – Yi[0,Mi(1)] and when W = 0, we have Yi[1,Mi(0)] – Yi[0,Mi(0)]. Yi[1,Mi(W)] and Yi[0,Mi(W)] cannot be simultaneously observed.
The second component is the causal mediation effect, which, at the individual level, can be written as:
The causal mediation effect refers to change in outcome Y that results from change in the mediation variable M, regardless of the value of W. Mi(1) and Mi(0), again, are not simultaneously observable.
The expected values of equation (14) and (15) are the average direct effect (ADE) and the average causal mediation effect (ACME), which are:
Then, how can we use the confounding covariates
Comments on practical issues
Although the causal mediation test proposed by Imai and colleagues expands the traditional propensity score method, there are still some limitations in its practical application when multiple mediation variables are available. To illustrate, suppose we want to explore the effect higher education attainment on happiness via of income and occupation, as shown in Figure 1.

The mediating effects of income and occupation between higher education and happiness.
In the case described in Figure 1, the causal mediation effect can be unidentifiable. That is because whenever we examine the mediation effect of income, we cannot get rid of the effect of occupation, and vice versa. In other words, we cannot distinguish between the chain D–C (higher education promotes income, which further contributes to happiness) from the chain E–B–C (a college graduate gets a lucrative occupation with high payment, which further promotes happiness). This problem can be more severe if the confounding mediation variable (e.g., occupation in this example) is unobservable so that researchers cannot even realize such a confounding effect in mediation. Consequently, whenever there are multiple and mutually correlated mediation variables, the conclusion based on the causal mediation test should be treated with caution.
Investigating causal heterogeneity in the stratification-multilevel model
The heterogeneous treatment effect
Previous propensity score methods focus on ATE. One implicit assumption in the examination of ATEs is that individuals, regardless of their personal characteristics, receive the treatment effect in a homogeneous fashion. In the past decades, however, many studies have shown that the assumption of homogeneity is not sufficient to exhibit the complexity in a causal relationship (e.g., Brand and Halaby, 2006; Crump et al., 2008; Dale and Krueger, 2002), so recent research has begun to pay more attention to the heterogeneity of the treatment effect, in which propensity score methods are powerful tools.
What should be noted first is that some econometric methodologies have explored the heterogeneous treatment effect. For example, Heckman proposed the marginal treatment effect (MTE) model to study how the treatment effect changes across the distribution of propensity score values (Heckman et al., 2006). Applications of Heckman’s MTE model have been found in economic and medical research (Basu et al., 2007; Carneiro et al., 2011). Besides Heckman’s model, Li and Racine (2004, 2007) developed the nonparametric local linear kernel method. Based on this method, individual-level heterogeneity in the treatment effect can be estimated (e.g., Henderson et al., 2006; Zhu, 2011). Lastly, Crump et al. (2008) developed a nonparametric approach of detecting zero and constant treatment effect for all subpopulations defined by the covariate, based on the series estimator for the regression function between Y and confounding variables
The rationale of the stratification-multilevel model
The stratification-multilevel model, in principle, follows the general multilevel modeling rationale where individuals are nested within higher level institutions. The research objective of a multilevel model is to see the cross-institutional variation in the individual-level treatment effect. One classic example in this light is students nested within different types of schools. Suppose we want to know the treatment effect of a new teaching strategy on students’ test scores, we can estimate this treatment effect for each type of school and examine if such an effect differs from one school type to another. If a significant cross-school-type variation is detected, we may say that the treatment effect of the new teaching strategy on the test score is heterogeneous across school types.
For example, suppose we have three types of schools according to the school size (1 = small; 2 = medium; 3 = large). Within each type of school, we estimate the treatment effect of the new teaching strategy and denote this effect with a dot in Figure 2. 11 Since we have three types of schools, three dots are marked out. Looking at Figure 2, we conclude that the treatment effect of the new teaching strategy declines with the school size, as noted by the OLS line. The larger the school, the smaller the treatment effect would be. In this way, we show how the treatment effect may vary across school types, that is, the treatment effect heterogeneity.

An example of the multilevel model.
Brand, Xie, and colleagues’ approach in probing treatment heterogeneity follows the same line of thought. In a research on the heterogeneity in the returns to a college degree (Brand and Xie, 2010), they first estimated the propensity score of attending college and grouped individuals into strata where people within the same stratum have close propensity score values (that is, they have similar odds of attending college). Here, strata play the same role of school types in Figure 2. Then, Brand and Xie order these strata according to the strata-wise propensity score values to make sure the odds of receiving higher education monotonously increase or decrease across strata (like the ordering of schools according to school size in Figure 2). Within each stratum, Brand and colleagues used weighted OLS to estimate the treatment effect (i.e., the returns to a college degree), and then the trend of the stratum-based treatment effects across strata is drawn using a weighted linear regression model. This second-level linear regression line (like the OLS line in Figure 2) demonstrates whether the returns to a college degree are heterogeneous and vary with the likelihood of attending college.
Comments on practical issues
Brand, Xie, and colleagues’ stratification-multilevel model is easy to implement in practice and the multilevel modeling rationale discussed above is also straightforward to comprehend, which makes this method widely used in many empirical studies (Brand, 2010; Brand and Davis, 2011). However, there are still several practical issues deserving special attention.
First of all, the stratification-multilevel model is sensitive to the procedure of predicting propensity scores. Adding or deleting one covariate in the logistic or probit model predicting propensity scores may change the propensity score values for most or even all individuals, and further alter the grouping of them into different strata. Again, this is a model uncertainty problem, and it cannot be solved through conventional model goodness of fit criteria, such as AIC or BIC. For instance, two models may have similar goodness of fit (e.g., two models can have close pseudo-R2 values), but one model adds a nonsignificant covariate and the other one does not. Although this added nonsignificant covariate does not contribute to the overall goodness of model fit, it will definitely change the predicted propensity score values for every individual since the propensity score method requires that confounding covariates, irrespective of their statistical significance, should be balanced between the treatment and control groups. Thus, adding a nonsignificant covariate to the model to predict the propensity scores might bring about substantial changes to the final pattern of the second-level trend (e.g., the downward lines shown in Figure 2 may become flat or even increasing).
Second, besides the concerns of model uncertainty, the stratification-multilevel model is sensitive to the number of strata. In propensity score matching, a rule of thumb is to construct five strata according to the quintile of estimated propensity score. This rule was attributed to Cochran (1968) and supported by subsequent research, such as Austin (2008). According to Rosenbaum and Rubin (1984), separating individuals into five strata based on the propensity score was enough to remove 90% of the bias that could be removed by individual matching on all covariates. Recently, Rubin (2010) suggested a grouping of five to 10 strata, but with no justification as to how this number should be determined. In Brand and Xie (2010), six strata were constructed.
If the research focus is only on removing estimation bias, the conventional suggestion of five to 10 strata should be fine. However, what is at issue here is that the number of strata also determines the statistical power of the second-level model (e.g., the OLS line in Figure 2) in the stratification-multilevel analysis. For a five-strata analysis, for instance, the second-level OLS model is fitted only based on five data cases, which is reasonably subject to the problem of low statistical power. In this case, the second-level OLS model is not robust to even a slight change (e.g., moving upward or downward) in the data cases.
It seems that the best way to solve this low statistical power problem for the second-level OLS model is to increase the number of strata. This, however, may not be a feasible solution. For one thing, individuals are grouped together in the same stratum on the condition that their propensity score values are close to each other, so the optimal number of strata is determined more by the data itself than by the researcher. For another thing, even a researcher predetermines the number of strata, there is a tradeoff between the number of individuals in each stratum and the total number of strata. For a given sample size, the decrease in the number of respondents in each stratum as the strata number goes up might call into question the statistical power of the analysis conducted within each stratum.
It is necessary to clarify that we are not arguing that the propensity score cannot be partitioned finely. With a correct propensity score estimation model, we can always classify individuals into several discernible strata and correct potential estimation bias. The concern here, however, is that there could be insufficient number of strata to guarantee the statistical power of a higher-level regression analysis. This problem itself does not relate directly to propensity score estimation.
For practical researchers, one solution to this problem might be to increase the overall sample size to guarantee both the within-stratum and cross-strata statistical power. Another strategy might be to give up multilevel modeling and set the analytical focus simply on the individual level. In this regard, several alternative methodologies have been proposed. For instance, in a recently published article, Xie et al. (2012) developed two nonparametric methods to investigate heterogeneous treatment effect. Specifically, the matching-smoothing method calculates the observed differences in outcome between the matched individuals and sees how such differences change across the spectrum of propensity score values using nonparametric techniques, such as local polynomial regression or lowess smoothing. The smoothing-differencing method, in contrast, first nonparametrically regresses the outcome on the values of propensity score respectively for the treated and control cases, and then computes ‘the difference in the nonparametric regression line between the treated and the untreated at different levels of the propensity score’ (Xie et al., 2012: 326). Evidently, these two nonparametric methods no longer follow a multilevel model approach, but instead focus on the pattern on the individual level. Another possible individual-level approach is to apply nonparametric generalized kernel estimation proposed by Li and Racine (2007) to calculate the individual-level treatment effect. Then, the relationship between propensity score values and the nonparametric individual-level treatment effect can be examined with routine parametric or nonparametric regression models (e.g., Hu and Hibel, 2015).
To recapitulate, Brand, Xie, and colleagues developed an easy-to-use stratification-multilevel model to examine the heterogeneous treatment effects across the spectrum of propensity score values. Although this method is straightforward and easy to implement, it could be sensitive to model uncertainty in propensity scores estimation, as well as the statistical power issues associated with the number of strata. In this regard, a series of individual-level analysis procedures could be possible alternatives for practical researchers.
A last comment on the treatment effect heterogeneity study is about the heterogeneity in actual covariates. Indeed, people from different propensity score strata have differential odds of receiving a treatment, which implies that they must differ from each other in one or several actual confounding covariates. In this light, it is tempting to explore treatment effect heterogeneity by directly seeing the heterogeneity of actual covariates. This approach, although providing interpretable results and informing a researcher where one individual substantially differs from another (e.g., gender, race), suffers from the challenge of dimensionality. For example, people who have different likelihoods of attending college might differ from one another in race (African American versus White), gender (female versus male), living region (urban versus rural), to name a few. In this case, the differential odds of attending college are determined by a bunch of covariates rather than a single one. This is a fairly common situation in sociological studies where many confounding covariates come into play. In such a research scenario, propensity score methods become desirable since the propensity score exactly serves to reduce the dimension of covariates.
A note on sample illustrations
So far, we have reviewed and commented on the three major breakthroughs in the propensity score methods in recent years. However, the space restriction of this article does not allow us to illustrate the detailed steps of the reviewed methods by conducting an empirical analysis respectively. As an alternative, in this section we provide some worked-through examples in the prior literature for practical researchers’ reference. Specifically, Hirano and Imbens (2004) presented an application of generalized propensity score adjustment on the well-examined Imbens–Rubin–Sacerdote lottery data (Imbens et al., 2001), with a focus on the effect of the prize amount on subsequent labor earnings. A practice of estimating the average treatment effect based on generalized propensity score weighting is conducted by Hu and Vargas (2015), where they were interested in the differential returns to college majors, locations, and prestige. As for causal mediation analysis, Imai et al. (2011) showed how the extent of anxiety may mediate between the media messages about immigrants and attitudes toward immigration. Moreover, interested readers may find from this study a comparison between the causal mediation analysis and the traditional routine of products of coefficients (MacKinnon et al., 2007). Lastly, an illustration of Brand and Xie’s method on probing treatment effect heterogeneity is their research on the differential returns to higher education (Brand and Xie, 2010). Some other applications can be found in empirical studies such as Schafer et al. (2013) and Hu and Hibel (2014).
Concluding remarks
Over the past decades, a marked development has been witnessed in propensity score methods. This article reviews how to use propensity score methods in the research scenarios of multi-categorical treatment, causal mediation test, and causal effect heterogeneity. All of these methods provide powerful analytical tools to social scientists in different disciplines. At the same time, the practical implementation of these methods involves many potential issues. Misleading conclusions might be drawn if these issues are not well addressed. To summarize the discussions above, the most important thing is to estimate propensity score values. For quantitative research based on survey data, however, this is somewhat challenging.
First of all, it is difficult to obtain the measures for all of the relevant confounding covariates when predicting propensity scores. One widely cited example in this regard is the preponderance of missing measures about individuals’ ability in the ordinary social surveys (e.g., Hout, 2012). Even if some proxy measures of ability, such as IQ score, are accessible, it is still debatable whether or not such measures are sufficient or valid. Besides, some other critical variables might be simply immeasurable in general social surveys. In light of the restrictions on the availability of confounding covariates, sensitivity analysis, we argue, should be always performed to make sure no important covariates are omitted. Otherwise, research conclusions could not be robust to potential omitted variable problems.
Second, another limitation to the practical application of propensity score methods is the property of the treatment variable. In order to apply propensity score methods, critical covariates which predict the transition between treatment levels should be specified, such as parental education when predicting children’s odds of attending college. It can be difficult to pin down the covariates if the transition between treatment levels is relatively vague. That is why most empirical research based on propensity score methods is focused on objective treatments, such as the transitions from high school to college or from a nonparticipant to a participant of a specific project, while few studies used propensity scores to examine the treatment involving subjective attitudes or feelings. Compared with the objective treatment, the changes in subjective attitudes or feelings are subtle and critical variables which can well predict such changes are difficult to pin down. For instance, it is difficult to identify the confounders predicting the transition from ‘relatively happy’ to ‘very happy’.
Third, the parametric approach to predict propensity scores, as we argued above, is limited. It is always necessary, in this regard, to think about whether or not we should take into account second-order terms or the interaction terms. These terms can influence not only the goodness of model fit, but also the conclusion based on propensity score matching because the second-order terms or the interaction terms have to be balanced between the treatment and the control groups as well in the matching process. Besides, it is also necessary to think over whether certain covariates should be included, as even adding a nonsignificant covariate may change the final result. In this light, nonparametric techniques can be promising as an alternative to the parametric estimation of propensity score (e.g., Hill, 2012).
In summary, through reviewing and commenting on the major methodological developments of propensity score methods in recent decades, this research shows the broad application prospects of propensity score methods in social scientific research, as well as the potential issues which deserve serious addressing in the practical implementation process, all of which can be enlightening and helpful for empirical researchers.
Footnotes
Acknowledgements
We are grateful to the Editor of Current Sociology and anonymous reviewers for their insightful comments and suggestions.
Funding
The first author gratefully acknowledges the general support from the research fund of the School of Social Development and Public Policy at Fudan University, the Innovation Program of Shanghai Municipal Education Commission [15ZS001] and the Major Project of the National Nature Science Foundation of China [71490733].
