Abstract
Propensity scores (PS) have been studied for many years, mostly in the aspect of confounder matching in the control and treatment groups. This work is devoted to the problem of estimation of the causal impact of the treatment versus control data in observational studies, and it is based on the simulation of thousands of scenarios and the measurement of the causal outcome. The generated treatment effect was added in simulation to the outcome, then it was retrieved using the PS and regression estimations, and the results were compared with the original known in the simulation treatment values. It is shown that only rarely the propensity score can successfully solve the causality problem, and the regressions often outperform the PS estimations. The results support the old philosophical critique of the counterfactual theory of causation from a statistical point of view.
Introduction
Problems of causal statistical inference have been considered in multiple works over many years. Modern approaches include Rubin’s potential-outcome and propensity score (PS) model, also known as Neyman-Rubin’s model (Rubin, 2006; Sekhon, 2009; Imbens & Rubin, 2015); Pearl’s structural causal model with its special diagrams and operators (Pearl, 2009; Pearl, 2015); merging structural models with machine learning Peters et al. 2017); Bayesian causal inference and networks (Dagum, Luby, 1993; Zigler, Dominici, 2014; Baldi, Shahbaba, 2020); and various other techniques (Anderson, Vastag, 2004; Dawid et al. 2016; Cardenas et al. 2017; Lipovetsky, Mandel, 2015; Lipovetsky, Mandel, 2015 a; Lipovetsky, 2016). Extensive reviews on PS theory with applications in medicine and biology, education and social studies, and many other areas can be found in tons of works (Rosenbaum, 2002; Stürmer et al. 2006; Imai et al. 2008; Guo, Fraser, 2009; Austin, 2011; Lane et al. 2012; Beal, Kupzyk, 2014; Williamson, Forbes, 2014; Leite, 2016; Bai, Clark, 2018). There are also studies with simulated experiments on various PS techniques properties (King, Nielsen, 2019; Ling et al. 2020; Zagar et al. 2017; Zagar et al. 2022; Zhang, 2021; Lipkovich et al. 2022). “Potential scores” search words in Google on 1/8/24 yielded 0.8m references, while “potential outcomes ” (PO) – 6.6m (!). Such huge popularity demonstrates that the whole approach is firmly accommodated into statistical practice, despite some rare warnings that it doesn’t work well in many situations (King, Nielsen, 2019).
The current paper is based on the simulation of thousands of scenarios in specially designed data, with the PS and regression modeling of the causal outcome. The main purpose of the experiment was to answer the question: do propensity scores estimate the real effect of treatment better than other techniques in different situations? The treatment effect was generated and used in the additive model via simulation of the outcome variable, then the latent treatment values were retrieved by the PS and regression estimations, and the results were compared with the originally generated, so known to us, treatment values. It is worthwhile to mention here, that in reality the actual mechanism that creates both observable and potential outcomes is one thing, but the data and mechanism used for estimation are something different. In simulation, however, we assume that they are identical at least in a very simple aspect: the same variables used in outcomes generation have been used for their estimation. Otherwise, it would create uncontrolled bias, which by itself is an interesting topic (which is beyond the scope of this work): if, say, doctors used one set of variables for the treatment, but only part of it is observed – how it affects the results of the PO or any other approach?
It is shown that the common opinion that the PS approach can successfully solve the causality problem is supported only in a small fraction of cases, and it was investigated in which situations the PS approach works or not. The counterfactual approach to causality may have originated from some of Hume’s remarks and developed from a philosophical standpoint most soundly a long time ago (Lewis, 1973), but its drawbacks and troubling aspects have been criticized by philosophers and methodologists (Menzies, 2019; Ingthorson, 2021), and their opinion is supported by our results. The paper shows that the counterfactual theory fails not only theoretically, but empirically as well. Its findings could be used by both philosophers of science and statisticians as an invitation to use other methods of causal estimations which are based on solely observable, not on imaginary data. The main (not all) results of the study were described in the preprint (Mandel, Lipovetsky, 2022).
The work is structured as follows: Section 2 describes the methodology and design of the simulation, Section 3 presents the numerical results, and Section 4 discusses the general issues and summarizes the findings. Appendix considers some formal features of the modeling techniques.
Simulation methodology and design
The final goal of the potential outcome theory is to organize observational data in such a way, that it will be as close as possible to data from the randomized experiment – a “golden standard” of science. If successful, it allows estimating the actual strength of the treatment, thus answering the causality question. The motivating example, explaining the need for that is the following.
Let’s say, one has data about the effect of the medicine given by doctors to their patients. Some patients got the medicine, and some – did not. The question is – how does medicine (treatment) really work? The naïve answer is – to take average results for those treated and non-treated, and subtract them from each other. This method, in fact, is not too naïve – it will work under two, but very strong conditions: treatments were given completely random and patients are more or less the same by all important factors, including health. This is the basis for planning of the experiment. In an observational study, which is the subject of this paper, both requirements are violated: doctors (intuitively or not) will give medicine to patients with worse conditions, which are, in turn, determined by such factors as age, gender, and so on. Those variables are called confounding and, technically, correlated with both – the outcome (the older people would die more often, regardless of medicine) and the treatment – doctors will give it definitely more often to older people. It creates a problem – how to eliminate the effect of confounding variables, to make the comparison of “treated – non-treated” closer to one in a pure randomized experiment? PO approach claims to solve this problem. Technically, it could be briefly described as follows (notations are from the recent book (Cunningham, 2021), although many other sources give about the same things).
Let us assume a binary variable
Observable or “actual” outcomes,
The unit’s observable outcome is a function of its potential outcomes determined according to the switching equation:
where
It is clear that this equation cannot be directly solved, because only one out of two outcomes is observable at each object. But if one calculates somehow the population means, their difference could provide the estimand of a general causal effect. Usually, two statistics are used for that purpose: the average treatment effect, ATE
and average treatment effect for the treatment group, ATT
We used both, but primarily it was ATT (assumed in the text below unless otherwise specified).
How to estimate those quantities, where one outcome is always unknown? The general logic of PO (for ATT) is of four steps:
Divide objects into two groups, treated (with an observable after-treatment outcome) and non-treated; Calculate somehow the closeness of each object in treated groups with all objects in the non-treated one by confounding variables Select the closest objects (one or several) from the untreated group to each treated object; Equalize the outcomes of those non-treated objects to
If, say, data has just one confounding variable – the problem is very straightforward, direct sorting by this variable would solve it. If, however, there are many of them – more complicated methods should be applied: one can calculate the absolute, Euclidean, or Mahalanobis distance in multidimensional space (King, Nielsen, 2019); stratify objects by some criteria; use data fusion techniques (Lipovetsky, 2013) and so on. But the most popular method by far is to calculate so-called propensity scores (PS): the probability of a unit being assigned to a particular treatment given a set of observed covariates, proposed in (Rosenbaum, Rubin, 1983). The procedure of obtaining the causal effect of treatment then has two parts: estimation of the PS and matching the units based on this (now one-dimensional) quantity; it looks as follows.
A binary treatment variable For each unit in the treated group, one finds the “best match” in the untreated group so that the scores in both groups are close. This procedure is called matching. The most popular ways to do matching are the nearest neighbor (NN) and the caliper. In the NN approach, each unit from the treated group finds the closest unit in the untreated group; in the caliper approach, not all differences between PS in groups are considered small but only those within a certain interval (caliper). Typically, a value for the caliper is 0.25 of the standard deviation of the PS. After matching, the average level of the actual outcomes in the treated group is compared with the average level of outcomes of the matched units. The difference is interpreted as the average causal effect of the treatment for treated grout, ATT.
To test the effectiveness of the PS approach we had to make a plausible mechanism of the data generation, apply the PS and other techniques to the generated data, and compare the results to the known treatment. In our simulation, the five variables are considered: two confounded numerical variables
The data sets for simulations are generated as follows. Two numerical predictor variables (confounders)
The basic level of the outcome before treatment for all units is defined as
where the intercept is fixed, e.g.,
The assignment variable
For
where the basic level
with additional random noise
Variables used for simulation
Resuming, the variable
Estimation for the theoretical model, besides PS approach, was performed by the observed variables using the ordinary least squares (OLS) linear regression of
We also experimented with intercept terms (like xd), but it didn’t improve the model, just made it more cumbersome. For that reason, we do not show it here. In fact, the regression could be specified (or misspecified) in multiple ways; we deliberately use the simplest possible version to avoid any problems of this kind.
When a model (4) is built and the fitted values are found, the causal effect in the PO style can be evaluated as the difference of the averaged estimated values for
The second method of estimation employs the random coefficients regression (RCR), first considered in (Hildreth, Houck, 1968). In the yield analysis version of it (Demidenko, Mandel, 2005) was shown how to estimate the random coefficients for each unit analytically. In (Lipovetsky, 2007) it was demonstrated that the proposed in (Demidenko, Mandel, 2005) solution makes a complete decomposition of the predicted
Values of all variables used for simulation are presented in Table 1. Coefficients of influence of
As one can see, the simulation mechanism covers a very wide area of possible scenarios, from almost deterministic to almost completely stochastic data. The final purpose of the simulation was to estimate the causal effect of treatment, as well as to understand how the quality of the evaluation is expressed by its different characteristics (i.e. which aspect of data affects the quality the most or the least). Propensity scores matching techniques included the nearest neighbor (NN) and calipers with several levels measured in fractions of the standard deviation of PS (row 8). Those values for the caliper were selected based on typical recommendations in the literature, which gravitate to 0.25. Two regression techniques, ordinary least squares (OLS) and random coefficients regression (RCR), were employed. The sample size of the data for each run consists of 1000 cases, with the number of random simulations for each parameter’s combination varied from 10 to 50.
The way of the data simulation may seem too complicated, but it reflects our intention to simulate a critical aspect of data – namely, that individual reactions to medicine (or any other treatment) with regard to influential
We do not emulate many possible complex causal mechanisms (different networking relations between objects, or hierarchical structures), i.e., we follow the ignorability assumption (Rosenbaum, Rubin, 1983), which postulates that objects are independent of each other. We assume just a simple linear dependence with non-random and random coefficients. Of course, this setting requires more parameters, than the OLS regression used in earlier simulation studies (Zagar et al., 2022).
Several characteristics describing the results of the simulation are as follows. The average outcome in the two groups
Statistics of overlapping percent by each confounder (or PS themselves) in histograms before and after matching is recommended (Austin, 2011) to check that there are similar
The difference between mean values of the estimated and actual treatment PS can be defined by the
where
where
Besides statistical criteria, we use a more feasible measure of the relative error (RE), in a percent:
Taking into account the design complexity, we ran multiple simulations to catch different aspects of the problem; the results of the numerical experiments are as follows.
The general results
Overall performance by different estimation techniques
Overall performance by different estimation techniques
Table 2 presents the main results obtained in simulations of 30,000 scenarios when all parameters were varied randomly in their intervals (listed in Table 1).
By
Statistics in the last two rows of Table 2 show how often the RCR outperforms the other models. By
Different methods of PS techniques give practically identical results, therefore, the discussions about preferences of different methods within PS paradigm are not very important. Particularly, the caliper level 0.25 usually is not better than others, and the NN approach typically just slightly outperforms other variants.
As mentioned before, two sources of randomness determine the outcome in simulation: variation of the coefficients (1) affecting
When std(
We also observed the following effects with the
Different indicators for three levels of randomness in the data
Different indicators for three levels of randomness in the data
Table 3 summarizes the findings for combinations of two sources of randomness: the variability of random coefficients and the level of the residual error. The first numerical column shows the indicators when both levels are very low, the second – when both are intermediate, and the third – on a maximum level. Comparison with PS was done by the NN approach because the calipers do not improve the results.
The coefficient
The RCR demonstrates relative stability in regard to OLS by the error rate (yet performing better with low randomness), but decisively outperforms PS in all categories, especially in one with the high randomness (77%). Resuming, in high randomness, PS performs worse than the OLS and RCR by both
Correlations of
The discordance indicator
The obtained results can be considered within a general framework of the causal theory where they get a new meaning. Here we list some theoretical and practical aspects of that theory, which could help to explain our results.
First, the whole PO approach is the strongest manifestation of the so-called counterfactual logic of causation when one assumes that non-existing but “potential” values may be used in explaining what has really happened. The originators of this approach never had hidden the counterfactual paradigm at its roots. For example, the founding paper (Rosenbaum, Rubin, 1983) starts as follows: “Inferences about the effects of treatments involve speculations about the effect one treatment would have had on a unit which, in fact, received some other treatment”. This is classical counterfactual speculation, and the authors noted (ibid., p. 41) that “Since each unit receives only one treatment, either
It was rooted in some of Hume’s remarks and philosophically founded by D. Lewis (1973). Later philosophical developments were summarized by P. Menzies (2019) and critically discussed by many philosophers of science (J. Woodward, J. Schaffer, N. Cartwright, among others). All that just shows, how unconvincing the whole theory is. Indeed, it was demonstrated many times that the presumption that something that might happen in one of the “possible worlds” (Lewis’ terminology) is “the cause” of something that has uniquely happened in reality is a stretch that contradicts the whole knowledge and scientific methodology of the last four hundred years. As was stressed in (Mandel, 2015, 2017 and in the references therein), it is a version of an “alternative history” that had been categorically rejected by historians, but, surprisingly, resurfaced in modern statistics. The paper by T. VanderWeele (2016), for example, which was a response to the earlier critique, was primarily devoted to the defense of the PO approach in a statistical (as opposed to pure philosophical) setting, but it didn’t obtain its goal, in our view; the fundamental objections to the counterfactual logic to be used for science, not for fiction, remain unchanged. It is impossible to go here into all arguments against and for counterfactuals, let’s just state, that any assumptions about something nonexistent cannot replace the analysis of what has really happened. No results in science have been achieved using counterfactual logic – only based on understanding the real causal mechanisms of the events.
Let’s consider just one inherent contradiction of the approach. Someone wants to learn the causal effect of gender on some medicine, i.e., “gender” is a treatment. Within the PO paradigm, she has to assume, that, say, “male’s potential outcomes” would be derived from the female’s real outcomes (we apologize to the followers of the modern concepts that gender is “fluid”, just because it is irrelevant in this case). But this very assumption (the switch from male to female) requires so many other assumptions to be held (that gender does not affect height, weight, strength, etc.), that the obtained results become unreliable and rather absurd from the very beginning. The speculations that the “male-female” example is kind of “extreme”, that nobody would use gender as a treatment, because genders are not “exchangeable”, unlike, say, patients in clinical trials, do not hold any scrutiny. An attempt to draw the line distinguishing “good and bad”, “workable and not” counterfactual variables would create even more troubles that cannot be discussed here. A comprehensive critique of the counterfactual theories from a philosophical standpoint can be found, for example, in the recent book (Ingthorson, 2021), and we just bring one point from there: before counterfactuals may somehow explain the “cause”, this cause should be already known, not the vice versa.
Second, more specific trouble with the PS approach: its whole idea is to balance two parts of the sample, treated and controlled, in such a way that they contain more or less the same values of the confounding variables
It creates quite a paradoxical situation that PS either is not needed or not working. Logistic regression, used in PS, by definition, tries to maximize the good of fitness, i.e., obtains the best prediction possible. But the better fit – the more distinctive, in general (besides very special cases),
The third problem, related to the second, is that propensity scores in each group are usually very small. If there are high probabilities of belonging to one group for one set of objects and low – for another, then matching looks easy and logical (it may be with very strong confounders – but then the troubles of point two above will take place). But if those probabilities are about 0.1–0.2 for each group, the matching for whatever criteria would be quite a chaotic exercise – what explains, in many cases, why PS so often yields to OLS or RCR.
Fourth, the whole idea of matching is based on the assumption that if the objects in the treatment group are similar by the
The most important conclusion from the experiment is that in a wide variety of situations, PS results could be easily outperformed by regular regressions or by models with random coefficients. Actually, the PS results are always worse than the ones of regressions by the level of errors and often by
To clarify it, let us consider a simple example. There are two clinical trials, 1,000 people in each with equal distribution of old and young people where in the first the treatment (say, vaccination) is performed randomly among old and young people with a probability 0.5 (the assumption of equality of distributions it is not actually required; it was made just for the sake of calculation simplicity;). In the second study, having the same budget, doctors intuitively assign the same treatment to old patients two times more often than to young ones, thus, making a strong age-inflicting bias, retaining though the same treatment rate of 0.5: young people will be treated with a probability of 0.33 (totaling 500
The regression modeling by itself is not the way to discover the causal relationship, which is a well-known fact (Lipovetsky, Mandel, 2015, 2015a; Mandel, 2015, 2017), though it may help if something, which is not just potential but rather real, is taken into account. If instead of the “real”
Footnotes
Acknowledgments
I thank Drs. S. Brodsky and M. Zakharevich for discussions, and especially Dr. S. Lipovetsky for help at different stages of this study.
Conflict of interest
The author declares no conflict of interest.
Funding
This research received no external funding.
Appendix. Random coefficients regression
Let us briefly describe the approach of the random coefficients regression and yield analysis considered in (Hildreth, Houck, 1968; Demidenko, Mandel, 2005; Lipovetsky, 2007). The random-coefficient Hildreth and Houck model can be formulated as a multiple regression:
where
where
with the total error defined as
The model (A3) looks like an OLS regression, but its error term depends on the variables themselves. The errors in each coefficient (A2) are assumed to be independent and identically distributed, with zero means, and with the standard deviation
It is possible to obtain the OLS residuals
Such an approach corresponds to the so-called iteratively re-weighted least squares (IRLS) technique. The random coefficients (A2) for the model (A1) can be found via minimization of the penalized least squares objective
with the weights
This formula gives the explicit expression convenient in the practical estimation of the random coefficients
Therefore, the predicted values
