Limitations of the propensity scores approach: A simulation study

Abstract

Propensity scores (PS) have been studied for many years, mostly in the aspect of confounder matching in the control and treatment groups. This work is devoted to the problem of estimation of the causal impact of the treatment versus control data in observational studies, and it is based on the simulation of thousands of scenarios and the measurement of the causal outcome. The generated treatment effect was added in simulation to the outcome, then it was retrieved using the PS and regression estimations, and the results were compared with the original known in the simulation treatment values. It is shown that only rarely the propensity score can successfully solve the causality problem, and the regressions often outperform the PS estimations. The results support the old philosophical critique of the counterfactual theory of causation from a statistical point of view.

Keywords

Causality potential outcome propensity score regression models random coefficients regression

1. Introduction

Problems of causal statistical inference have been considered in multiple works over many years. Modern approaches include Rubin’s potential-outcome and propensity score (PS) model, also known as Neyman-Rubin’s model (Rubin, 2006; Sekhon, 2009; Imbens & Rubin, 2015); Pearl’s structural causal model with its special diagrams and operators (Pearl, 2009; Pearl, 2015); merging structural models with machine learning Peters et al. 2017); Bayesian causal inference and networks (Dagum, Luby, 1993; Zigler, Dominici, 2014; Baldi, Shahbaba, 2020); and various other techniques (Anderson, Vastag, 2004; Dawid et al. 2016; Cardenas et al. 2017; Lipovetsky, Mandel, 2015; Lipovetsky, Mandel, 2015 a; Lipovetsky, 2016). Extensive reviews on PS theory with applications in medicine and biology, education and social studies, and many other areas can be found in tons of works (Rosenbaum, 2002; Stürmer et al. 2006; Imai et al. 2008; Guo, Fraser, 2009; Austin, 2011; Lane et al. 2012; Beal, Kupzyk, 2014; Williamson, Forbes, 2014; Leite, 2016; Bai, Clark, 2018). There are also studies with simulated experiments on various PS techniques properties (King, Nielsen, 2019; Ling et al. 2020; Zagar et al. 2017; Zagar et al. 2022; Zhang, 2021; Lipkovich et al. 2022). “Potential scores” search words in Google on 1/8/24 yielded 0.8m references, while “potential outcomes ” (PO) – 6.6m (!). Such huge popularity demonstrates that the whole approach is firmly accommodated into statistical practice, despite some rare warnings that it doesn’t work well in many situations (King, Nielsen, 2019).

The current paper is based on the simulation of thousands of scenarios in specially designed data, with the PS and regression modeling of the causal outcome. The main purpose of the experiment was to answer the question: do propensity scores estimate the real effect of treatment better than other techniques in different situations? The treatment effect was generated and used in the additive model via simulation of the outcome variable, then the latent treatment values were retrieved by the PS and regression estimations, and the results were compared with the originally generated, so known to us, treatment values. It is worthwhile to mention here, that in reality the actual mechanism that creates both observable and potential outcomes is one thing, but the data and mechanism used for estimation are something different. In simulation, however, we assume that they are identical at least in a very simple aspect: the same variables used in outcomes generation have been used for their estimation. Otherwise, it would create uncontrolled bias, which by itself is an interesting topic (which is beyond the scope of this work): if, say, doctors used one set of variables for the treatment, but only part of it is observed – how it affects the results of the PO or any other approach?

It is shown that the common opinion that the PS approach can successfully solve the causality problem is supported only in a small fraction of cases, and it was investigated in which situations the PS approach works or not. The counterfactual approach to causality may have originated from some of Hume’s remarks and developed from a philosophical standpoint most soundly a long time ago (Lewis, 1973), but its drawbacks and troubling aspects have been criticized by philosophers and methodologists (Menzies, 2019; Ingthorson, 2021), and their opinion is supported by our results. The paper shows that the counterfactual theory fails not only theoretically, but empirically as well. Its findings could be used by both philosophers of science and statisticians as an invitation to use other methods of causal estimations which are based on solely observable, not on imaginary data. The main (not all) results of the study were described in the preprint (Mandel, Lipovetsky, 2022).

The work is structured as follows: Section 2 describes the methodology and design of the simulation, Section 3 presents the numerical results, and Section 4 discusses the general issues and summarizes the findings. Appendix considers some formal features of the modeling techniques.

2. Simulation methodology and design

The final goal of the potential outcome theory is to organize observational data in such a way, that it will be as close as possible to data from the randomized experiment – a “golden standard” of science. If successful, it allows estimating the actual strength of the treatment, thus answering the causality question. The motivating example, explaining the need for that is the following.

Let’s say, one has data about the effect of the medicine given by doctors to their patients. Some patients got the medicine, and some – did not. The question is – how does medicine (treatment) really work? The naïve answer is – to take average results for those treated and non-treated, and subtract them from each other. This method, in fact, is not too naïve – it will work under two, but very strong conditions: treatments were given completely random and patients are more or less the same by all important factors, including health. This is the basis for planning of the experiment. In an observational study, which is the subject of this paper, both requirements are violated: doctors (intuitively or not) will give medicine to patients with worse conditions, which are, in turn, determined by such factors as age, gender, and so on. Those variables are called confounding and, technically, correlated with both – the outcome (the older people would die more often, regardless of medicine) and the treatment – doctors will give it definitely more often to older people. It creates a problem – how to eliminate the effect of confounding variables, to make the comparison of “treated – non-treated” closer to one in a pure randomized experiment? PO approach claims to solve this problem. Technically, it could be briefly described as follows (notations are from the recent book (Cunningham, 2021), although many other sources give about the same things).

Let us assume a binary variable $d$ that takes a value of 1 if a particular unit $i$ receives the treatment and 0 if it does not. Potential outcomes are defined as ${Y}_{i}^{1}$ if unit i received the treatment and as ${Y}_{i}^{0}$ if the unit did not. Each unit i has exactly two potential outcomes: under a state of the world where the treatment occurred and where the treatment did not occur.

Observable or “actual” outcomes, ${Y}_{{i}}$ are distinct from potential outcomes…Whereas potential outcomes are hypothetical random variables that differ across the population, observable outcomes are factual random variables. It is important to note, that the generative mechanism of the PO is a certain (unknown) continuous function, defined in a very wide space of all affecting variables, or, as was remarked in (Richardson, Robins, 2023), “In the standard presentation of the potential outcome approach, random variables corresponding to the outcomes for an individual under all possible interventions are assumed to exist, living on a common probability space”. The procedure of estimation of the “true mechanism” of PO generation, respectively, should also be continuous, what is reflected, particularly, to the use of logistic function in the most popular estimation procedures.

The unit’s observable outcome is a function of its potential outcomes determined according to the switching equation:

$\displaystyle{Y}_{i}={d}_{i}{Y}_{i}^{1}+\left({1-{d}_{i}}\right){Y}_{i}^{0}$

where ${d}_{i}$ equals 1 if the unit received the treatment and 0 if it did not. The unit-specific treatment effect, or causal effect, is the difference between the two states of the world:

$\displaystyle\delta_{i}={Y}_{i}^{1}-{Y}_{i}^{0}$

It is clear that this equation cannot be directly solved, because only one out of two outcomes is observable at each object. But if one calculates somehow the population means, their difference could provide the estimand of a general causal effect. Usually, two statistics are used for that purpose: the average treatment effect, ATE

$\displaystyle\textit{ATE}={E}[{Y}_{i}^{1}\left]{-{E}[{Y}_{i}^{0}}\right]$

and average treatment effect for the treatment group, ATT

$\displaystyle\textit{ATT}={E}[{Y}_{i}^{1}|{d}_{i}=1]-{E}[{Y}_{i}^{0}|{d}_{i}=1].$

We used both, but primarily it was ATT (assumed in the text below unless otherwise specified).

How to estimate those quantities, where one outcome is always unknown? The general logic of PO (for ATT) is of four steps:

1.
Divide objects into two groups, treated (with an observable after-treatment outcome) and non-treated;
2.
Calculate somehow the closeness of each object in treated groups with all objects in the non-treated one by confounding variables $x$ ;
3.
Select the closest objects (one or several) from the untreated group to each treated object;
4.
Equalize the outcomes of those non-treated objects to ${Y}_{i}^{0}$ and calculate ATT based on it.

If, say, data has just one confounding variable – the problem is very straightforward, direct sorting by this variable would solve it. If, however, there are many of them – more complicated methods should be applied: one can calculate the absolute, Euclidean, or Mahalanobis distance in multidimensional space (King, Nielsen, 2019); stratify objects by some criteria; use data fusion techniques (Lipovetsky, 2013) and so on. But the most popular method by far is to calculate so-called propensity scores (PS): the probability of a unit being assigned to a particular treatment given a set of observed covariates, proposed in (Rosenbaum, Rubin, 1983). The procedure of obtaining the causal effect of treatment then has two parts: estimation of the PS and matching the units based on this (now one-dimensional) quantity; it looks as follows.

a.
A binary treatment variable $d$ is considered as the dependent variable and all confounders $x$ as independent variables. A logit regression of $d$ on $x$ is built to estimate the probability (that is the propensity score) of each observation belonging to the treated ( $d=$ 1) or untreated ( $d=$ 0) group.
b.
For each unit in the treated group, one finds the “best match” in the untreated group so that the scores in both groups are close. This procedure is called matching. The most popular ways to do matching are the nearest neighbor (NN) and the caliper. In the NN approach, each unit from the treated group finds the closest unit in the untreated group; in the caliper approach, not all differences between PS in groups are considered small but only those within a certain interval (caliper). Typically, a value for the caliper is 0.25 of the standard deviation of the PS.
c.
After matching, the average level of the actual outcomes in the treated group is compared with the average level of outcomes of the matched units. The difference is interpreted as the average causal effect of the treatment for treated grout, ATT.

To test the effectiveness of the PS approach we had to make a plausible mechanism of the data generation, apply the PS and other techniques to the generated data, and compare the results to the known treatment. In our simulation, the five variables are considered: two confounded numerical variables $x_{1}$ and $x_{2}$ , the treatment assignment binary variable $d$ , the numerical effect of treatment $T$ (i.e., how the outcome changed in the unit after receiving treatment), and the numerical outcome $y$ . Four of them, except $T$ , can be observed in practice, and $T$ is used for checking the results of the simulation. The goal is to estimate the real, independent of assignment, causal effect of $T$ on $y$ in observational settings.

The data sets for simulations are generated as follows. Two numerical predictor variables (confounders) $x_{1}$ and $x_{2}$ are taken from the probability density function (pdf) of the random normal distribution $N$ (20, 5) with a mean of 20 and standard deviation of 5. The $x_{1}$ and $x_{2}$ are correlated variables with the correlation coefficient taken at several levels (Table 1, row 7), which was achieved by the algorithm described in (Middleton, 2003).

The basic level of the outcome before treatment for all units is defined as

$\displaystyle U_{i}=a_{0}+a_{1}x_{1i}+a_{2}x_{2i}+e_{1i},$ (1)

where the intercept is fixed, e.g., $a_{0}=$ 300, parameters $a_{1}$ and $a_{2}$ are varying (see Table 1, row 1), $e_{1i}$ is the random error $N$ (0, 30), with $i=$ 1, 2, …, $N$ – the number of observations. The typical outcome $U$ in Eq. (1) is around $a_{0}=$ 300, so the value std $=$ 30 of the noise gives a moderate level of variation for $U$ about 10%. Note that Eq. (1) does not have any causal connotation, it just describes how the basic outcome level was generated.

The assignment variable $d$ is a binary indicator taken with frequency 0.3 if $x_{1i}<$ mean ( $x_{1}$ ) and with the frequency $f_{1}\geqslant$ 0.3 if $x_{1i}\geqslant$ mean ( $x_{1}$ ) (see Table 1, row 2). It creates a confounding effect: for instance, if treatment is rarely assigned to the patients with a small value of $x_{1}$ (frequency of d in this group will be 30%), and more often to those with high values (with the frequency of 80%), it would create a positive correlation between d and $x_{1}$ . This effect, supposedly, should be further eliminated by applying the PS procedures.

For $d=$ 1, the numerical variable of the unobserved treatment outcome is defined as

$\displaystyle T_{i}=T_{0}+b_{1i}x_{1i}+b_{2i}x_{2i}+e_{2i},$ (2)

where the basic level $T_{0}$ for treatment is defined by the uniform distribution (see Table 1, row 3); the random coefficients are taken from the normal pdf $N$ (2, std( $b))$ (Table 1, row 4), and the standard deviation of $e_{2}$ is defined by several variants (Table 1, row 5). The observed outcome variable is defined as the sum of the components (1) and (2),

$\displaystyle y_{i}=U_{i}+T_{i}+e_{3i},$ (3)

with additional random noise $e_{3i}$ from the normal distribution $N$ (0, 300* $f_{2}$ ), where the parameter $f_{2}$ defines a fraction of the basic level of $U$ for the residual errors in $y$ (Table 1, row 6).

Table 1
Variables used for simulation

Variables Notation Description

1 Coefficients of influence $x_{1}$ and $x_{2}$ on $U$ $a_{1}$ , $a_{2}$ Uniform [1, 1.5]

2 Frequency for $d$ if $x_{1i}\geqslant$ mean ( $x_{1})$ $f_{1}$ 0.3; 0.5; 0.8; Uniform [0.3, 0.8]

3 Basic treatment level $T_{0}$ Uniform [5, 10]

4 The standard deviation of random coefficients for treatment std( $b$ ) 0.0001; 0.75; 2.5

5 The standard deviation of errors $e_{2}$ in treatment as a fraction of $T_{0}$ std Uniform [0.05, 0.5]

6 Fraction of basic level of $U$ for $y$ residual error $e_{3}$ regulation $f_{2}$ 0.05; 0.5; 0.9; Uniform [0.05, 0.9]

7 Correlation between $x_{1}$ and $x_{2}$ cor 0; 0.5; 0.9

8 Caliper levels, % of the standard deviation of PS cal 0.05; 0.2; 0.25

Resuming, the variable $d$ is the assignment binary variable, and all the rules of the PS approach apply to it. The variable $T$ is the variable for describing how the treatment (say, taking a pill) creates a different outcome effect $y$ for different patients. It is obvious in any realistic setting that the same treatment can create different outcomes in different individuals, and we want to simulate this effect.

Estimation for the theoretical model, besides PS approach, was performed by the observed variables using the ordinary least squares (OLS) linear regression of $y$ by the predictors and the binary 1–0 treatment-control indicator $d$ :

$\displaystyle y_{i}=a_{0}+a_{1}x_{1i}+a_{2}x_{2i}+a_{3}d_{i}.$ (4)

We also experimented with intercept terms (like xd), but it didn’t improve the model, just made it more cumbersome. For that reason, we do not show it here. In fact, the regression could be specified (or misspecified) in multiple ways; we deliberately use the simplest possible version to avoid any problems of this kind.

When a model (4) is built and the fitted values are found, the causal effect in the PO style can be evaluated as the difference of the averaged estimated values for $y$ in two groups by d (1 and 0). The mean values of the two groups estimated by regression coincide with the averaged observed values in the corresponding groups (see the proof in (Mandel, I., Lipovetsky, 2022)) – so, the whole regression (4) could be replaced by a simple comparison of averages in treated and non-treated groups.

The second method of estimation employs the random coefficients regression (RCR), first considered in (Hildreth, Houck, 1968). In the yield analysis version of it (Demidenko, Mandel, 2005) was shown how to estimate the random coefficients for each unit analytically. In (Lipovetsky, 2007) it was demonstrated that the proposed in (Demidenko, Mandel, 2005) solution makes a complete decomposition of the predicted $y_{i}$ value into contributions associated with each independent variable (see Appendix). Respectively, the causal component is the one attributed to the $d$ variable in (4) which equals the estimation of the effect for the treated group.

Values of all variables used for simulation are presented in Table 1. Coefficients of influence of $x_{1}$ and $x_{2}$ to $U$ from Eq. (1) and random coefficients from (2) of $x_{1}$ and $x_{2}$ to $T$ (rows 1 and 4–5 in Table 1, respectively) have been regulated by two independent distributions each. Together with other variables (see Notation in Table 1), they determine a set of parameters for the simultaneous simulation by their different combinations.

As one can see, the simulation mechanism covers a very wide area of possible scenarios, from almost deterministic to almost completely stochastic data. The final purpose of the simulation was to estimate the causal effect of treatment, as well as to understand how the quality of the evaluation is expressed by its different characteristics (i.e. which aspect of data affects the quality the most or the least). Propensity scores matching techniques included the nearest neighbor (NN) and calipers with several levels measured in fractions of the standard deviation of PS (row 8). Those values for the caliper were selected based on typical recommendations in the literature, which gravitate to 0.25. Two regression techniques, ordinary least squares (OLS) and random coefficients regression (RCR), were employed. The sample size of the data for each run consists of 1000 cases, with the number of random simulations for each parameter’s combination varied from 10 to 50.

The way of the data simulation may seem too complicated, but it reflects our intention to simulate a critical aspect of data – namely, that individual reactions to medicine (or any other treatment) with regard to influential $x$ variables will not be the same for all respondents. It seems and is obvious, but, paradoxically, the whole concept of matching in the PO approach is based on the opposite idea, that the same $x$ values generate the same or very similar $y$ values. We discuss it from a methodological standpoint in part 4, but here just mention that we at least take this fact into account in the simulation mechanism by making the outcomes vary respectively. Noticeably, all three components of the observed outcomes, as seen in (3), are subject to random simulation, in accordance with what is going on in reality.

We do not emulate many possible complex causal mechanisms (different networking relations between objects, or hierarchical structures), i.e., we follow the ignorability assumption (Rosenbaum, Rubin, 1983), which postulates that objects are independent of each other. We assume just a simple linear dependence with non-random and random coefficients. Of course, this setting requires more parameters, than the OLS regression used in earlier simulation studies (Zagar et al., 2022).

Several characteristics describing the results of the simulation are as follows. The average outcome in the two groups $y_{0}$ and $y_{1}$ as calculated by any technique (either by PS or regression); the average outcome in the treated group $y_{1e}$ and estimated the causal effect of treatment $T_{1e}=y_{1e}-y_{1}$ . The actual average treatment effect in the treated group $T_{1}$ is not observed in practice, but we know it in the simulation and use to check the validity of the estimation, abs ( $T_{1}-T_{1e}$ ).

Statistics of overlapping percent by each confounder (or PS themselves) in histograms before and after matching is recommended (Austin, 2011) to check that there are similar $x$ values that supposedly produce the similar $y$ values; otherwise, the matching and balancing make no sense. This hypothesis can also be tested by the direct measurement of the actual $y$ distribution in the treated group: if sort the data in the treated group by the PS and to break them into small chunks with 10 or 20 items in each, then in each chunk the variations by PS and by $y$ are small and we can propose two indicators. In $D_{1}$ discordance indicator, we normalize $y$ into the 0–1 scale and make a ratio of two variances $y$ and PS within each chunk. Then those ratios are averaged by all segments (it reminds the correlation ratio in regression analysis). In $D_{2}$ discordance indicator, we just calculate the standard deviation of the normalized $y$ in each chunk and average them. The higher such indicators, the worse are conditions for successful matching, thus, for PS success.

The difference between mean values of the estimated and actual treatment PS can be defined by the $t$ -statistic as the $T_{1e}-T_{1}$ divided by the standard error for the unequal variances:

$\displaystyle t\left(\textit{PS}\right)=\frac{T_{1e}-T_{1}}{\sqrt{s_{1}^{2}/N_% {1}+s_{2}^{2}/N_{2}}},$ (5)

where $s_{1}$ and $s_{2}$ are standard deviations of the groups, and $N_{1}$ and $N_{2}$ are their sizes. The paired $t$ -statistic for differences between the estimated and actual treatment effects by regression can be found as follows:

$\displaystyle t\left(\textit{Regr}\right)=\frac{T_{1e}-T_{1}}{\sqrt{s_{1}^{2}/% N_{1}}},$ (6)

where $s_{1}^{2}$ is the variance of differences between the actual and estimated effects in the first group.

Besides statistical criteria, we use a more feasible measure of the relative error (RE), in a percent:

$\displaystyle\textit{RE}=\frac{\textit{abs}\left({T_{1e}-T_{1}}\right)}{T1}100.$ (7)
3. Numerical results

	Variables	Notation	Description
1	Coefficients of influence $x_{1}$ and $x_{2}$ on $U$	$a_{1}$ , $a_{2}$	Uniform [1, 1.5]
2	Frequency for $d$ if $x_{1i}\geqslant$ mean ( $x_{1})$	$f_{1}$	0.3; 0.5; 0.8; Uniform [0.3, 0.8]
3	Basic treatment level	$T_{0}$	Uniform [5, 10]
4	The standard deviation of random coefficients for treatment	std( $b$ )	0.0001; 0.75; 2.5
5	The standard deviation of errors $e_{2}$ in treatment as a fraction of $T_{0}$	std	Uniform [0.05, 0.5]
6	Fraction of basic level of $U$ for $y$ residual error $e_{3}$ regulation	$f_{2}$	0.05; 0.5; 0.9; Uniform [0.05, 0.9]
7	Correlation between $x_{1}$ and $x_{2}$	cor	0; 0.5; 0.9
8	Caliper levels, % of the standard deviation of PS	cal	0.05; 0.2; 0.25

Taking into account the design complexity, we ran multiple simulations to catch different aspects of the problem; the results of the numerical experiments are as follows.

3.1 The general results

Table 2
Overall performance by different estimation techniques

Criteria	OLS model	RCR model	PS, NN	PS, caliper 0.1	PS, caliper 0.2	PS, caliper 0.25
When $t$ -statistics (5)–(6) is less than 1.96 for “effect estimated – effect actual”, %	80.2	44.3	77.5	77.7	77.5	77.5
Average relative error (7), %	10.2	10.1	12.9	12.9	12.9	12.9
When RCR outperforms other models by $t$ -statistics, %	24.3		23.8	23.9	23.9	23.9
When RCR outperforms other models by the relative error, %	53.1		56.7	56.8	56.8	56.7

Table 2 presents the main results obtained in simulations of 30,000 scenarios when all parameters were varied randomly in their intervals (listed in Table 1).

By $t$ -statistics criteria, the OLS regression is a bit better than any PS techniques (80% of “good cases” vs. 77.5%); by the level of errors, both OLS and RCR regression results are noticeably better than any PS estimations (10% vs. 13%).

Statistics in the last two rows of Table 2 show how often the RCR outperforms the other models. By $t$ -statistics, it is better only in a quarter of cases (about 24%); by the level of errors, it is better in more than half of all cases (53% for OLS and up to 57% for PS).

Different methods of PS techniques give practically identical results, therefore, the discussions about preferences of different methods within PS paradigm are not very important. Particularly, the caliper level 0.25 usually is not better than others, and the NN approach typically just slightly outperforms other variants.

3.2 Different levels of randomness

As mentioned before, two sources of randomness determine the outcome in simulation: variation of the coefficients (1) affecting $U$ (row 1 in Table 1), random coefficients (2) for $T$ (row 4 in Table 1), and a level of residual errors affecting different variables (rows 5 and 6 in Table 1). In the experiments, we observed the following behavior of the characteristics under investigation.

When std( $b$ ) $=$ 0.0001 (Table 1, row 4), the random coefficients $b$ (2) became practically constant and equal the benchmark 2 (the mean value of their normal distribution). When std( $b$ ) $=$ 2.5, the coefficients vary widely around this level, from about 0 to 6, which is not very typical in real-life situations. The most noticeable result is: the larger std( $b$ ), the more clearly the OLS outperforms PS by $t$ -statistics. With minimum variance, the OLS and PS had about 73% of “good cases” each; with maximum variance, the OLS had 90% vs. 83% for PS. The RCR also increased its performance from 36% to 56%, which is still noticeably lower than OLS and PS. The reason is that the RCR intended to estimate the random coefficients in a straightforward way when they affect $y$ directly, but in the model (2)–(3) they affect the entire outcome just via the treatment. The relative errors grew for every technique as expected, but both OLS and RCR advantages became clear with the rising variance: in the beginning, it is 9.8% for OLS and RCR vs. 12.5% for PS, and for the biggest variance it is 10.6% for OLS and RCR vs. 13.8% for PS. In general, the more volatile are coefficients regulating the outcome, the better are results of the regressions in comparison with the PS.

We also observed the following effects with the $y$ residual error’s standard deviation varying from 5% to 90% of the basic $U$ level (Table 1, row 6). The coefficient of multiple determination ( $R^{2}$ ) of the models decreased from 32% with the smallest noise to 16% for the largest one. When OLS improves performance measured by $t$ -statistics from 71% to 85% of “good values”, the PS worsens it from 87% to 71%. However, the RCR quality drops from 75% to just 22% by $t$ -statistic, i.e., being better than regression with a small noise it becomes much worse when noise is growing. At the same time, the error rate approximately doubled for all methods, still remaining 20–30% lower for OLS and RCR than for PS. Those tendencies show that the PS produces worse results than the regressions when the $y$ approximation is going down. The regression improves its performance of the causal effects when the approximation level of $y$ decreases because with errors growth the denominator in the $t$ -statistic (variances in (5) or (6)) rises and the $t$ -values go down, so more simulated cases mark the closer estimated and real treatment values. The OLS improves its performance, while the RCR reacts differently to two types of randomness, i.e., the random coefficients and noise.

Table 3
Different indicators for three levels of randomness in the data

	Indicators	Level of randomness in data
		Low	Average	High
1	OLS coefficient of determination, $R^{2}$	0.67	0.23	0.15
2	The errors odds ratio for group 1, PS/OLS	0.76	1.28	1.25
3	When OLS is better than PS by $t$ -statistic, %	32	52	55
4	When RCR is better than OLS by error rate, %	58	53	50
5	When RCR is better than PS by error rate, %	71	67	77
6	Average correlation $U$ , $y$	0.63	0.32	0.24
7	Average correlation (confoundedness) $x_{1}$ , $y$	0.38	0.19	0.15
8	Average correlation $d$ , $y$	0.77	0.38	0.3
9	Average correlation $T$ , $y$	0.81	0.45	0.35
10	Overlapping $x_{1}$ in 0–1 groups before matching	0.76	0.75	0.75
11	Overlapping $x_{1}$ in 0–1 groups after matching	0.74	0.74	0.73
12	Overlapping PS in 0–1groups after matching	0.66	0.66	0.66
13	When PS estimates in groups 0–1 significantly differ by $t$ -statistic, %	32.3	11.1	8.1
14	The difference in relative errors of the PS estimates in groups 0 and 1, to actual treatment, %	5.4	13.6	18.6
15	Discordance $D_{2}$	0.113	0.141	0.145

Table 3 summarizes the findings for combinations of two sources of randomness: the variability of random coefficients and the level of the residual error. The first numerical column shows the indicators when both levels are very low, the second – when both are intermediate, and the third – on a maximum level. Comparison with PS was done by the NN approach because the calipers do not improve the results.

The coefficient $R^{2}$ differs in those conditions quite significantly, from 67% to 15% (Table 3, row 1). When randomness increases, PS becomes less accurate than regression, which is shown in the odds ratio PS/OLS (this trend is even more visible for the group 0, with odd ratios changing from 1.3 to 0.8). The same effect is observed for $t$ -statistics (Table 3, row 3), when OLS underperforms in low randomness setting and outperforms PS when randomness increases (32% vs. 55%).

The RCR demonstrates relative stability in regard to OLS by the error rate (yet performing better with low randomness), but decisively outperforms PS in all categories, especially in one with the high randomness (77%). Resuming, in high randomness, PS performs worse than the OLS and RCR by both $t$ -statistic and error rate criteria; in low randomness, the results are mixed, so we cannot conclude that PS works better than regression (i.e., the group means difference.)

Correlations of $y$ with different variables, including the basic outcome level $U$ , confounder $x_{1}$ (similarly with $x_{2}$ ), the treatment assignment variable $d$ , and level of treatment $T$ (Table 3, rows 6–9) decrease with a rise in randomness. These trends, together with a substantial overlapping for each of $x$ or PS variables (Table 3, rows 10–12) correspond to conditions of the successful PS analysis at any level of randomness. Ideally, the causal estimates made by matching from group 1 to group 0 and from group 0 to group 1 should be equal or very similar. However, it is not observed in the experiment: the percent of cases with a significant difference in PS causal effect estimates in groups 0 and 1 by $t$ -statistic (Table 3, row 13) is quite large, especially for the low randomness level. In contrast to it, the difference in relative errors of the PS estimates in groups 0 and 1 (Table 3, row 14) grows from 5% to about 19% with an increased level of randomness, so the measures of accuracy could lead to different conclusions and should be used with caution.

The discordance indicator $D_{1}$ , described in Table 1, turned out to be non-informative, but indicator $D_{2}$ shows the expected direction of the correlations by randomness level (Table 3, row 15), by all 30,000 scenarios. This description suggests that the $D_{2}$ captures some aspects of the desired features, but it needs further studying.

4. Discussion and conclusion

The obtained results can be considered within a general framework of the causal theory where they get a new meaning. Here we list some theoretical and practical aspects of that theory, which could help to explain our results.

First, the whole PO approach is the strongest manifestation of the so-called counterfactual logic of causation when one assumes that non-existing but “potential” values may be used in explaining what has really happened. The originators of this approach never had hidden the counterfactual paradigm at its roots. For example, the founding paper (Rosenbaum, Rubin, 1983) starts as follows: “Inferences about the effects of treatments involve speculations about the effect one treatment would have had on a unit which, in fact, received some other treatment”. This is classical counterfactual speculation, and the authors noted (ibid., p. 41) that “Since each unit receives only one treatment, either $r_{1i}$ or $r_{0i}$ is observed, but not both, so comparison of $r_{1i}$ and $r_{0i}$ imply some degree of speculation. In a sense, estimating the causal effect of treatment is a missing data problem, since either $r_{1i}$ or $r_{0i}$ is missing”. The whole quoted paper and thousands of the following ones have been designated to the problem of how to materialize those “speculations” about the missing (i.e., non-existing, counterfactual) data. Later, summarizing his decades-long development, D. Rubin wrote: “…“the Rubin Causal Model”…has two essential parts: the definition of the scientific situation using “potential outcomes” to define causal effect estimands, and the formulation of the real or hypothetical “assignment mechanism”; and a third optional part, the modeling of the science to produce imputation of missing potential outcomes” (Rubin, 2006, p. 460). There are countless statements like this one in the literature on PO and PS. But counterfactual logic was always a very controversial topic, regardless of its use in PO theory.

It was rooted in some of Hume’s remarks and philosophically founded by D. Lewis (1973). Later philosophical developments were summarized by P. Menzies (2019) and critically discussed by many philosophers of science (J. Woodward, J. Schaffer, N. Cartwright, among others). All that just shows, how unconvincing the whole theory is. Indeed, it was demonstrated many times that the presumption that something that might happen in one of the “possible worlds” (Lewis’ terminology) is “the cause” of something that has uniquely happened in reality is a stretch that contradicts the whole knowledge and scientific methodology of the last four hundred years. As was stressed in (Mandel, 2015, 2017 and in the references therein), it is a version of an “alternative history” that had been categorically rejected by historians, but, surprisingly, resurfaced in modern statistics. The paper by T. VanderWeele (2016), for example, which was a response to the earlier critique, was primarily devoted to the defense of the PO approach in a statistical (as opposed to pure philosophical) setting, but it didn’t obtain its goal, in our view; the fundamental objections to the counterfactual logic to be used for science, not for fiction, remain unchanged. It is impossible to go here into all arguments against and for counterfactuals, let’s just state, that any assumptions about something nonexistent cannot replace the analysis of what has really happened. No results in science have been achieved using counterfactual logic – only based on understanding the real causal mechanisms of the events.

Let’s consider just one inherent contradiction of the approach. Someone wants to learn the causal effect of gender on some medicine, i.e., “gender” is a treatment. Within the PO paradigm, she has to assume, that, say, “male’s potential outcomes” would be derived from the female’s real outcomes (we apologize to the followers of the modern concepts that gender is “fluid”, just because it is irrelevant in this case). But this very assumption (the switch from male to female) requires so many other assumptions to be held (that gender does not affect height, weight, strength, etc.), that the obtained results become unreliable and rather absurd from the very beginning. The speculations that the “male-female” example is kind of “extreme”, that nobody would use gender as a treatment, because genders are not “exchangeable”, unlike, say, patients in clinical trials, do not hold any scrutiny. An attempt to draw the line distinguishing “good and bad”, “workable and not” counterfactual variables would create even more troubles that cannot be discussed here. A comprehensive critique of the counterfactual theories from a philosophical standpoint can be found, for example, in the recent book (Ingthorson, 2021), and we just bring one point from there: before counterfactuals may somehow explain the “cause”, this cause should be already known, not the vice versa.

Second, more specific trouble with the PS approach: its whole idea is to balance two parts of the sample, treated and controlled, in such a way that they contain more or less the same values of the confounding variables $x$ which affect both the outcome and the assigning variable. In an experimental setting, the assignment is fully randomized, in an observational setting – not, and PS should make it as close to randomized as possible. Therefore, if two groups have identical sets of predictors $x$ values in each – the problem of “balancing” does not exist, and, respectively, PS matching is not even needed. The exact identity in each group, of course, is the idealization, but it emphasizes the real problem. If $x$ variables predict well enough the assignment variable $d$ (say, $R^{2}$ is about 0.7–0.8, what we observed), it means that the assignment is very far from randomized, which contradicts the PS approach, intending to avoid such a situation. Inversely, if $x$ does not predict the assignment at all, $R^{2}$ is close to zero – then matching PS in two groups will be absolutely random, i.e., any $y$ from group 1 would be assigned to any $y$ from group 0, thus, any “causal” conclusion will be just meaningless.

It creates quite a paradoxical situation that PS either is not needed or not working. Logistic regression, used in PS, by definition, tries to maximize the good of fitness, i.e., obtains the best prediction possible. But the better fit – the more distinctive, in general (besides very special cases), $x$ values are in two groups, for otherwise the good prediction will not be obtained. The more different between themselves are the confounders’ values in the two groups – the less accurate is matching, thus resembling the situation, when the goodness of fit is low. The balance between those two scenarios is very subtle. Another aspect of it is that if a sample size is too small, the logistic approximation of the foreseen step function will be bad. To avoid this additional trouble, we use simulations by 1,000 observations each time, which is big enough to minimize such a risk.

The third problem, related to the second, is that propensity scores in each group are usually very small. If there are high probabilities of belonging to one group for one set of objects and low – for another, then matching looks easy and logical (it may be with very strong confounders – but then the troubles of point two above will take place). But if those probabilities are about 0.1–0.2 for each group, the matching for whatever criteria would be quite a chaotic exercise – what explains, in many cases, why PS so often yields to OLS or RCR.

Fourth, the whole idea of matching is based on the assumption that if the objects in the treatment group are similar by the $x$ variables, they produce the same outcome (this is the exact reason for estimating the non-realized outcome ${Y}_{i}^{0}$ by values of other objects). Ideally, if $x$ values in two groups are identical, the goal is achieved, and the closer the situation is to this ideal, the better. This illusion lies at the heart of the whole PS approach, which tries to escape the insolvability of the ATT equation due to the absence of half of the data by pretending, that, as a “missing data problem”, it does have a solution. But it rather reminds the cargo cult, where the visual similarity of the aircraft from the USA army and the one from the wood by aborigines of the remote island would not let the latter fly. Why similar $x$ should produce a similar $y$ ? Adding random components in the data generation as we did, allows for eliminating this unreal hidden requirement, and it also contributed into not so positive for PS results of the simulation – the higher the randomness of data, the better regression or random regression compared to PS (see Table 3, rows 2,3,5). Other troubles with PS matching were discussed, for example, in (Pearl, 2009; King, Nielsen, 2019; Mandel, 2017).

The most important conclusion from the experiment is that in a wide variety of situations, PS results could be easily outperformed by regular regressions or by models with random coefficients. Actually, the PS results are always worse than the ones of regressions by the level of errors and often by $t$ -statistics. As it was said, the regular regression is just a direct comparison of the $y$ values in 1 and 0 groups, and the whole propensity scores machinery with its assumptions and versions fails to produce results significantly better than the traditional most obvious technique from Ronald Fisher’s times. Already many years the search for the real causes slid to a search for imaginary potential outcomes, but nothing has been really improved since then. This fact may seem counterintuitive: how could it be that a simple comparison of outcomes in two groups performs better than sophisticated methods of PS (or other techniques)?

To clarify it, let us consider a simple example. There are two clinical trials, 1,000 people in each with equal distribution of old and young people where in the first the treatment (say, vaccination) is performed randomly among old and young people with a probability 0.5 (the assumption of equality of distributions it is not actually required; it was made just for the sake of calculation simplicity;). In the second study, having the same budget, doctors intuitively assign the same treatment to old patients two times more often than to young ones, thus, making a strong age-inflicting bias, retaining though the same treatment rate of 0.5: young people will be treated with a probability of 0.33 (totaling 500 $\times$ 0.33 $=$ 165) and old ones – with a probability of 0.67 (totaling 335), respectively. In both studies, the treatment’s actual effect is 30 for old people and 10 for young, i.e., doctors’ intuition is justified (we assume that “the more – the better”). The “background value” of the outcome of interest for both old and young participants is 100. It is easy to show that with those conditions the estimated by regression causal effect (the difference between average values in treated and non-treated groups) based on observed outcome (sum of basic value and the treatment effect) in the treated group is 120.4 for random trials and 123.1 for biased trials. The actual value of the treatment effect for the treated group is 20.4 for random trials 20.4 and 23.1 for biased trials. As seen, there are huge differences between actual and estimated values in both cases – however, the relative error (ATE) for random trials is 491%, while for biased ones – 433%, i.e., smaller. This example with “age” as a variable affecting the “intrinsic bias” of the doctors just for illustrative purposes – in reality, causes for non-random assignments of treatments are much more complex.

The regression modeling by itself is not the way to discover the causal relationship, which is a well-known fact (Lipovetsky, Mandel, 2015, 2015a; Mandel, 2015, 2017), though it may help if something, which is not just potential but rather real, is taken into account. If instead of the “real” $x$ variables one will use their irrelevant substitutes, as once was done with a number of storks “substituting” the number of babies in a famous demonstration of the fictitious correlation (Mattews, 2000) – the causal source will not be revealed neither by PS nor by regression. The current paper just demonstrates that in practice applying regressions with reasonable variables can often be used in place of the sophisticated PS methods, but its main methodological message is “to stay away from counterfactual assumptions”, which lead to a deeper swamp, than typical “usual” regression misspecifications. The most prospective studies in causal statistical implementation, such as (Peters et al. 2017), combining different ideas and merging the structural approach with machine learning, could eventually agree that the counterfactual theories lead to counterproductive results. Statistics was and is a science about the factual, not the counterfactual universe, and it should be unequivocally recognized in practical and theoretical studies.

Footnotes

Acknowledgments

I thank Drs. S. Brodsky and M. Zakharevich for discussions, and especially Dr. S. Lipovetsky for help at different stages of this study.

Conflict of interest

The author declares no conflict of interest.

Funding

This research received no external funding.

Appendix. Random coefficients regression

Let us briefly describe the approach of the random coefficients regression and yield analysis considered in (Hildreth, Houck, 1968; Demidenko, Mandel, 2005; Lipovetsky, 2007). The random-coefficient Hildreth and Houck model can be formulated as a multiple regression:

$\displaystyle y_{i}=b_{i0}x_{i0}+b_{i1}x_{i1}+\ldots+b_{\textit{in}}x_{\textit% {in}}+\varepsilon_{i},\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad% \qquad\qquad\qquad\quad\text{(A1)}$

where $y_{i}$ are observations by the dependent variable ( $i=$ 1, 2, …, $N$ – number of observations), $x_{\textit{ij}}$ are the predictors ( $j=$ 0, 1, 2, …, $n)$ , and $\varepsilon_{i}$ are the errors. The coefficients $b_{\textit{ij}}$ differ across the observations, and the random intercept $b_{i0}$ corresponds to the dummy variable $x_{i0}=$ 1. The coefficients $b_{\textit{ij}}$ are assumed to be randomly distributed around the fixed center defined by the parameters $a_{j}$ :

$\displaystyle b_{\textit{ij}}=a_{j}+\delta_{\textit{ij}}\;,\quad j=0,\;1,\;2,% \;\ldots,\;n,\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad% \qquad\qquad\qquad\text{(A2)}$

where $\delta_{\textit{ij}}$ is unobserved random noise for each $j^{\textit{th}}$ variable and $i^{\textit{th}}$ observation. Using (A2) in (A1) yields the model with the f

$\displaystyle y_{i}=a_{0}x_{i0}+a_{1}x_{i1}+\ldots+a_{n}x_{\textit{in}}+u_{i},% \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad% \,\text{(A3)}$

with the total error defined as

$\displaystyle u_{i}=\delta_{i0}x_{i0}+\delta_{i1}x_{i1}+\ldots+\delta_{\textit% {in}}x_{\textit{in}}.\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad% \qquad\qquad\qquad\qquad\quad\text{(A4)}$

The model (A3) looks like an OLS regression, but its error term depends on the variables themselves. The errors in each coefficient (A2) are assumed to be independent and identically distributed, with zero means, and with the standard deviation $\sigma_{j}^{2}$ by $j$ -th variable. Then the mean $E\left({u_{i}}\right)=$ 0, and the relation (A4) produces squared error expectation in each $i^{\textit{th}}$ observation:

$\displaystyle u_{i}^{2}=\sigma_{0}^{2}x_{i0}^{2}+\sigma_{1}^{2}x_{i1}^{2}+% \ldots+\sigma_{n}^{2}x_{\textit{in}}^{2}.\qquad\qquad\qquad\qquad\qquad\qquad% \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad\text{(A5)}$

It is possible to obtain the OLS residuals $u_{i}$ in the regression (A3), and to treat them squared $u_{i}^{2}$ as the dependent variable in the model (A5), which can be considered as a regression by the squared predictors $x_{j}^{2}$ . Then coefficients $\sigma_{j}^{2}$ of this regression serve as the variance parameters of the random-coefficient (A2), and the errors estimated in (A5) can be used in the weighted least squares regression (A3) with weights $w_{i}$ defined as:

$\displaystyle w_{i}=(\sigma_{0}^{2}x_{i0}^{2}+\ldots+\sigma_{n}^{2}x_{\textit{% in}}^{2})^{-1}.\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad% \qquad\qquad\qquad\qquad\quad\!\text{(A6)}$

Such an approach corresponds to the so-called iteratively re-weighted least squares (IRLS) technique. The random coefficients (A2) for the model (A1) can be found via minimization of the penalized least squares objective

$\displaystyle LS=\mathop{\sum}\limits_{i=1}^{N}w_{i}\left({y_{i}-b_{i0}x_{i0}-% \ldots-b_{\textit{in}}x_{\textit{in}}}\right)^{2}+\mathop{\sum}\limits_{j=0}^{% n}\frac{\left({b_{\textit{ij}}-a_{j}}\right)^{2}}{\sigma_{j}^{2}},\qquad\qquad% \qquad\qquad\qquad\qquad\qquad\qquad\quad\!\text{(A7)}$

with the weights $w_{i}$ (A6). The second sum in (A7) corresponds to minimizing the squared deviations of the random coefficients from the fixed coefficients, $b_{\textit{ij}}-a_{j}=\delta_{\textit{ij}}$ (A2), in the units of their variance. Derivatives of the objective (A7) by the unknown random coefficients $b_{\textit{ij}}$ yield the equations $w_{i}u_{i}x_{\textit{ij}}=\left({b_{\textit{ij}}-a_{j}}\right)/\sigma_{j}^{2}$ , where the residual errors $u_{i}$ (B3)–(B4) are used as estimators for the residuals (A1), and the weights $w_{i}$ are defined in (A6). From these equations, the random coefficients can be written as follows:

$\displaystyle b_{\textit{ij}}=a_{j}+w_{i}\sigma_{j}^{2}x_{\textit{ij}}u_{i}=a_% {j}+\frac{\sigma_{j}^{2}x_{\textit{ij}}}{\sum\nolimits_{k=0}^{n}{\sigma_{k}^{2% }x_{\textit{ik}}^{2}}}\left({y_{i}-a_{0}x_{i0}-\ldots-a_{n}x_{\textit{in}}}% \right).\qquad\qquad\qquad\qquad\qquad\qquad\quad\text{(A8)}$

This formula gives the explicit expression convenient in the practical estimation of the random coefficients $b_{\textit{ij}}$ via the fixed coefficients $a_{j}$ , and their variances $\sigma_{j}^{2}$ . The formula (A10) distributes the residual error from the model with the fixed coefficients by the predictors proportionally to the shares of their variances in the total variance. It is an exact decomposition of the residual errors in (A3), so in the model (A1) with the random coefficients (A8) the residuals identically equal zero. The predicted values of the dependent variable are:

$\displaystyle\tilde{y}_{i}=\mathop{\sum}\limits_{j=0}^{n}x_{\textit{ij}}b_{% \textit{ij}}=\mathop{\sum}\limits_{j=0}^{n}x_{\textit{ij}}a_{\textit{j}}+\frac% {\mathop{\sum}\limits_{j=0}^{n}\sigma_{\textit{j}}^{2}x_{\textit{ij}}^{2}}{% \mathop{\sum}\limits_{k=0}^{n}\sigma_{\textit{k}}^{2}x_{\textit{ik}}^{2}}\left% (y_{i}-\mathop{\sum}\limits_{j=0}^{n}x_{\textit{ij}}a_{j}\right)=y_{i}.\qquad% \qquad\qquad\qquad\qquad\qquad\qquad\qquad\text{(A9)}$

Therefore, the predicted values $\tilde{y}_{i}$ by the random coefficients coincide with the observed values of the dependent variable, and there is no residual error. Estimating the variance’ parameters via regression (A5) usually produces poor results with many negative coefficients $\sigma_{j}^{2}$ , although those should be positive by definition. The simple way to improve this part of the estimation for the variance parameters is to take the regular estimates of the variance of regression coefficients proportional to the diagonal elements of the inverted covariance matrix.

References

Anderson

R.D.

, & Vastag

(2004). Causal modeling alternatives in operations research, overview and application. European J of Operational Research, 156(1), 92-109.

Austin

P.C.

(2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46, 399-424.

Bai

, & Clark

M.H.

(2018). Propensity score methods and applications. Sage, Los Angeles, USA.

Baldi

, & Shahbaba

(2020). Bayesian causality. The American Statistician, 74(3), 1-16.

Beal

S.J.

, & Kupzyk

K.A.

(2014). An introduction to propensity scores, what, when, and how. J of Early Adolescence, 34(1), 66-92.

Cardenas

I.C.

Voordijk

, & Dewulf

(2017). Beyond theory, towards a probabilistic causation model to support project governance in infrastructure projects. International J of Project Management, 35(3), 432-450.

Cunningham

(2021). Causal Inference, The Mixtape. Yale University Press.

Dagum

, & Luby

(1993). Approximating probabilistic inference in bayesian belief networks is NP-hard. Artificial Intelligence, 60(1), 141-153.

Dawid

A.P.

Musio

, & Fienberg

S.E.

(2016). From statistical evidence to evidence of causality. Bayesian Analysis, 11(3), 725-752.

10.

Demidenko

, & Mandel

(2005). Yield Analysis and Mixed Model. Proceedings of Joint Statistical Meeting, ASA, Minneapolis, USA.

11.

Guo

, & Fraser

M.W.

(2009). Propensity score analysis, statistical methods and applications. Sage, Thousand Oaks, USA.

12.

Hildreth

, & Houck

J.P.

(1968). Some estimators for a linear model with random coefficients. J of the American Statistical Association, 63, 584-595.

13.

Imai

King

, & Stuart

E.A.

(2008). Misunderstandings between experimentalists and observationalists about causal inference. J of the Royal Statistical Society, Series A, 171, 481-501.

14.

Imbens

G.W.

, & Rubin

B.D.

(2015). Causal inference for statistics, social, and biomedical sciences, an introduction. Cambridge University Press, USA.

15.

Ingthorson

R.D.

(2021). A powerful particular view of causation. Routledge, USA.

16.

King

, & Nielsen

(2019). Why propensity scores should not be used for matching. Political Analysis, 27(4), 435-454.

17.

Lane

F.C.

Y.M.

Shelley

, & Henson

R.K.

(2012). An illustrative example of propensity score matching with education research. Career andTechnical Education Research, 37(3), 187-212.

18.

Leite

(2016). Practical propensity score methods using R. Sage, Los Angeles, USA.

19.

Lewis

(1973). Counterfactuals. Oxford, Blackwell.

20.

Ling

Montez-Rath

Mathur

Kapphahn

, & Desai

(2020). How to apply multiple imputation in propensity score matching with partially observed confounders, a simulation study and practical recommendations. J of Modern Applied Statistical Methods, 19(1), 2-64.

21.

Lipkovich

Ratitch

Zhang

Shan

, & Mallinckrodt

(2022). Using principal stratification in analysis of clinical trials. Statistics in Medicine, 1-41. doi: 10.1002/sim.9439.

22.

Lipovetsky

(2013). Data fusion in several algorithms. Advances in Adaptive Data Analysis, 5(3), 1-12. doi: 10.1142/S1793536913500143.

23.

Lipovetsky

(2018). Causal nets, interventionism, and mechanisms, philosophical foundations and applications, by Gebharter

, book review, Technometrics, 60(1), 127.

24.

Lipovetsky

(2016). Combined granger-koyck causality distributed lag modeling. International J of Operations and Quantitative Management, 22(4), 317-333.

25.

Lipovetsky

(2007). Iteratively re-weighted random-coefficient models and shapley value regression. Model Assisted Statistics and Applications, 2, 201-212.

26.

Lipovetsky

, & Mandel

(2015). Modeling probability of causal and random impacts. J of Modern Applied Statistical Methods, 14(1), 180-195.

27.

Lipovetsky

, & Mandel

(2015). Handbook of causal analysis in social research, by Morgan

S.L.

, Ed., book review. Technometrics, 57(2), 298-300.

28.

Mandel

(2017). Causality modeling and statistical generative mechanisms. In, Braverman Reading in Machine Learning, Lecture Notes in Artificial Intelligence, Springer, 148-188.

29.

Mandel

(2015). Troublesome dependency modeling, causality, inference, statistical learning, SSRN.

30.

Mandel

, & Lipovetsky

(2022). Propensity Scores – Do They Really Work? Simulation Study, SSRN, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4215007.

31.

Mattews

(2000). Storks Deliver Babies (

p=

0.008). Teaching Statistics, 22(2), 36-38.

32.

Menzies

(2019). Counterfactual theories of causation, Stanford Encyclopedia of Philosophy, https://plato.stanford.edu/entries/causation-counterfactual/.

33.

Middleton

(2003). Decision modeling using Excel, https://www.excelforum.com/excel-general/334950-generating-correlated-random-values-in-excel.html.

34.

Pearl

(2009). Causality, Models, Reasoning, and Inference, 2nd ed., Cambridge University Press, New York, USA.

35.

Pearl

(2015). Causes of effects and effects of causes. Sociological Methods and Research, 44(1), 149-164.

36.

Peters

Janzing

Scholkopf

(2017). Elements Of Causal Inference, The MIT Press, USA.

37.

Richardson

T.S.

, & Robins

J.M.

(2023). Potential outcome and decision theoretic foundations for statistical causality. https://arxiv.org/pdf/2302.03899.pdf.

38.

Rosenbaum

P.R.

(2002). Observational Studies, 2nd ed., Springer-Verlag, New York, USA.

39.

Rosenbaum

P.R.

, & Rubin

D.B.

(1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.

40.

Rubin

D.B.

(2006). Matched Samples for Causal Effects. Cambridge University Press, Cambridge, USA.

41.

Sekhon

(2009). The neyman-rubin model of causal inference and estimation via matching methods. In The Oxford Handbook of Political Methodology Box-Steffensmeier

J.M.

Brady

H.E.

Collier

, Eds, Oxford University Press, UK, 271-299.

42.

Stürmer

Joshi

Glynn

R.J.

Avorn

Rothman

K.J.

, & Schneeweiss

(2006). A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. J of Clinical Epidemiology, 59, 437-447.

43.

VanderWeele

(2016). Commentary, on causes, causal inference, and potential outcomes. Int J Epidemiol, 45(6), 1809-1816. doi: 10.1093/ije/dyw230.

44.

Williamson

E.J.

, & Forbes

(2014). Introduction to propensity scores. Respirology, 19, 625-635.

45.

Zagar

A.J.

Kadziola

Lipkovich

, & Faries

D.E.

(2017). Evaluating different strategies for estimating treatment effects in observational studies. J of Biopharmaceutical Statistics, 27(3), 535-553.

46.

Zagar

A.J.

Kadziola

Lipkovich

Madigan

, & Faries

D.E.

(2022). Evaluating bias control strategies in observational studies using frequentist model averaging. J of Biopharmaceutical Statistics, 1-30. doi: 10.1080/10543406.2021.1998095.

47.

Zhang

Yang

Faries

D.E.

Lipkovich

, & Kadziola

(2021). Practical recommendations on double score matching for estimating causal effects. Statistics in Medicine, 1-25. doi: 10.1002/sim.9289.

48.

Zigler

M.C.

, & Dominici

(2014). Uncertainty in propensity score estimation, bayesian methods for variable selection and model-averaged causal effects. Journal of the American Statistical Association, 109, 95-107.

Limitations of the propensity scores approach: A simulation study

Abstract

Keywords

1. Introduction

2. Simulation methodology and design

3.1 The general results

Table 2 Overall performance by different estimation techniques

Table 3 Different indicators for three levels of randomness in the data

Footnotes

Acknowledgments

Conflict of interest

Funding

Appendix. Random coefficients regression

References

Table 2
Overall performance by different estimation techniques

Table 3
Different indicators for three levels of randomness in the data