Discovering context specific causal relationships

Abstract

With the increasing need of personalised decision making, such as personalised medicine and online recommendations, a growing attention has been paid to the discovery of the context and heterogeneity of causal relationships. Most existing methods, however, assume a known cause (e.g. a new drug) and focus on identifying from data the contexts of heterogeneous effects of the cause (e.g. patient groups with different responses to the new drug). There is no approach to efficiently detecting directly from observational data context specific causal relationships, i.e. discovering the causes and their contexts simultaneously. In this paper, by taking the advantages of highly efficient decision tree induction and the well established causal inference framework, we propose the Tree based Context Causal rule discovery (TCC) method, for efficient exploration of context specific causal relationships from data. Experiments with both synthetic and real world data sets show that TCC can effectively discover context specific causal rules from the data.

Keywords

Decision trees context specific causal rules potential outcome model

1. Introduction

Causal relationships reveal the causes behind the phenomena and provide insights into the mechanisms of complex systems, therefore finding causal relationships is a central task in many areas. Several causal models, such as causal Bayesian network [22], structural equation model [4] and potential outcome model [28], have been proposed to represent and infer causal relationships which are global or context free.

In reality a variable (e.g. a therapeutic procedure) often has a strong causal effect on an outcome only when the other variables (e.g. genomic profiles) having a specific value. The former variable is called a cause or treatment, while the latter is a context to define a subpopulation. Such causal relationships are called context specific causal relationships in this paper.

The discovery of context specific causal relationships has important applications in various areas [11, 35, 10]. For example, for most economical outcomes, it is important to know for different industries (contexts), the most effective polices (causes/treatment) to be implemented. To maximise profit, it is essential to find the customer groups with different shopping profits (contexts) and the profitable products (causes/treatments) for the corresponding groups.

Context specific causal relationships, however, are hidden and difficult to be discovered since the overall causal effect may be averaged out to be marginal in the whole population. For example, for a treatment, some patients respond positively and some respond negatively, and hence the overall effect among all patients is marginal. A straightforward solution is to assess the treatment effect under all different conditions/contexts, but it is infeasible given the large number of all possible conditions.

Recently researchers seek to apply data mining and machine learning techniques to the investigation of treatment effect heterogeneity [34, 2]. These techniques are utilised to efficiently find the contexts (subpopulations) across which heterogeneous effects of a treatment can be observed. The work has made it practical to discover the contexts and heterogeneity of causal effects.

However, from the data mining perspective, these techniques bear a major limitation. They assume a known cause (i.e. treatment) variable and focus on finding the proper contexts where the cause has heterogeneous causal effects on the outcome. Therefore, it is not suitable for the exploration for context specific causal relationships in data, where the causes are unknown. With the assumption of a known cause relaxed, a big challenge arises for finding context specific causal relationships directly from data, that is, how to distinguish potential causal/treatment variables from context variables.

Our goal is to design a data mining method to discover context specific causal relationships without knowing or assuming a cause, that is, to find both contexts and the causal relationships under the contexts simultaneously. Our approach to this challenge, TCC (Tree based Context Causal rule discovery) adapts decision tree induction, like the work in [2], but in a very different way. In [2], a causality based criterion is used to build a causal tree for finding the subpopulations across which a treatment has heterogeneous effects. Instead we directly make use of the highly efficient and mature decision tree algorithm [25] to find candidate causes and context variables with respect to a given target. Then within the much reduced search space, we employ the potential outcome model [28] to assess the candidates to identify causes and their contexts.

We use decision tree as a base for the following two reasons. Firstly, a rational assumption is that contexts and causes are all highly related to the target, so it is reasonable to use decision tree to select the candidates. Meanwhile, each decision rule encodes context specific relationships between predictor variables and the target, which are likely the indicators of context specific causal relationships. Secondly, a decision tree is efficient for both large sized and high dimensional data, and hence basing TCC on decision tree induction will be practical for various applications. In contrast, it is multiple orders of magnitude slower to build a causal tree compared to a normal decision tree, as the causality based criterion is performed in each split of the tree construction to examine each variable for choosing the optimal branching variable [2, 15].

We further extend decision rules to context specific causal rules for actionable decision making. For example, along a path in a decision tree, a decision rule like $(X_{1}=1,X_{2}=1)\rightarrow(Y=1)$ shows the co-occurrence of $(X_{1}=1,X_{2}=1)$ and $Y=1$ , which is sufficient for classification. However, such a rule is insufficient for actions, since it is important to know which variable leads to the change of $Y$ in actionable decision making. For example, in biomedical experimental design, the decision rule can be interpreted as “ $X_{1}\rightarrow Y|X_{2}=1$ ” or “ $X_{2}\rightarrow Y|X_{1}=1$ ”, which means totally different manipulation operations. The former refers to manipulating $X_{1}$ under the context $X_{2}=1$ , while the latter is to manipulate $X_{2}$ when $X_{1}=1$ . Therefore, causality based examination is in demand to identify the causes and their contexts for evidence based decision making.

We take this work truly as a journey of causal knowledge discovery from large data sets, therefore our method design has been focused on practical approach and the TCC algorithm has been aimed at quickly finding meaningful causal signals and their contexts in a large data set. The experimental results have shown that TCC performs consistently when it is applied to synthetic or real world data sets, and its high efficiency is also proved by the experiments.

One significance of our work is that we demonstrate that a supervised learning method can be easily adapted for causal discovery with high efficiency and high quality.

In the rest of this paper, Section 2 reviews related work. The problem statement is presented in Section 3, and then a practical definition of context specific causal rules is defined under the potential outcome model. The proposed method is discussed in Section 4. Section 5 demonstrates the performance of the proposed method. Finally, we conclude the paper in Section 6.

2. Related work

Many attentions have been paid to causal discovery on observational data. Various causal models have been developed for causal relationship discovery [22, 15, 17, 29]. The potential outcome model [28] has been widely used for the estimation of causal relationships. Matching methods [32] are developed to remove confounding when estimating the average causal effect of the treatment on the outcome. Rosenbaum and Rubin [27] proposed the propensity score matching for average causal effect estimation, where a logistic regression is used to estimate the propensity score.

A growing literature focuses on modelling and finding context specific relationships. A stream of research is to derive Context Specific Independence (CSI) based on a known Bayesian network [5, 12]. Researchers intended to speed up the Bayesian network inference algorithms by introducing the concept of CSI. Instead of aiming at fast inference with Bayesian networks, some others focused on extending a Bayesian network by adding special notations, such as labelled graphical models [6], gates [19] and stratified Gaussian graphical models [21], such that the extended Bayesian network can explicitly present the context specific causal relationships. These methods normally assume that global dependency relationships between variables are known in advance.

Another main stream of research that is related to context specific causal discovery is subgroup analysis. Subgroup analysis is commonly used to evaluate the treatment effects in a specific subpopulation defined by some context variables. Su et al. [33] adapted the idea of recursive partitioning to construct an interaction tree for the causal effect estimation. Dud et al. [8] developed an approach to get the optimal policy via the technique of doubly robust estimation. Supervised machine learning approaches have been applied to estimate heterogeneous causal effects [34, 2].

However, these methods are designed to validate hypothesised causal effects of subgroups and the hypotheses have been provided based on the domain knowledge at the commencement of a study. The subjective hypotheses may result in that previously unobserved patterns and relationships would never be tested. What we expect is not only to validate the hypothesised causal relationships, but also to find unobserved causal relationships previously. Thus computational methods are required to discover causal relationships from observational data automatically.

Causal decision tree method [15] was developed to explore both general and context specific causal relationships. Specifically, the causal relationship between the root node and the outcome is context free, while non-root nodes are causes of the outcome under the context of their parent nodes. Although such type of trees have widely practical applications, it has a limitation that the contexts of causes have to be already causes (or context specific causes) of the outcome.

3. Problem statement and definitions

In this section, we firstly state the research problem of this work, then we define context specific causal rules, and discuss how to identify a context specific causal rule from data.

3.1 Research problem

The objective of the work is to find context specific causal relationships in data. Specifically, we aim to find context specific causal rules as stated below.

.

Given a data set $\bm{D}$ with a set of predictor variables $\bm{V}$ and the target variable $Y$ , find all the potential treatment variables $X_{p}\in\bm{V}$ and the corresponding context variables $\bm{X}_{c}$ ( $\bm{X}_{c}\subset\bm{V}\backslash X_{p}$ ), such that $X_{p}\rightarrow Y$ is a causal rule when $\bm{X}_{c}=\bm{x}_{c}$ .

A rule $X\rightarrow Y$ is causal, if the treatment variable $X$ has a significant causal effect on the target variable $Y$ , that is, varying $X$ will result in a significant change of $Y$ . In other words, a causal rule $X\rightarrow Y$ satisfies two major conditions: (i) the variable $X$ precedes the target $Y$ , and (ii) if $X$ had not happened, $Y$ would be different.

The first condition specifies a temporal relationship between variables $X$ and $Y$ , which normally can be identified with domain knowledge. In our study, we always assume that all treatment variables precede the outcome temporally. The second condition is at the conceptual level and it indicates that outcome $Y$ would be different when the same individual received a treatment and did not receive it. The difference between the two outcomes under treatment and no treatment is typically called the treatment/causal effect [29, 22].

Similarly, we have the following criteria of identifying a context specific causal rule. A rule $X_{p}\rightarrow Y|\bm{X}_{c}=\bm{x}_{c}$ is a causal rule in the context $\bm{X}_{c}=\bm{x}_{c}$ , if (i) the treatment variable $X_{p}$ and context variables $\bm{X}_{c}$ are disjunctive, i.e. $X_{p}\cap\bm{X}_{c}=\emptyset$ , and (ii) within the context $\bm{X}_{c}=\bm{x}_{c}$ , $X_{p}$ has a significant causal effect on $Y$ . In a special case, $\bm{X}_{c}$ can be an empty set, and thus the context specific causal rule becomes a general causal rule (i.e. context free).

In the next section, we will formally present a practical definition of causal rules and context specific causal rules, and the estimation of causal effects.

3.2 Causal rule definition

The potential outcome model [28, 20] is widely used in the estimation of causal effects in social science, health and medical research. In this model, an individual $i$ in a population has two potential outcomes with respect to a treatment (we only consider binary treatment in this paper): when taking the treatment ( $X_{p}=1$ ), the potential outcome is $Y_{1}(i)$ ; and when not taking the treatment ( $X_{p}=0$ ), the potential outcome is $Y_{0}(i)$ .

However, for an individual $i$ , we can only observe one of the two potential outcomes, either $Y_{1}(i)$ or $Y_{0}(i)$ . The unobserved outcomes, namely the counterfactual outcomes, need to be estimated by using the observed outcomes, such that we can compare the difference of outcomes when receiving treatment or control.

The individual level causal effect is expressed as $Y_{1}(i)-Y_{0}(i)$ . The causal effects of individuals in a population are normally aggregated to get the Average Causal Effect (ACE) as defined below:

$\displaystyle\textit{ACE}(X_{p}\rightarrow Y)=E[Y_{1}]-E[Y_{0}]$ (1)

where $E[.]$ stands for the expectation operator in probability theory. Note that $i$ is omitted when we focus on the population level potential outcomes.

With the definition of ACE, the practical definitions of causal rules and context specific causal rules are formally presented in the following.

(Causal rules).

Given a data set $\bm{D}$ with a set of predictor variables $\bm{V}$ and the target variable $Y$ , a rule $X_{p}\rightarrow Y$ ( $X_{p}\in\bm{V}$ ) is a causal rule, if $\textit{ACE}(X_{p}\rightarrow Y)\geqslant\eta$ in $\bm{D}$ , where $\eta$ is the minimal causal effect threshold.

(Context specific causal rules).

Given a data set $\bm{D}$ with a set of predictor variables $\bm{V}$ and the target variable $Y$ , a rule $X_{p}\rightarrow Y|\bm{X}_{c}=\bm{x}_{c}$ ( $X_{p}\in\bm{V},\bm{X}_{c}\subset\bm{V}$ and $X_{p}\cap\bm{X}_{c}=\emptyset$ ) is a causal rule in the context $\bm{X}_{c}=\bm{x}_{c}$ , if $ACE(X_{p}\rightarrow Y|\bm{X}_{c}=\bm{x}_{c})\geqslant\eta$ , where $\eta$ is the minimal causal effect threshold.

The threshold $\eta$ can be determined based on domain knowledge.

Note that in this paper we assume that the differences of individuals could be captured by the covariates, i.e. the set of variables used for stratification. This assumption implies that there are no hidden confounding variables to bias the causal effect estimation.

3.3 Causal effect estimation

The major issue for ACE estimation is to unbiasedly estimate the counterfactual outcomes, e.g. what the effect would be if a person had not taken a treatment (actually the person did take the treatment). If we have two groups of individuals, one group taking a treatment and another not, and the two groups of individuals have the same characteristics apart from being treated or not, we can straightforwardly estimate the counterfactual outcomes based on the observed outcomes. In this process, the indistinguishability of two groups apart from treated or not is essential.

Randomised treatment assignment is a way to achieve indistinguishability. However, with observational data, such random assignments of treatments are often not guaranteed. In this case, stratification of the data set is a way of trying to achieve the indistinguishability. In each stratified sub data set, the records of all covariates take the same values in the treatment ( $X=1$ ) and control ( $X=0$ ) groups, respectively. Thus under the stable unit treatment value assumptions [29], the individuals of the two groups in a stratum are indistinguishable, except the state of the treatment. Then in each stratum, we can unbiasedly estimate the counterfactual outcomes and obtain ACE.

Now we present the details of the procedure of causal effect estimation with observational data.

3.3.1 Variables used for stratification

The first step of causal effect estimation is to determine the set of covariate variables (denoted by $\bm{C}$ in the paper) to be used for stratifying data. In a non-experimental study, a key assumption for the variable selection is the unconfoundedness [27]:

.

The treatment assignment $X$ is independent of the potential outcomes ( $Y_{0}$ , $Y_{1}$ ) given the covariates $\bm{C}$ , i.e. $X\perp(Y_{0},Y_{1})|\bm{C}.$

Figure 1.

A causal diagram.

The causal diagram in Fig. 1 is used to help with the following discussions of covariate selection. In this figure, the nodes represent variables and the edges denote the causal links between the nodes.1

Note that a causal diagram is different with an influence diagram, where an arrow denotes an influence and it does not necessarily imply a causal relation.

For example, the connection

X\rightarrow Y

means

X

is a cause of the target

Y

. Apart from treatment

X

and the target

Y

, the other variables are categorised into four different types: (i) Indirect causes (e.g.

I

), which indirectly cause a change of

Y

; (ii) Confounders (e.g.

C

), which are the common causes of

X

and

Y

; (iii) Direct causes (e.g.

D

), which are direct causes of

Y

, apart from

X

; and (iv) Irrelevant variables (e.g.

U

), which are totally independent with both

X

and

Y

From this causal structure, we can see that confounders $C$ are the ones that may influence the causal effect estimation of the treatment $X$ on $Y$ . Thus to satisfy the assumption of unconfoundedness, variables known to have causal effects on both treatment assignment and the outcome, i.e. the confounders $C$ shown in Fig. 1, are required to be included in stratification [32]. Unfortunately, the causal graph is typically unknown. In this paper, all variables that are associated with both the treatment variable $X$ and the target $Y$ are included into covariates $\bm{C}$ for stratification, as a variable can never be a cause of another if they are independent. Covariates $\bm{C}$ are normally a superset of confounders $C$ in Fig. 1. With the propensity score method used in this paper (details in Section 3.3.2), it has been shown that there is less cost in terms of increased bias to include variables that actually do not impact on the causal effect estimation of $X$ on $Y$ , compared to the case excluding potentially important confounders [32].

3.3.2 Distance measures and stratification

The second step is to choose a distance measure for stratification. Perfect stratification (i.e. all samples in a stratum have exactly same values), is ideal to eliminate the bias, but it does not work when the number of covariates is large since the statistical power is lost quickly with the increase of the number of covariates. To improve the statistical power, approximate stratifications are developed to match individuals with similar covariate distributions (not exact ones).

Various distance measure, e.g. Minkowski distance and Mahalanobis distance, can be used for the stratification, but most of them does not perform well when there are many covariates under study [9]. Propensity score [27, 32] is another commonly used distance measure, which summarises covariates $\bm{C}$ into one scalar: the probability of the individual receiving the treatment conditioning on $\bm{C}$ :

$\displaystyle e(\bm{C})=\textit{Prob}(X=1|\bm{C}=\bm{c}).$ (2)

Subclassification on propensity score [32] is used here to do stratification, i.e. grouping individuals with similar propensity scores to a stratum, such that individuals are indistinguishable (in terms of receiving the treatment or not) within one stratum.

3.3.3 Causal effect estimation

After the data set has been stratified based on propensity scores, we can estimate the causal effect within each stratum and the aggregate the causal effects over the strata to obtain the overall causal effect. In each stratum $\bm{C}=\bm{c}_{k}$ , a contingency table, shown in Table 1, is generated for the estimation of average causal effect. $a$ , $b$ , $c$ and $d$ are the counts of variable $X_{p}$ and the outcome $Y$ with different values, and $n_{k}=a+b+c+d$ is the number of samples/individuals in the sub data set with the context $\bm{C}=\bm{c}_{k}$ .

Table 1
An example of notations for a contingency table

$X_{p}$	$Y=1$	$Y=0$	Total
1	$a$	$b$	$a+b$
0	$c$	$d$	$c+d$
Total	$a+c$	$b+d$	$n_{k}$

Referring to the definition of ACE, the causal effect is the difference of the outcomes in two groups. Thus in the stratum $\bm{c}_{k}$ , the average causal effect is expressed as

$\displaystyle\textit{ACE}(X_{p}\rightarrow Y|\bm{c}_{k})=\frac{a}{a+b}-\frac{c% }{c+d}.$ (3)

The ACE in a population is determined by aggregating the ACEs in all strata

$\displaystyle\textit{ACE}(X_{p}\rightarrow Y)=\sum_{k}w_{k}\textit{ACE}(X_{p}% \rightarrow Y|\bm{c}_{k}).$ (4)

where $w_{k}$ is the weight of the stratum $\bm{C}=\bm{c}_{k}$ . In this paper, $w_{k}$ is set as the ratio of the sample size of $\bm{c}_{k}$ to the size of data $\bm{D}$ .

4. Context specific causal rule discovery

In this section, we firstly present the proposed algorithm, TCC, for mining context specific causal rules with a single decision tree, then we introduce a variant of TCC to explore context specific causal rules with multiple trees.

4.1 TCC with a single decision tree

As shown in Algorithm 4.1, TCC contains two major parts: decision rule selection (lines 1 to 7) and causal rule discovery with a pruning strategy (lines 8 to 25).

TCC firstly picks up a proper search base for finding causal rules by learning a decision tree from the data. C4.5 [25] is employed to build a decision tree from data. We restrict the minimum number of instances per leaf, such that there are enough samples for the ACE estimation. Each path in the decision tree is a decision rule $\bm{R}$ expressed as $(\bm{X}=\bm{x})\rightarrow(Y=y)$ or $\bm{x}\rightarrow y$ , where $\bm{X}$ and $Y$ are the predictor variables and the target on a path of the decision tree, and $\bm{x}$ and $y$ are the corresponding values respectively.

Tree based Context Causal rule discovery (TCC) algorithmInput: A data set $\bm{D}$ for predictor variable set $\bm{V}$ and the target $Y$ , the minimal confidence threshold $\theta$ , and the minimal causal effect threshold $\eta$ . Output: $\bm{C_{Y}}$ , the set of causal rules to the target $Y$ . [1] // Building single or multiple decision trees $\bm{T}=\emph{decisionTree}(\bm{V},Y)$ $\bm{C_{Y}}=\emptyset$ each decision rule $\bm{R}$ of decision trees $\bm{T}$ extract the predictor variables $\bm{X}$ and $Y$ and their corresponding values $\bm{x}$ and $y$ from $\bm{R}$ $\emph{confTest}(\bm{X},Y)\leqslant\theta$ continue // Global causal test each variable $X_{p}\in\bm{X}$ $\emph{ACEValue}=\emph{causalTest}(\bm{D},X_{p},\emptyset,\emptyset,Y)$ $\emph{ACEValue}>\eta$ $\bm{C_{Y}}=\bm{C_{Y}}\cup\{X_{p}\rightarrow Y\}$ // Context specific causal test each variable $X^{\prime}_{p}\in\bm{X}$ $\bm{X^{\prime}}=\bm{X}\backslash\{X^{\prime}_{p}\}$ each variable set $\bm{X}_{c}\subset\bm{X^{\prime}}$ , $\bm{X}_{c}=\bm{x}_{c}$ $redundantTest(X^{\prime}_{p},\bm{X}_{c},\bm{x}_{c},\bm{C_{Y}})$ continue $\emph{ACEValue}=\emph{causalTest}(\bm{D},X^{\prime}_{p},\bm{X}_{c},\bm{x}_{c},Y)$ $\emph{ACEValue}>\eta$ $\bm{C_{Y}}=\bm{C_{Y}}\cup\{X^{\prime}_{p}\rightarrow Y|\bm{X}_{c}=\bm{x}_{c}\}$ Output $\bm{C_{Y}}$

To guarantee the statistical significance, we also use the Fisher’s exact test to prune branches of a decision tree [16]. With the notation in Table 1, $X_{p}$ and $Y$ here refers to a branching variable and the outcome, and the $p$ -value is given by:

$p([a,b;c,d])=\sum_{i=0}^{\textit{min}(b,c)}{\frac{(a+b)!(c+d)!(a+c)!(b+d)!}{n_% {k}!(a+i)!(b-i)!(c-i)!(d+i)!}}$

A low $p$ -value means that the null hypothesis (i.e. $X_{p}$ and $Y$ are independent) is rejected. We only keep branches that are statistically significant (with low $p$ -values).

Given all decision rules of a decision tree, the predictor variables in each decision rule are considered as the search base of both potential causes and contexts, as a decision rule encodes context specific relationships. Then a confidence test (line 5 in Algroithm 4.1) is conducted to remove the decision rules if it has low confidence, since causal signal in a low confidence rule is weak. Here the confidence of $\bm{x}\rightarrow y$ is defined as the proportion of individuals containing $\bm{x}$ which also contains $y$ . Only a decision rule with high confidence, i.e. exceeding the specified minimal confidence threshold, will be inserted into the candidate set for causal rule discovery.

For a high confidence decision rule $\bm{x}\rightarrow y$ , global causal tests are performed to detect if $X_{p}\rightarrow Y$ ( $X_{p}\in\bm{X}$ ) is a global causal rule. Lines 6 to 9 show this process, where Eq. (1) is employed to estimate the ACE.

Then we move to context specific causal rule discovery. The discovery of context specific causal rules from a decision rule includes two nested loops (lines 10 to 17 in Algorithm 1). In the outer loop, we traverse the predictor variables $\bm{X}$ in $\bm{R}$ as the candidate treatment variable $X^{\prime}_{p}$ , while the inner loop enumerates the subsets of $\bm{X}\backslash\{X^{\prime}_{p}\}$ finding the contexts. With each subset $\bm{X}_{c}$ of $\bm{X}\backslash\{X^{\prime}_{p}\}$ , the subset of data is extracted from the original data with $\bm{X}_{c}=\bm{x}_{c}$ , where $\bm{x}_{c}$ is the value of $\bm{X}_{c}$ as indicated in the decision rule $\bm{R}$ .

A bottle-neck for context specific causal rule discovery is the enumeration of different contexts in the variable set of the antecedent of a decision rule. Thus a pruning strategy is developed to address the efficiency problem. Function RedundantTest() (line 13) is invoked to test if the rule $X_{p}\rightarrow Y|\bm{X}_{c}=\bm{x}_{c}$ is redundant. Only if the rule is not redundant, then causal test (line 14) is performed to estimate the causal effect of $X^{\prime}_{p}$ on $Y$ under the context $\bm{X}_{c}=\bm{x}_{c}$ .

As we know, if a causal relationship holds in a population, then it should hold in each of the subpopulations. In other words, if $X_{p}\rightarrow Y|\bm{X}_{c}=\bm{x}_{c}$ is a context specific causal rule, $X_{p}\rightarrow Y|\{\bm{X}_{c}=\bm{x}_{c},\bm{X_{a}}=\bm{x_{a}}\}$ is also a context specific causal rule, where $\bm{X_{a}}$ is an additional condition defining a specific subpopulation. The more specific rules (i.e. with more conditions than the general one) are implied by the general causal rule. For example, if Children’s Panadol is effective for relieving child under 12 from fever and pain (i.e. $\textit{Panadol}\rightarrow\textit{recovery}|age<12$ ), then we can conclude that it is also effective for boys under 12 (i.e. $\textit{Panadol}\rightarrow\textit{recovery}|\{\textit{age}<12,\textit{sex}=% \textit{male}\}$ ). We call such more specific rules as redundant rules.

We are not interested in redundant rules as the causal relationships (if any), since the causal relationships are already implied by their more general context specific causal rules. Thus we exclude redundant rules in the algorithm to reduce the search space. Once we find a context-specific causal rule (including a global causal rule where the context variable set $X_{c}=\emptyset$ ), we stop searching for its more specific context specific rules.

4.2 TCC with multiple decision trees

The performance of TCC could be sensitive to the results of decision tree construction. A decision tree normally makes use of a small subset of variables in the decision rules, so a key limitation of using decision tree for our purpose is that it may not cover all possible causal factors and the context variables, and thus we may miss some potential causal relationships. In this section, we present a variant of TCC with an ensemble classifier, Diversified Multiple Trees (DMT) [14], to address the false negative issue.

DMT uses C4.5 [25] to sequentially build $m$ decision trees, where attributes used in a tree are not to be used in the construction of the next tree. Thus the output decision trees are disjunct. Then decision rules extracted from the output DMTs are used as the search space of the TCC algorithm.

To avoid confusion, we call the TCC algorithm with DMT as “TCC ${}_{m}$ ”, where $m$ is the number of decision trees built, and “TCC” without a subscript denotes the TCC algorithm with a single tree as introduced in 4.1.

As DMT is capable of detecting more attributes highly correlated with the target, potentially TCC ${}_{m}$ has less false negatives and is expected to achieve higher accuracy.

4.3 Complexity analysis

The time complexity of the proposed method TCC comes from three main parts: tree construction, general causal rule extraction, and context specific causal rule extraction. Here we focus on analysing the performance of context specific causal extraction, since the complexity of two other parts is significantly lower than the complexity of this part. We denote the height of a tree as $h$ , the number of variables as $m$ , and the number of samples as $n$ .

The number of paths of the tree is $2^{h^{\prime}-1}$ where $h^{\prime}<h$ considering that not all paths have the same length of $h$ . For each path, we enumerate the contexts of all variables along the path and we have $2^{h^{\prime\prime}-1}$ possible contexts where $h^{\prime\prime}\ll h$ because of the effect of pruning. The total number of context specific tests is in the order of $O(2^{\beta h})$ where $0.5<\beta<2$ . In each test, finding covariates is at the cost of $O(m^{2})$ . For computing the propensity score, the complexity ranges from $O(n\log(n))$ (regression tree) to $O(n^{\alpha})$ (logistic regression) where $2\leqslant\alpha\leqslant 3$ [13]. The overall complexity is between $O(2^{\beta h}(n\log(n)+m^{2}))$ and $O(2^{\beta h}(n^{\alpha}+m^{2}))$ . Consider that $h$ is normally a small integer and $m\ll n$ . The complexity can be approximately in the order of $O(l*n\log(n))$ where $l$ is in the range of hundreds to thousands.

5. Experiments

In this section, we firstly introduce the process of synthetic data generation. Then we present the experiments on TCC and TCC ${}_{m}$ with the synthetic data sets, and compare the performance of TCC and TCC ${}_{m}$ with the Causal Tree (CT) method [2]. CT is designed to examine the heterogeneity of causal effects across subsets of the population, while assuming known cause. Specifically, it applies regression tree with a modified MSE (Mean Squared Error) criterion to partition the population into multiple subgroups.

Then we apply TCC and TCC ${}_{m}$ to a clinical data set, the METABRIC data set, to capture meaningful context specific causal rules.

5.1 Synthetic data

In order to evaluate the proposed method, we generate several synthetic data sets containing context specific causal relationships.

Each of the synthetic data sets is generated with four main steps: ( $i$ ) create randomly two Causal Bayesian Networks (CBNs) with the same number of variables by using the TETRAD software tool (http://www.phil.cmu.edu/tetrad/), where the Direct Acyclic Graphs (DAGs) and Conditional Probability Tables (CPTs) are both created randomly by the software; ( $i i$ ) generate two data sets from the two causal Bayesian networks based on their respective conditional probability tables via the built-in Bayes Instantiated Model; ( $i i i$ ) add one more column to each of two data sets, as the context variable $X_{c}$ , such that $X_{c}\equiv 0$ in the first data set and $X_{c}\equiv 1$ in the other one; and ( $i v$ ) concatenate these two new data sets by columns to obtain the final data set.

We use the above procedure to generate five synthetic data sets with 10, 20, 30, 40 and 50 variables (Syn-10, Syn-20, Syn-30, Syn-40, and Syn-50) respectively. Each data set also has 10K samples. Then we use precision ( $P$ ), recall ( $R$ ) and $F_{1}$ -measure ( $F_{1}$ ) as the metrics to evaluate the performance of TCC, TCC ${}_{m}$ and CT in term of the accuracy. Different from TCC, CT focuses on a fixed treatment variable to estimate the differences in causal effects of the treatment across different subpopulations, where the treatment variable is a hypothesised cause of the target variable. For comparing with our proposed method, we conduct multiple independent runs of the CT algorithm, with each predictor variable set as a treatment. Here we only discuss the results within the context $X_{c}=0$ and $X_{c}=1$ , since all we know about the ground truth are the causes in the context $X_{c}$ .

Table 2
The accuracy of TCC, TCC ${}_{3}$ , TCC ${}_{5}$ and CT on synthetic data sets

	TCC			TCC ${}_{3}$			TCC ${}_{5}$			CT
	$P$	$R$	$F_{1}$	$P$	$R$	$F_{1}$	$P$	$R$	$F_{1}$	$P$	$R$	$F_{1}$
Syn-10	0.83	0.83	0.83	0.83	0.83	0.83	0.83	0.83	0.83	0.75	1.00	0.86
Syn-20	1.00	0.83	0.91	1.00	0.83	0.91	1.00	0.83	0.91	0.50	0.75	0.60
Syn-30	1.00	0.83	0.91	1.00	0.83	0.91	1.00	0.83	0.91	0.38	1.00	0.55
Syn-40	1.00	1.00	1.00	1.00	1.00	1.00	0.86	1.00	0.92	0.29	0.67	0.40
Syn-50	1.00	0.83	0.91	0.83	0.83	0.83	0.83	0.83	0.83	–	–	–

The results discovered by TCC, TCC ${}_{3}$ , TCC ${}_{5}$ and CT are shown in the Table 2. We can see that CT has achieved high performance on the small data set, Syn-10, while its performance drops sharply as the number of variables increases. CT becomes infeasible on a larger data set, Syn-50, as the efficiency of CT is sensitive to both dimension and size of a data set. In contrast, TCC, TCC ${}_{3}$ and TCC ${}_{5}$ consistently achieve high performance and the average of $F_{1}$ score is larger than 0.80. Meanwhile, these three methods obtain very similar results. It is because on these five data sets, single decision tree has included almost all potential causes and corresponding contexts, and thus TCC with multiple trees could not make more improvement.

Table 3

Performance of TCC and TCC ${}_{m}$ on Syn-100

	TCC	TCC ${}_{3}$	TCC ${}_{5}$	TCC ${}_{7}$
$P$	0.80	0.82	0.90	0.90
$R$	0.80	0.90	0.90	0.90
$F_{1}$	0.80	0.86	0.90	0.90

To further examine the performance of TCC and TCC ${}_{m}$ , we run TCC and TCC ${}_{m}$ ( $m=3$ , 5 and 7) on a larger data set, Syn-100. Table 3 shows the results with Syn-100. The accuracy is significantly improved when DMT is employed, and more decision trees (larger the values of $m$ ) bring bigger improvement. This is due to the fact that multiple trees may be able to include more potential causes and context variables than a single decision tree, which in turn improves the chance of detecting context specific causes. When all possible causal factors and context variables are included in the output trees, the accuracy improvement will stop, which explains why TCC ${}_{7}$ and TCC ${}_{5}$ get the same $F_{1}$ score.

5.2 Scalability

To evaluate the efficiency of TCC and TCC ${}_{m}$ , we run experiments on five synthetic data sets with 50, 100, 150, 200 and 250 variables, respectively. The experiments are performed on a computer with a 3.4 GHz Quad-core CPU and 16 GB of memory.

Figure 2.

Scalability evaluation with TCC and TCC ${}_{m}$ .

We run TCC, TCC ${}_{3}$ and TCC ${}_{5}$ with $|\bm{X}_{c}|=1$ and $|\bm{X}_{c}|=2$ respectively, where $|\bm{X}_{c}|$ means the cardinality of the context variable set, and compare the running time with CT. Note that only the results returned within 5 hours are shown in Fig. 2. From Fig. 2a we see that TCC is the most efficient algorithm, while the run time of TCC ${}_{m}$ (here $m=$ 3, 5) is roughly $m$ times of TCC. CT is much slower than our proposed methods and it is not feasible with large data sets. The results show that CT is not competent in exploring for context specific causal relationships in large data sets. Figure 2(b) shows the efficiency of TCC, TCC ${}_{3}$ and TCC ${}_{5}$ , when the size of context variables is set as 2. All three methods are also efficient and scalable. Specifically, in this setting, TCC and TCC ${}_{m}$ require slightly more than double of the time used when $|\bm{X}_{c}|=1$ .

5.3 METABRIC data set

We also apply the proposed method to real world data, the METABRIC data set [31]. The data set contains clinical traits and outcomes for 1981 primary breast cancer patients collected from participants of the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) trial. We run the proposed method on the data set with two different outcomes, 10 years Overall Survival (OS) and 10 years Disease Free Survival (DFS), respectively. The 10 years OS status indicates whether a patient died from breast cancer within 10 years or is alive 10 years after initial consultation. The 10 years DFS status indicates whether a patient survives more than 10 years or not without any signs or symptoms of breast cancer after the cancer ends due to the primary treatment. The METABRIC data set contains 570 and 762 patients who have positive and negative 10 years OS status, respectively, and 820 and 516 patients with positive and negative 10 years DFS status, respectively. Note that to avoid unexpected noise or/and incorrectness involved, instead of imputing missing data [3], we directly removed the records with missing values from the METABRIC data set.

Firstly, the decision tree method is employed to detect association based relationships with respect to breast cancer. Then each decision rule is considered as the search base of both potential causes and contexts. In the set of experiments with this real world data set, available domain knowledge can be utilised to further reduce the search space, to improve the computational efficiency, that is, a potential cause could only be one of three different treatment therapies: chemotherapy, hormone therapy and radiotherapy. Note that these three therapies could also be the contexts to define specific subpopulations, who received multiple treatments.

Table 4
Top context specific causal rules discovered by TCC from METABRIC with 10 years OS as the outcome

Treatment	Context	10 years OS	Reference
Chemotherapy	Breast surgery $=$ breast conserving	Yes	[30]
	Cellularity $=$ high & Nottingham prognostic index $\leqslant$ 5	No
	Cellularity $=$ low	No

We firstly run TCC on the METABRIC data set with 10 years OS as the outcome, and extract top context specific causal rules based on causal effects, shown in Table 4. The causal rules discovered by TCC are supported (or partially supported) by domain knowledge and literature. The first rule in the table indicates that chemotherapy is very effective within the subpopulation, where patients received a lumpectomy to remove a part of the breast tissue, instead of the entire breast [30]. Some interesting context specific causes are also detected. The other two significant causal rules are showing that chemotherapy is not an effective treatment for the subpopulation that has low tumor cellularity mass and the subpopulation with high tumor cellularity mass and low nottingham prognostic index.

Table 5

Top context specific causal rules discovered by TCC from METABRIC with 10 years DFS as the outcome

Treatment	Context	10 years DFS	Reference
Chemotherapy	IntClust $>$ 2 & Claudin subtype $=$ claudin-low	Yes	[1, 24]
	Pam50 subtype $=$ normal	Yes	[23]
	Age $>$ 60	No
	IntClust $>$ 2 & Claudin subtype $=$ not classified	No	[1]
Hormone therapy	Chemotherapy $=$ no & Three gene $=$ ER+/HER2-high proliferation	Yes	[26]
	Inferred menopausal state $=$ post & Radiotherapy $=$ no	Yes
	Three gene $=$ ER-/HER2-	No	[26]
Radiotherapy	Hormone therapy $=$ yes	Yes
	Chemotherapy $=$ no & Age $\leqslant$ 60	Yes	[18]
	Chemotherapy $=$ no & Pam50 subtype $=$ HER2	No	[18]
	Chemotherapy $=$ no & Pam50 subtype $=$ not classified	No

We then set 10 years DFS as the outcome, and run the proposed method on the data set. Some interesting and reasonable results are found (Table 5). For example, the third rule in the table shows that chemotherapy has a negative impact on older people (Age $>$ 60). TCC also confirms the effectiveness of chemotherapy when IntClust $>$ 2 and Claudin subtype = claudin-low [24], where the IntClust approach [7] classifies the breast cancer into ten subtypes based on gene expression and it is good for survival when IntClust is larger than 2 [1]. The results also show that hormone therapy is poor to cure the patients with lower Estrogen Receptors (ER-) [26]. Radiotherapy cannot bring survival benefit to the patients with HER2 Pam50 subtype [18].

Table 6

Top context specific causal rules discovered by TCC ${}_{3}$ from METABRIC with 10 years OS as the outcome

Treatment	Context	10 years OS	Reference
Chemotherapy	Claudin subtype $=$ claudin-low	No	[24]
Hormone therapy	Tumor size $\leqslant$ 57 & Chemotherapy $=$ no	Yes
	HER2 SNP6 $=$ loss	No
	Claudin subtype $=$ Luminal-A	No

We also apply TCC with three trees (TCC ${}_{3}$ ) on the data set with two different outcome variables, 10 years OS and 10 years DFS, respectively. For the data set with 10 years DFS, TCC and TCC ${}_{3}$ have captured similar causal rules with strong causal effects,so here we only show the results of TCC ${}_{3}$ on the data with 10 years OS as the outcome. Comparing to the TCC results, more causal rules are discovered and some examples are shown in Table 6. Patients with claudin-low tumors have poor overall survival outcomes, even if they received chemotherapy [24]. Hormone therapy is beneficial to the patients with small tumor size.

6. Conclusion

In this paper, a novel method, Tree based Context Causal rule discovery (TCC) has been proposed to explore context specific causal relationships from observational data. Finding causes and contexts simultaneously is important but challenging. Decision tree is utilised to make the complex problem manageable. We have designed TCC based on a well-known causal framework, the potential outcome model, to assess context specific causal relationships. A variant, TCC ${}_{m}$ (i.e. TCC with multiple decision trees) is also introduced to help improve the performance of TCC.

The experiments results show that TCC can achieve high performance with the synthetic data sets and find insights from real world data sets. TCC also outperforms an existing causal tree method, in terms of the exploration of short and meaningful context specific causal relationships and easy operation without specifying a candidate cause. TCC provides a scalable and automated way to address the increasing need of uncovering context specific causal relationships for personalised decision making.

Footnotes

Acknowledgments

This work has been supported by Australian Research Council (ARC) Discovery Project Grant DP170101306 and the National and Medical Research Council (NHMRC) Grant 1123042.

References

Ali

H.R.

Rueda

O.M.

Chin

S.-F.

Curtis

Dunning

M.J.

Aparicio

S.A.

and Caldas

, Genome-driven integrated classification of breast cancer validated in over 7,500 samples, Genome Biology 15(8) (2014).

Athey

and Imbens

, Recursive partitioning for heterogeneous causal effects, Proceedings of the National Academy of Sciences 113(27) (2016), 7353–7360.

Azur

M.J.

Elizabeth

A.S.

Constantine

and Philip

J.L.

, Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research 20(1) (2011), 40–49.

Bielby

W.T.

and Hauser

A.R.M.

, Structural equation models, Annual Review of Sociology 3(1) (1977), 137–161.

Boutilier

Friedman

Goldszmidt

and Koller

, Context-specific independence in Bayesian networks, In Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence, 1996, pp. 115–123.

Corander

, Labelled graphical models, Scandinavian Journal of Statistics 30(3) (2003), 493–508.

Dawson

S.-J.

Rueda

O.M.

Aparicio

and Caldas

, A new genome-driven integrated classification of breast cancer and its implications, The EMBO Journal 32(5) (2013), 617–628.

Dud

Langford

and Li

, Doubly robust policy evaluation and learning, arXiv preprint arXiv:1103.4601, 2011.

and Paul

R.R.

, Comparison of multivariate matching methods: Structures, distances, and algorithms, Journal of Computational and Graphical Statistics 2(4) (1993), 405–420.

10.

Hunink

M.M.

Weinstein

M.C.

Wittenberg

Drummond

M.F.

Pliskin

J.S.

Wong

J.B.

and Glasziou

P.P.

, Decision making in health and medicine: integrating evidence and values, Cambridge University Press, 2014.

11.

Hyder

A.A.

Corluka

Winch

P.J.

El-Shinnawy

Ghassany

Malekafzali

Lim

M.-K.

Mfutso-Bengo

Segura

and Ghaffar

, National policy-makers speak out: are researchers giving them what they need? Health Policy and Planning, 2010, page czq020.

12.

Koller

and Friedman

, Probabilistic graphical models: principles and techniques, MIT press, 2009.

13.

Komarek

, Logistic regression for data mining and high-dimensional classification, Robotics Institute, 2004, p. 222.

14.

Liu

and Green

, Building diversified multiple trees for classification in high dimensional noise data, arXiv:1612.05888 [cs, stat], arXiv: 1612.05888, 2016.

15.

Liu

and Liu

, Causal decision trees, IEEE Transactions on Knowledge and Data Engineering (99) (2016), 1–14.

16.

Liu

Chawla

Cieslak

D.A.

and Chawla

N.V.

, A robust decision tree algorithm for imbalanced data sets, In Proceedings of the 2010 SIAM International Conference on Data Mining, SIAM, 2010, pp. 766–777.

17.

Liu

and Le

, Mining combined causes in large data sets, Knowledge-Based Systems 92 (2016), 104–111.

18.

Mao

J.-H.

Diest

P.J.V.

Perez-Losada

and Snijders

A.M.

, Revisiting the impact of age and molecular subtype on overall survival after radiotherapy in breast cancer patients, Scientific Reports 7(1) (2017), 12587.

19.

Minka

and Winn

J.G.

, In Koller

Schuurmans

Bengio

and Bottou

, editors, Advances in Neural Information Processing Systems 21 (2009), pp. 1073–1080.

20.

Morgan

S.L.

and Winship

, Counterfactuals and causal inference, Cambridge University Press, 2014.

21.

Nyman

Pensar

Koski

and Corander

, Stratified graphical models – context-specific independence in graphical models, Bayesian Analysis 9(4) (2014), 883–908.

22.

Pearl

, Causality: models, reasoning and inference, Cambridge University Press, Cambridge, 2000.

23.

Prat

Fan

Ferndez

Hoadley

K.A.

Martinello

Vidal

Viladot

Pineda

Arance

Muz

Laia

Cheang

M.C.U.

Adamo

and Perou

C.M.

, Response and survival of breast cancer intrinsic subtypes following multi-agent neoadjuvant chemotherapy, BMC Medicine 13 (2015).

24.

Prat

Parker

J.S.

Karginova

Fan

Livasy

Herschkowitz

J.I.

and Perou

C.M.

, Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer, Breast Cancer Research: BCR 12(5) (2010), R68.

25.

Quinlan

, C4.5: programs for machine learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.

26.

Rastelli

and Crispino

, Factors predictive of response to hormone therapy in breast cancer, Tumori Journal 94(3) (2008), 370–383.

27.

Rosenbaum

P.R.

and Rubin

D.B.

, The central role of the propensity score in observational studies for causal effects, Biometrika 70(1) (1983), 41–55.

28.

Rubin

D.B.

, Estimating causal effects of treatments in randomized and non-randomized studies, Journal of Educational Psychology 66(5) (1974), 688.

29.

Rubin

D.B.

, Bias reduction using Mahalanobis-metric matching, Biometrics, 1980, pp. 293–298.

30.

Sanford

R.A.

Lei

Barcenas

C.H.

Mittendorf

E.A.

Caudle

A.S.

Valero

Tripathy

Giordano

S.H.

and Chavez-MacGregor

, Impact of time from completion of neoadjuvant chemotherapy to surgery on survival outcomes in breast cancer patients, Annals of Surgical Oncology 23(5) (2016), 1515–1521.

31.

Soulakis

N.D.

Carson

M.B.

Lee

Y.J.

Schneider

D.H.

Skeehan

C.T.

and Scholtens

D.M.

, Visualizing collaborative electronic health record usage for hospitalized patients with heart failure, Journal of the American Medical Informatics Association 22(2) (2015), 299–311.

32.

Stuart

E.A.

, Matching methods for causal inference: a review and a look forward, Statistical Science: A Review Journal of the Institute of Mathematical Statistics 25(1) (2010), 1.

33.

Tsai

C.-L.

Wang

Nickerson

D.M.

and Li

, Subgroup analysis via recursive partitioning, The Journal of Machine Learning Research 10 (2009), 141–158.

34.

Weisberg

H.I.

and Pontes

V.P.

, Post hoc subgroups in clinical trials: Anathema or analytics? Clinical Trials (London, England) 12(4) (2015), 357–364. s

35.

Zsambok

C.E.

and Klein

, Naturalistic decision making, Psychology Press, 2014.

Discovering context specific causal relationships

Abstract

Keywords

1. Introduction

2. Related work

3. Problem statement and definitions

3.1 Research problem

.

3.2 Causal rule definition

(Causal rules).

(Context specific causal rules).

3.3.1 Variables used for stratification

.

Table 1 An example of notations for a contingency table

4.1 TCC with a single decision tree

4.2 TCC with multiple decision trees

4.3 Complexity analysis

5. Experiments

5.1 Synthetic data

Table 2 The accuracy of TCC, TCC 3 , TCC 5 and CT on synthetic data sets

Table 4 Top context specific causal rules discovered by TCC from METABRIC with 10 years OS as the outcome

Footnotes

Acknowledgments

References

Table 1
An example of notations for a contingency table

Table 2
The accuracy of TCC, TCC ${}_{3}$ , TCC ${}_{5}$ and CT on synthetic data sets

Table 4
Top context specific causal rules discovered by TCC from METABRIC with 10 years OS as the outcome