Abstract
This paper concerns a study indicating that the expression levels of genes in signaling pathways can be modeled using a causal Bayesian network (BN) that is altered in tumorous tissue. These results open up promising areas of future research that can help identify driver genes and therapeutic targets. So, it is most appropriate for the cancer informatics community.
Our central hypothesis is that the expression levels of genes that code for proteins on a signal transduction network (STP) are causally related and that this causal structure is altered when the STP is involved in cancer. To test this hypothesis, we analyzed 5 STPs associated with breast cancer, 7 STPs associated with other cancers, and 10 randomly chosen pathways, using a breast cancer gene expression level dataset containing 529 cases and 61 controls. We identified all the genes related to each of the 22 pathways and developed separate gene expression datasets for each pathway. We obtained significant results indicating that the causal structure of the expression levels of genes coding for proteins on STPs, which are believed to be implicated in both breast cancer and in all cancers, is more altered in the cases relative to the controls than the causal structure of the randomly chosen pathways.
Keywords
Introduction
There is evidence that similar cancers have many variations at the molecular level, and each has its own clinical course. This is called the heterogeneity of tumors. For example, in the case of human epidermal growth factor receptor 2 (HER2)-amplified breast cancer, the survival of many patients is vastly improved with the drug Herceptin, 1 but less than 50% respond 2 and the drug can be toxic. 3 This phenomenon is not peculiar to HER2-amplified breast cancer. As another example, breast cancer patients with positive estrogen receptor (ER) expression and negative lymph node metastasis (ER+/node-) have better clinical outcomes than other subtypes of patients; however, a sub-population has recurrence. So, we cannot provide optimal treatment for many patients because responses to treatments are often different for patients who have similar clinical features.
A signal transduction pathway (STP) is a network of information flow in the cells that initiates with a signal outside the cell and results in a cellular response. Many aberrant STPs have been associated with various cancers.4–10 For example, we now know that the ERbB, PI3K–Akt, and Wnt pathways are associated with breast cancer. The signal aberrations associated with a disease often result from one or more mutated genes that code proteins on the pathways. There has been an explosion of new genomic and proteomic datasets providing us with unprecedented and rich resources to reveal the mechanisms of STPs. We have datasets concerning single nucleotide polymorphisms (SNPs), somatic mutations, copy number, methylation levels, and expression levels in both cancerous and non-cancerous tissues.11–13 We have flow cytometry datasets providing us with simultaneous observations of many signaling molecules in a multitude of individual cells.14,15
To develop optimal treatments for cancer patients, it is necessary to address two fundamental issues regarding STPs: (1) the discovery of which STPs are implicated in a cancer or cancer subtype and (2) the prediction of how stimulations and inhibitions will affect the overall activity of the STP.
Using gene expression datasets, a good deal of effort has been devoted to the first issue just mentioned. Initially, techniques such as over-representation analysis16–18 were employed. Such methods ignore the topology of the network, and hence do not account for key biological information. That is, if a pathway is activated through a single receptor and that protein is not produced, the pathway will be severely impacted. However, a protein that appears downstream may have a limited effect on the pathway. Recently, researchers have developed methods that account for the topology of an STP when analyzing gene expression data to determine whether the STP is implicated in a cancer.19–21 Signaling pathway impact analysis (SPIA) 19 is a software package (http://bioinformaticsprb.med.wayne.edu/SPIA) for identifying whether a signaling network is relevant in a given condition that accounts for the topology of the network. However, it is not model based, and does not provide a predictive causal model of an STP. PARADIGM 20 creates a model of a single patient rather than the population, and is able to incorporate copy number variations (CNV) and even mutations. Not being population based, it does not provide an overall causal model of the altered STPs in a given cancer.
To address the second issue (the prediction of how stimulations and inhibitions will affect the overall activity of the STP), we need a causal model of the variables related to an STP. A number of studies22–24 have shown that STPs can be modeled as causal Bayesian networks (BNs) if each node in the network represents the phosphorylation activity of a protein. A strength of BNs is that they represent probabilistic relationships, and therefore they can manage the noise in biological data. A second strength is that they can model the natural causal relationships in biology.
On the one hand, protein phosphorylation assays are slow, relatively expensive, and can be performed for only a tiny but important fraction of the genome. On the other hand, the gene expression level data are widely available because they are inexpensive and genome wide. As noted previously, methods have already been developed that account for the topology of an STP when analyzing gene expression data to determine whether the STP is implicated in a cancer.19–21 However, the correlation of gene expression with activity is not well established. Studies show that the protein expression level (abundance) is often not positively correlated with activity 25 and that the gene expression level is often not correlated with protein abundance. 26 Hence, the gene expression levels might be at most loosely correlated with the activity, which means that the causal structure of an STP might not be represented by the relationships among the gene expression levels. More fundamentally, it is an open question as to whether there are causal relationships among the expression levels of genes coding for proteins on an STP.
We investigated this question. Specifically, the central hypothesis to be investigated in this paper is that the expression levels of genes that code for proteins on a given STP are causally related, and that this causal structure is altered when the STP is involved in a particular cancer. If this hypothesis is correct, using the ample gene expression datasets and BN learning algorithms, we can learn the causal network structure of the gene expression levels in an STP that is altered in a given cancer, and then identify driver genes based on the topology of the network.
The Cancer Genome Atlas (TCGA) makes available a breast cancer dataset that contains data on SNPs and the expression levels of 17,814 genes. There are 529 cases and 61 controls for which this information is available. Using these datasets and BN technology, we investigate the causal structure of genes that code for proteins on 5 STPs believed to be associated with breast cancer, 7 STPs believed to be associated with other cancers, and 10 randomly chosen pathways. We obtain significant results indicating that the causal structure of the STPs, which are believed to be implicated in both breast cancer and all cancer, is more altered in the cases relative to the controls than the causal structure of the randomly chosen pathways.
Method
As our method applies BNs to modeling STPs, we first review both of these.
BNs.
BNs27–29 are increasingly being used for uncertain reasoning and machine learning. A BN consists of a directed acyclic graph (DAG) G = (V, E) whose nodeset V contains random variables and whose edges E represent relationships among the random variables, the prior probability distribution of every root variable in the DAG and the conditional probability distribution of every non-root variable given each set of values of its parents. Often the DAG is a causal DAG, which is a DAG containing the edge X ↠ Y only if X is a direct cause of Y. 29 The probability distribution of the variables in a BN must satisfy the Markov condition, which states that each variable in the network is probabilistically independent of its non-descendents conditional on its parents.
Figure 1 shows a BN representing the causal relationships among variables related to lung disorders. In this BN, h1, for example, denotes an individual has a smoking history and h2 denotes the individual does not. Using this BN, we can determine conditional probabilities of interest with a BN inference algorithm. 29

A BN representing a subset of the variables related to lung disorders. There is an edge from node A to node B if A has a direct causal influence on B.
A BN DAG model consists of a DAG G = (V, E), where V is a set of random variables, and a parameter set θ whose members determine conditional probability distributions for G, but without numerical assignments to the parameters. The task of learning a BN DAG model from the data is called model selection.
In the constraint-based approach,
30
we learn a DAG model from the conditional independencies that the data suggest are present in the generative probability distribution P. In the score-based approach, we assign a score to a DAG based on how well the DAG fits the data. The Bayesian score is the probability of the Data given the DAG model.
31
A popular variant of this score is the Bayesian Dirichlet equivalent uniform (BDeu) score.
32
If the set of variables in model G is {X1, X2, …, Xn}, this score is as follows:
Many biological processes have been modeled using BNs including molecular phylogenetics, 33 gene regulatory networks,34–36 genetic linkage, 37 genetic epistasis,38–42 and STPs.22–24 Figure 2 shows a BN representing a small gene regulatory network.

A BN for a small gene regulatory network (based on a figure in Ref 33). Only the conditional probability distribution for node S is shown. Each variable is continuously distributed, and defined to be “high” if its value is higher than 1 and “low” if its value is less than 1. The notation ρ(s|C = low, D = low) = NormaIDen(s;0.6,0.1) means S is normally distributed with mean 0.6 and standard deviation 0.1 if C and D are both “low.”
STPs Modeled as BNs.
An STP is a network of intercellular information flow initiated when extracellular signaling molecules bind to cell-surface receptors. The signaling molecules become modified, causing a change in their functional capability and affecting a change in the subsequent molecules in the network. This cascading process culminates in a cellular response. Consensus STPs have been developed based on the composite of studies concerning individual STP components. Figure 3 shows part of the consensus STP of human primary naive CD4 T cells, downstream from CD3, CD28, and LFA-1 activation. Kyoto Encyclopedia of Genes and Genomes (KEGG) 43 has a collection of manually drawn pathways representing our knowledge of about 136 pathways. STPs are not thought to be stand-alone networks, but rather they have inter-pathway communication. 44

A portion of the consensus STP of human primary naive CD4 T cells, downstream from CD3, CD28, and LFA-1 activation. Arcs are used to illustrate connections between signaling molecules. In some cases, the connections may be indirect and may involve specific phosphorylation sites of the signaling molecules. MAPKKK appears twice because MEK4/7 and MEK3/6 each have a MAPKKK that is its activator. This figure is based on a figure in Ref. 14; see that paper for more details.
If we represent the phosphorylation level of each protein in an STP by a random variable and draw an arc from X to Y if there is an edge from protein X to protein Y in the STP, then we are modeling the STP as a BN. For this BN to represent the joint probability distribution of the random variables, the Markov condition must be satisfied. Woolf et al. 22 argue that the steady-state concentrations should satisfy this condition. For example, in Figure 3 the phosphorylation activity of MEK1/2 should be dependent on the phosphorylation activity of PKA because high PKA activity implies high RAF activity, which in turn implies high MEK1/2 activity. However, once we know the activity of RAF, the implication link is broken, which is what the Markov condition entails. Sachs et al. 14 performed a proof of principle study concerning this conjecture, and confirmed this. A number of other papers15,23,24 successfully modeled STPs as BNs using phosphorylation activity.
As discussed in the Introduction, gene expression level seems to be at most loosely correlated with activity. So, if there are causal relationships among the expression levels of genes coding for proteins on an STP, the BN representing these relationships may not represent the biological flow of an STP. This means it would be difficult to learn STPs from the gene expression levels. However, if our goal is to investigate how variables concerning known STPs are modified in tumors, not to learn the structure of unknown STPs, then the causal structure of the gene expression levels in tumors can provide us with important information. As also mentioned in the Introduction, it is an open question as to whether there are even causal relationships among the expression levels of genes coding for proteins on an STP. This paper investigates this question.
Identifying Aberrant STPs Using BNs and Gene Expression Level Data.
In what follows, for simplicity we will say that a gene coding for a protein on an STP is on the STP itself. We assume that we have two sets of data. The first set contains the gene expression levels of all (at least most) genes in a set of cases (tumors) and the second set contains the gene expression levels of all genes in a set of controls. Let STPX be an STP we are investigating, Data1 be the data concerning the cases for genes on STPX, and Data2 be the data concerning controls for genes on STPX.
There are two models. Model MA represents that the same causal structure (BN) is generating both Data1 and Data2. In this case, the two datasets can be considered as coming from the same population and are therefore combined. Model MB represents that two different causal structures (BNs) are generating the data. We compute the log Bayes factor of model MB relative to model MA as follows. We first compute
An alternative method would be to approximately learn the most likely DAG model G1 based on Data1, the most likely DAG model G2 based on Data2, and the most likely DAG model G3 based on Data1 and Data2. Then take the maximum likelihood estimates
The larger the value of K or L, the more the data indicate that the causal structure of STPX is altered in the tumorous tissue. The advantage of using the Bayes factor is that it automatically includes a penalty for model complexity. However, it is costly to compute. In our investigations, we approximate the Bayes factor by approximately learning the most likely model and then using the Bayesian information criteria (BIC) to approximate the probability of the data given the model. In the limit, the BIC and the BDeu score (Equation 1) choose the same model. 29
Evaluation Methodology
It is difficult to assess a pathway analysis model or methodology using real data because the ground truth is not known. In the absence of a gold standard, we can perform our analysis based on the existing biological knowledge. Hence, to investigate whether the causal structure of the expression levels of genes on an STP is altered when the STP is involved in cancer, we compared results obtained using the breast cancer data for 5 STPs implicated in breast cancer, 7 STPs implicated in other cancers, and 10 random pathways. We investigated STPs implicated in other cancers because it is believed that there are commonalities across tumor lineages. 45 The pathways investigated are listed in Table 1. The first column lists the five STPs believed to be implicated in breast cancer. The PI3K pathway is one of the most important pathways in cancer metabolism in general, and has recognized as an important target in breast cancer management for years. 46 Hyperactive Wnt signaling has been shown to contribute to cancer in a wide range of human tissue, and Wnt genes have been identified as oncogenes in mouse mammary tumorigenesis. 47 Over-expression of the ErbB2 gene occurs in approximately 20% of breast cancers. 48 The Hedgehog 49 and Notch 50 pathways have also been associated with breast cancer, but perhaps less strongly. The second column of Table 1 lists the seven STPs implicated in other cancers, and the last column shows the randomly chosen pathways. We did not investigate the glioma, pancreatic, and colorectal cancer pathways because of their overlap with the PI3K and Wnt pathways. The 10 non-cancerous pathways were obtained by first eliminating the 12 cancer pathways investigated and then randomly choosing 10 pathways from all the pathways in the KEGG database.
Pathways investigated.
The cancer genome atlas (TCGA) makes available a breast cancer dataset that contains data on SNPs and the expression levels of 17,814 genes. There are 529 cases and 61 controls for which this information is available. Using the KEGG database, we identified all the genes related to each of the 22 pathways. We extracted gene expression profiles for the 529 breast cancer patients and 61 controls in the TCGA database. By mapping the gene names of the genes in the gene sets identified using the KEGG pathways and the gene names in the TCGA data, we were able to extract the gene expression profiles for each of the 22 pathways for the 529 patients and 61 controls. All the expression levels were discretized to values low, medium, and high using the equal width discretization technique, which discretizes the data into partitions of K equally sized intervals (K = 3 in our application). Using the resultant datasets, we computed the approximate Bayes factor for each of the 22 pathways. We used the BN learning package HUGIN 51 to approximately learn the most probable DAG models, and to calculate the BICs.
All experiments were run using a Dell PowerEdge R515, which has two AMD Opteron™ 4276HE, 2.6 GHz, 8C, Turbo CORE, 8M L2/8M L3, 1600 MHz Max Mem single processors.
Results
Table 2 lists the pathways, along with their Bayes factors, in a decreasing order. It is notable that PI3K, which is “probably one of the most important pathways in cancer metabolism and growth,” 52 scored much higher than all other pathways. The Wnt and ErbB pathways are also near the top of the list. However, the Notch and Hedgehog pathways are not. In general, however, the cancer-related pathways are concentrated at the top of the list. Figure 4 shows the average Bayes factor and standard error for each of the three categories.
Bayes factors for 22 pathways. There is an “X” if the pathway is implicated in breast cancer or any cancer.

Average Bayes factors and standard error for breast cancer pathways, all cancer pathways, and other pathways, when causation is modeled.
Table 3 shows the P-values, obtained using the non-parametric Mann-Whitney test, comparing the 5 breast cancer pathways to the 10 randomly chosen pathways (listed as “other pathways”) and comparing all 12 cancer pathways to the 10 randomly chosen pathways. In both cases, the results were significant with the respective P-values of 0.049 and 0.040. These significant values were obtained even though the Hedgehog and Notch pathways, which scored in the bottom half of the list, were included in the cancer sets. However, as mentioned previously, we do not have absolute ground-truth STPs. Perhaps, these pathways are not substantially implicated in breast cancer, which is what our results suggest.
P-values obtained with causal modeling and without causal modeling.
The possibility exists that these significant results were obtained simply because the genes are over or under expressed in cancer-related STPs and the causal structure is not relevant. To test this possibility, we redid the study with all the BNs constrained to having no causal edges. Table 3 shows the resultant P-values which are significant. Figure 5 shows the average Bayes factor and standard error for each of the three categories when there are no causal edges. Note that the ranges overlap more than those in Figure 4, which is obtained when causation is modeled. These results support that there is an underlying causal structure among the expression levels of genes on an STP and that this causal structure is altered when an STP is involved in cancer.

Average Bayes factors and standard error for breast cancer pathways, all cancer pathways, and other pathways, when causation is not modeled.
All networks learned are fairly complex. As an example, Figure 6 shows the network learned from cases for the ErbB pathway.

The causal BN learned from breast cancer cases for the ErbB pathway.
Discussion
We analyzed 5 STPs associated with breast cancer, 7 STPs associated with other cancers, and 10 randomly chosen pathways. Based on modeling the relationships among the expression levels of genes on the pathways as causal BNs, we obtained results indicating that the causal structure of the cancer-related STPs is significantly more altered in breast cancer tissue than the randomly chosen pathways. These results support that the expression levels of genes on STPs are causally related and that this causal structure is altered in the tumorous tissue when an STP is involved in cancer.
These results are significant for a number of reasons. First, we can use the methodology to develop a method for investigating whether an STP is involved in cancer, which can be compared to the existing methods.30,31,53 Second, these results open up a promising area of future research involving the use of BN technology to model the causal relationships among the expression levels of genes on an STP. Using such a network, we can learn possible driver genes, and the effect of genetic variants on these driver genes and therefore on the network. Such investigations would enable us to better identify therapeutic targets in a patient-specific fashion.
In future research, we can implement the Bayes factor calculation (Equation 1), and see if it yields better results than the approximation used in the given studies. Furthermore, we can develop and implement a method that better learns the causal edges among the genes in the STP. Rather than just learning a single highly likely model using a package like HUGIN, we can do approximate model averaging to learn the strength of the edges. Finally, we can develop and test an entire BN that contains both expression levels and genetic causes of expression levels.
Conclusion
We conclude that our study supports that the relationships among the expression levels of genes on an STP can be modeled using a causal BN, and that this network is altered in the tumorous tissue. This result opens up new avenues for identifying driver genes on STPs.
Author Contributions
XJ conceived and designed the experiments. DX processed the data, developed the datasets representing the pathways, and analyzed the data. RN wrote the first draft of the manuscript. XJ contributed to the writing of the manuscript. RN and XJ jointly developed the structure and arguments for the paper. All authors reviewed and approved the final manuscript.
Disclosures and Ethics
As a requirement of publication the authors have provided signed confirmation of their compliance with ethical and legal obligations including but not limited to compliance with ICMJE authorship and competing interests guidelines, that the article is neither under consideration for publication nor published elsewhere, of their compliance with legal and ethical guidelines concerning human and animal research participants (if applicable), and that permission has been obtained for reproduction of any copyrighted material. This article was subject to blind, independent, expert peer review. The reviewers reported no competing interests.
References
ghici S. et.al. 