Abstract
This paper concerns a new method for identifying aberrant signal transduction pathways (STPs) in cancer using case/control gene expression-level datasets, and applying that method and an existing method to an ovarian carcinoma dataset. Both methods identify STPs that are plausibly linked to all cancers based on current knowledge. Thus, the paper is most appropriate for the cancer informatics community. Our hypothesis is that STPs that are altered in tumorous tissue can be identified by applying a new Bayesian network (BN)-based method (causal analysis of STP aberration (CASA)) and an existing method (signaling pathway impact analysis (SPIA)) to the cancer genome atlas (TCGA) gene expression-level datasets. To test this hypothesis, we analyzed 20 cancer-related STPs and 6 randomly chosen STPs using the 591 cases in the TCGA ovarian carcinoma dataset, and the 102 controls in all 5 TCGA cancer datasets. We identified all the genes related to each of the 26 pathways, and developed separate gene expression datasets for each pathway. The results of the two methods were highly correlated. Furthermore, many of the STPs that ranked highest according to both methods are plausibly linked to all cancers based on current knowledge. Finally, CASA ranked the cancer-related STPs over the randomly selected STPs at a significance level below 0.05 (P = 0.047), but SPIA did not (P = 0.083).
Introduction
Microarray technology is providing us with increasingly abundant gene expression-level datasets. For example, the cancer genome atlas (TCGA) makes available gene expression-level data on cases and controls in five different types of cancer. Translating the information in these data into a better understanding of underlying biological mechanisms is of paramount importance to identifying therapeutic targets for cancer. In particular, if the data can inform us as to whether and how a signal transduction pathway (STP) is altered in the cancer, we can investigate targets on that pathway.
An STP is a network of intercellular information flow initiated when extracellular signaling molecules bind to cell-surface receptors. The signaling molecules become modified, causing a change in their functional capability and affecting a change in the subsequent molecules in the network. This cascading process culminates in a cellular response. Consensus pathways have been developed based on the composite of studies concerning individual pathway components. Figure 1 shows a portion of the signaling pathway of human primary naive CD4 T cells, downstream from CD3, CD28, and LFA-1 activation. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway 1 is a collection of manually drawn pathways representing our knowledge of the molecular interaction and reactions for about 136 pathways. Signaling pathways are not stand alone, but rather it is believed there is inter-pathway communication. 2

A portion of the STP of human primary naive CD4 t cells, downstream from CD3, CD28, and LFA-1 activation.
Many aberrant STPs have been associated with various cancers.3–9 For example, we now know that the ErbB, PI3K-Akt, and Wnt pathways are associated with breast cancer. To develop optimal treatments for cancer patients, it is important to discover which STPs are implicated in a cancer or cancer-subtype.
The phosphorylation activity state of each protein in an STP corresponds to the information flow on the STP. However, protein phosphorylation assays are slow, relatively expensive, and can be performed for a tiny but important fraction of the genome. Protein expression level (abundance) is correlated with activity, and gene expression level (mRNA abundance) is associated with protein abundance (correlation coefficient of 0.4–0.6). So, it seems gene expression data should be loosely correlated with activity. Furthermore, as mentioned above, microarray technology is providing us with increasingly abundant gene expression-level datasets. So, researchers developed techniques that investigate which STPs are implicated in a cancer by analyzing gene expression datasets. Initially, techniques such as over-representation analysis10–12 were employed. These techniques simply determine which genes are differentially expressed in the sample groups. Such methods ignore the topology of the network, and so do not account for key biological information. That is, if a pathway is activated through a single receptor and that protein is not produced, the pathway will be severely impacted. However, a protein that appears downstream may have a limited effect on the pathway. Recently, researchers have developed methods that account for the topology of an STP when analyzing gene expression data to determine whether the STP is implicated in a cancer.13–15 Signaling pathway impact analysis (SPIA) 13 is a software package (http://bioinformaticsprb.med.wayne.edu/SPIA) that analyzes gene expression data to identify whether a signaling network is relevant in a given condition that combines over-representation analysis with a measurement of the perturbation measured in a pathway.
However, the correlation of gene expression with activity is not well established. Some studies show that protein expression level (abundance) is often not positively correlated with activity 16 and that gene expression level is often not correlated with protein abundance. 17 Thus, gene expression levels might at most be loosely correlated with activity, which means that the causal structure of an STP might not be represented by the relationships among gene expression levels. More fundamentally, it remained an open question as to whether there even are causal relationships among the gene expression levels of genes coding for proteins on an STP. Neapolitan et al. 18 investigated this question. Specifically, they used a Bayesian network (BN) model to study whether the expression levels of genes that code for proteins on a given STP are causally related and whether this causal structure is altered when the STP is involved in a particular cancer. The results of their study supported that there is a causal structure and that it is altered.
The technique used in the investigation in Ref. 18 provides us with a new method for analyzing whether an STP is implicated in a cancer using gene expression data. In this paper, we present this technique. Then we apply both this technique and SPIA 13 to the analysis of the ovarian carcinoma dataset provided by TCGA. We obtained highly correlated results using the two methods, and we identified biologically plausible STPs as being the ones to be most likely implicated in ovarian carcinoma.
Method
As our method applies BNs to modeling STPs, we first review BNs.
BNs
A BN19–21 consists of a directed acyclic graph (DAG) G = (V,E) whose nodeset V contains random variables, and whose edges E represent relationships between the random variables, the prior probability distribution of every root variable in the DAG, and the conditional probability distribution of every non-root variable given each set of values of its parents. Often the DAG is a causal DAG, which is a DAG containing the edge X→Y only if X is a direct cause of Y. 19 The probability distribution of the variables in a BN must satisfy the Markov condition, which states that each variable in the network is probabilistically independent of its nondescendents conditional on its parents.
Figure 2 shows a BN representing the causal relationships among a subset of the variables related to lung cancer. Using this BN, we can determine conditional probabilities of interest using a BN inference algorithm. 19 For example, we can determine P(L = yes|H = yes, X = yes, T = no).

A BN containing a subset of the variables related to lung cancer.
A BN DAG model consists of a DAG G = (V,E) where V is a set of random variables, and a parameter set θ whose members determine conditional probability distributions for G, but without numerical assignments to the parameters. The task of learning a BN DAG model from data is called model selection.
In the constraint-based approach,
22
we learn a DAG model from the conditional independencies that the data suggested are present in the generative probability distribution P. In the score-based approach, we assign a score to a DAG based on how well the DAG fits the data. The Bayesian score is the probability of the Data given the DAG model.
23
A popular variant of this score is the Bayesian Dirichlet equivalent uniform (BDeu) score.
24
If the set of variables in DAG model G is {X1, X2, …, X
n
}, this score is as follows:
When learning a DAG model from data, we can only learn a Markov equivalence class of DAG models rather than a unique DAG model. Two DAGs are called Markov equivalent if they entail the same conditional independencies. 19 For example, the DAGs X → Y → Z and X ← Y ← Z are Markov equivalent.
Many biological processes have been modeled using BNs including molecular phylogenetics, 25 gene regulatory networks,26–28 genetic linkage, 29 genetic epistasis,30–34 and STPs.35–39
STPs modeled as BNs
If we represent the phosphorylation activity state of each protein in an STP by a random variable and draw an arc from X to Y if there is an edge from protein X to protein Y in the STP, then we are modeling the STP as a BN. For this BN to represent the joint probability distribution of the random variables, the Markov condition must be satisfied. Woolf et al. 38 argue that steady-state concentrations should satisfy this condition. For example, in Figure 1 the phosphorylation activity of MEK1/2 should be dependent on the phosphorylation activity of PKA because the activity of PKA affects the activity of RAF, which in turn affects the activity of MEK1/2. However, once we know the phosphorylation activity of RAF, the implication link is broken, which is what the Markov condition entails. Sachs 37 performed a proof of principle study concerning this conjecture, and found that it is true.
Causal analysis of STP aberration (CASA)
In what follows for simplicity we will say that a gene coding for a protein on an STP is on the STP itself. We assume we have two sets of data. The first set contains the gene expression levels of all (or at least most) genes in a set of cases (tumors) and the second set contains the gene expression levels of all genes in a set of controls. Let X be an STP we are investigating, Data1 be the data concerning cases for genes on X, and Data2 be the data concerning controls for genes on X.
There are two models. Model M
A
represents that the same causal structure (BN) is generating both Data1 and Data2. In this case, the two datasets can be considered as coming from the same population and therefore combined. Model M
B
represents that two different causal structures (BNs) are generating the data. We compute the log Bayes factor of model M
B
relative to model M
A
as follows. We first compute
The larger the value of K, the more the data indicate that the causal structure of STP X is altered in the tumorous tissue. In our investigations, we approximate the Bayes factor by approximately learning the most likely model and then use the Bayesian information criteria (BIC) to approximate the probability of the data given that model. In the limit, the BIC and the BDeu score (Equation 1) choose the same model. 19
We call the method CASA.
Jiang et al. 18 used Equations (2) and (3) to analyze 5 STPs associated with breast cancer, 10 STPs associated with other cancers, and 10 randomly chosen STPs, using a breast cancer gene expression-level dataset containing 529 cases and 61 controls. They obtained significant results indicating that K (Equation 4) is larger in the cancer-related STPs than in the randomly chosen ones. These results support that the causal structure is altered in the cancer-related pathways. However, the possibility exists that these significant results were obtained simply because the genes are over or under expressed in cancer-related STPs, and the causal structure is not relevant. To test this possibility, they redid the study with all BNs constrained to having no causal edges. They obtained results that had no significance at all. Hence, their overall results support that there is an underlying causal structure among expression levels of genes on an STP and that this causal structure is altered when an STP is involved in cancer. These results indicate that CASA should be able to effectively identify cancer-related STPs.
Application to ovarian cancer
TCGA makes available datasets concerning breast cancer, colon adenocarcinoma, glioblastoma, lung squamous cell carcinoma, and ovarian carcinoma. Each dataset contains data on the expression levels of 17,814 genes in cases (tumorous tissue) and controls (non-tumorous tissue). Table 1 shows the number of cases and controls in each of these datasets.
The number of cases and controls in the five TCGA datasets.
These datasets are highly unbalanced in that there are many more cases that controls. Difficulties can occur with unbalanced datasets. For example, predictive accuracy, a method often used for evaluating the performance of a classifier, is not appropriate when the data are unbalanced. 40 However, our application is discovery, not prediction. The BDeu score, which we employ, automatically incorporates both the number of cases and controls into the resultant score. Too few data items would make it more difficult to distinguish models, but not produce an inappropriate measure. Furthermore, to increase the number of controls, we used the controls from all five datasets, resulting in a total of 102 controls.
We investigated the ovarian carcinoma dataset. We analyzed 20 cancer-related STPs to see which ones the data indicate are involved in ovarian cancer. We also analyzed six arbitrary STPs to see whether STPs involved in cancer in general have more implicating scores than arbitrary STPs. These pathways were selected at random from the KEGG pathways list after removing the cancer-related pathways. The STPs analyzed appear in Table 2.
The STPs analyzed using the TCGA ovarian carcinoma dataset.
Using the KEGG database, we identified all the genes related to each of the 26 pathways. We extracted gene expression profiles for the 591 ovarian carcinoma cases and 102 controls in the TCGA database. By mapping the gene names of the genes in the gene sets identified using KEGG pathways and the gene names in TCGA data, we were able to extract the gene expression profiles for each of the 26 pathways for the 591 cases and 102 controls. All expression levels were discretized to values low, medium, and high using the equal width discretization technique, which discretizes the data into partitions of K equally sized intervals (K = 3 in our application).
Using the resultant datasets, we used CASA to learn Bayes factors and SPIA to determine P-values for the 26 pathways. We used the BN learning package HUGIN 41 to approximately learn the most probable DAG models and to calculate the BICs. We obtained SPIA from http://bioinformaticsprb.med.wayne.edu/SPIA.
Results
Table 3 shows the results for CASA, and Table 4 shows the results for SPIA. The P-values for CASA were obtained by making the null hypothesis that the log Bayes factor is #0, assuming a normal distribution, and approximating the variance by the variance of the observed log Bayes factors.
The Casa results for the STPs analyzed using the TCGA ovarian carcinoma dataset.
The SPIA results for the STPs analyzed using the TCGA ovarian carcinoma datasets.
Notice that in both tables, the cancer-related pathways in general are near the top. Based on a two-sample t-test, the cancer-related pathways scored higher (larger K values for CASA and smaller P-values for SPIA) than the noncancer pathways at the 0.047 level for CASA and at the 0.083 level for SPIA. Also, the P-values for the two methods are highly correlated (correlation coefficient = 0.405; P = 0.040).
We combined the P-values using Brown's 42 modification of Fisher's method because both CASA and SPIA analyzed the same dataset and therefore we do not have independence. The combined P-values appear in Table 5. In this table, we show whether CASA and SPIA individually found each STP noteworthy, where by noteworthy we mean a P-value no larger than 0.05.
The combined results for the STPs analyzed using the TCGA ovarian carcinoma dataset. There is an X in the far right columns if Casa or SPIA separately found the STP noteworthy.
Many of the results obtained are plausible according to current knowledge. PI3 K, which is “probably one of the most important pathways in cancer metabolism and growth,” 43 has P-value essentially equal to 0 based on each method individually and based on the combined results. Furthermore, PI3 K, Ras, ErbB, and Wnt, all of which rank high, are known players in normal growth regulation and deregulation in cancer cells.
Discussion
We developed CASA, which is a BN-based method for investigating whether STPs are implicated in cancer using case–control gene expression datasets. We applied both CASA and another topology-based method, SPIA, to the TCGA ovarian carcinoma dataset to analyze 20 cancer-related STPs and 6 randomly selected STPs. The results of the two methods were highly correlated. CASA ranked the cancer-related STPs over the randomly selected STPs at a significance level below 0.05 (P = 0.047) but SPIA did not (P = 0.083). Furthermore, several of the STPs that ranked highest are linked to all cancers based on current knowledge.
These results open up avenues for future research. In particular, we can analyze all 136 pathways in KEGG pathway with the purpose of identifying undiscovered pathways related to ovarian cancer. This analysis will require a good deal of manual effort to develop the individual STP datasets from the manually drawn pathways and the TCGA datasets. Second, we can analyze the remaining four cancers in the TCGA datasets, and perform a pan cancer analysis, looking for STPs involved across all cancers.
Conclusion
We conclude that our study supports that both CASA and SPIA can identify aberrant STPs in cancer using case/control gene expression-level data. These results open up avenues for discovering cancer-related STPs across different types of cancers.
Author Contributions
XJ conceived and designed the experiments. XJ processed the data, developed the datasets representing pathways, and analyzed the data. RN wrote the first draft of the manuscript. XJ contributed to the writing of the manuscript. RN and XJ jointly developed the structure and arguments for the paper. Both authors reviewed and approved the final manuscript.
