Abstract
Driver genes are directly responsible for oncogenesis and identifying them is essential in order to fully understand the mechanisms of cancer. However, it is difficult to delineate them from the larger pool of genes that are deregulated in cancer (ie, passenger genes). In order to address this problem, we developed an approach called TRIAngulating Gene Expression (TRIAGE through clinico-genomic intersects). Here, we present a refinement of this approach incorporating a new scoring methodology to identify putative driver genes that are deregulated in cancer. TRIAGE triangulates - or integrates -three levels of information: gene expression, gene location, and patient survival. First, TRIAGE identifies regions of deregulated expression (ie, expression footprints) by deriving a newly established measure called the Local Singular Value Decomposition (LSVD) score for each locus. Driver genes are then distinguished from passenger genes using dual survival analyses. Incorporating measurements of gene expression and weighting them according to the LSVD weight of each tumor, these analyses are performed using the genes located in significant expression footprints. Here, we first use simulated data to characterize the newly established LSVD score. We then present the results of our application of this refined version of TRIAGE to gene expression data from five cancer types. This refined version of TRIAGE not only allowed us to identify known prominent driver genes, such as MMP1, IL8, and COL1A2, but it also led us to identify several novel ones. These results illustrate that TRIAGE complements existing tools, allows for the identification of genes that drive cancer and could perhaps elucidate potential future targets of novel anticancer therapeutics.
Introduction
Cancer is characterized by the accumulation of genomic abnormalities that result in activated oncogenes and inactivated tumor suppressor genes. These deregulated genes are known as “driver genes.” Identifying genes that “drive” oncogenesis is central to improving our understanding of the mechanisms of cancer and to developing new anticancer therapies. Driver genes can be used as biomarkers of cancer susceptibility. For instance, inherited mutations in BRCA1 and BRCA2 a are strong indicators of breast and ovarian cancer risk. 1 Driver genes can also be used to define common genetic profiles shared by subgroups of patients who may benefit from targeted treatment strategies. For example, ERBB2 b (also known as HER2/neu) is amplified and overexpressed in 20% to 25% of breast cancers 2 and is the target of the monoclonal antibody trastuzumab (marketed as Herceptin, http://www.herceptin.com/breast/), a drug that is effective only when ERBB2 is amplified and overexpressed. Those who seek to compile a catalog of additional driver genes must attempt to distinguish them from the larger number of “passenger genes,” which have been disrupted as a result of cancer progression but do not confer growth or survival (dis)advantage.
breast cancer 1/2, early onset.
v-erb-b2 erythroblastic leukemia viral oncogene homolog 2, neuro/glioblastoma derived oncogene homolog avian).
Driver genes may be deregulated through a number of mechanisms, operating at the levels of both DNA and RNA to trigger oncogenesis. The first genomic aberration consistently found to be associated with malignancy in humans was a translocation between BCR c and ABL d on chromosomes 9 and 22, a discovery that led chromosome 9 to be known as the Philadelphia chromosome.3,4 Following this discovery, a drug named Imatinib (commercialized as Gleevec, http://www.gleevec.com/) was developed to specifically inhibit the resulting fusion gene BCR-ABL. A translocation between TMPRSS2 e and the ETS (E 26) family of genes (ERG f , ETV1 g and ETV4 h ) occurs frequently in prostate cancer. 3 Copy number alterations, such as genomic amplifications or deletions, are also common in cancer (eg, amplification of ERBB2 is common in breast cancer, as mentioned above). In addition, mutations can cause deregulation of driver genes leading to oncogenesis. For instance, the TP53 i mutation, which makes the cell insensitive to signals of apoptosis, 5 is present in most human tumors. Epigenomic modifications, such as histone methylation, acetylation, and chromatin modifications, also contribute to tumor formation and progression. By activating downstream oncogenes (eg, HRAS j in gastric cancer) and by silencing tumor suppressors (eg, RB1 k in retinoblastoma 6 ), these modifications lead to chromosomal instability and to more frequent and aggressive tumors.
breakpoint cluster region.
c-abl oncogene 1, non-receptor tyrosine kinase.
transmembrane protease, serine 2.
v-ets erythroblastosis virus E 26 oncogene homolog (avian).
ets variant 1.
ets variant 4.
tumor protein p53.
v-Ha-ras Harvey rat sarcoma viral oncogene homolog.
retinoblastoma 1.
We have recently developed a data-mining strategy called TRIAngulating Gene Expression (TRIAGE through clinico-genomic intersects) to guide the identification of potential driver genes, which are typically deregulated in only a subset of tumor samples. TRIAGE triangulates three levels of information: gene expression, gene location and clinical survival. We have used TRIAGE to discover and validate a novel oncogene RAB11FIP1 l that promotes metastasis in breast cancer. 7 TRIAGE has also been used to characterize patients with the fusion gene RPS6KB1-VMP1 m , a mutation caused by tandem duplications. 8 In this work, we describe recent refinements to the TRIAGE scoring methodology and we present the results of simulations testing this new scoring. Further, we describe the results obtained when we applied the newly refined TRIAGE approach to discover new candidate oncogenes and tumor suppressors in five human cancers.
RAB11 family interacting protein 1 (class I).
ribosomal protein S6 kinase, 70kDa, polypeptide 1– vacuole membrane protein 1.
The first step in the TRIAGE methodology is to identify “expression footprints” (ie, regions that are either induced or repressed at the RNA level and are therefore referred to as “induced” or “repressed” expression footprints, respectively). These areas, which are identified using a novel measure called a Local Singular Value Decomposition (LSVD) score, may overlap at the level of DNA with other genomic events including copy number alterations, mutations or epigenomic changes and may contain driver genes. TRIAGE then uses dual survival analyses to distinguish driver genes from passenger genes located in the same expression footprint. The first survival analysis identifies the genes that are significantly associated with the time-to-event outcome (eg, time to local or distant recurrence) by fitting a Cox proportional hazards model 9 over all patients in the cohort. The second survival analysis identifies potential driver genes by testing associations with the time-to-event outcome in the samples that are not characterized by these expression footprints.
TRIAGE represents several improvements over classical approaches to the analysis of differential expression. First, unlike single whole cohort survival analysis, the TRIAGE approach allows one to distinguish between driver and passenger genes. Second, it is sensitive enough to detect driver genes that are deregulated in a small subset of patients, whereas classical analyses are only able to detect genes that are commonly deregulated in most patients (as described in a number of detailed reviews10–13). Third, it is able to identify the samples and genes that contribute to the expression footprint. Furthermore, contrary to previous methods, 14 which derive a measure of significance for each sample separately, TRIAGE analyzes the whole tumor cohort simultaneously. Finally, unlike other methods, 15 it does not require samples from normal tissues.
Here, we present the statistical properties of the LSVD score, which we characterized using simulated data. We then present the results of our analysis to identify potential candidate driver genes by applying TRIAGE to five human cancers and we discuss the resulting catalog.
Methods
The TRIAGE approach comprised three main steps, as outlined in Figure 6, and described in detail below:
The LSVD score is used to identify induced (or repressed) genomic expression footprints from gene expression data. Genomic regions containing a substantial proportion of genes that are either over- or under-expressed in multiple tumors contain potential oncogenes or tumor suppressor genes, respectively. The LSVD score that we used to perform the analyses presented here has been refined in this version of TRIAGE.
Unselected survival analysis is used to identify associations between patient survival and gene expression profiles in the expression footprints. Gene expression profiles that are significantly associated with either increased or reduced risk of failure by Cox proportional hazards regression models are indicative of potential oncogenes or tumor suppressor genes, respectively.
Selected survival analysis is used to distinguish driver genes from passenger genes in the expression footprints. While the expression of passenger genes may be associated with survival, driver genes are expected to be associated with survival even in samples where the respective expression footprint is present; this is not the case for passenger genes. This expectation is based on the assumption that driver gene expression in tumors will often be deregulated by mechanisms other than copy number alteration or other regional events. The genes that are significantly associated with survival in both the unselected and selected survival analyses are interpreted as potential driver genes.
These three steps are further detailed below.
Using LSVD Score to Identify Induced (Or Repressed) Genomic Expression Footprints
The objective of this step is to identify regions (ie, expression footprints) of co-expressed (ie, co-induced or co-repressed) genes and corresponding subgroups of tumors that share the same expression footprints. The problem can be posed as the analysis of an undirected bipartite graph between the set of tumor samples and the set of genes. An edge between a tumor sample and a gene is established if the gene is overexpressed (or repressed) in that particular sample, ie, if its expression is above (or below) a predefined threshold (denoted by a) in that particular sample, an edge between the tumor sample and the gene is established. Next, we identify dense subgraphs in which the connectivity between tumor samples and genes is higher.
The link structure is then analyzed using Singular Value Decomposition (SVD) constrained upon the localization of gene nodes in the genome as described below.
In the following procedures, we consider the measurement of gene expression for a set of n tumors over m genes. For each chromosome c;c∈ {1,…, K}, let E c denote the matrix of log2 of gene expression of dimensions n × m c (with ∑ c m c = m), where m c is the number of genes on chromosome c.
The expression footprints are identified by analyzing E c using LSVD according to the following steps:
Defining the bipartite graph structure by transforming E c into binary connectivity matrix Y c
Deriving chromosome localized matrices Y lc , for each location “l c ”
Applying SVD and computing the connectivity or LSVD score Δ lc
Identifying the regions of interest (ie, expression footprints).
While the first step is slightly different for the analysis of induced and repressed expression footprints (as described below), steps 2, 3, and 4 remain the same.
Defining the Gene–Tumor Bipartite Graph
E c is transformed into binary connectivity matrices A c for induced expression footprints and D c for repressed expression footprints through the discretization rule.
Induced expression footprints.
For each tumor sample j and gene i, the transformation of the expression data into A
c
is obtained as follows:
Repressed expression footprints
For each tumor sample j and gene i, matrix E
c
is transformed into D
c
as
Y c = A c to identify induced expression footprints and Y c = D c to identify repressed expression footprints.
Deriving Localized Matrices
To account for the localization of expression footprints in the genome, we derive localized connectivity matrices. Local matrices Y
lc
at location “l
c
” are derived from Y
c
using genes located in
Performing SVD of Localized Matrices Y lc
SVD decomposes a matrix Y
lc
of dimensions n × m
lc
into a product of three matrices U, ∑, and V
T
such that
The largest or principal singular value of Y lc summarizes the density of the network. Its value increases with the number of links but it does not allow one to distinguish between networks in which links are concentrated around a few genes and tumor samples and networks in which links are spread among different genes and/or tumor samples. To account for this observation, in the newer version of the LSVD score, the singular vectors associated with the principal singular value are also appropriately included in the definition of improved version of LSVD score as described in the next subsection.
Identifying the Regions of Interest
As shown by Kleinberg,
17
the discriminative ability of the principal singular value increases with the number of repeated multiplications of the matrix to be decomposed. In order to build the final score, SVD is thus applied on matrices P
lc
and Q
lc
, which are based on the repeated multiplication of the square matrices
These matrices can be decomposed by SVD as
Then, an LSVD score Δ
lc
at l
c
is obtained by weighing the principal singular value of P
lc
denoted by
In the above formula, the value of
Higher the LSVD score, the higher confidence in the expression footprint around l c indicating that the genes at this location are contributing to the expression footprint in a subset of tumor samples.
Finally, the genomic regions with consecutive LSVD scores above the predetermined threshold represent the expression footprints. The weights
Dual Survival Analysis
Dual survival analysis is used to distinguish between driver and passenger genes.
Let E ij denote the log2 of expression of gene i;i ∈ {1,…, m} in tumor sample j; j ∈ {1,…, n}. Let T be the possibly censored survival time for each tumor sample j.
Unselected survival analysis
First, for each gene i, in a selected expression footprint, a Cox proportional hazards (Cox-PH) model 9 is fit.
The Cox-PH model is defined by the following hazards function
The score statistic and associated P-value are then used to assess the significance of the association.
Selected survival analysis
Since passenger genes located in the expression footprints may also be associated with the survival outcome, a so-called selected survival analysis is conducted to reevaluate the association between survival and gene expression in the absence of expression footprint. Driver genes are indeed assumed to influence the survival outcome even in the tumor samples that do not have the expression footprint. 7 For this reason, model (1) is applied to the tumor samples that do not contribute to the footprint (ie, the tumor samples with weights U lc [k∈{1,…,n},1] equal to 0 for the expression footprint in consideration). The score statistic and associated P-value are used to assess the significance of the survival association.
Finally, genes that are significantly associated with survival in both the unselected and selected survival analyses are interpreted as candidate driver genes.
Simulations to Study the Properties of the LSVD Score
Simulation Scheme
We conducted a simulation study to evaluate the statistical properties of the LSVD score (ΔA lc ), when used to identify expression footprints. Induced and repressed expression footprints were considered separately.
We simulated gene expression datasets composed of m = 1,000 genes profiled among n = 100 tumor samples. Gene expression values were simulated with a log-normal distribution Log − N (μ, σ) with expectation
As genes involved in the same or related pathway are likely to be co-expressed, we generated datasets with so-called “clumpy” dependence (ie, while gene measurements are dependent upon each other in small groups, measurements in each group are independent from the other groups) using the following procedures.19,20 For each group of 10 genes indexed by k;k = 1, …, 10, a random vector R = R
ik
, i = 1, …, n, was generated from a standard normal distribution N (0, 1). The data matrix E was then built so that
Number of configurations were considered in order to study the influence of (i) percentage of tumors contributing to the expression footprint, (ii) number of genes forming the expression footprint, (iii) mean value of the log-normal distribution μ, (iv) window size ω (v) threshold parameter a used to define deregulated expression, and (vi) correlation parameter ρ. These configurations are summarized in Table 1. Additional details for each configuration are provided in Supplementary File 1. For each configuration, 200 repetitions of simulations were performed.
Parameter settings used in the six simulated configurations.
Simulation Results
Simulations results for configurations (i), (ii), and (iii) are presented in Figures 1, 2, and 3, respectively. Simulation results for configurations (iv), (v), and (vi) are shown in Figures S1, S2, and S3 in Supplementary File 2, respectively.

Graph (

Graph (

Graph of the score value for 1,000 overexpressed (
Figure 1 presents the variation of the LSVD score for different percentages of tumor samples contributing to the induced expression footprint. The LSVD score Δ lc allows one to detect the expression footprints that are shared by 5 to 30% of the samples. Simulations for repressed expression footprints yielded similar results (data not shown).
Results for expression footprints of varying sizes (Fig. 2) show that the value of Δ lc is saturated for expression footprints >10 genes because of the window size of ω = 5 genes. The value of Δ lc is smaller for expression footprints of sizes 10 or fewer genes. We obtained similar results for repressed expression footprints (data not shown).
The influence of the mean expression change is depicted in Figure 3A for induced expression footprints and in Figure 3B for repressed expression footprints. The score increases with the absolute value of the gene effect. The greater the absolute mean value of the log-normal distribution, the higher the score.
Figure S1 in Supplementary File 2 presents the results of simulations similar to configuration (ii) for different window sizes (ω = 5; 10; 20). The LSVD score is lower for smaller window sizes while its variance is larger.
Figure S2 in Additional File 2 shows the results of simulations similar to configuration (ii) for different threshold parameters (a = 1; 1.5; and 2) indicating that the value of Δ lc is robust to the variation of a.
Finally, the influence of the correlation parameter on Δ lc was determined for simulation configuration (vi) for ρ = 0; 0.25; 0.5; 0.75. Figure S3 in Additional File 2 shows that the higher the correlation, the smaller the value of Δ lc , as the addition of the correlation introduces noise in the dataset. Even with this noise, expression footprints are still detectable. Simulations using different sample sizes (n = 50; 100; 200; 500) yielded similar results (data not shown). Thus, the sample size did not affect the value of Δ lc .
Deriving Driver Gene Catalogs in Five Cancers
In this section, we present the results obtained when we used TRIAGE to identify candidate driver genes that are deregulated in subpopulations of tumors. We used five large datasets representing cancers of the breast, ovary, lung, colon, and glioma.
The datasets that we used are summarized in Table 2; sample sizes varied from 111 to 741 patient tumors. Gene expression was measured using Affymetrix HU133A, HU133B, and HU133Plus2.0 arrays (Affymetrix, Santa Clara, CA, USA). We used na32 annotation files obtained from Affymetrix (http://www.affymetrix.com). Raw data were normalized using quantile normalization. 21 We averaged the measurements of transcripts that corresponded to the same gene on a chromosome. Different types of survival outcomes were available in different datasets, defined as follows. Overall survival (OS) was defined as the time from inclusion of the respective patient in the study (eg, surgery) until death or last follow-up. Relapse free survival (RFS) was defined as the time from inclusion until disease-related death, disease recurrence (either local or distant), or last follow-up. Disease metastasis-free survival (DMFS) was defined as the time interval between inclusion to the first distant recurrence event or to last follow-up.
Description of the different cancer studies. The two rightmost columns give the number of putative driver genes identified by TRIAGE.
GSE3494, GSE1456, GSE6532, GSE4922.
For each dataset, Δ lc was calculated for each chromosomal arm with a sliding window of size ω = 5. Induced and repressed expression footprints were identified separately. A threshold corresponding to median(Δ lc ) ± 2mad(Δ lc ), with “mad” representing adjusted MAD, was chosen to identify relevant expression footprints. Regions with LSVD score exceeding this threshold were selected and extended by ω on either side for dual survival analyses. The extended regions were considered to be expression footprints.
Unselected and selected survival analyses were performed using the genes within the expression footprints. Associations between gene expression and “poor” prognosis were obtained for the genes located within induced expression footprints in order to identify potential oncogenes. Associations between gene expression and “good” prognosis were obtained for those within repressed expression footprints in order to identify potential tumor suppressor genes. A threshold of P = 0.05 was used to indicate statistical significance for both sets of survival analyses. Circular plots 22 (see Supplementary File 3) provide an overview of these results, including the value of Δ lc for each chromosome and the location of potential oncogenes and tumor suppressors along the genome. For instance, the plot in Figure S5 in Supplementary File 3 (from the breast cancer study) indicates that the highest values of Δ lc were located on chromosome 17q for the induced expression footprint and on chromosome 7p for the repressed expression footprint. The largest sets of potential oncogenes were observed on chromosomes 17q, 19, and 8. Potential tumor suppressors were located throughout the genome.
Supplementary Files 4 and 5 provide lists of putative oncogenes and tumor suppressors selected by TRIAGE for different cancers studied. A pathway analysis on the selected genes (1638 oncogenes and 1196 tumor suppressors) performed using Ingenuity Pathway Analysis (Ingenuity Systems, www.ingenuity.com); see Supplementary File 6) shows significant enrichment in cancer annotated genes, most specifically in carcinoma, in solid tumor, and in several other types of tumors and cancers. A total of 786 genes were classified under this category. Other pathways commonly observed in cancers including apoptosis, cell death, cell growth and proliferation, and tumor morphology were also significantly enriched.
From the lists of potential oncogenes and tumor suppressors, we intersected those common to the different cancer types (see Fig. 4 for summary statistics). The number of genes commonly expressed in different cancer types was relatively small compared to the total number of genes identified for each individual cancer, indicating that driver genes are specific to a given cancer type. The most commonly expressed genes are listed in Table 3; the top four genes from this list are present in at least three different cancers. Among them, MMP1 n belongs to the matrix metalloproteinase (MMP) family, which is known to play a role in metastasis as up-regulation of MMPs lead to enhanced cancer cell invasion. 23 IL8 o is an important mediator of the inflammatory response and is implicated in various cancer types.24,25 COL1A2 P encodes one of the chains for type I collagen and is methylated in multiple cancer cell lines.26,27 It is also involved in a fusion with gene PLAG1 in lipoblastoma, a benign infant tumor. 28
matrix metallopeptidase 1 (interstitial collagenase).
interleukin 8.
collagen, type I, alpha 2.

Number of genes in common among different cancer types.
Genes common to the five cancer studies.
Figure 5 displays the centered and normalized LSVD score for the different windows containing MMP1, IL8, and COL1A2 (0 corresponds to the window centered on the considered gene). Stars indicate studies where the gene was significantly associated with both the unselected and selected survival. For most studies, the LSVD score for MMP1 was high for all window sizes, suggesting that this gene belongs to a larger region of deregulation. In a similar way, IL8 is deregulated as part of a large expression footprint. The deregulation of MMP1 and IL8 is strongly associated with a dosage effect. For COL1A2, high LSVD scores are more localized indicating that, although this gene does not belong to a large region of deregulation, it might be activated by other mechanisms, such as methylation or fusion.

Heatmap representation of the LSVD score for the windows centered on (

Overview of the TRIAGE methodology.
Discussion
In this paper, we presented refinements of the TRIAGE method, an approach that we developed to identify potential driver genes. We first characterized the LSVD score using simulation. We next identified known and novel driver genes in cancer using gene expression data, genomic information, and survival data. TRIAGE uses two main steps. First, Δ lc is derived using a LSVD score to identify regions of deregulated expression, called expression footprints. Then, two survival analyses were performed on the genes located within these selected loci. The score derived in the first step represents the linkage structure between the set of genes and the set of tumor samples. This score is obtained using three factors: the principal singular value, which quantifies the number of links between the tumor samples and the genes and the two ordered principal singular vectors of the LSVD, which together summarize the connectivity of the network. Calculating this score using local matrices allows us to take into account the location of expression footprints throughout the genome. Indeed, genes located in the same region are more likely to influence each other or to be influenced by chromosomal or epigenetic events.
In the second step, dual survival analyses are used to distinguish driver genes from passenger genes. First, unselected survival analysis is used to identify the genes that are significantly associated with time-to-event outcomes. A selected survival analysis is conducted next by excluding the tumor Samples that contribute to the expression footprint in order to distinguish driver genes from passenger genes. Driver genes are presumed to have an impact on survival even in the absence of the corresponding expression footprint, whereas passenger genes are selected only because they are co-located with a driver gene and thus belong to the expression footprint. Potential driver genes are those that are significantly associated with survival in both the unselected and selected survival analyses.
Our simulation results illustrated that the value of DL lc increases with the size of both the expression footprint size and the relative risk. While it is robust to varying threshold parameters (ie, within a range of 1–2), it is affected by the size of the window, although it is important to note that this is not a problem if the same window size is used throughout the analysis. Indeed, the score was only saturated in larger expression footprints, which were few in number.
Using real datasets derived from five different cancer types, we illustrated that TRIAGE was able to identify potential driver genes that were enriched for biological processes known to be involved in cancer progression. Among the selected genes, known oncogenes such as MMP1, IL8, and COL1A2 were identified as drivers for multiple cancers. Many new potential driver genes were identified and further biological validation studies would be invaluable to confirm or disprove the importance of these genes in the etiology of cancer.
Our results illustrate that TRIAGE offers several advantages over traditional methods of expression analysis, which select genes that are commonly over- or underexpressed. In contrast, TRIAGE relies on patient heterogeneity to highlight different subtypes of gene expression. TRIAGE is thus a useful tool for identifying the genes that distinguish between subgroups of patients having the same disease but differing in their genomic profiles, including differences in active driver genes. These subpopulations could thus potentially benefit from different treatments. Such patient-specific approaches are central to the increasingly influential field of personalized medicine.
TRIAGE is not without some limitations. Here, we focused on the analysis of gene expression. However, the mechanisms that underlie cancer are tremendously complex, involving a host of other genomic aberrations including copy number variations, mutations, and fusions. We anticipate that future refinements to the TRIAGE approach will allow us to account for these influences. TRIAGE is limited to the identification of driver genes harbored in regions associated with deregulated gene expression. However, many genes become deregulated in isolation through many mechanisms. For example, p53 is deregulated by a deleterious mutation but the expression of its genomic region is not deregulated. Similarly, fusion genes may be formed by translocations in the absence of tandem duplications. These limitations notwithstanding, TRIAGE is a valuable tool to identify driver genes that are associated with regions of deregulated gene expression in cancer and may perhaps be applicable to other vexing conditions as well.
Author Contributions
Conceived and designed the experiments: RKMK, LDM. Analyzed the data: SR, RKMK. Wrote the first draft of the manuscript: SR. Contributed to the writing of the manuscript: RKMK, LDM. Agree with manuscript results and conclusions: SR, LDM, RKMK. Jointly developed the structure and arguments for the paper: LDM, RKMK. Made critical revisions and approved final version: LDM, RKMK. All authors reviewed and approved of the final manuscript.
Footnotes
Acknowledgments
We thank Genome Institute of Singapore and The Jackson Laboratory for supporting this work. We also thank Joshy George for valuable comments and Tara Mclaughlin for helping editing the manuscript.
