Graphical Identification of Cancer-Associated Gene Subnetworks Based on Small Proteomics Data Sets

Abstract

Proteomics is a rapidly emerging frontier in post-genomics medicine and biology, but the quantitative analysis and validation of proteomic data are in need of further improvements. Before selecting potential candidate proteomic biomarkers, it is important to understand the broader context of how biological processes are regulated under different conditions or in different phenotypes. The enrichment of proteomic data consists of extracting as much biological meaning as possible from curated, pathway-based, functional protein interaction networks. Currently, most of the enrichment tools are intended for microarray data and require parametric data, whereas proteomic data are often nonparametric. In this study, we aimed to select a suite of interactive tools that can enrich proteomic results with a graphical overview. This facilitated diagnosis and interpretation prior to further analysis. From a list of proteins, a network was constructed using a map of the most severely disrupted biological process, and the disease entity was then identified on the basis of clinical data. Taken together, this graphical and interactive method ranks potential proteins via functional analysis in order to improve the choice of biomarkers for validation with the following advantages: 1) It adds neighbor proteins that are not selected by mass spectrometry analysis, but could in fact be key proteins; 2) pinpoints the biological process most often involved; and 3) predicts the most likely disease on the basis of clinical data.

Introduction

Differential proteomic analysis has been widely used to identify candidate proteins for potential use as biomarkers. Recent progress in the technology of mass spectrometry provides high mass accuracy and detection sensitivity. However, an alternative analysis (ELISA, Western blot) remains helpful to verify the protein quantification (Rifai et al., 2006). Examples of this are in recent findings (Zhi et al., 2011; Xiao et al., 2012).

Protein identification and quantification are only one-half of the proteomic story. The choice of statistical parameters for selecting candidates, their functional clustering, predicting their network interactions, scoring the pathways, and identifying the disease involved constitute the other half of proteomic investigations.

The choice of candidates for validation remains arbitrary, and is limited to a small number of proteins. For this reason, it is important to choose the most specific candidates for validation. To obtain a global overview of candidates, it is essential to have in silico tools that can be used to integrate experimental results (fold change, p value/FDR, replicates, variability, correlations) and functional biology data from curated databases (interaction, function, biological process, pathway, and disease). Such an overview provides a better classification of the candidates, and makes it possible to select the most relevant ones for subsequent validation. We describe here an interactive graphical tutorial that integrates several existing programs in order to improve the selection of candidates.

In most cases, the programs developed for gene set enrichment analysis have been adapted to microarray data and require many chip sets and several thousand genes. However, proteomic data are often nonparametric, and implement only a few replicates, and just a few hundred proteins. We have investigated the existing computational tools, and have selected only those that are suitable for small data sets such as those found in proteomics.

Methods

Gene set enrichment

Gene set enrichment analysis is a widely used strategy for scoring gene sets on the basis of their differential expression and known pathway databases. This strategy has yielded a list of the best pathways between two biological states that include the greatest number of significant modulated genes (Allison et al., 2006). Several methods and tools have been developed (see the Supplementary Table S1 at http://www.liebertonline.com/omi.) Briefly, they differ mainly in their databases of known gene-sets [GO (Ashburner et al., 2000), KEGG (Kanehisa and Goto, 2000), Reactome (Vastrik et al., 2007), MsigDB (Subramanian et al., 2005)], and the statistical method used to assess enrichment (Sample randomization or gene randomization). For further details see two articles that describe the existing approaches and propose solutions (Luo et al., 2009; Merico et al., 2010). To date, few tools provide interactive exploration, and fewer still can be run using nonparametric data.

The Enrichment MAP program (Subramanian et al., 2005) is a network-based method for gene set enrichment and visualization, but it uses sample randomization and requires more than 8 gene chips per state. The GAGE program (Luo et al., 2009) is a generally applicable gene set enrichment tool for pathway analysis. It could be adapted for use with nonparametric data, and is able to handle data sets corresponding to different sample sizes or experimental designs, but it is not implemented using graphical tools. The GeneMANIA (Montojo et al., 2010), the ClueGO (Bindea et al., 2009), and Reactome FI (Wu et al., 2010) programs have all cytoscape plug-ins (Shannon et al., 2003; Smoot et al., 2011), and are graphical tools requiring only a gene list, but unfortunately they have at least two shortcomings: they do not consider either the gene expression profile nor the gene correlations.

Multi-correlation network analysis

Estimating the correlation between gene expressions is fundamental for clustering functionally relevant gene sets with a cellular pathway. Two closely-correlated gene expressions are likely to be involved in the same biological process (Langfelder and Horvath, 2008). Eight R packages were tested (see Supplementary Table 1) using our data. Only CORREP (Zhu et al., 2007) was able to cope with the small sample size, and it handles replicates as independent samples.

Disease-based network

A network of diseases associated with particular biological processes and the corresponding genes offers a platform for exploring the common genetic origin of many diseases in a single graph (Becker et al., 2004; Goh et al., 2007). The cBio Cancer Genomics Portal (Cerami et al., 2012) (http://www.cbioportal.org/) was used to overlap the gene expression profiles of twenty cancers (clinical data) with our experimental data (Supplementary proteinEXP.txt). This completed the map of the biological processes induced by the modulated genes, and was expected to help to shed light on the panel of genes that can lead to a particular disease.

Proteomics data

We assume that the protein list and its quantitative analysis had already been compiled (See Supplementary files: proteinEXP.txt, only-modulatedEXP.txt, Patient-profile.png at http://www.liebertonline.com/omi.)

The proteomic data consisted of 6 blood samples extracted from subjects corresponding to two occupational conditions: three (biological replicates) of them worked in a Radiobiology Department, and the other three (biological replicates) were administrative office workers. The general characteristics of the patients/volunteers are in Supplemental_Patient-profile.png file.

The institutional Ethics Committee approved the study. All participants were asked to complete a standardized questionnaire including items concerning smoking habits, alcohol intake, drug consumption, medical history, and years of employment.

Figure 1 shows the work-flow of our approach. It starts with quantitative expression data as in the Supplementary proteinEXP.txt file. From this table, the fold change, the FDR and the multivariate correlation were computed as described in the SupplementaryTutorial. The gene list was then submitted to Reactome FI plugin (Wu et al., 2010) through Cytoscape software (Shannon et al., 2003; Smoot et al., 2011) to predict the function interactions network (Fig. 2. Network with gray edges). The colors of the nodes were assigned by the cancer module from the National Cancer Institute (Wu et al., 2010). The pertinent gene were selected from this network, and submitted to the ClueGO plugin (Bindea et al., 2009) to annotate the functional network with the biological processes (Fig. 2. Biological process ellipse). The fold change was indicated by the color of the node borders (red: upregulated, blue: downregulated) and the false discovery rate (FDR) was indicated by the node line width.

FIG. 1.

Work-flow of an enrichment network drawn using Inkscape.org..

FIG. 2.

Enrichment network for human exposure to ionizing radiation. Network constructed using Cytoscape (Shannon et al., 2003; Smoot et al., 2011) software, ReactomeFI (Wu et al., 2010), and ClueGO (Bindea et al., 2009) plug-ins. The gene expression correlation was computed by the CORREP package, and the cancers were predicted using the cBio portal database. The node color was assigned by a cancer module from the National Cancer Institute (Wu et al., 2010). The edge color mapping was optimized; cancer node interactions are shown in green and the correlation interaction in blue (negative) or red (positive). The edge width line has been optimized to pinpoint the most frequently regulated gene in cancers or the most closely correlated gene expression.

The multivariate correlation analysis was done using the CORREP (Zhu et al., 2007) package. Two closely-correlated proteins (r >0.831) were linked with the edge shown in red or blue (Fig. 2. Multicorrelation network ellipse).

The gene expression profile was compared to the gene expression profile of clinical cancer data curated by the cBio Cancer Genomics Portal (Cerami et al., 2012) (http://www.cbioportal.org/). Only expression proteins profile that displayed more than 75% overlap with clinical cases were linked to diseases and merged in the same network (Fig. 2. Green edges, cancer network ellipse).

A 3-D enhanced version of this tutorial is available in Supplementary video at http://www.liebertonline.com/omi.

Results

Figure 2 shows the results of the enriched proteomic data. It is essential to follow certain rules to be able to decipher the information contained in Figure 2: (1) The nodes with colored borders represent the genes that are up- or downregulated in the example of proteomic analysis. (2) The square nodes are added by Reactome FI plugin in order to enhance the identification between genes and specific biological processes. (3) The nodes that are either large or small size (cancers) or which have thick borders (genes) are the keys to the network. (4) The thick and short edges are the best interactions to take into consideration.

The network has three node clusters (densely connected nodes). The multicorrelation network is formed mainly by proteins involved in blood homeostasis via coagulation cascades. Most of these proteins were identified by mass spectrometry, but not selected as being either up- or downregulated. However, it is important to note that some of them were closely correlated with key proteins (nodes with borders: up- or down-regulated). For example, EFEMP1 and ADAMTS13 were positively correlated with the CFHR2 protein, and SPTBN4 was positively correlated with PPBP.

The subnetwork of biological processes can be useful in two ways: (1) It shows that oxidative stress and ionizing radiation exposure responses are most likely, and (2) the interactions between these biological processes illustrate the proteins involved in the responses to these stress. This makes it possible to enhance the visualization of the roles and therefore provides the most useful biomarker.

The subnetwork of cancers illustrates the most frequent cancers and attracts some interesting proteins that could be a potential biomarkers. For example, ovarian cancer is the most frequent cancer (the biggest node), but it does not have any specific protein (there is no thick interaction with protein). In contrast, kidney cancer is not frequent (a small node), because the gene expression profiles of cases from cbio portal data-base do not match our expression profile (proteinEXP.txt), but it does display a strong interaction (short and thick edge) with ADH4 protein, which could therefore be the best candidate for this cancer (Fig. 3).

FIG. 3.

Zoom-in of the cancer network cluster. The most pertinent interaction is between the small cancer node and thick and short edge. It is the case of ADH4 and Kidney cancer: Kidney cancer did not involve many proteins but in major cases ADH4 was the most frequently altered gene. This means that the gene is predominantly altered in kidney cancer. In contrast, the gene expression profile of the ovarian cancer implicates many proteins (several connections) that cannot be specific for this disease.

Discussion

The biological processes selected by ClueGO show the most protein often involved to oxidative stress and ionizing radiation. This finding consolidates our approach based on proteomic data in which samples had been taken from individuals who may have been exposed to ionizing radiation while working in a Radiobiology Department. The Supplementary Table 2 lists the proteins with related diseases.

All proteins connected to P53 could be interesting to study. P53 was not included in our first list of modulated proteins, but it is involved in the four most interesting biological processes (Fig. 2. Oxidative stress and ionizing radiation response). For example, YWHAZ (14-3-3z protein) was identified as upregulated and it connects only to the ovarian and breast cancer nodes (female cancers). It seems to be a specific candidate, because it is not involved in many cancers.

The third cluster consists of cancers showing the greatest overlap of proteins expressed. Ovarian carcinoma was the most likely disease (biggest node, many edges), but it did not have any specific protein (many interactions with genes). In contrast, the clinical expression profile of kidney carcinoma did not overlap sufficiently (few interactions with genes) except with ADH4, which could therefore be specific. The wide edge between ADH4 and kidney carcinoma indicates that ADH4 is frequently altered in kidney carcinoma. The short distance between ADH4 and kidney carcinoma could be a second indicator of the specificity of ADH4 to kidney carcinoma (Fig. 3). If we check the statistics for kidney cancer (Supplementary Edge-Diseases.txt file—Kidney renal clear cell carcinoma) we can see that ADH4 rank first for both lung (99% of 178 cases) and kidney (99.7% of 368 cases) cancers. ADH4 is also frequent in ovarian cancer (83% of 316 cases). It was easier to reach these conclusions using the graph than from the tables.

Conclusion

This graphical and interactive method ranks potential proteins via functional analysis in order to improve the choice of biomarkers for validation. It is based on clinical data and adapted to small datasets. It could be used to enrich proteomic data, thus enhancing the choice of candidates for validation: 1) It adds neighbor proteins that are not selected by mass spectrometry analysis, but could in fact be key proteins, 2) it pinpoints the biological process most often involved, and 3) it predicts the most likely disease on the basis of clinical data.

Footnotes

Acknowledgments

I thank Prof. Vural Ozdemir for his main text amendment. I thank the two anonymous reviewers for their valuable comments which greatly improved the presentation of this work. I would like to thank Dr. Marc Edery for technical and scientific support, and Monika Ghosh for linguistic assistance. This work was supported by National Center for Nuclear Sciences and Technologies.

Disclosure Statement

The author declares that there are no conflicting financial interests.

References

Allison

, Cui

, Page

, Sabripour

. 2006. Microarray data analysis: From disarray to consolidation and consensus. Nature Rev Genet, 7:55–65.

Ashburner

, Ball

, Blake

et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nature Genet, 25:25–29.

Becker

, Barnes

, Bright

, Wang

. 2004. The genetic association database. Nature Genet, 36:431–432.

Bindea

, Mlecnik

, Hackl

et al. 2009. ClueGO: A Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics, 25:1091–1093.

Cerami

, Gao

, Dogrusoz

et al. 2012. The cBio Cancer Genomics Portal: An open platform for exploring multidimensional cancer genomics data. Cancer Disc, 2:401–404.

Goh

K-I

, Cusick

, Valle

, Childs

, Vidal

, Barabási

A-L

. 2007. The human disease network. Proc Natl Acad Sci USA, 104:8685–8690.

Kanehisa

, Goto

. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res, 28:27–30.

Langfelder

, Horvath

. 2008. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform, 9:559.

Luo

, Friedman

, Shedden

, Hankenson

, Woolf

. 2009. GAGE: Generally applicable gene set enrichment for pathway analysis. BMC Bioinform, 10:161.

10.

Merico

, Isserlin

, Stueker

, Emili

, Bader

. 2010. Enrichment map: A network-based method for gene-set enrichment visualization and interpretation. PloS One, 5:e13984.

11.

Montojo

, Zuberi

, Rodriguez

et al. 2010. GeneMANIA Cytoscape plugin: Fast gene function predictions on the desktop. Bioinformatics, 26:2927–2928.

12.

Rifai

, Gillette

, Carr

. 2006. Protein biomarker discovery and validation: The long and uncertain path to clinical utility. Nature Biotechnol, 24:971–983.

13.

Shannon

, Markiel

, Ozie

et al. 2003. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res, 13:2498–2504.

14.

Smoot

, Ono

, Ruscheinski

, Wang

P-L

, Ideker

. 2011. Cytoscape 2.8: New features for data integration and network visualization. Bioinformatics, 27:431–432.

15.

Subramanian

, Tamayo , Mootha

et al. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA, 102:15545–15550.

16.

Vastrik

, D'Eustachio

, Schmidt

et al. 2007. Reactome: A knowledge base of biologic pathways and processes. Genome Biol, 8:R39.

17.

, Feng

, Stein

. 2010. A human functional protein interaction network and its application to cancer data analysis. Genome Biol, 11:R53.

18.

Xiao

, Zhang

, Zhou

, Lee

, Garon

, Wong

DTW

. 2012. Proteomic analysis of human saliva from lung cancer patients using two-dimensional difference gel electrophoresis and mass spectrometry. Mol Cell Proteom MCP, 11:M111.012112.

19.

Zhi

, Sharma

, Purohit

et al. 2011. Discovery and validation of serum protein changes in type 1 diabetes patients using high throughput two dimensional liquid chromatography-mass spectrometry and immunoassays. Mol Cell Proteom MCP, 10:M111.012203.

20.

Zhu

, Li

. 2007. Multivariate correlation estimator for inferring functional relationships from replicated genome-wide data. Bioinformatics, 23:2298–2305.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.00 MB

88.02 MB

0.09 MB

0.07 MB

3.52 MB

0.06 MB