GeneExpressionSignature: an R package for discovering functional connections using gene expression signatures

Abstract

Comparisons of gene expression signatures provide a way to explore functional connections among biological events in global aspects of cell response. GeneExpressionSignature is an R package developed for the large-scale analysis of gene expression signatures. The package implements two rank-merging algorithms and two similarity-scoring algorithms. The functions of GeneExpressionSignature provide a flexible solution for gene expression signature-based studies and hold great potential in biomedical research applications, such as drug repurposing. GeneExpressionSignature is released under GPL v2 within the Bioconductor project and is freely available at http://www.bioconductor.org/packages/release/bioc/html/GeneExpressionSignature.html.

Introduction

Genome-wide mRNA profiling provides a cost-efficient snapshot of the global state of cells (Kutalik et al., 2008). The gene expression signature, which is a subset of significantly differential expressed genes, can be used as a marker of this snapshot but with less noise. Comparative analyses of gene signatures can establish functional connections among diseases, genes, and drugs.

In vitro and in vivo experiments have successfully verified many of the drug–drug, drug–target, and drug–disease associations revealed by gene signature comparisons, that indicate potential drug repositioning (Dudley et al., 2011; Hu and Agarwal, 2009; Iorio et al., 2010; Lamb et al., 2006; Kutalik et al., 2008; Sirota et al., 2011). These large-scale transcription pattern-matching studies measured the similarity between gene expression signatures as their core algorithms. Lamb (2006) introduced a rank-based scoring metric of gene expression signatures that was based on the Gene Set Enrichment Analysis (GSEA) algorithm (Subramanian et al., 2005). They used the GSEA algorithm to construct a large association map (Connectivity Map, C-MAP) of 1309 Food and Drug Administration (FDA)-approved drugs. A rank-merging step was introduced in order to synthesize gene expression profiles from different cultured human cells treated with drugs at different concentrations.

A web application (MANTRA, http://mantra.tigem.it) and the R package PGSEA (Kim and Volsky, 2005) are currently available for gene expression signature analysis. Unfortunately, neither of these tools supports the integration of gene expression profiles, and both are limited in terms of how many data resources can be used in a single study. Here, we present an R package named GeneExpressionSignature for comparative studies among different transcriptional responses based on gene expression signatures, and providing solutions from data preprocessing to similarity scoring.

Materials and Methods

GeneExpressionSignature is developed as a package for the statistical computing environment R and is released under the GNU General Public License within Bioconductor. This package provides a simple data preprocessing method, which integrates two rank-merging and two similarity-scoring methods. All of the functions in the GeneExpressionSignature package, except getRLs, support ratio, log-ratio, and rank data stored as assay data in the “ExpressionSet” object of the Biobase package as input data. The label of each column, as well as phenotypic data in the “ExpressionSet” object, is the biological state descriptions of the gene expression profiles.

The GeneExpressionSignature package offers a gene-expression profile data preprocessing method, named getRLs, to get the ranked lists by sorting the microarray probe-set identifiers according to the expression values. It should be noted that no standard methods for data preprocessing, so the function getRLs which takes the method in C-MAP is just for reference.

The function RankMerging of GeneExpressionSignature can merge a group of ranked differentially expressed gene lists into a single ranked list, containing items that represent a certain kind of biological response. The outcome list is referred to as a prototype ranked list (PRL). GeneExpressionSignature implements two rank merging approaches for two cases: 1) ‘equally weighted’, that means all ranked list with the same biological state are treated equally important; 2) ‘adaptively weighted’, that means each individual ranked lists has its own ranked weights.

For the first case, a simple but useful algorithm is implemented for the equally weight rank merging problem, named ‘equally weighted rank merging.’ The algorithm utilizes the average ranking technique including two steps: calculate average rank for the ranked lists and then construct their final rankings.

For the second case, an ‘adaptively weighted rank merging’ algorithm is implemented, including aggregating and merging. In the aggregating step, the iterative rank-aggregating algorithm proposed by Iorio et al. (2010) is implemented for the adaptively weight rank merging. This graph-theory algorithm was originally presented for finding minimum spanning trees. In the iterative process, two closest-ranked lists are selected to be merged to one unified ranked list in each iteration. The iteration is continued until all rank lists are merged. In the merging step, a consensus-based voting algorithm (Borda Counting algorithm) is utilized. Each ranked list is taken as a voter voting and is used for the point calculation of a ranking. The merged list is ranked by the total points of ranking. This merging algorithm needs to calculate the distances among ranked lists. Therefore, distance measuring between two ranked lists is the key step in this adaptively weighted algorithm. The most commonly used measurements are Spearman's Footrule distance and Kendall's tau distance. Both of these are provided by the GeneExpressionSignature package. Because computation of Kendall's tau distance can be extremely slow for large data sets, the effectiveness of rank merging is certainly limited by the size of the merging problem if using Kendall's tau distance measuring. Spearman's Footrule distance measurement is set as default option for efficiency, and it is also recommended in most cases.

Which rank merging algorithm is selected depends on the experimental conditions of expression data. In most cases, experimental conditions of expression data to be merged are ‘unbalanced’ (i.e., some experiments in which several treatments with a single drug on a particular cell line are available but only one or a few of them have been performed on the other cell lines). The adaptively weighted algorithm as default option should be used. In other cases, if the expression data is from a ‘balanced’ setting (i.e., replicates of the same experiment or same number of treatments with a drug on different cell lines), the equally weighted algorithm should be used.

Once ranked lists with the same biological states are merged into one single PRL, functions ScoreGSEA and ScorePGSEA are adopted to measure the distance between PRLs of different biological states. These two functions take the GSEA (Subramanian et al., 2005) and PGSEA (Kim and Volsky, 2005) algorithms, respectively. Furthermore, the corresponding p values are also estimated, function ScoreGSEA calculates the p value with Monte Carlo procedures (North et al., 2002) and ScorePGSEA estimates the p value with the method in PGSEA package. To reduce noise in the PRLs further, Iorio et al., (2010) used only some of the PRLs as the gene signature (the so-named “optimal signature”). According to the method mentioned above, for a given ranked list y, GeneExpressionSignature takes the most upregulated gene subset p_y and the most downregulated gene subset q_y to form an optimal signature {p_y, q_y}. The Inverse Total Enrichment Score (TES) of PRLs x and y is then introduced as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\rm TES} _{\{x , y\}} = 1 - \frac { ES_x^ {p_y} - ES_x^ {q_y}} {2} \tag {1} \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ES_x^{p_y}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ES_x^{q_y}$$ \end{document} are enrichment scores of p_y and q_y, respectively, with respect to x, and can be calculated by the GSEA or PGSEA algorithm. The TES, which ranges from −1 to +1, describes how many genes in the optimal signature {p_y, q_y} have similar rankings in x. Based on TES, Iorio et al. (2010) defined two distance measurements between PRLs: the Average Enrichment-Score Distance D_avg=(TES_{x,y}+TES_{y,x})/2, and the Maximum Enrichment-Score Distance D_max=Max{TES_{x,y},TES_{y,x}}. D_avg is more stringent than D_max, whereas D_max is more sensitive to weak similarities, with lower precision but larger recall. GSEA is more time consuming than PGSEA because of the repeated computation of permutated data set. Therefore, GSEA is especially useful when the data set is minimal or moderate. Compared with GSEA, PGSEA requires less computation and is more convenient for large data sets. Both of these similarity-scoring methods are implemented. Considering that the GSEA has already been accepted as a standard method, it is set as default option in the GeneExpressionSignature package.

Results

The GeneExpressionSignature is implemented for similarity measuring based on gene expression profiles. To illustrate how to use GeneExpressionSignature in an analysis of gene expression signatures, a subset of C-MAP (Lamb et al., 2006) was used as sample data. The sample preprocessed data is a collection of 50 genome-wide transcriptional expression data from cultured human cells treated with 15 different small molecules. The 50 gene expression files were merged into 15 PRLs according to their biological states, and the distances between the PRLs was also computed. Example data from C-MAP are very small and adaptively weight. All the calculations are performed with default options (adaptively weighted algorithm and Spearman's Footrule distance measurement for rank merging and GSEA algorithm for similarity scoring). Finally, with the distance-matrix, the community structure of the associations of gene expression signatures can be revealed by the affinity propagation clustering algorithm and presented by Cytoscape (Fig. 1).

FIG. 1.

Clustering result. 15 different drugs are divided into 3 groups based on the distances of gene signatures. Rectangles (green), circles (red), and triangles (blue) represent different group of drugs, respectively. To be more comprehensive, the top 25% most significant (high similarity score) links between drugs are kept.

These results are consistent with the findings of the previous studies that drugs shared the common mode of action (MOA) are in the same groups (Iorio et al., 2010). This illustration provides valuable information for further analysis.

Conclusion and Discussion

We developed a gene expression profile-based tool GeneExpressionSignature to compute the distances among gene expression profiles. Analysis based on gene expression signatures is widely recognized as a powerful tool in translational bioinformatics, because it can quickly connect global biological events and provide valuable clues for experimental research. The GeneExpressionSignature package provides a complete solution for gene expression signature-based studies. The integration of typical rank-merging and similarity-scoring algorithms gives the package great potential in terms of biomedical applications, such as drug repurposing.

This current version of GeneExpressionSignature can be used only with data coming from the same platform because we took the most downregulated and upregulated gene sets as gene signature (example data are on the HG-U133A platform). Another limitation is no standard pipeline available for data preprocessing. The data ranking function getRLs in GeneExpressionSignature is just for reference. Finally, distance metrics in the merging methods is time consuming for large data sets, particularly for the Kendall's tau metric. Meanwhile, due to the Monte Carlo procedures, it also extremely slow if the p value estimation is performed in a GSEA algorithm. We plan to fix these problems to improve this package in the future, and we believe it will increase the practicality of GeneExpressionSignature.

Footnotes

Acknowledgments

This study was funded by the National Key Technologies R&D Program for New Drugs (2012ZX09301-003); and National Nature Science Foundation of China (81102419, 81273488).

Author Disclosure Statement

The authors declare that no conflicting financial interests exist.

References

Dudley

, Sirota

, Shenoy

et al. 2011. Computational repositioning of the anticonvulsant topiramate for inflammatory bowel disease. Sci Transl Med, 3:96ra76–96ra76.

, Agarwal

. 2009. Human disease-drug network based on genomic expression profiles. PLoS ONE, 4:e6536.

Iorio

, Bosotti

, Scacheri

et al. 2010. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proc Natl Acad Sci USA, 107:14621–14626.

Kim

S-Y

, Volsky

. 2005. PAGE: Parametric analysis of gene set enrichment. BMC Bioinformat, 6:144.

Kutalik

, Beckmann

, Bergmann

. 2008. A modular approach for integrative analysis of large-scale gene-expression and drug-response data. Nature Biotechnol, 26:531–539.

Lamb

, Crawford

, Peck

et al. 2006. The connectivity map: Using gene-expression signatures to connect small molecules, genes, and disease. Science, 313:1929–1935.

North

, Curtis

, Sham

. 2002. A note on the calculation of empirical P values from Monte Carlo procedures. Am J Hum Genet, 71:439–441.

Sirota

, Dudley

, Kim

et al. 2011. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci Transl Med, 3:96ra77–96ra77.

Subramanian

, Tamayo

, Mootha

et al. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA, 102:15545–15550.