Pathway-Based Functional Analysis of Metagenomes

Abstract

Metagenomic data enables the study of microbes and viruses through their DNA as retrieved directly from the environment in which they live. Functional analysis of metagenomes explores the abundance of gene families, pathways, and systems, rather than their taxonomy. Through such analysis, researchers are able to identify those functional capabilities most important to organisms in the examined environment. Recently, a statistical framework for the functional analysis of metagenomes was described that focuses on gene families. Here we describe two pathway level computational models for functional analysis that take into account important, yet unaddressed issues such as pathway size, gene length, and overlap in gene content among pathways. We test our models over carefully designed simulated data and propose novel approaches for performance evaluation. Our models significantly improve over the current approach with respect to pathway ranking and the computations of relative abundance of pathways in environments.

1. Introduction

Metagenomics is an increasingly prevalent approach for the study of microbial communities directly from the environment in which they live. Unlike in traditional microbiology, random DNA pieces (called reads)—collected directly from the environment without a culturing stage—are being sequenced. Avoiding the culturing stage makes it possible to study the vast majority of microbes on earth, more than 99% according to some estimates (Amann et al., 1995), which cannot be cultured. To date, metagenomics was applied for studying several environments and microbial functions (DeLong et al., 2006; Gill et al., 2006; Rusch et al., 2007; Tyson et al., 2004; Warnecke et al., 2007; Yooseph et al., 2007). Notable discoveries, including the identification of proteorhodopsin (Beja et al., 2000) and the discovery of photosystem I genes in viral genomes (Sharon et al., 2009a), were made using metagenomics.

Analysis of metagenomic data poses analytical challenges resulting from the short length of DNA reads of which the data consists. Traditional Sanger sequencing generates reads of average length 900bps; newer high-throughput sequencers produce reads of even shorter lengths ranging from less than 100bps (e.g., the Illumina Solexa and ABI SOLiD sequencers) to 500bps (the 454 Life Sciences sequencer). Even with recent and expected advances in sequencing technology, read length is likely to remain a major issue in metagenomics analysis that will require novel computational methods that are different from those used for the analysis of complete genomes. Such methods have been emerging in an increasing rate lately, including methods and strategies for assembly, gene calling, community structure prediction, and more (Raes et al., 2007).

The functional analysis of metagenomes aims to identify those functional capabilities most significant to organisms living in the environment under study. Usually, analysis is done either at the single gene level, focusing on the abundance of gene families, or at the pathway level in which the occurrence of genes in pathways is taken into account. These processes start by identifying genes in the data and predicting their function, where function prediction is done by aligning the data against function-oriented databases. Such databases include COG (Tatusov et al., 2003), Pfam (Finn et al., 2006), and TIGRFAM (Haft et al., 2003) for gene level analysis, and KEGG (Kanehisa and Goto, 2000), MetaCyc (Caspi et al., 2008, 2010), or SEED (Overbeek et al., 2005) for systems or pathway level analysis¹. For each function, the function prediction process generates its read count, i.e., the number of reads associated with the function in the metagenome. Once determined, read counts can be used for computing the relative abundance of each function in the metagenome. Previous works ignored issues related to gene length and the minimum portion of a gene required in order to identify it (DeLong et al., 2006; Markowitz et al., 2006; Rodriguez-Brito et al., 2006)) and estimated the relative abundance of each function f, both at the gene family and pathway levels, as the relative abundance of its read count from all functions in the function database F: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} freq (f) = \frac {read \_count(f)} {\sum\limits_ {f^{\prime}\in F }read \_count (f^{\prime})}\tag {1} \end{align*} \end{document}

We refer to this as the read count approach. It is straightforward when complete genomes are considered and the relative abundance of functions is computed based on gene count, namely the number of genes associated with the different functions. However, it results in inherently biased estimates when read counts are considered, due to the fact that longer genes are expected to have a higher read count simply due to their length. This problem is addressed in a recently published work (Sharon et al., 2009b) that presents a statistical framework for the functional analysis at the gene family level. The model presented in that article is based on the assumption that the number of reads beginning at each position across any genome is Poisson-distributed (Lander and Waterman, 1988). While this framework fits gene families, it may not be suitable as it is for functional analysis at the pathway level, most notably due to the presence of the same genes in several pathways.

Functional analysis at the pathway level is mainly used for two purposes: computation of pathway relative abundance, and pathway content comparison. Computing relative abundance of pathways within a single sample provides an overall view of the environment and was used in many studies and platforms (Dinsdale et al., 2008; Edwards et al., 2006; Overbeek et al., 2005; Rodriguez-Brito et al., 2006). Comparing pathways' abundance between samples makes it possible to identify pathways that are enriched within one of the environments with respect to the other (DeLong et al., 2006; Edwards et al., 2006). Derivatives of pathway content comparison may be used for clustering functionally similar environments using metrics over pathway abundances vectors (DeLong et al., 2006; Feingersch et al., 2010).

Pathway reconstruction is a related problem in which the most likely set of pathways in a genome or a metagenome is determined, without estimating their abundance. A commonly used naive approach to this problem would be to collect all pathways with at least one representative in the data. However, this approach is expected to yield an inflated list of pathways. Recently, a method called MinPath was described that attempts to deduce the minimal set of pathways required for supporting an observed set of functions (Ye and Doak, 2009). The method uses Integer Programming for deciding whether a pathway is present, based on the observed functions. Note that in this case the relative abundance of the different functions is not taken into account, and no estimation of the relative abundance of the different pathways is done.

Here, we present two models for the functional analysis of metagenomes at the pathway level. Both models ignore pathway topology and treat pathways as gene sets. We begin with a short description of the model described in Sharon et al. (2009b) and deduce the independent pathways model that can be regarded as a natural extension of the previous work. Next, we present the pathway intersection model that takes into account the co-occurrences of genes in more than one pathway. We test both models on synthetic data and compare the results to the currently used read-count approach. Our tests focus on the above-mentioned two common functional analysis tasks, namely sample comparison and the computation of relative abundance of pathways in the environment.

2. Methods

2.1. The Poisson model for computing gene family abundance

A metagenome M is a set of R sequence reads of length r each, extracted randomly with uniform probability for all positions across all genomes from some DNA sample of size L bps. A gene family G represents a set of functionally similar genes, which can be defined, for example, via sequence similarity. COG, Pfam, and other databases are often used as references for the identification of gene families in metagenomic data. We denote a collection of gene families by D^GENE; the association between M's reads and gene families is defined in terms of the read count, R_G, representing the number of reads (out of R) carrying a detectable portion of G's member. Assuming that the abundance of a gene family G∈D^GENE in the DNA pool is C_G (i.e., the DNA sample has C_G copies of genes that are members in G), the read count, R_G, is Poisson distributed with mean λ_G (Sharon et al., 2009b): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \Pr ( R_G = k ) \sim Poisson ( \lambda_G ) = \frac { \lambda_G^k \cdot e^ { - \lambda_G } } { k! } \tag { 2 } \end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \lambda_G = \frac { R } { L } ( r + L_G - 2T ) \cdot C_G \tag { 3 } \end{align*} \end{document}

In this formula, R/L is the rate of read starts per base pair. The term (r + L_G − 2T) reflects the average number of starting positions for reads carrying a detectable portion of a single copy of G, where L_G is the average length of G's members, T is the minimum portion of a gene required to be present on a read in order to be associated with its family, and r is the read length.

An estimator for a gene read count, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{R}_G$$ \end{document} , can be computed using BLAST (Altschul et al., 1990) with a certain threshold. A Maximum Likelihood Estimate (MLE) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{C}_G$$ \end{document} for C_G can be calculated from Equation 3 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{R}_G$$ \end{document} : \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \hat { C } _G = \frac { \hat { R } _G } { \frac { R } { L } \cdot ( r + L_G - 2T ) } \tag { 4 } \end{align*} \end{document}

All parameters in this formula are known, except for L, and hence an explicit calculation of gene family abundance is impossible. In previous work (Sharon et al., 2009b), the above formula was used to compute frequency estimators for gene families, which is the relative abundance of a certain gene family out of the total abundance of all gene families in the DNA sample pool (which eliminates the dependency on L)². In this article, we resolve the problem of the unknown DNA sample length L by computing the abundance of a gene family per organism in the sample, instead of the absolute abundance. This requires an estimation of the average genome length in the DNA sample, as shown next.

2.2. Estimating the average genome length in the DNA sample

The estimation of the average length of a genome is based on the known existence of a group of genes that are known to be present exactly once per genome in all bacterial species. Several known single-copy genes, such as bacterial rpoB, recA, and gyrA, were used as both phylogenetic markers (Mollet et al., 1997; Venter et al., 2004), as well as for the normalization of the abundance of genes in metagenomic samples (Howard et al., 2006; Loy et al., 2009; Rusch et al., 2007; Venter et al., 2004; Yutin et al., 2007). Notably, several other approaches enable to estimate the average genome length without relying on single copy genes, either experimentally (Grossart et al., 2000) or computationally (Angly et al., 2009).

In the case of a single-copy gene SCG, the number of copies in the entire DNA sample, C_SCG, is equal to the number of organisms in the sample, N₀; hence, it is possible to deduce an MLE for the average genome length based on Equation 4: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \frac { L } { N_0 } \approx \frac { L } { \hat { C } _ { SCG } } = \frac { R } { \hat { R } _ { SCG } } ( r + L_ { SCG } - 2T ) \tag { 5 } \end{align*} \end{document}

A more accurate estimation of the average genome length is achieved by averaging the estimated values for several single copy genes.

Utilizing the estimated average genome length, based on Equation 4, the abundance of a gene family G per organism in the DNA sample can be calculated as following: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \frac { \hat { C } _G } { N_0 } = \left( \frac { L } { N_0 } \right) \cdot \frac { \hat { R } _G } { R \cdot ( r + L_G - 2T ) } \tag { 6 } \end{align*} \end{document}

2.3. Computing pathway abundance: the independent pathways model

In the context of the current analysis, a pathway P is defined as a set of gene families \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$P = \{ G_1^P , \ldots , G_m^P \}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$G_i ^P \subseteq D^{GENE}$$ \end{document} . Several repositories of pathways exist, for example, KEGG and MetaCyc, and they can be used in this study. We denote a collection of pathways by D^PATH.

The independent pathways model assumes that all gene families within a certain pathway, P, have the same number of occurrences (i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$C_1^P = C_2^P \ldots = C_m^P$$ \end{document} ), and refer to this number of occurrences as the abundance of the pathway, C^P. In this section, we assume that pathways' abundances in an organism are mutually independent (Fig. 1a). Analogously to the case of gene families, for each pathway P∈D^PATH our goal here is to compute the abundance of the pathway per organism, denoted by W^P. Based on the latter assumptions, for each pathway P an estimation of its abundance per organism can be calculated by averaging the estimated abundance of the member gene families: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} W^P = \frac { \hat { C } ^P } { N_0 } = \left( \frac { L } { N_0 } \right) \cdot \frac { 1 } { m } \sum_ { i = 1 } ^m \frac { \hat { R } _ { G_i^P } } { R \cdot \left( r + L_ { G_i^P } - 2T \right) } \tag { 7 } \end{align*} \end{document}

FIG. 1.

(a) The independent pathways model. In this model, a gene that is shared among several pathways is assumed to have a copy for each pathway in which it appears. For example, G₅ belongs to three pathways and thus assumed to have three copies. (b) The pathway intersection model. Each gene that appears in one or more pathways is assumed to appear once. In this case, G₅ will have a single copy, shared between P², P³, and P⁴.

Note that it is also possible to express the relative abundance of a pathway with respect to all other pathways in a sample by dividing W^P by the sum of Wⁱ for all Pⁱ∈D^PATH. In this case, the estimation for the average genome length (L/N₀) is eliminated.

2.4. Computing pathway abundance: the pathways intersection model

Pathways—being a descriptive tool—are not necessarily disjoint modules, but rather they share common proteins. Ignoring the overlap in gene family content between pathways may lead the method of Section 2.3 to overestimate the abundance of pathways that share proteins with other pathways. Here, we describe a second model that accounts for non-empty pathway intersections by jointly computing the abundance of all pathways within a collection of pathways.

The pathways intersection model assumes that a given pathway Y is either present or absent in an organism in the sample, where the presence of the pathway entails the presence of all of its member gene families in the organism. We denote by W^Y the random Boolean variable that represents the presence of a pathway Y in an organism. The probability that a gene family G is present in the genome of the organism is given by: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} P ( G \mid W ) = 1 - \prod_{ \{ Y \in D^{PATH} \mid G \in Y \} } [ 1 - P ( W^Y = 1 ) ] \tag{8} \end{align*} \end{document}

The abundance of G in the sample, C_G, is deduced by multiplying this probability by the number of organisms in the sample, N₀. Consequently the read count, R_G, is Poisson distributed with the following mean: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \lambda_G = \frac { R } { L } ( r + L_G - 2T ) \cdot \left( 1 - \prod_ { \ { Y \in D^ { PATH } \mid G \in Y \ } } [ 1 - P ( W^Y = 1 ) ] \right) \cdot N_0 \tag { 9 } \end{align*} \end{document}

This can be computed for various estimates of the W variables, using the estimated average of the genome lengths in the sample (see Section 2.2).

Using the observed number of reads, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{R}_G$$ \end{document} , we estimate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$P ( W^Y = 1 / \hat{R}_G )$$ \end{document} via a Markov Chain Monte Carlo (MCMC) posterior sampling. We assume a uniform prior for P(W), and estimate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$P ( \hat{R}_G \ / \ W^Y = 1 )$$ \end{document} using Equation (2), with λ_G given by Equation (9). The average of the obtained samples is used as the estimated posterior probability for the presence of each pathway in an organism.

2.5. Materials

In order to test both our models, we have generated five synthetic metagenomes based on simulated organisms, with different community complexities and metagenome sizes.

Generating organisms . We have generated two sets of organisms, KEGG10 and KEGG125, consisting of 10 and 125 synthetic species, respectively. First, the number of pathways and frequency of “dummy genes” (i.e., genes that do not belong to any pathway and that were chosen at random) were chosen either manually (KEGG10) or at random using a normal distribution with manually set parameters (KEGG125). Next, the simulated number of pathways was chosen at random from the KEGG database. Having done that, all genes from the selected pathways and the dummy genes were placed at random on the genome, using lengths as they appear in KEGG (for enzymes) or 1000 (for dummy genes). In addition to these genes, three single copy genes, gyrA, recA and rpoB, were also located randomly on each genome using their true lengths, averaged over instances from several bacterial genomes (2670, 1040, and 3520 bps, respectively). Note that our simulated data is based on the pathway intersection model (see Section 2.4), namely a single copy for every gene that appears in at least one pathway. Overall, the average genome length, number of pathways, and frequency of dummy genes was 2.5Mbps, 57, and 75% (respectively) for KEGG10; and 2.9Mbps, 81, and 68% for the KEGG125. (The choice of parameters was made in accordance with metagenomes in the IMG/M system [Markowitz et al., 2006]).

Generating populations . For each simulated population, a different organisms' prevalence and a different population structure were used. Population complexity, which refers to the relative abundance among species, was either high (similar abundance for most species) or low (a few relatively dominant species, low abundance for the rest).

Metagenome generation . Number of reads per metagenome was manually set; read length (r) and minimum detectable gene portion (T) were set to 900 (typical of Sanger sequencing [Rusch et al., 2007]) and 100 (corresponds to e-value ≈ 1e-100 in BLAST) base-pairs, respectively. Number of reads per species is proportional to its DNA share in the population, defined as (genome length*frequency in the population)/(sum of (genome length*frequency in the population) over all species).

2.6. Evaluation of the different methods

Functional comparison . In this test, the quality of each method with respect to pathway-based functional comparison of two samples is evaluated. Given two metagenomes, M and M', and a method for pathway abundance estimation, the frequency of each pathway in both M and M' is estimated, and the absolute difference between the two frequencies is computed. Next, pathways are ranked based on their differential enrichment, and the intersection between the true and estimated most differentially enriched pathways is computed for every prefix size m (≤100).

Pathway abundance estimations . For each method, pathways are ranked based on their estimated frequencies. Similarly to the case of functional comparison, we use the number of pathways that are common to both the true and estimated m most abundant pathways as a measure of quality. Hyper-geometric distribution was used in order to evaluate the statistical significance of the results. In short, the probability that the intersection between the two lists of size m contains exactly k pathways is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \Pr(X=k)\sim Hypergeometric(k;N, m, n)=\frac{\left( \begin{matrix} m\\ k \end{matrix}\right) \left(\begin{matrix} N-m\\ n-k \end{matrix}\right)} {\left(\begin{matrix} N\\ n \end{matrix}\right)} \tag {10} \end{align*} \end{document}

where N is the total number of pathways and n = m is the prefix size. The significance of the observed k is given by the Hyper-Geometric Tail (HGT): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} \Pr ( X \geq k ) = \sum_{i = k}^m Hypergeometric ( i;N , m , n ) \tag{11} \end{align*} \end{document}

In addition to the above, we have also used the Pearson correlation coefficient for evaluating the degree of agreement between the lists of true and estimated frequencies.

3. Results

To evaluate the performance of our methods in predicting pathway abundances, we generated synthetic metagenome data with various community complexities and sizes (see Section 2.5; Table 1). Our tests focus on two of the most interesting tasks in the context of metagenomics: (i) comparing pathways' abundance between samples and (ii) computing relative abundance of pathways within a single sample. As a baseline, we compared the performance of our methods to that of a standard read-count approach, estimating the relative abundance of each pathway as the relative abundance of its read counts out of the total number of read counts in all considered pathways (Equation 1).

Table 1.

General Information on Simulated Metagenomes

Metagenome	Organisms	Population complexity (% of most abundant species)	No. of reads
M1	KEGG10	High (10%)	100,000
M2	KEGG10	Low (50%)	100,000
M3	KEGG125	High (1.4%)	100,000
M4	KEGG125	Low (10.8%)	100,000
M5	KEGG125	Low (10.8%)	10,000

To evaluate the performance of the various prediction methods on the task of function comparison, we compared all pairs of metagenomes and evaluated the resulting lists (Fig. 2; also see Fig. 5 in the Appendix). The pathway intersection model showed superior performance over the other models in six out of 10 scenarios (Fig. 2a). The independent pathways model performed slightly better than the read-count model in most cases. The relatively low improvement in performance in this task is somewhat expected, since our models aim to correct biases in the estimation of pathway abundances introduced by differences in pathway size and gene lengths, while these biases are largely eliminated when comparing the same pathway over two samples.

FIG. 2.

Agreement between true and estimated lists of most differentially enriched pathways in selected pairs of metagenomes. For each m ≤ 100 (x-axis), the y-axis shows the number of pathways (normalized by m) that are ranked as being among the m most differentially enriched pathways between two simulated metagenomes, by both the true ranking and the predicted ranking by the various prediction methods. Results for differently enriched pathways between metagenomes M1 and M5 (left) and M3 and M4 (right) are presented. The Appendix includes the results for the remaining metagenomes.

Since the original aim of the read-count method was to address the above task of computing changes in gene set abundances across metagenomes, it is not suitable for computing the relative abundance of pathways in a sample as it does not account for differences in pathway sizes (an inherent factor for this task). Therefore, as a baseline to assessing the performance of our methods here, we implemented a fourth method, the normalized read-count, that is based on read-counts but also account for pathway sizes. The relative abundance of a pathway in this method is given by: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align*} freq ( P ) = \frac { read \_count ( P ) / { \rm size ( P ) } } { \sum \limits_ { P^ { \prime } \in D^ { PATH } } read \_count ( P^ { \prime } ) / { \rm size ( P } ^ { \prime } ) } \tag { 12 } \end{align*} \end{document}

To evaluate the performance of the various methods in predicting the relative abundance of pathways across a single sample, we tested the agreement between the rankings of pathways based on their true abundances and predicted abundances by the various methods (Fig. 3; also see Fig. 6 in the Appendix). Quite expectedly, the performance of the read-count method is significantly worse than that of the other methods. A significant improvement is achieved by the normalized read-counts, taking into account pathway sizes. An additional marked improvement is achieved by the independent pathway model, accounting for variation in gene lengths. The performance of the pathway intersection method is inferior to that of the independent pathways method when ranking sets of highly abundant pathways. On the other hand, the performance of the pathway intersection method is superior to all other methods when considering pathways with lower abundances.

FIG. 3.

Ranking pathways based on their enrichment in metagenome M3. (Left) The intersection between the true and predicted m most abundant pathways using the various prediction methods (I-axis), for different values of m (I-axis). (Right) Statistical significance (hyper-geometric I-values) for the intersection between the true and predicted highly abundant pathway sets shown in (a). Other simulated metagenomes exhibited similar behavior (see Fig. 6 in the Appendix).

To further evaluate the performance of the various methods in predicting relative abundance of pathways across a single sample, we computed the Pearson correlation coefficient between the true and predicted relative abundances (Fig. 4a). Consistent with its poor performance in the ranking test, the read-counts method shows no correlation with the true pathway abundances across all metagenomes. The independent pathways and the pathway intersection methods perform better than or equal to the normalized read counts method in all cases. In particular, the pathway intersection method outperforms the other approaches when lowly abundant pathways are considered (Fig. 4b), as also shown above in the ranking tests. The success of the pathway intersection method on rare (or missing) pathways may be due to the fact that it does not do multiple counting of a gene common to several pathways while the other methods do. Frequencies assigned by the other models will be higher than the true frequencies; while this also happens with abundant pathways, its influence on rare pathways is much higher. The independent pathways model does not suffer from this bias.

FIG. 4.

Pearson correlation between predicted and true pathway abundances across the various metagenomes. Correlations obtained for the entire set of 250 pathways used in the simulation (left) and correlations obtained for the set of 150 less abundant pathways (right).

FIG. 5.

Agreement between true and estimated lists of most differentially enriched pathways in pairs of metagenomes. Refer to the legend of Figure 2 for description. Read count (blue), independent pathways (green), and pathway intersection (red) models are compared.

FIG. 6.

Ranking pathways based on their abundances in metagenomes M1, M2, M4, and M5. Refer to the legend of Figure 3 for description. Read count (blue), normalized read-count (cyan), independent pathways (green), and pathway intersection (red) models are compared.

4. Conclusion

In this work, we have proposed two models for functional analysis of metagenomes at the pathway (systems) level reflecting two different assumptions regarding the sharing of genes among pathways. The two models eliminate biases resulting from variations in number of genes across pathways and also biases resulting from variation in genes' lengths (Sharon et al., 2009b). Our methods performed much better with respect to predicting relative abundance of pathways. Each of our two methods was shown to have its own strength: the pathway intersection method outperforms the other approaches in predicting pathway abundances when focusing on lowly abundant pathways; the independent pathways method is superior in ranking pathway abundances for highly abundant pathways. Both our methods performed only slightly better than the read-count method when used for functional comparison, despite the failure of the later in the second task of predicting the absolute frequencies of the different pathways. One possible explanation for this behavior is that frequency estimation biases of specific pathways tend to be similar in both compared datasets and thus cancel each other when computing the relative abundances. For example, the relative abundance of a gene family or a pathway whose members are relatively long is likely to be overestimated by the read-count method in both samples. Such mutual compensation does not hold in the general case, suggesting that a more robust method is in place.

The pathway intersection method relies on the availability of single copy genes that are present in the vast majority of species in the studied environment. Single copy genes were used in the past as phylogenetic markers (Yooseph et al., 2007) and for estimating gene abundance (Loy et al., 2009; Rusch et al., 2007; Yutin et al., 2007). There are several families of single copy genes that are known to be present across all known bacterial species, but these families are not present in Archaea and Eukaryotes. Hence, the pathway intersection method is more appropriate for environments in which the vast majority of sampled microbes are bacteria such as marine environments, but is likely to yield skewed frequencies when applied to environments in which either Archaeal or Eukaryotic species are abundant (such as acid mine drainage).

Functional characterization of metagenomic data such as that discussed in this study depends, first and foremost, on the quality of the employed pathway annotation data. Specifically, all pathway analysis methods rely on the basic assumption that a pathway is a coherent functional module that is either entirely present or absent in an organism. However, pathways defined in databases such as KEGG and MetaCyc do not fully address this requirement and, in many cases, have only a fraction of their genes actually present in many species. Future advances in pathways curation are expected to significantly improve the outcome of the presented methods.

To our knowledge, this is the first time in which the issue of functional analysis at the pathway level of metagenomic data was studied in depth, providing further means for the exploration of metagenomes and their functions via environment-based comparative analysis.

5. Appendix

Footnotes

Acknowledgments

S.B. is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship.

Disclosure Statement

No competing financial interests exist.

1

The SEED database is commonly used for functional analysis that is defined in terms of subsystems rather than pathways. However, since our models treat pathways as gene sets and do not consider issues such as pathway topology and products, the theory described in this paper is also applicable to databases such as SEED.

2

Note that λ_G in [18] refers to the expected number of clone inserts whose two sides are sequenced. Here reads are assumed to be independent of each other, the adjustment to pair-end sequencing should be straightforward.

References

Altschul

S.F.

, Gish

, Miller

et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215:403–410.

Amann

R.I.

, Ludwig

, Schleifer

K.H.

1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol. Rev., 59:143–169.

Angly

F.E.

, Willner

, Prieto-Davo

et al. 2009. The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLoS Comput. Biol., 5:e1000593.

Beja

, Aravind

, Koonin

E.V.

et al. 2000. Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science, 289:1902–1906.

Caspi

, Altman

, Dale

J.M.

et al. 2010. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res., 38:D473–D479.

Caspi

, Foerster

, Fulcher

C.A.

et al. 2008. The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res., 36:D623–D631.

DeLong

E.F.

, Preston

C.M.

, Mincer

et al. 2006. Community genomics among stratified microbial assemblages in the ocean's interior. Science, 311:496–503.

Dinsdale

E.A.

, Edwards

R.A.

, Hall

et al. 2008. Functional metagenomic profiling of nine biomes. Nature, 452:629–632.

Edwards

R.A.

, Rodriguez-Brito

, Wegley

et al. 2006. Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics, 7:57.

10.

Feingersch

, Suzuki

M.T.

, Shmoish

et al. 2010. Microbial community genomics in eastern Mediterranean Sea surface waters. ISME J., 4:78–87.

11.

Finn

R.D.

, Mistry

, Schuster-Bockler

et al. 2006. Pfam: clans, web tools and services. Nucleic Acids Res., 34:D247–D251.

12.

Gill

S.R.

, Pop

, Deboy

R.T.

et al. 2006. Metagenomic analysis of the human distal gut microbiome. Science, 312:1355–1359.

13.

Grossart

H.P.

, Steward

G.F.

, Martinez

et al. 2000. A simple, rapid method for demonstrating bacterial flagella. Appl. Environ. Microbiol., 66:3632–3636.

14.

Haft

D.H.

, Selengut

J.D.

, White

2003. The TIGRFAMs database of protein families. Nucleic Acids Res., 31:371–373.

15.

Howard

E.C.

, Henriksen

J.R.

, Buchan

et al. 2006. Bacterial taxa that limit sulfur flux from the ocean. Science, 314:649–652.

16.

Kanehisa

, Goto

2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28:27–30.

17.

Lander

E.S.

, Waterman

M.S.

1988. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2:231–239.

18.

Loy

, Duller

, Baranyi

et al. 2009. Reverse dissimilatory sulfite reductase as phylogenetic marker for a subgroup of sulfur-oxidizing prokaryotes. Environ. Microbiol., 11:289–299.

19.

Markowitz

V.M.

, Korzeniewski

, Palaniappan

et al. 2006. The integrated microbial genomes (IMG) system. Nucleic Acids Res., 34:D344–D348.

20.

Mollet

, Drancourt

, Raoult

1997. rpoB sequence analysis as a novel basis for bacterial identification. Mol. Microbiol., 26:1005–1011.

21.

Overbeek

, Begley

, Butler

R.M.

et al. 2005. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res., 33:5691–5702.

22.

Raes

, Foerstner

K.U.

, Bork

2007. Get the most out of your metagenome: computational analysis of environmental sequence data. Curr. Opin. Microbiol., 10:490–498.

23.

Rodriguez-Brito

, Rohwer

, Edwards

R.A.

2006. An application of statistics to comparative metagenomics. BMC Bioinformatics, 7:162.

24.

Rusch

D.B.

, Halpern

A.L.

, Sutton

et al. 2007. The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol., 5:e77.

25.

Sharon

, Alperovitch

, Rohwer

et al. 2009a. Photosystem I gene cassettes are present in marine virus genomes. Nature, 461:258–262.

26.

Sharon

, Pati

, Markowitz

V.M.

et al. 2009b. A statistical framework for the functional analysis of metagenomes. Proc. RECOMB, 2009; 496–511.

27.

Tatusov

R.L.

, Fedorova

N.D.

, Jackson

J.D.

et al. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41.

28.

Tyson

G.W.

, Chapman

, Hugenholtz

et al. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 428:37–43.

29.

Venter

J.C.

, Remington

, Heidelberg

J.F.

et al. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304:66–74.

30.

Warnecke

, Luginbuhl

, Ivanova

et al. 2007. Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature, 450:560–565.

31.

, Doak

T.G.

2009. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput. Biol., 5:e1000465.

32.

Yooseph

, Sutton

, Rusch

D.B.

et al. 2007. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol., 5:e16.

33.

Yutin

, Suzuki

M.T.

, Teeling

et al. 2007. Assessing diversity and biogeography of aerobic anoxygenic phototrophic bacteria in surface waters of the Atlantic and Pacific Oceans using the Global Ocean Sampling expedition metagenomes. Environ. Microbiol., 9:1464–1475.