Learning Microbial Interaction Networks from Metagenomic Count Data

Abstract

Many microbes associate with higher eukaryotes and impact their vitality. To engineer microbiomes for host benefit, we must understand the rules of community assembly and maintenance that, in large part, demand an understanding of the direct interactions among community members. Toward this end, we have developed a Poisson-multivariate normal hierarchical model to learn direct interactions from the count-based output of standard metagenomics sequencing experiments. Our model controls for confounding predictors at the Poisson layer and captures direct taxon–taxon interactions at the multivariate normal layer using an ℓ₁ penalized precision matrix. We show in a synthetic experiment that our method handily outperforms state-of-the-art methods such as SparCC and the graphical lasso (glasso). In a real in planta perturbation experiment of a nine-member bacterial community, we show our model, but not SparCC or glasso, correctly resolves a direct interaction structure among three community members that associates with Arabidopsis thaliana roots. We conclude that our method provides a structured, accurate, and distributionally reasonable way of modeling correlated count-based random variables and capturing direct interactions among them.

1. Introduction

Microbes are the most diverse form of life on the planet. Many associate with higher eukaryotes, including humans and plants, and perform key metabolic functions that underpin host viability (The Human Microbiome Project Consortium, 2012; Lundberg et al., 2012). Importantly, they coexist in these ecologies in various symbiotic relationships (Konopka, 2009). Understanding the structure of their interaction networks may simplify the list of microbial targets that can be modulated for host benefit or assembled into small artificial communities that are deliverable as probiotics.

Microbiomes can be measured by sequencing all host-associated 16S rRNA gene content. Because the 16S gene is a faithful phylogenetic marker, this approach readily reveals the taxonomic composition of the host metagenome (Segata et al., 2013). Given that such sequencing experiments output an integral, non-negative number of sequencing reads, the final output for such an experiment can be summarized in the n-samples × o-taxa count table, Y, where Y_ij denotes the number of reads that map taxon j in sample i. It is assumed Y_ij is proportional to taxon j's true abundance in sample i.

To study interrelationships between taxa, we require a method that transforms Y into an undirected graph represented by a weighted o × o adjacency matrix, A, where a nonzero entry in position (i, j) indicates an association between taxon i and taxon j. Correlation-based methods are a popular approach to achieve this end (Faust and Raes, 2012; Faust et al., 2012; Friedman and Alm, 2012). Nevertheless, correlated taxa need not directly interact if, for example, they are coregulated by a third taxon. Gaussian graphical models remedy this concern by estimating a conditional independence network in which A_ij = 0 if and only if taxon i and taxon j are conditionally independent given all remaining taxa under consideration (Meinshausen and Bühlmann, 2006; Friedman et al., 2007; Wainwright and Jordan, 2008). However, they also assume the columns of Y are normally distributed, which is unreasonable for a metagenomic sequencing experiment. Finally, neither correlation nor Gaussian graphical modeling offers a systematic way to control for confounding predictors, such as measured biological covariates (e.g., body site or plant fraction), experimental replicate, sequencing plate, or sequencing depth.

As baseline methods, we consider the commonly used correlation network method, SparCC (Friedman and Alm, 2012), and a state-of-the-art method for inferring Gaussian graphical models, the graphical lasso (glasso) (Friedman et al., 2007). SparCC calculates an approximate linear correlation between taxa after taking the log ratio of their abundances, and through an iterative procedure, prunes correlation edges that are not robust. In this way, it not only aims to produce a sparse network but also avoids negative correlations between taxa that arise from data compositionality—a common problem in metagenomics experiments, in which counts of taxa can only be interpreted relative to each other and not as absolute abundance measurements. Importantly, the authors point out that SparCC does not make any parametric assumptions.

The glasso aims to construct a sparse graphical model, in which nonzero edges can be interpreted as direct interactions between taxa. Model inference proceeds by optimizing the likelihood of a standard multivariate normal distribution with respect to the precision matrix, subject to an ℓ₁ constraint on each entry. The magnitude of this ℓ₁ penalty controls the degree of sparsity or, equivalently, model parsimony.

In this work, we develop a Poisson-multivariate normal hierarchical model that can account for a correlation structure among count-based random variables. Our model controls for confounding predictors at the Poisson layer and captures direct taxon–taxon interactions at the multivariate normal layer using an ℓ₁ penalized precision matrix.

2. Methods

2.1. Preliminaries

Let n, p, and o denote the number of samples, number of predictors, and the number of response variables under consideration, respectively. Throughout this article, response variables will be read counts of bacteria and will be referred to as such, although in practice, any count-based random variable is relevant. Let Y be the n × o response matrix, where Y_ij denotes the count of bacteria j in sample i. Let X be the n × p design matrix, where X_ij denotes predictor j's value for sample i. For a matrix M, we will use the notation M_:i and M_i_: to index the entire ith column and row, respectively. The Frobenius norm of M is defined to be \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\parallel M{\parallel_F} = \sqrt { \sum \nolimits_i \sum \nolimits_j M_{ij}^2}$$ \end{document} .

2.2. The model

We wish to model direct interaction relationships among bacteria measured in a metagenomic sequencing experiment while also controlling for the confounding biological and/or technical predictors encoded in X. Toward this end, we propose the following Poisson-multivariate normal hierarchical model. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{w_{i:}}\, \sim\, { \rm{Multivariate}} - { \rm{Normal}} \left( {{ \bf{0}} , { \Sigma ^{ - 1}}} \right) \tag{1}\end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{Y_{ij}} \,\sim\, { \rm{Poisson}} ( \exp \{ {X_{i:}}{ \beta _j} + {w_{ij}} \} ) \tag{2}\end{align*} \end{document}

Here 0 and Σ⁻¹ are the 1 × o zero mean vector and o × o precision matrix of the multivariate normal, and w is an n × o latent abundance matrix. The coefficient matrix, β, is p × o such that β_ij denotes predictor i's coefficient for taxon j. It is important that in our decision to restrict the mean of w to 0, we assume that X contains an intercept term. Under this assumption, the mean of w and coefficient corresponding to the intercept term of X are redundant in expectation.

The log posterior of this model is given as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mathop \sum \limits_ { j = 1 } ^o \mathop \sum \limits_ { i = 1 } ^n \left[ { { y_ { ij } } ( { x_ { i: } } { \beta _ { :j } } + { w_ { ij } } ) - \exp {\{ { x_ { i: } } { \beta _ { :j } } + { w_ { ij } } \}} } \right] + \frac { n } { 2 } \log \mid { \Sigma ^ { - 1 } } \mid - \frac { n } { 2 } tr \left( { S ( w ) { \Sigma ^ { - 1 } } } \right) \tag { 3 } \end{align*} \end{document}

where S(w) = w^Tw/n is the empirical covariance matrix of w.

Intuitively, the columns of w are adjusted, “residual” abundance measurements of each bacterium, after controlling for confounding predictors in X. Assuming all relevant confounding covariates are indeed included in X, the only signal that remains in these residuals must arise from interactions between the bacteria being modeled. Therefore, we wish to model direct interactions, or equivalently, conditional independences at the level of these latent abundances, rather than the observed counts. Recall if Σ_ij⁻¹ = 0, then w_:i and w_:j are conditionally independent, and so too are Y_i_: and Y_j_: since the probability density of Y_:k given w_:k is completely determined. Thus, assuming a correct model, Σ_ij⁻¹ = 0 is sufficient to conclude that bacteria i and bacteria j do not interact and are conditionally independent given all other bacteria. Similarly, if Σ_ij⁻¹ ≠ 0, we would conclude that bacteria i and bacteria j do directly interact.

To appreciate the degree of coupling between two bacteria, we must normalize Σ_ij⁻¹ to Σ_ii⁻¹ and Σ_jj⁻¹. A large |Σ_ij⁻¹| need not be indicative of a strong coupling if, for example, Σ_jj⁻¹ and Σ_ii⁻¹—the conditional variance of bacteria i and j given all others—are much larger. Therefore, in subsequent results and visualizations, we consider a transformation of Σ⁻¹ to its partial correlation matrix, P, whose entries are specified as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${P_{ij}} = - { \Sigma _{ij}} / \sqrt { \Sigma _{ii}^{ - 1} \Sigma _{jj}^{ - 1}}$$ \end{document} .

Finally, we wish to have an estimate of an interaction network that not only well explains the correlated count data we observe but also does so parsimoniously, in a manner that maximizes the number of correct hypotheses and minimizes the number of false ones that lead to wasted testing effort. Toward this end, we impose an adjustable ℓ₁-penalty on the entries of the precision matrix during optimization, which encourages the precision matrix to be sparse. Importantly, from a Bayesian perspective, the ℓ₁ penalty can be seen as a zero-mean Laplace distribution (with parameter λ) over the model parameter it is regularizing.

2.2.1. Model learning

The ℓ₁-penalized log posterior, modulo unnecessary constants, is given as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{split} & \mathop { { \rm { argmax } } } \limits_ { _ { \beta , w , { \Sigma ^ { - 1 } } } } \mathop \sum \limits_ { j = 1 } ^o \mathop \sum \limits_ { i = 1 } ^n \left[ { { y_ { ij } } ( { x_ { i: } } { \beta _ { :j } } + { w_ { ij } } ) - \exp \{ { x_ { i: } } { \beta _ { :j } } + { w_ { ij } } \} } \right] + \frac { n } { 2 } \log \mid { \Sigma ^ { - 1 } } \mid - \frac { n } { 2 } { \rm { tr } } \left( { S ( w ) { { \rm { \Sigma } } ^ { - 1 } } } \right) - \frac { { \lambda n } } { 2 } \mid \mid { \Sigma ^ { - 1 } } { \parallel _1 } + { o^2 } \log \left( { \frac { { n \lambda } } { 4 } } \right) \end{split} \tag { 4 }\end{align*} \end{document}

where λ is a tuning parameter and ||·||₁ denotes the ℓ₁-norm, which for a matrix M equals ∑_i ∑_j|M_ij|. Note we have presented the ℓ₁ penalty as a Laplace distribution with parameter 2/(nλ). In other words, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f ( \Sigma _{ij}^{ - 1} \mid 2 / ( n \lambda ) ) = n \lambda \exp ( - n \lambda \mid \Sigma _{ij}^{ - 1} \mid / 2 ) / 4$$ \end{document} .

We optimize this objective using an iterative conditional mode algorithm, in which parameters are sequentially updated to their mode value given current estimates of the remaining parameters (Besag, 1986). Given the estimates for w and Σ⁻¹, the conditional objective for β is given as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mathop {{ \rm{argmax}}} \limits_{_ \beta } \mathop \sum \limits_{j = 1}^o \mathop \sum \limits_{i = 1}^n \left[ {{y_{ij}} ( {x_{i:}}{ \beta _{:j}} + {{ \hat w}_{ij}} ) - \exp \{ {x_{i:}}{ \beta _{:j}} + {{ \hat w}_{ij}} \} } \right] \tag{5}\end{align*} \end{document}

This is efficiently and uniquely optimized by setting β_:k to the solution of the Poisson regression of Y_:k onto X using a log link and w_:k as an offset for all k∈{1,2,…,o}.

Given the estimates for β and Σ⁻¹, the conditional objective for w is given as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mathop { { \rm { argmax } } } \limits_w \mathop \sum \limits_ { j = 1 } ^o \mathop \sum \limits_ { i = 1 } ^n \left[ { { y_ { ij } } { w_ { ij } } - \exp \ { { x_ { i: } } { { \hat \beta } _ { :j } } + { w_ { ij } } \ } } \right] - \frac { n } { 2 } { \rm { tr } } \left( { S ( w ) { { \hat \Sigma } ^ { - 1 } } } \right) \tag { 6 } \end{align*} \end{document}

Each row of w is independent of all other rows in this objective and can therefore be updated separately. To obtain the conditional update for w_i_:, we apply Newton–Raphson. The gradient vector, g_i, and Hessian, H_i, are given as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{g_i} = {y_{i:}} - \exp \{ {x_{i:}}{ \hat \beta _{:j}} + {w_{ij}} \} - {w_{i:}}{ \hat \Sigma ^{ - 1}} \tag{7}\end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{H_i} = - { \hat \Sigma ^{ - 1}} - { \rm{diag}} ( \exp \{ {x_{i:}} \hat \beta + {w_{i:}} \} ) \tag{8}\end{align*} \end{document}

Because \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \hat \Sigma ^{ - 1}}$$ \end{document} is positive definite and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\exp \{ {x_{i:}} \hat \beta + {w_{i:}} \} > 0$$ \end{document} for all components, H_i is always negative definite. Thus, the conditional update for w_i_: is a unique solution.

Given β and w, the conditional objective for Σ⁻¹ is given as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mathop {{ \rm{argmax}}} \limits_{_{{ \Sigma ^{ - 1}}}} \log \mid { \Sigma ^{ - 1}} \mid - { \rm{tr}} \left( {S ( w ) { \Sigma ^{ - 1}}} \right) - \lambda \parallel { \Sigma ^{ - 1}}{ \parallel _1} \tag{9}\end{align*} \end{document}

which is convex and efficiently optimized using the glasso (Friedman et al., 2007).

2.2.2. Model initialization

In a manner similar to our conditional update for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} , we initialize \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\beta _{:k}}$$ \end{document} to be the solution of the Poisson regression of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${Y_{:k}}$$ \end{document} onto X using a log link, but with no offset for all \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$k \in \{ 1 , 2 , \ldots , o \} $$ \end{document} . Given \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\beta$$ \end{document} , the predicted mean of the associated Poisson distribution is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{\mathbb E}} ( {Y_{ij}} \mid X ) = \exp ( {X_{i:}}{ \beta _{:j}} )$$ \end{document} . Note, however, in the original formulation of our model, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ {\mathbb E}} ( {Y_{ij}} \mid X ) = \exp ( {X_{i:}}{ \beta _{:j}} + {w_{ij}} )$$ \end{document} . This suggests a natural initialization for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${w_{ij}}$$ \end{document} : \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${w_{ij}} = \log ( {Y_{ij}} ) - {X_{i:}}{ \beta _{:j}}$$ \end{document} . To complete the initialization, we set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Sigma ^{ - 1}}$$ \end{document} to be the generalized pseudoinverse of S(w)—a numerically stable estimate of the precision matrix of w. The rationale behind this initialization is consistent with the previously presented intuitions underlying the components of each model and in practice leads to quick convergence.

2.2.3. Model selection

The ℓ₁ tuning parameter, λ, is a hyperparameter that must be set before the model can be learned. In supervised learning, cross validation is a popular method used to set such penalties. In our model, however, w is a sample specific parameter that consequently must be estimated for held out data before prediction error can be evaluated. This breaks the independence assumption between training data and test data and, in general, results in poor or undeterminable model selection; less penalizing (smaller) values of λ tend to always produce statistically lower test set prediction error, because w is allowed to adapt to the test set samples. Instead of cross validation, we assume only for the purpose of selecting a value for λ that there is a joint distribution between λ and the remaining parameters, in which λ has an improper flat prior (the prior probability density of λ always equals 1). Then, differentiating Eq. (4), setting equal to 0 and solving for λ gives us \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\hat \lambda = 2{o^2} / \left( {n \parallel \Sigma _{init}^{ - 1} \parallel} \right)$$ \end{document} , which is the value of λ we use throughout the optimization. Here, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Sigma _{init}^{ - 1}$$ \end{document} is our initial estimate of Σ⁻¹ and is obtained as described in the previous section.

We note here a qualitative connection to empirical Bayes inference, in which hyperparameter values are set to be the maximizers of the marginal likelihood—the probability density of the data given only the hyperparameters. In effect, empirical Bayes calculates the expected posterior density by averaging over model parameters and then chooses the hyperparameter value that maximizes it. In our case, instead of marginalizing over parameters, we make an intelligent guess at their value and condition on these values to set our hyperparameter λ. In both methods, hyperparameters are set in an unbiased and objective way by looking first at the data.

2.3. Synthetic experiment

To test our model's accuracy, efficiency, and performance relative to other leading methods, we constructed a 20-node synthetic experiment composed of 100 samples. As our precision matrix, Σ⁻¹, we generated a random, 20 × 20, 85% sparse (total of 27 nonzero, off-diagonal entries) positive definite matrix using the sprandsym function in MATLAB. From Σ⁻¹, we generated latent abundances, w, for 100 samples using a standard, multivariate, normal random variable generator based on the Cholesky decomposition. We then generated two confounding covariates, X₁ and X₂. X₁ was a vector of 100 independent and identically distributed normal (4,1) random variables. X₂ was a 100-long vector where the first 50 entries equaled 1 and the last 50 equaled 0. The weights, β_1j and β_2j, on each confounding covariate, were set to be −0.5 and 6, respectively, for all nodes (i.e., for all j∈{1,2,…,20}). These coefficient values were chosen such that the combined effect size of these confounding covariates on the response was three times larger than the effect size of the latent abundances, or equivalently, the contribution of the interactions encoded in the precision matrix. The 100 × 20 response matrix, Y, was generated according to Y_ij ∼ Poisson(X_i₁β_1j + X_i₂β_2j + w_ij). Finally, for the same precision matrix, we generated 20 replicate response matrices in this manner.

We applied our model to the 100 × 20 synthetically generated response matrix Y and entered the confounding covariates, X₁ and X₂, as predictors. We also applied SparCC and the glasso to illustrate the performance of a state-of-the-art correlation-based method and a widely used method for inferring graphical models, respectively. Because SparCC models correlations at the level of log₂ ratios between variables, the authors claim that SparCC does not require any type of normalization (e.g., rarefaction to a common depth) or standardization of variables across samples.

While we applied SparCC to Y only, we ran glasso on a matrix composed of the column-wise concatenation of a rarefied version of Y and X, effectively learning a joint precision matrix over nodes represented in Y and the covariates in X. Each sample in Y was rarefied to 1000 counts, which is a large number for measuring 20 variables. Applying glasso in this manner allowed it to account for the confounding predictors in X. To compare the glasso-learned precision matrix to the true precision matrix, we use only the 20 × 20 subset matrix corresponding to the nodes represented in Y. The ℓ₁ tuning parameter for glasso was chosen by cross validation, where the selection criterion was the test set log likelihood. We additionally note that when running glasso, we supply the sample correlation matrix, which accounts for differences in the scale of each variable.

2.4. Artificial community experiment

To test the model with real data, we constructed a nine-member artificial community composed of Escherichia coli (a putative negative root colonization control) and eight other bacterial strains originally isolated from Arabidopsis thaliana roots grown in two wild soils (Lundberg et al., 2012). These eight isolates were chosen based on their potential to confer beneficial phenotypes to the host (unpublished data) and to maximize phylogenetic diversity. Into each of the 94 2.5-inch-square pots filled with 100 mL of a 2× autoclaved, calcined clay soil substrate, we inoculated the nine isolates in varying relative abundances to perturb their underlying interaction structure. For all pots, all strains were present, but ranged in input abundance from 0.5% to 50%.

To each of these inoculated pots, we carefully and aseptically transferred a single sterilely grown Col-0 A. thaliana seedling. Pots were spatially randomized and placed in growth chambers providing short days of 8 hours light at 21°C and 16 hours dark at 18°C. The plants were allowed to grow for 4 weeks, after which we harvested their roots and for each performed 16S profiling (includes DNA extraction, polymerase chain reaction [PCR], and sequencing) of the V4 variable region. To quantify the relative amount of each input bacterium, sequencing reads were demultiplexed, quality filtered, adjusted to ConSeqs if applicable (see Batch B processing below), and then mapped using the Burrows Wheeler Aligner (Li and Durbin, 2010) to a previously constructed sequence database of each isolate's V4 sequence. Mapped ConSeqs or reads to a given isolate in a given sample were counted and subsequently assembled into a 94-samples × 9-isolates count matrix.

While all 94 samples were harvested over 2 days, they were thereafter processed in two batches, A (52 samples) and B (42 samples), ∼4 months apart. Batch B samples were 16S profiled using the method described in Lundberg et al. (2013). This PCR method partially adjusts for sequencing error and PCR bias by tagging all input DNA template molecules with a unique 13-mer molecular tag before PCR. After sequencing, this tag is then used to informatically collapse all tag-sharing amplicon reads into a single consensus sequence or ConSeq. Batch A samples were 16S profiled by using a more traditional PCR. Having two distinct sample sets, each processed using different protocols, allowed us to assess our model's ability to statistically account for batch effects when inferring the interaction network of our nine-member community.

2.5. In vitro coplating validation experiments

To test predicted interactions from our artificial community experiment, we grew liquid cultures of predicted interactors and noninteractors to OD 1 in 2XYT liquid media. We then coplated six 5 uL dots of predicted interactors and noninteractors on King's B media agar plates, either 1 cm apart (three dots each) or 12 cm (three dots each) apart on the same plate. We then examined each strain for growth enhancement or restriction that was specific to its proximity to the potential interactor it was tested against.

3. Results

3.1. Synthetic experiment

Figure 1 illustrates performances for the three methods. With the exception of SparCC, Figure 1a illustrates the Frobenius norm of the difference between the partial correlation transformations of the true precision matrix and the estimated one. The Frobenius norm, also called the Euclidean norm, is equivalent to an entrywise Euclidean distance between two matrices and is therefore a measure of the closeness of the two matrices when computed on their entrywise difference. For SparCC, the difference is calculated between the true partial correlation matrix and the estimated correlation matrix.

FIG. 1.

The Poisson-multivariate normal hierarchical model outperforms SparCC and glasso in a synthetic experiment. (a) Frobenius norm of the difference between the partial correlation transformed true precision matrix and the estimated precision matrix for each method. The glasso was run jointly over all response variables and covariates and is therefore suffixed with “w.c.” Shaded blue bands represent 2× standard deviation and shaded red bands represent 2× standard error. (b) False discovery rate of each method as a function of the number of magnitude-ordered edges called significant. The solid thick line illustrates the average FDR curve across all replicates. The shaded bands illustrate the 5th and 95th percentile FDR curve considering all replicates. Network representations of the (c) true partial correlation transformed precision matrix, (d) correlation matrix outputted by SparCC, (e) partial correlation transformed precision matrix outputted by glasso w.c., and (f) partial correlation transformed precision matrix outputted by our Poisson-multivariate normal hierarchical model. glasso, graphical lasso; FDR, false discovery rate; w.c. with covariates.

SparCC's correlation matrix is the most different from the true partial correlation matrix, followed by glasso with covariates (w.c.) entered as variables. Our Poisson-multivariate normal hierarchical model performs the best and, interestingly, is the most consistent across replicates than the other methods.

Figure 1b illustrates a complimentary measure of accuracy, the false discovery rate (FDR), which is defined to be the number of false nonzero edges inferred divided by the total number of nonzero edges inferred. More specifically, Figure 1b illustrates FDR as a function of the number of edges (ordered by descending magnitude) called significant. Here again we see SparCC has the least desirable performance with the FDR curve that nearly majorizes glasso w.c. and completely majorizes our method. The glasso has the next most desirable FDR curve, but still has 3–4 false discoveries in the top 10 nonzero edges. Our method outperforms the other two and incurs nearly 0 false discoveries in the top 10 (in magnitude) edges it discovers across all replicates.

Figure 1d–f illustrates network representations for the average (across all replicates) correlation or partial correlation matrix learned by each method. Figure 1c provides the network representation of the true partial correlation matrix used in this synthetic experiment. These networks visually support previous claims. The network produced by SparCC is not sparse and is visually most distant from the true network. The glasso w.c. method is considerably more sparse and seems to recover several of the top positive edges. Our method's network is visually closest to the true network and recovers considerably more of the top edges. However, it also detects them with less magnitude.

3.2. Artificial community experiment

We applied our model to the 94 root samples × 9 isolates count matrix. Starting input abundances and processing batch (Fig. 2a, left) were entered as covariates in our model. Before running the model, the design matrix was standardized so that coefficients on each variable could be directly comparable.

FIG. 2.

Recolonization and isolate–isolate interaction results from the nine-member synthetic community. (a) Design (left) and response (right) matrices. The design matrix was composed of a binary vector indicating processing batch and the relative input abundances of each input isolate except Escherichia coli (to preserve rank). Before running the model, the design matrix was standardized so that coefficients on each variable could be directly comparable. Response matrix illustrates raw counts on a log₁₀ scale. (b) Latent abundance matrix, w, inferred from our model. These latent abundances are read counts of each bacterium adjusted for the covariates encoded in X. (c) Dotted box plot illustrating the effect size of each predictor on each isolate, presented as a single dot for each predictor. Purple bands illustrate 2× standard deviation and red bands illustrate 2× standard error. (d) Network visualizations of correlation matrices outputted from SparCC run on raw counts (left), partial correlation transformation of the precision matrix outputted by glasso w.c. (middle), and the partial correlation matrix obtained from the sparse precision matrix inferred from our mode (right). (e) Interaction and noninteraction predictions tested in vitro on agar plates. The rightmost coplating among the confirmed interactions attempts to directly test the conditional independence structure of the (i181, i50, and i105) triad.

On examining the response matrix (Fig. 2, right), we notice a clear difference in the number of counts between Batch A samples and Batch B samples. This is due to the molecule tag correction that was available and applied to Batch B samples. The molecule tag correction collapses all reads sharing the same molecule tag into a single ConSeq—a representative of the original template molecule of DNA, before PCR. However, on examining the latent abundances, w (Fig. 2b), we notice the model has successfully adjusted for these effects. As we would also expect, Figure 2c illustrates that the learned effect size of the batch variable is an influential predictor of bacterial read counts, more so than the starting abundance of each bacterium.

Interestingly, in further scrutinizing Figure 2b, we notice an interesting correlation in the latent abundances of i181 and i105 and to a lesser extent between i50 and i105. As a corollary, the latent abundances of i181 and i50 are also correlated. These correlations are suggestive of direct interaction relationships among these three bacteria, but a number of direct interaction structures could explain these correlations.

Figure 2d illustrates network visualizations of either correlation matrices outputted from SparCC (left), or partial correlation transformed precision matrices outputted by glasso w.c. entered as variables, or by our model. SparCC applied to the raw response matrix suggests a number of negative correlations that include all community members except i303. Interestingly, SparCC, which operates on log ratios of the bacterial counts, seems immune to the positive correlations among the community members one would expect to arise due to the processing batch effect. The glasso with covariates entered as variables affords the simplest model and only suggests a positive interaction between i8 and i105.

We note that varying λ by two orders of magnitude of the selected value minorly changed the sparsity pattern, and the two edges described above always persisted. For smaller (relaxed) values of λ, the incoming edges were small in magnitude and not significant after bootstrapping.

The precision matrix our model infers is sparse, containing only two edges—one between i105 and i181, and another between i105 and i50. The network representation of the partial correlation matrix of our precision matrix reveals a strong predicted direct antagonism between i181 and i105 and also to a lesser extent between i105 and i50. Note that the model does not predict any interaction between i181 and i50.

In vitro coplating experiments corroborate the model's predictions exactly in direction and also semiquantitatively (Fig. 2e). In particular, they show that (i105, i181) and (i105, i50) are, indeed, antagonistic interaction pairs and that, moreover, i181 and i50 are the inhibitors. In addition, the (i181, i105) inhibition appears more pronounced than the (i181, i50) inhibition, just as the model suggests. The model also predicts conditional independence of i50 and i181 given i105. Indeed, the inward facing edges of the i181 and i50 colonies do not appear deformed in either the paired coplating or the triangular coplating and therefore suggest a noninteraction. Note that naively interpreting the SparCC network edges as evidence of direct interaction would falsely lead one to conclude that a direct, positive interaction exists between i50 and i181.

Note that our model predicts that i181, i50, and i105 do not interact with many of the other community members. As support for this prediction, we see that neither i105, i50, nor i181 interacts with iEc.

Finally, to examine the robustness of these results to the selection of λ, the ℓ₁ tuning parameter, we examined network solutions obtained from running the model at 20 λ values ranging from two orders of magnitude above or below (0.0001–1, logarithmically spaced) the selected value of 0.080. For all of these values, the two nonzero edges shown in the Poisson+MVN network in Figure 2d had the greatest magnitude among all edges. The edge between i105 and i50 was pushed to 0, for λ values >0.234. The network became significantly less sparse for λ values <0.002, and at this point, edges between the predicted noninteractions tested in Figure 2e became nonzero. Taken together, these results suggest that our selection criterion for λ is reasonable and is one that balances false positives and false negatives.

4. Discussion

We demonstrated that our Poisson-multivariate normal hierarchical model can infer true and direct microbe–microbe interactions in synthetic and real data. Proper modeling of confounding predictors is necessary to detect the (i105, i181) and (i105, i50) interactions. Although not illustrated for brevity, without controlling for processing batch, the model detects a large number of positive interactions, none of which is supported in our coplating experiments.

While SparCC is capable of detecting the top correlations between directly interacting members, it is unable to successfully resolve the correct conditional independence structure among them, despite its intention to produce a sparse network. Although the glasso can infer direct interactions, its inability to properly model covariates or count-based abundance measurements greatly reduces its utility in metagenomic sequencing experiments, which are often multifactorial in nature. We note here that covariates can be technical or biological in nature. For example, in addition to finding direct interactions that are independent of batch effects (e.g., controlling for samples obtained from mice cohoused in separate cages), one might also be interested in finding interactions independent of the diet status of the mouse host.

Given the importance of properly controlling for covariates for correctly inferring direct interactions, it is clear that potential confounding variables must be diligently measured. This is especially true when the effect size of the covariates exceeds that of the direct interactions. For example, it is well known that diet strongly influences the gut microbiota in mice. Suppose, two community members do not directly interact but have very similar preferences for a particular micronutrient in the mouse diet. Whenever the mouse is fed, the abundance of both bacteria will concurrently increase, and if the community is sampled at a random time thereafter, it will misleadingly appear that the two bacteria may interact.

Finally, although not derived for brevity, we note that the Poisson-multivariate normal model has as flexible a mean-variance relationship as a negative binomial model and can therefore readily handle overdispersion. Intuitively, it is modeling the Poisson mean as a log-normal random variable that affords this flexibility.

We conclude that our method provides a structured, accurate, and distributionally reasonable way of modeling correlated count-based random variables and capturing direct interactions among them.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

References

Besag

1986. On the statistical analysis of dirty pictures. J. R. Stat. Soc., 48, 259–302.

Faust

, and Raes

2012. Microbial interactions: From networks to models. Nat. Rev. Microbiol., 10, 538–550.

Faust

, Sathirapongsasuti

J.F.

, Izard

, et al. 2012. Microbial co-occurrence relationships in the human microbiome. PLoS Comput. Biol., 8, e1002606.

Friedman

, and Alm

E.J.

2012. Inferring correlation networks from genomic survey data. PLoS Comput. Biol., 8, 1–11.

Friedman

, Hastie

, and Tibshirani

2007. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9, 432–441.

The Human Microbiome Project Consortium. 2012. Structure, function and diversity of the healthy human microbiome. Nature, 486, 207–214.

Konopka

2009. What is microbial community ecology?. ISME J., 3, 1223–1230.

, and Durbin

2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595.

Lundberg

D.S.

, Lebeis

S.L.

, Paredes

S.H.

, et al. 2012. Defining the core Arabidopsis thaliana root microbiome. Nature, 488, 86–90.

10.

Lundberg

D.S.

, Yourstone

, Mieczkowski

, et al. 2013. Practical innovations for high-throughput amplicon sequencing. Nat. Methods, 10, 999–1002.

11.

Meinshausen

, and Bühlmann

2006. High-dimensional graphs and variable selection with the Lasso. Ann. Stat., 34, 1436–1462.

12.

Segata

, Boernigen

, Tickle

T.L.

, et al. 2013. Computational meta'omics for microbial community studies. Mol. Syst. Biol., 9, 666.

13.

Wainwright

M.J.

, and Jordan

M.I.

2008. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1, 1–305.