Fast Metagenomic Binning via Hashing and Bayesian Clustering

Abstract

We introduce GATTACA, a framework for fast unsupervised binning of metagenomic contigs. Similar to recent approaches, GATTACA clusters contigs based on their coverage profiles across a large cohort of metagenomic samples; however, unlike previous methods that rely on read mapping, GATTACA quickly estimates these profiles from kmer counts stored in a compact index. This approach can result in over an order of magnitude speedup, while matching the accuracy of earlier methods on synthetic and real data benchmarks. It also provides a way to index metagenomic samples (e.g., from public repositories such as the Human Microbiome Project) offline once and reuse them across experiments; furthermore, the small size of the sample indices allows them to be easily transferred and stored. Leveraging the MinHash technique, GATTACA also provides an efficient way to identify publicly available metagenomic data that can be incorporated into the set of reference metagenomes to further improve binning accuracy. Thus, enabling easy indexing and reuse of publicly available metagenomic data sets, GATTACA makes accurate metagenomic analyses accessible to a much wider range of researchers.

1. Introduction

Despite their important role, microbes constitute the dark matter of the biological universe. Thousands of species live in the human gut, but only a small fraction can be isolated and studied in a laboratory and very little is known about those that cannot be cultured. The short-read lengths of modern sequencing instruments—combined with various challenges inherent to complex bacterial environments—make it difficult to accurately identify what bacterial strains are present in metagenomic samples, reconstruct their genomic sequences, and assess their abundance. Specialized library preparation techniques such as Hi-C or synthetic long reads are often more accurate, but also highly complex and expensive. As a result, more popular metagenomic analysis approaches rely on standard short-read data and multistep bioinformatic pipelines.

A common first step in such pipelines is de novo assembly of metagenomic reads into contigs. Since this initial assembly typically does not produce fully reconstructed microbial genomes, the partially assembled contigs are then deconvoluted using metagenomic binning, which groups together contigs that belong to the same microorganism. Current binning techniques fall into mainly two categories: (1) supervised classification of contigs into known taxons via comparisons to previously cataloged species (Huson et al., 2007; Rosen et al., 2011; Wood and Salzberg, 2014) and (2) unsupervised clustering techniques using features derived directly from the metagenomic sample data (Kislyuk et al., 2009; Albertsen et al., 2013; Alneberg et al., 2014; Wu et al., 2014; Kang et al., 2015; Lu et al., 2016; Luo et al., 2016), where unsupervised clustering has the clear advantage of binning contigs that pertain to previously unknown species. While some unsupervised techniques (Kislyuk et al., 2009; Strous et al., 2012) perform clustering based only on the contig sequence composition (the frequency of certain short motifs, e.g., all tetramers), the most successful recent approaches (Albertsen et al., 2013; Alneberg et al., 2014; Wu et al., 2014; Kang et al., 2015; Lu et al., 2016) also incorporate contig coverage profiles across multiple metagenomic samples.

In brief, given a cohort of metagenomic samples, these techniques compute the average coverage of each contig in each sample by mapping sample reads to the contigs using standard read mapping tools, such as Bowtie2 (Langmead and Salzberg, 2012). Naturally, contigs belonging to the same species will have similar abundances across different samples, resulting in similar coverage profiles, which allow them to be clustered together. This approach is accurate but has two main limitations: it requires a large cohort of samples, as well as sizable compute resources for read alignment. We address both of these limitations in this work.

In particular, we present GATTACA, a lightweight framework for metagenomic binning, which (1) avoids read alignment without loss of accuracy and (2) enables efficient stand-alone analysis of single metagenomic samples. Both results are based on the finding that we can approximate contig coverages using kmer counts while still achieving the same binning accuracy as leading alignment-based methods. In addition to offering a significant speedup in coverage estimation, using kmer counts, as opposed to alignment, provides us with the exciting ability to index offline any publicly available metagenomic sample and incorporate it into the coverage profile of the contigs being processed. This allows us to efficiently pull in data from large growing repositories, such as the Human Microbiome Project (HMP; Turnbaugh et al., 2007) or EBI Metagenomics archive (Hunter et al., 2014), into any metagenomic study (especially one where only a single or few samples are available) at almost no cost. For example, the kmer count index of an HMP sample used in our experiments (containing 20 million reads) only requires: 150 MB on average. We achieve the small space requirement by leveraging memory-efficient hashing with minimal perfect hash functions (MPHFs) and the probabilistic Bloom filter data structure. In contrast, using these data sets with read alignment would require much larger downloads and expensive subsequent handling to map the reads.

While using small indices allows us to incorporate a large number of publicly available samples into a given study, not all existing samples will carry content relevant to the study in question. Namely, samples that do not contain any of the species present in a given set of contigs cannot contribute any useful information for grouping the contigs. The same logic applies also to samples that carry content identical to a sample that has already been included. This motivates the need to additionally define appropriate sample selection criteria, for which we propose two metrics: (1) relevance and (2) diversity. More specifically, we would like to select a panel of samples that share content with the samples being analyzed (our query) but that also differ in the content that is shared. We use the MinHash technique (Broder et al., 2000) to find such samples efficiently. At a high level, we create and index small MinHash fingerprints for each sample in the database (offline) and then extract the appropriate samples according to the fingerprint of the query. The resulting index can be separately downloaded and used to determine which samples to include into the analysis; it needs to be updated only occasionally when new samples become available.

We evaluate GATTACA in clustering contigs assembled across multiple samples (coassemblies) and from individual samples, using both synthetic and real data sets. We compare our results with several state-of-the-art methods in metagenomic binning: CONCOCT (Alneberg et al., 2014), MetaBAT (Kang et al., 2015), and MaxBin (Wu et al., 2014), using standardized cluster evaluation metrics and benchmarks (reusing evaluation scripts from existing methods when appropriate). GATTACA was implemented in C++ and Python and is freely available at https://github.com/viq854/gattaca-public.

2. Methods

2.1. Index of kmer counts

To quickly estimate contig coverages, GATTACA builds a small index of kmer counts for each sample in the cohort. Many solutions have been proposed for exact [e.g., using hash tables (Marçais and Kingsford, 2011; Rizk et al., 2013) or minimum perfect hash functions (Patro et al., 2014)] and approximate kmer counting [e.g., using the count-min sketch (Zhang et al., 2014)]. Since the content of each sample in our panel is static, our index uses an MPHF (Cichelli, 1980) to store the kmer counts without loss of accuracy, resulting in a drastic reduction in space when compared to traditional hash tables (we also found it to be more space efficient than the count-min sketch solution for the same binning accuracy). At a high level, given a set S of n keys, an MPHF h provides a mapping between the keys and n consecutive integers from 0 to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$n - 1$$ \end{document} ; that is, h is an injection on S, guaranteeing no collisions among its keys (for x and y in S, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x \ne y$$ \end{document} , then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$h ( x ) \ne h ( y )$$ \end{document} ) and exactly n possible outputs from the integer set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ 0 , 1 , 2 , \ldots , n - 1 \} $$ \end{document} . We use the BDZ algorithm based on random r-partite hypergraphs (Botelho et al., 2007) for constructing the MPHFs.

2.1.1. Index construction

To construct the index, we first generate the kmers from all the reads in the sample (accounting for both forward and reverse complement strands) and exclude kmers that occur only once, since these are most likely present due to sequencing errors. We use a kmer length of 31-bp in our experiments (compacting the kmers into 64-bit integers for convenience). We then generate the MPHF, h_S, for the resulting set of distinct kmers, S, and store their counts in an integer array A \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( \vert A \vert = \vert S \vert )$$ \end{document} , at the indices given by h_S; namely, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A [ {h_S} ( x ) ] = count ( x )$$ \end{document} , for each kmer \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x \in S$$ \end{document} . We found 8 bits to be sufficient for storing the kmer counts (and since many counts are small, these can be compressed even further using techniques such as varint encoding).

Finally, we need to store the elements in S to support lookups, since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${h_S} ( z )$$ \end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$z \notin S$$ \end{document} will return a valid but incorrect index into A. One direct solution for storing S would be to rely on the MPHF, using a secondary array B and setting \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B [ {h_S} ( x ) ] = x$$ \end{document} for all kmers \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x \in S$$ \end{document} ; then we could check upon lookup of a key y, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B [ {h_S} ( y ) ]$$ \end{document} is equal y and determine whether y was in the set. However, this solution requires storing the array B of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vert S \vert$$ \end{document} 64-bit integers, which is 4 × larger than A, and would substantially increase the index. So instead, we store the set S in a Bloom filter, BF, which is a widely used probabilistic data structure for testing set membership that offers space efficiency at the expense of possible false positives (no false negatives are possible). We have configured the size of BF based on a false-positive probability of 0.05. As a result, our index for each sample consists of (1) the MPHF, h_S, (2) the array of counts, A, and (3) the Bloom filter storing the elements of S, BF. As an example, the size of the index constructed for an HMP sample containing 20 million 100-bp long reads was 108 MB.

2.1.2. Coverage estimation

Given a contig c and an index I of a cohort sample, we estimate the coverage of c in this sample by performing lookups in I for each kmer in c and then computing the median of the resulting counts. More specifically, we return the median of the set of counts \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$C = \{ \ldots {\rm count} ( x ) . \ldots \vert \; \forall \;{\rm kmers} \;x \in C \} $$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\rm count} ( x ) = \left\{ \begin{matrix}{I.A [ I.{h_S} ( x ) ] , } \hfill & {{\rm{if }}\ x \in I.BF} \hfill \\ {0,} \hfill & {{\rm{otherwise.}}} \end{matrix} \right. \tag{1} \end{align*} \end{document}

2.2. Contig representation

Given a set of contigs assembled from a single or multiple metagenomic samples, our goal is to bin together the contigs that belong to the same class (e.g., species or strain). Similar to existing methods, for example, CONCOCT, we first represent each contig as a multidimensional vector using both its sequence composition and coverage profile across multiple samples, where our coverages are approximated using kmer counts instead of read mapping, as previously described. Namely, given M reference samples (either from the same study or from a public database), our coverage profile is the median count of the contig kmers in each sample, while the composition profile is the normalized frequency of each possible tetramer in the contig and its reverse complement (resulting in a total of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$F = 136$$ \end{document} such features); the normalization of composition features is done according to the CONCOCT procedure (Alneberg et al., 2014). Therefore, each contig is a vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$V = [ {c_1} , \ldots , {c_M} , {f_1} , \ldots , {f_F} ]$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${c_i}: = $$ \end{document} the median kmer count in sample i and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_j}: = $$ \end{document} the frequency of tetramer j in the contig sequence.

2.3. Clustering algorithm

Given the resulting vector representations, we cluster the contigs using a Bayesian Gaussian mixture model (GMM) with a Dirichlet prior. In brief, we define a mixture distribution p of K Gaussian components over n data points \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${x_i} \in { \mathbb{R}^d}$$ \end{document} and (unobserved) assignment labels \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${z_i} \in { \{ 0 , 1 \} ^K}$$ \end{document} for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$i = 1 , \ldots , n$$ \end{document} . Our model is the product of a likelihood term \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} p ( z , X \vert \theta ) = \prod \limits_{i = 1}^n \; \prod \limits_{k = 1}^K \pi _k^{{z_{ik}}}{ \cal N}{ ( {x_i} \vert { \mu _k} , \Lambda _k^{ - 1} ) ^{{z_{ik}}}} \end{align*} \end{document}

and a prior term \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} p ( \theta ) = { \rm{Dir}} ( \pi \vert { \alpha _0} ) \prod \limits_{k = 1}^K { \cal N} ( { \mu _k} \vert {m_0} , ( { \beta _0}{ \Lambda_k})^{ - 1}) Wi ( { \Lambda_k} | {L_0}{ \nu _0} ) . \end{align*} \end{document}

Here, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X \in { \mathbb{R}^{d \times n}}$$ \end{document} is the matrix of data points and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal N}{ ( \cdot \vert { \mu _k} , \Lambda_k^{ - 1} ) ^{{z_{ik}}}}$$ \end{document} is a multivariate Gaussian with mean \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \mu _k}$$ \end{document} and inverse covariance matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \Lambda_k}$$ \end{document} . The \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi \in { \Delta _{K - 1}}$$ \end{document} form a vector of cluster weights. Together, the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \mu _k} , { \Lambda _k} , \pi$$ \end{document} form the parameter vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document} of the likelihood. The prior over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta$$ \end{document} is a product of a Dirichlet with hyperparameter \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \alpha _0} \in { \Delta _{K - 1}}$$ \end{document} , a multivariate normal with hyperparameters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${m_0} \in { \mathbb{R}^d}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \beta _0} > 0$$ \end{document} , and a Wishart distribution parameterized by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${L_0} \in { \mathbb{R}^{d \times d}}$$ \end{document} p.s.d. and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \nu _0} > 0$$ \end{document} .

We perform inference by maximizing the marginal log-likelihood \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\log p ( X )$$ \end{document} using variational inference. In brief, we maximize the evidence lower bound \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \log p ( X ) \ge { \mathbb{E}_{q ( z , \theta ) }} \left[ { \log p ( z , X , \theta ) - \log q ( X , z ) } \right] \end{align*} \end{document}

over the set of approximating distributions q. By our choice of conjugate prior, the posterior \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p ( \theta , z \vert X )$$ \end{document} and hence the optimal q have the same form, which factors over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q ( z \vert \theta ) q ( \theta )$$ \end{document} . We optimize the bound using variational expectation-maximization, which consists of repeatedly choosing updating \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q ( z \vert \theta )$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q ( \theta )$$ \end{document} . Each update has a closed-form solution by our choice of conjugate prior. We conclude the algorithm by assigning each data point to its maximum a-posteriori label according to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q ( z \vert \theta )$$ \end{document} . We refer the reader to section 21.6.1 in the standard textbook Murphy (2012) for the full derivation of this algorithm.

At a high level, the above model is very similar to automatic relevance determination (ARD), which is used by CONCOCT. We have found this approach to perform slightly better in practice than ARD, especially for automatically determining the number of clusters in the data; however, both algorithms are implemented in our software package and can be experimented with. Other alternative clustering methods can also be easily plugged into GATTACA's binning pipeline.

2.4. Sample selection

Given a query sample, Q, we would like to select samples from the public database, which can provide discriminatory features for clustering the contigs of Q (where the features represent the coverage of the contigs in the respective samples). Intuitively, the selected samples must share some content with Q (have relevance), as well as have pairwise diversity among themselves to guarantee coverage of different contigs of Q. Similar relevance and diversity concepts can be found in online recommendation systems [e.g., for articles or music (Abbar et al., 2013; Chen and Cong, 2015)].

A standard approach for comparing strings is to represent them as sets and then measure the relative size of their sets' intersection. For example, text documents can be represented as sets of words, while DNA sequences are commonly represented as sets of overlapping kmers. In this case, the similarity of two given sets A and B is measured using the Jaccard coefficient \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$J ( A , B ) = { \frac { \vert A \cap B \vert } { \vert A \cup B \vert } } $$ \end{document} . Since computing the Jaccard coefficient directly can be very inefficient for large sets, in practice it is often estimated from a much smaller set of representations or fingerprints using the MinHash technique (Broder et al., 2000). MinHash has been extensively applied in textual data mining and information retrieval systems, for example, to efficiently compare documents, websites, and news articles.

Due to its simplicity and efficiency, it has also been used for various tasks involving DNA sequence comparisons, such as motif discovery (Wijaya et al., 2005), de novo assembly (Berlin et al., 2014), read alignment (Popic and Batzoglou, 2017), and clustering of DNA sequence collections, including metagenomic sequences (Ondov et al., 2016). Here we use it to quickly estimate the relevance and diversity of the set of metagenomic samples considered for inclusion into the binning panel. Internally, GATTACA uses the implementation of MinHash for DNA sequence comparisons adapted from Popic and Batzoglou (2017). At a high level, we first breakdown the reads of a sample Q into a set of overlapping kmers, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$K = \{ {s_0} , {s_1} , \ldots , {s_{n - 1}} \} $$ \end{document} , and then compute the MinHash fingerprint of K as the vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$F = [ {f_0} , {f_1} , \ldots , {f_{L - 1}} ]$$ \end{document} , where each fingerprint entry f_i is simply the minimum set element under hash function h_i: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${f_i} = \min \{ {h_i} ( {s_0} ) , {h_i} ( {s_1} ) , \ldots , {h_i} ( {s_{n - 1}} ) \} $$ \end{document} (for efficiency, we use universal hash functions of the form: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${h_i} ( x ) = {a_i}x + {b_i}$$ \end{document} ).

We define the relevance score between a sample A and B as simply the ratio, r, of the number of entries shared by their fingerprints over L (which approximates their original Jaccard similarity; Broder et al., 2000). Then, the user can include all samples S for which \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r ( S , Q ) > T$$ \end{document} into the panel (where T is a configurable threshold) or according to their rank. By indexing the fingerprints of all the samples in the database into L tables (based on the value of each fingerprint entry, respectively), we can find all the samples that share at least one fingerprint entry in common with Q using simple lookups, as well as rank them according to relevance.

Finally, if the number of relevant samples is too high, the panel can be reduced using the diversity criterion. That is, given all the relevant samples, we can select n samples that maximize the diversity of the set. This problem is known as the dispersion problem (Erkut, 1990), where the objective is to locate k points among n, such that some function of distances between the k points is maximized. One popular optimality criterion is the MAX–MIN, which maximizes the minimum distance between a pair of points. This problem is known to be NP-hard; however, an efficient greedy heuristic algorithm exists for the MAX–MIN dispersion problem when the distances satisfy the triangle inequality, with provable performance guarantee of 2 (Ravi et al., 1991). Given two samples A and B, we define their diversity as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$D ( A , B ) = 1 - J ( A , B )$$ \end{document} and apply the greedy algorithm of Ravi et al. (1991) to find the n samples (where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$J ( A , B )$$ \end{document} is estimated as previously described). Currently, this simple scheme can be efficiently used to select samples with distinctive kmer sets; as an extension of this work, we are also evaluating alternative distance metrics and locality sensitive hashing techniques that can also discriminate between samples which differ in kmer frequency only.

3. Results

3.1. Data sets

3.1.1. Synthetic data sets

We used two synthetic data sets generated by Alneberg et al. (2014) from the 16S rRNA samples of the HMP (Turnbaugh et al., 2007). The first data set (“Species-Mock”) consists of 96 samples containing a mixture of 101 different species (without strain-level variation), while the second data set (“Strain-Mock”) consists of 64 samples comprising a mixture of 20 different organisms, of which some represent strains of the same species (e.g., this data set includes five different E. coli strains). The relative abundance profiles of the species and strains in each sample were assigned according to the distribution of the 101 and 20 most abundant organisms in the original HMP samples, respectively. Reads (100-bp long) were simulated from random positions of the genomes present in the sample based on their relative abundance, for a total of 7.75 million reads and 11.75 million reads in each “Species-Mock” and “Strain-Mock” sample, respectively. Both data sets contain the set of contigs coassembled across all the samples by Alneberg et al. (2014) using the Ray assembler (Boisvert et al., 2010), and partitioned into fragments of 10 kb when appropriate.

We used the default minimum contig length of 1000-bp when running CONCOCT, MaxBin, and GATTACA; this parameter was set to 1500-bp for MetaBAT, which is the smallest length supported by this method. As a result, the “Species-Mock” included 37,627 valid contigs and the “Strain-Mock” included 9411 valid contigs (for all tools except MetaBAT). To determine the genome assignment for each contig, the simulated reads were mapped back to the contigs and the genome was selected based on the majority of the mapped reads (with the requirement that it corresponds to at least 50% of the reads and there are at least 10 unambiguously mapped reads); this is the same criteria used by Alneberg et al. (2014).

To evaluate single-sample binning, we also assembled the contigs for several simulated samples of the “Species-Mock” independently. We used the SPADES (Bankevich et al., 2012) assembler with default parameters and automatic kmer selection, and applied the same postprocessing and labeling criteria.

3.1.2. Real data sets

To evaluate coassembly-based binning, we used the small Sharon et al. (2013) data set comprising 11 fecal samples from a preterm infant (Sharon et al., 2013). We downloaded 1.372 × 10⁸ 100-bp short reads from the SRA052203 NCBI archive as 18 separate samples (of which 7 were resequenced from the original 11 samples). The coassembled contigs for this data set were made public by Alneberg et al. (2014). To test binning accuracy, we used the CheckM (Parks et al., 2015) method based on single-copy core genes (SCGs), which are housekeeping genes that occur in single copies in the microbial genomes and have been previously used to evaluate the purity of the metagenomic clusters (Alneberg et al., 2014).

To evaluate binning of contigs assembled from single samples, we downloaded Illumina raw sequencing reads from the following two publicly available gut microbiome sample collections: (1) gut samples from HMP (148 total) and (2) 56 gut samples from the study of Albertsen et al. (2013) available through the European Nucleotide Archive. We used the SPADES assembler (Bankevich et al., 2012) with default parameters to assemble the contigs of individual samples (cutting contigs into 10 kb fragments and filtering contigs shorter than 1000-bp, as in simulation). Since the HMP samples were very large (over 30 GB/sample uncompressed), we subsampled them using the seqtk program to 20 million paired-end reads per sample during index construction. To label the contigs in each evaluated sample, we used the TAXAssign program. Labels were assigned based on the top 100 hits against the BLAST NCBI genome database that had at least 95% identity and 90% query coverage. A taxonomic label was assigned if 90% of the hits at a given level belong to the taxon associated with that label. This procedure classified roughly 15% of them on average at the species level depending on the sample. Since only a small fraction of contigs could be classified via TAXAssign, we also used SCGs to evaluate binning accuracy with CheckM.

3.2. Clustering evaluation metrics

We apply the following three standardized metrics for evaluating metagenomic clustering of labeled contigs: (1) recall—measures how many same-class contigs are placed in the same cluster, (2) precision—measures the purity of each cluster, and (3) adjusted Rand index (ARI)—measures how often pairs of same-class contigs are clustered together. On synthetic data sets (for which we know the ground-truth genome assignment) and the contigs confidently classified using TAXAssign, we compute these metrics using the Validate.pl script provided by CONCOCT directly (Alneberg et al., 2014). For SCG-based evaluation, we report the metrics computed by CheckM, using the “benchmark.R” script provided by MetaBAT (Kang et al., 2015) for visualization.

3.3. Synthetic benchmarks

First, we show that replacing read mapping with kmer counting for coverage estimation does not result in loss of clustering accuracy, using two synthetic benchmarks created by Alneberg et al. (2014). Namely, we compare the performance of GATTACA, CONCOCT, MaxBin, and MetaBAT on the “Species-Mock” and “Strain-Mock” data sets, using the published coassembled contigs. Here, we built the index of kmer counts only from the cohort of samples associated with each data set. As a result, our “Species-Mock” index contains 96 samples (with a total size of 12 GB) and our “Strain-Mock” index contains 64 samples (size of 5.1 GB). Table 1 shows the accuracy results for each tool. GATTACA achieves similar accuracy to CONCOCT and MaxBin (all higher than MetaBAT) on both data sets, with a somewhat lower recall and higher precision and ARI. MetaBAT achieves the highest precision on both data sets; however, it clusters substantially fewer contigs.

Table 1.

Binning Performance on Synthetic Data

Data set	Contigs	Labeled contigs	True clust.	Pred. clust.	Rec	Prec	ARI	Program
Species-Mock	37,628	37,589	101	107	0.986	0.994	0.985	GATTACA
				101	0.998	0.988	0.983	CONCOCT
				101	0.997	0.989	0.982	MaxBin
	37,501	37,490	101	139	0.950	0.999	0.953	MetaBAT
Strain-Mock	9417	9109	20	23	0.971	0.979	0.975	GATTACA
				21	0.982	0.942	0.945	CONCOCT
				116	0.967	0.917	0.879	MaxBin
	6281	6244	20	34	0.856	0.998	0.857	MetaBAT

Evaluation metrics were computed with the Validate.pl script provided by CONCOCT.

ARI, adjusted Rand index.

Since one of the goals of this project is to support single-sample analysis, we also evaluated single-sample contig clustering on simulated data. In particular, we assembled several samples of the synthetic “Species-Mock” data set separately (which resulted in 11 thousand contigs on average per sample) and used the remaining samples to build the index of kmer counts in a “leave-out-out” manner; namely, for a given sample Q, the index was built from all the samples in the original cohort except Q. Table 2 shows GATTACA's performance on three different samples. As expected, the accuracy results are somewhat lower but still comparable to the coassembly benchmark.

Table 2.

Evaluation of Single-Sample Binning

Sample	Contigs	Labeled contigs	True clust.	Pred. clust.	Rec	Prec	ARI
S1023	8245	8237	25	23	0.964	0.986	0.952
S1063	12,845	12,824	25	29	0.992	0.997	0.992
S1095	12,863	12,848	31	29	0.996	0.996	0.997

Contigs were assembled from individual samples of the “Species-Mock” data set and clustered using a reference index built from all other samples in this data set.

3.4. Real data benchmarks

Next, we evaluate the binning performance of each method on the coassembly of the real Sharon et al. (2013) data set. Since the true reference genomes are not available for this experiment, we used the CheckM (Parks et al., 2015) method based on SCGs to approximate recall and precision. The GATTACA index was built from the 18 samples of the data set, having a total size of 2.2 GB. Figure 1 shows the number of genomes identified by each method with a precision higher that 90% and 95%, respectively, and varying recall, as reported by CheckM. All methods, except MetaBAT, produce three highly accurate genomes with precision and recall ≥95% (MetaBAT generates only one such genome). GATTACA's performance is very similar to CONCOCT here, with the exception of an unreported bin at precision ≥90%, which CONCOCT produces with a recall of 30%. Figure 2 shows how these performance metrics vary across the generated bins.

FIG. 1.

Binning performance on the real Sharon data set reported by CheckM. The number of genomes identified with ≥90% (left) and ≥95% (right) is shown for varying recall. The image was generated using the “benchmark.R” script (slightly modified) provided by MetaBAT.

FIG. 2.

Binning performance on the real Sharon data set reported by CheckM. The image was generated using the “benchmark.R” script (slightly modified) provided by MetaBAT.

We also evaluate the binning of single-sample assemblies using a panel of publicly available data sets of the gut microbiome. Table 3 shows the binning accuracy computed using TAXAssign labels for three HMP stool samples clustered using a panel of 95 different stool samples from the HMP database. Due to the long read mapping runtime, we show results for CONCOCT, MetaBAT, and MaxBin on only one of the HMP samples, SRS011239. GATTACA's performance is overall stable and accurate across samples, with an ARI of about 90%. On SRS011239, GATTACA achieves an accuracy similar to that of other methods. Figure 3 also shows the SCG-based results for SRS011239 obtained using CheckM.

FIG. 3.

Binning performance on the real SRS011239 sample and a panel of 95 HMP metagenomes reported by CheckM. The number of genomes identified with ≥90% (left) and ≥95% (right) is shown for varying recall. The image was generated using the “benchmark.R” script (slightly modified) provided by MetaBAT. HMP, Human Microbiome Project.

Table 3.

Binning Performance for Single Real Human Microbiome Project Samples Using a Panel of 95 Other Human Microbiome Project Metagenomes

Data set	Contigs	Labeled contigs	True clust.	Pred. clust.	Rec	Prec	ARI	Program
SRS011239	28,839	5430	46	131	0.910	0.963	0.929	GATTACA
	33,180	5456	48	212	0.902	0.959	0.906	CONCOCT
	33,166	5456	48	52	0.928	0.936	0.910	MaxBin
	16,136	3419	17	51	0.920	0.988	0.937	MetaBat
SRS011134	37,645	5200	54	128	0.848	0.920	0.892	GATTACA
SRS011302	28,034	4712	55	115	0.875	0.929	0.898	GATTACA

Contigs that were not covered by at least a single sample were not included in the clustering step.

Finally, we evaluate GATTACA in clustering contigs from a single sample of the Albertsen et al. (2013) study using an index built from 46 samples of the study and an index augmented with 83 HMP samples selected using the relevance criteria previously described. Table 4 shows the results based on the TAXAssign labels. As previously shown by Alneberg et al. (2014), binning accuracy increases with larger cohort sizes. Similarly, in this experiment, it can be seen that the binning precision improves with the addition of the HMP samples, motivating the inclusion of publicly available data in studies with fewer available samples.

Table 4.

Binning Performance for Sample SAMEA1965191 from the Study of Albertsen et al. Using a Smaller Index of Samples from the Study and an Index Augmented with Human Microbiome Project Samples

Index	Contigs	Labeled contigs	True clust.	Pred. clust.	Rec	Prec	ARI
Nielsen	30,345	2655	45	97	0.699	0.833	0.660
Nielsen+HMP(83)	30,821	2658	45	129	0.711	0.882	0.735

Contigs that were not covered by at least a single sample were not included in the clustering step.

HMP, Human Microbiome Project.

3.5. Runtime and space consumption

To estimate the reduction in runtime and space consumption of our approach, we compare the runtime and space requirements of the CONCOCT pipeline, which is based on read mapping [internally using Bowtie2 (Langmead and Salzberg, 2012), samtools (Li et al., 2009), and Picard tools], to the GATTACA kmer counting implementation for computing the coverage of preassembled contigs. The reported runtimes were obtained on a single 2.67 GHz Intel Xeon X5550 processor, in single-threaded mode, although both of the tools can be easily run in parallel. We outline the performance across several data sets hereunder.

For the “Species-Mock” data set, coverage estimation in one given sample with the CONCOCT pipeline took 62 minutes; with an additional 16 minutes for the initial sample download from the repository. This single sample also requires 3 GB (uncompressed)/580 MB (compressed) disk space to be stored in FASTA format (without base qualities), with an additional 450 MB to store its alignment results. With GATTACA, the kmer count lookups in this sample took 2.7 minutes and the GAC index construction took 6 minutes. Ignoring the download time, the resulting speedup in coverage estimation is 23 × if the index for this sample has already been precomputed offline; otherwise, the speedup is 7 × when the index is computed on the fly. Moreover, the GAC index only takes 140 MB and it is the only data structure required to use this sample for binning.

For the “Strain-Mock” data set, coverage estimation in one given sample with the CONCOCT pipeline took 101 minutes, with an additional 24 minutes for the initial sample download from the repository. This single sample also requires 4.8 GB (uncompressed)/842 MB (compressed) disk space to be stored; with an additional 685 MB to store its alignment results. With GATTACA, the kmer count lookups in this sample took 0.75 minutes and the GAC index construction took 6.2 minutes. Therefore, the resulting speedup in coverage estimation is 135 × with offline indexing and 14.5 × speedup with online indexing. Moreover, the GAC index only takes 91 MB.

Finally, for the real HMP data set, coverage estimation in one given sample (containing 20 million reads) with the CONCOCT pipeline took 51.2 minutes, while GATTACA took 2.7 minutes for lookups and 5.3 minutes in GAC index construction, resulting in a 19 × speedup with offline indexing and 6.4 × speedup with online indexing. This sample also requires 7 GB (uncompressed)/2.6 GB (compressed) of disk space [with an initial download size of 7 GB (compressed) before subsampling]. On the contrary, its resulting GAC index only requires 108 MB.

Given these results and considering that binning panels usually consist of one hundred or more samples, both the runtime and disk savings for switching to GATTACA can be very substantial. Note, the reported performance was obtained using the BDZ algorithm (Botelho et al., 2007) for index construction, and our recent preliminary results show that using the BBhash algorithm (Limasset et al., 2017) can speed up the runtime further.

3.6. Evaluation of sample selection

We performed the following experiment to evaluate the performance of MinHash for sample selection. Given a randomly selected real query sample Q (namely, SRS011134 from the HMP), we created a panel of 50 samples (each containing 10 M reads of length 100-bp) that share a varying fraction of reads with Q, ranging from 2% to 100%. More specifically, we first started with Q and 49 samples containing reads simulated from the chicken genome (under the assumption that there will be little content similarity between this genome and the species present in SRS011134). We then replaced varying fractions of the chicken reads with the reads of SRS011134. Figure 4 shows the true Jaccard similarity between the panel samples and Q, as well as the Jaccard similarity estimated using MinHash, with varying fingerprint lengths. We found that MinHash fingerprints of length 1024 provide a good trade-off between runtime and sensitivity (fingerprint length is a configurable parameter in our software). This fingerprint length requires only 8K additional space per metagenomic sample. The fingerprint comparisons performed in this experiment took less than a second in total, and can easily scale to large publicly available metagenomic data sets. More experiments of interest on MinHash can be found in the recent article (Ondov et al., 2016); for example, it shows the results of applying MinHash to cluster HMP samples.

FIG. 4.

Jaccard similarities estimated using MinHash for a query sample Q (SRS011134) and a panel of 49 samples sharing varying levels of content similarity with Q. The true Jaccard similarity is shown along with estimates computed from fingerprints of different lengths.

4. Conclusion

GATTACA is a lightweight method for metagenomic binning, which achieves comparable binning accuracy with leading alignment-based methods at a fraction of the cost by approximating contig abundances using kmer counts. Furthermore, through offline indexing of publicly available cohorts of metagenomes, it enables efficient analysis of single metagenomic samples, catering to settings with scarce computational and data resources. As a result, an ongoing effort of this project is to index all the publicly available samples from various metagenomic archives and make them available to the public.

Footnotes

Acknowledgments

V.K. was supported by a VMWare Stanford Graduate Fellowship. The authors thank Roye Rozov, Rayan Chikhi, Johannes Alneberg, and the anonymous reviewers for their helpful feedback on the article and/or the software.

Author Disclosure Statement

S.B. is a cofounder and advisor to DNAnexus and member of the scientific advisory boards of 23andMe, Eve Biomedical, and GenapSys. S.B. and V.P. are currently employed by Illumina, Inc. M.S. is a cofounder of Personalis and a member of the scientific advisory boards of Personalis, AxioMx, and GenapSys.

References

Abbar

, Amer-Yahia

, Indyk

, et al. 2013. Real-time recommendation of diverse related articles. Proceedings of the 22nd international conference on World Wide Web, Rio de Janeiro, Brazil, pp. 1–12.

Albertsen

, Hugenholtz

, Skarshewski

, et al. 2013. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol., 31, 533–538.

Alneberg

, Bjarnason

B.S.

, de Bruijn

, et al. 2014. Binning metagenomic contigs by coverage and composition. Nat. Methods., 11, 1144–1146.

Bankevich

, Nurk

, Antipov

, et al. 2012. Spades: A new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 455–477.

Berlin

, Koren

, Chin

C.S.

, et al. 2015. Assembling large genomes with single-molecule sequencing and locality sensitive hashing. Nat. Biotechnol. 33, 623.

Boisvert

, Laviolette

, and Corbeil

2010. Ray: Simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol., 17, 1519–1533.

Botelho

F.C.

, Pagh

, and Ziviani

2007. Simple and space-efficient minimal perfect hash functions. Proceedings of the 10th international conference on Algorithms and Data Structures, Halifax, Canada, pp. 139–150.

Broder

A.Z.

, Charikar

, Frieze

A.M.

, et al. 2000. Min-wise independent permutations. J. Comput. Syst. Sci., 60, 630–659.

Chen

, and Cong

2015. Diversity-aware top-k publish/subscribe for text stream. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, pp. 347–362.

10.

Cichelli

R.J.

1980. Minimal perfect hash functions made simple. Commun. ACM., 23, 17–19.

11.

Erkut

1990. The discrete p-dispersion problem. Eur. J. Oper. Res., 46, 48–60.

12.

Hunter

, Corbett

, Denise

, et al. 2014. EBI metagenomics—a new resource for the analysis and archiving of metagenomic data. Nucleic Acids Res. 42, D600–D606.

13.

Huson

D.H.

, Auch

A.F.

, Qi

, et al. 2007. Megan analysis of metagenomic data. Genome Res. 17, 377–386.

14.

Kang

D.D.

, Froula

, Egan

, et al. 2015. Metabat, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 3, e1165.

15.

Kislyuk

, Bhatnagar

, Dushoff

, et al. 2009. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinf. 10, 1.

16.

Langmead

, and Salzberg

S.L.

2012. Fast gapped-read alignment with bowtie 2. Nat. Methods., 9, 357–359.

17.

, Handsaker

, Wysoker

, et al. 2009. The sequence alignment/map format and samtools. Bioinformatics. 25, 2078–2079.

18.

Limasset

, Rizk

, Chikhi

, et al. 2017. Fast and scalable minimal perfect hashing for massive key sets. arXiv Preprint arXiv:1702.03154.

19.

Y.Y.

, Chen

, Fuhrman

J.A.

, et al. 2016. COCACOLA: Binning metagenomic contigs using sequence composition, read coverage, co-alignment, and paired-end read linkage. Bioinformatics. 33, 791–798.

20.

Luo

, Zeng

, Berger

, et al. 2016. Low-density locality-sensitive hashing boosts metagenomic binning. arXiv Preprint arXiv:1604.02699.

21.

Marçais

, and Kingsford

2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27, 764–770.

22.

Murphy

K.P.

2012. Machine Learning: A Probabilistic Perspective. Cambridge, MA: MIT Press.

23.

Ondov

B.D.

, Treangen

T.J.

, Melsted

, et al. 2016. Mash: Fast genome and metagenome distance estimation using minhash. Genome Biol. 17, 132.

24.

Parks

D.H.

, Imelfort

, Skennerton

C.T.

, et al. 2015. Checkm: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055.

25.

Patro

, Mount

S.M.

, and Kingsford

2014. Sailfish enables alignment-free isoform quantification from rna-seq reads using lightweight algorithms. Nat. Biotechnol., 32, 462–464.

26.

Popic

, and Batzoglou

2017. A hybrid cloud read aligner based on minhash and kmer voting that preserves privacy. Nat. Commun. 8, 15311.

27.

Ravi

, Rosenkrantz

D.J.

, and Tayi

G.K.

1991. Facility dispersion problems: Heuristics and special cases. Proceedings of the 10th international conference on Algorithms and Data Structures, Halifax, Canada, pp. 355–366.

28.

Rizk

, Lavenier

, and Chikhi

2013. Dsk: K-mer counting with very low memory usage. Bioinformatics. 29, 652–653.

29.

Rosen

G.L.

, Reichenberger

E.R.

, and Rosenfeld

A.M.

2011. Nbc: The naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics. 27, 127–129.

30.

Sharon

, Morowitz

M.J.

, Thomas

B.C.

, et al. 2013. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120.

31.

Strous

, Kraft

, Bisdorf

, et al. 2012. The binning of metagenomic contigs for microbial physiology of mixed cultures. Front. Microbiol. 3, 410.

32.

Turnbaugh

P.J.

, Ley

R.E.

, Hamady

, et al. 2007. The Human Microbiome Project: Exploring the microbial part of ourselves in a changing world. Nature. 449, 804.

33.

Wijaya

, Rajaraman

, Brahmachary

, et al. 2005. A hybrid algorithm for motif discovery from dna seq.s. 3rd Asia-Pacific Bioinformatics Conference, Citeseer.

34.

Wood

D.E.

, and Salzberg

S.L.

2014. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, 1.

35.

Y.W.

, Tang

Y.H.

, Tringe

S.G.

, et al. 2014. Maxbin: An automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2, 26.

36.

Zhang

, Pell

, Canino-Koning

, et al. 2014. These are not the k-mers you are looking for: Efficient online k-mer counting using a probabilistic data structure. PLoS One, 9, e101271.