Mutant-Bin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome

Abstract

High genetic variability in viral populations plays an important role in disease progression, pathogenesis, and drug resistance. The last few years has seen significant progress in the development of methods for reconstruction of viral populations using data from next-generation sequencing technologies. These methods identify the differences between individual haplotypes by mapping the short reads to a reference genome. Much less has been published about resolving the population structure when a reference genome is lacking or is not well-defined, which severely limits the application of these new technologies to resolve virus population structure.

We describe a computational framework, called Mutant-Bin, for clustering individual haplotypes in a viral population and determining their prevalence based on a set of deep sequencing reads. The main advantages of our method are that: (i) it enables determination of the population structure and haplotype frequencies when a reference genome is lacking; (ii) the method is unsupervised—the number of haplotypes does not have to be specified in advance; and (iii) it identifies the polymorphic sites that co-occur in a subset of haplotypes and the frequency with which they appear in the viral population. The method was evaluated on simulated reads with sequencing errors and 454 pyrosequencing reads from HIV samples. Our method clustered a high percentage of haplotypes with low false-positive rates, even at low genetic diversity.

1. Introduction

Next generation sequencing (NGS) technologies generate data more efficiently, economically, and with greater depth than previously possible, which enables many new applications. In particular, the characterization of genetic diversity in heterogeneous viral populations is of significant interest because of its impact on disease and therapeutic intervention strategies (Shankarappa et al., 1999; Gaschen et al., 2002). As compared to existing technologies, reads produced by NGS are typically shorter and more error-prone. Thus, many computational challenges arise while analyzing deep sequence data from heterogeneous populations (Pop, 2009).

At any given time, within-host virus populations consist of a collection of distinct, albeit closely related genetic variants known as quasispecies. Each individual variant, a haplotype, occurs with a different relative frequency. The high genetic diversity of a pathogen population has important consequences in disease progression as it allows the virus to respond to changes in the host environment such as evading host defenses and therapeutic interventions. The high coverage and enormous sequence data output by NGS technologies have the potential to resolve the genetic variation within a virus sample and thereby infer the population structure and composition, which will directly benefit research on disease progression, drug resistance, vaccine design, and viral evolution (Hoffmann et al., 2007).

A number of methods have been published for quasispecies reconstruction that are able to infer genomes of individual haplotypes as well as their prevalence (Jojic et al., 2008; Astrovskaya et al., 2011; Beerenwinkel et al., 2012). In Eriksson et al. (2008) and Zagordi et al. (2010a), the authors proposed a set of methodologies based on a graph theoretic solution, where a set of haplotypes were obtained by calculating a minimal coverage set of paths over a graph of aligned reads. In Prosperi et al. (2011), a reconstruction algorithm based on combinations of multinomial distributions was designed to account for the overlaps between reads with similar frequencies. These methods have been applied to HIV datasets with diversity in the range of 3–10%, and they require a reference genome to which the reads are aligned. De novo assembly of short reads has been modeled as an NP-hard optimization problem. Moreover, the presence of rampant structural polymorphisms, high mutation rate, and sequencing artifacts makes it difficult to obtain a meaningful consensus sequence. In Beerli et al. (2002), the authors explored the limitations of consensus sequencing. In some regions of HIV env gene (eg., V1, V2, V5), where insertions and deletions accumulate rapidly, the consensus sequence has no biological meaning, making their use for phylogenetic analyses questionable. Also, for viruses with variable region, alignment of reads to the consensus is problematic. To the best of our knowledge, not much has been published about resolving the population structure when an assembled reference is not well defined.

We present a computational framework, Mutant-Bin, for clustering individual haplotypes in a viral population and determining their prevalence based on a set of deep sequencing reads. The main advantages of our method are that: (i) it enables determination of the population structure and haplotype frequencies when a reference genome is lacking, and hence, avoids the costly clonal sequencing and compute intensive alignment required by other methods; (ii) the number of haplotypes does not have to be specified in advance; and (iii) it identifies the regions with co-occurrence of polymorphic sites in a subset of haplotypes and the frequency with which they appear in the population. Our main motivation is to use this framework to survey the genetic diversity of viral populations in situations where there is substantial structural variation among haplotypes.

Mutant-Bin is tested on both simulated and real deep sequencing datasets. We evaluate its accuracy as a function of diversity, sequencing errors, relative frequencies, and number of haplotypes. We assess its performance in isolating regions containing polymorphic sites into clusters and estimating haplotype frequencies. Also, we compare our method to the existing state of the art, ShoRAH (Zagordi et al., 2010a).

2. Methods

Consider a quasispecies with K haplotypes of genome length G that appear with a frequency of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf x} = \{ x_k \} _{k = 1}^K$$\end{document} in the population. The number of haplotypes K and their frequencies are unknown. Within a population, the fraction of reads sequenced from haplotype k will be proportional to its frequency x_k. Our objective is to identify regions within each haplotype that contain the polymorphic sites, to cluster individual haplotypes and to infer their frequency in the population in the absence of a reference genome.

We define the pairwise difference d_ij between two sequences s_i and s_j as the number of base positions on which the two sequences differ. The diversity of a population containing K haplotypes is then, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}D_ { pop } = \frac { \sum_ { i = 1 } ^ { K - 1 } ( \sum_ { j = i + 1 } ^K d_ { ij } ) } { K ( K - 1 ) / 2 } \tag { 1 } \end{align*}\end{document}

We designate as the reference the variant with the lowest average pairwise difference in the population. The regions within the reference genome that have polymorphic sites with derived nucleotides that result in different haplotypes are called suspect regions. Even a single base difference, if it appears consistently in all copies of variants will lead to its identification as a new haplotype.

Mutant-Bin is an application of the Lander-Waterman model to viral population estimation and is based on the l-tuple (ordered sequence of length l) content of the reads. Its framework consists of three steps. First, the distribution of l-tuples within the population is modeled as a mixture of Poisson distributions, whose means correspond to the coverage of suspect regions, determined by the subset of haplotypes in which the polymorphic sites co-occur. Second, the l-tuples are clustered by their frequency using the Variable Bandwidth Mean Shift (VBMS) algorithm. We bin the l-tuples by their cluster centers to identify the l-tuples spanning the polymorphic sites that co-occur in a subset of haplotypes and determine the frequency with which they appear. Finally, we propose a greedy heuristic to bin the l-tuples by the genomes they originate from and thereafter infer the local haplotype structure.

2.1 Mixture of Poisson distributions

According to Lander-Waterman model, the probability that a base is sequenced m times follows a Poisson distribution, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$P ( m ) = \lambda^m \frac { e^ { - \lambda } } { m! } $$\end{document} , where λ is the coverage of the experiment (Port et al., 1995). We define l such that all l-tuples appear at most once within the genome. Then, the count of l-tuples in a set of reads also follows a Poisson distribution with parameter N(L−l+1)/(G−L+1)≈NL/G, where G is the genome length, N is the number, and L is the length of reads.

In a viral population, at any given time each haplotype occurs with a different relative frequency. In Figure 1, the frequency spectrum illustrates the distribution of l-tuples within a quasispecies being modeled as a mixture of Poissons for a population containing two haplotypes that appear with a coverage of x_A and x_B (proportional to their relative frequencies in the population).a The distribution of l-tuples that span the portion of the genomes that is common to both haplotypes will be Poisson with mean x_A+x_B and those that span the suspect regions unique to A and B with means x_A and x_B respectively. Now the problem of identifying different suspect regions is transformed to that of a modeling mixture of Poisson distributions.

FIG. 1.

Virus population with two haplotypes in frequencies x_A and x_B. Frequency spectrum of l-tuple counts in the population. The figure depicts two genomes of length G and a shaded region that corresponds to the base positions on which the two genomes differ (i.e., the suspect region).

The frequencies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ x_i \} _{i = 1}^K$$\end{document} of K haplotypes in the viral population are called basic frequencies. A polymorphic site at a given position can occur in one or more of K haplotypes. Polymorphic sites for which exactly one haplotype in the population has a derived nucleotide will be most common, followed by those for which exactly two haplotypes in the sample have a derived nucleotide and so on (Johnson and Slatkin, 2006). The distribution of l-tuples that span the suspect region with polymorphic sites that co-occur in r haplotypes, say \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( x_{j_1} , x_{j_2} , . , x_{j_r} )$$\end{document} , will be Poisson with mean \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$y_i = \sum\nolimits_{k = 1}^r x_{j_k}$$\end{document} . The maximum number of Poisson distributions that can be obtained is 2^K−1, though one usually does not observe all of them in a given population. Let us assume that we obtain M different Poisson distributions with means \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf y} = \{ y_i \} _{i = 1}^M$$\end{document} in the population. We call these the composite frequencies. Our immediate goal is to determine the mean values of the different Poisson distributions that correspond to the composite frequencies and cluster the l-tuples using them. We note that a similar method was used by Li and Waterman (2003) for estimating the repeat content within a genome. The proposed method was recently adapted by Wu and Ye (2010) to classify reads within a metagenome.

2.2. Cluster l-tuples using variable bandwidth mean shift analysis

The frequency spectrum of l-tuples is multimodal with modes corresponding to the frequency of suspect regions unique to each subset of haplotypes. We propose to use VBMS method to assign each l-tuple, based on its count, to a cluster that represents a mode of the Poisson distribution. The mean shift procedure was originally described in Comaniciu et al. (2002, 2001) and adapted in Zhao et al. (2010) to detect and remove sequencing errors prior to assembly. Mean shift analysis is a robust nonparametric method that estimates the modes of the distribution. This technique is attractive, since it needs no prior knowledge of the number of clusters nor will it be affected by the variance in data due to the presence of sequencing biases.

Given the set of reads R sequenced from the viral population, the algorithm starts by counting l-tuples in it (Algorithm 1). Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ c_i \} _{i = 1}^L$$\end{document} denote the count of different l-tuples in R, L being the total number of possible l-tuples. The multivariate kernel density estimate (KDE) with kernel K(x) for l-tuple k is defined as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\tilde { f } ( c_k ) = \frac { a } { L } \sum_ { i = 1 } ^L \frac { 1 } { h_i } K \left(\left|\left| \frac { c_k - c_i } { h_i } \right|\right| ^2 \right) \tag { 2 } \end{align*}\end{document}

where a is the normalization constant and h_i is the kernel bandwidth that determines the range of influence of the kernel located at l-tuple i. Here, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ c_i \} _{i = 1}^L$$\end{document} represents a random sample from an unknown density f. The kernel, K(x), is taken to be a spherically symmetric, non-negative function centered at zero and integrating to one. The adaptive bandwidth procedure estimates the density at each point c_i by taking the average of differently scaled kernels centered at each of the data points. For multivariate kernels, the optimum kernel yielding minimum mean-integrated square error is the 1-d Epanechnikov kernel, with its profile defined as, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}K_E ( x ) = \begin{cases} 1 - x \quad 0 \le x \le 1 \\ \quad 0 \qquad { \rm otherwise}\end{cases} \tag{3}\end{align*}\end{document}

While using the adaptive KDE \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\tilde{f} ( c_k )$$\end{document} for clustering, our objective is to assign each data point to a cluster, based on the mode that point evolves to under a gradient ascent algorithm. The gradient of the kernel density estimate is: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\nabla \tilde { f } ( x ) = \frac { a } { L } \sum_ { i = 1 } ^L \frac { 1 } { h_i } \nabla K_E \left(\left|\left| \frac { x - c_i } { h_i } \right|\right| ^2 \right) \tag { 4 } \end{align*}\end{document}

We define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$g ( x ) = - K_E^{ \prime} ( x )$$\end{document} . The mean shift vector is defined as an estimate of the normalized gradient of the underlying distribution and is obtained from Equation (4): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}m ( x ) = \frac { \sum_ { i = 1 } ^L c_ig ( \parallel \frac { x - c_i } { h_i } \parallel ^2 ) } { \sum_ { i = 1 } ^L g ( \parallel \frac { x - c_i } { h_i } \parallel ^2 ) } - x \tag { 5 } \end{align*}\end{document}

This process converges at a point in which the estimate has zero gradient (Comaniciu et al., 2002) (i.e., the modes of the density). Our method clusters l-tuples based on their corresponding mode at convergence. The obtained modes will correspond to the composite frequencies, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf y} = \{ y_i \} _{i = 1}^M$$\end{document} . We represent the set of l-tuples clustered with mode y_i by C_yi. Therefore, each cluster C_yi contains l-tuples spanning the polymorphic sites that co-occur in a subset of haplotypes for which ∑_k x_k = y_i.

Algorithm 1

VBMS Analysis

Input:

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ c_i \} _{i = 1}^L$$\end{document}

Output: Modes

in the spectrum,

1. Compute the fixed bandwidth h₀ using the 1-dimensional plug-in rule (Turlach, 1993):

2. Calculate the initial KDE

of l-tuple k (Silverman, 1986):

3. For each l-tuple c_i, compute its adaptive bandwidth,

, where

4. For each l-tuple k, initialize

with c_k, the count of l-tuple k to be clustered.

(a) Compute the mean shift vector

using Equation (5)

(b) Translate density estimation window:

5. l-Tuples that converge to the same mode y_i form cluster C_yi.

2.3. Greedy heuristic for the generating set

Once we determine the composite frequencies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf y} = \{ y_i \} _{i = 1}^M$$\end{document} , our next task is to find the basic frequencies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf x} = \{ x_i \} _{i = 1}^K$$\end{document} , such that every y_i is the sum of a subset of x and the size of x is minimal. Such a set x is known as a generating set of y. Determining the generating set is an NP-complete problem (Collins et al., 2007). Using the greedy heuristic outlined below (Algorithm 2), the problem can be solved in polynomial time. Most of the algorithms for the generating set in the literature are heuristic or approximative. The work by Collins et al. provides some nontrivial lower bounds given certain constraints.

We propose a greedy heuristic for constructing the generating set x of y. We do not allow x and y to be multisets. If some x is repeated, then without loss of generality, we can replace x by 2x in the set. Let P_yi be the representation of y_i (i.e., the subset of x such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$y_i = \sum\nolimits_{x_k \in{P_{y_i}}}x_k$$\end{document} ). Let D be the set of all possible differences in y. The main idea is that while constructing x, at each step, we choose the least y_i that does not already have a representation in x. The condition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$y_i\,\notin\, mode ( D )$$\end{document} ensures that we do not delete some x_r such that x_p+x_q = x_r for x_p, x_q, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$x_r \in { \bf x}$$\end{document} .

Here, x corresponds to basic frequencies. Cluster C_yi contains l-tuples spanning the polymorphic sites that co-occur in haplotypes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$x_k \in P_{y_i}$$\end{document} . We can now cluster the l-tuples by the basic frequencies of the genomes from which they originate. Initialize K bins, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ B_k = \emptyset \} _{k = 1}^K$$\end{document} . For each haplotype k, consider all y_i for which \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$x_k \in P_{y_i}$$\end{document} and bin together the l-tuples from such C_yi into B_k. That is, for each composite frequency y_i and its representation P_yi, define {B_k = B_k∪C_yi|∀ y_i such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$x_k \in P_{y_i}$$\end{document} }. Ultimately, each B_k will consist of all l-tuples corresponding to haplotype k.

Algorithm 2

Generating Set

Input:

Output:

z ← y, x ← ∅.

D = {|y_i − y_j|}_∀i,j

x ← x ∪ min(z) & z ← z − min(z)

while z ≠ ∅ do

x ← x ∪ min(z) & z ← z −min(z)

for all subsets S ⊆ x do

if u∈z & u≠mode(D) then

P_u ← S Representation of u

z ← z − {u}

end if

end for

end while

3. Results

We evaluated our method on simulated and real deep sequencing data from HIV samples. All of our simulations were based on a 2000 bp region at the 5′ end of the HIV-1 genome. We used Metasim's population sampler to simulate heterogeneous HIV samples of different diversities, evolved from a single parent genome (Richter et al., 2008). Reads were generated from these populations by mixing the haplotypes in various proportions and coverage depths using Metasim, which replicates the error process of 454/Roche sequencing.

We require the size of l-tuple to be large enough to avoid repetition within the genome, yet be small enough that a large number of reads contain the l-tuple. We select a lower bound on l as 1/p^l > G, where p is the probability that the most frequent nucleotide will appear at a given position and G is the genome size. When G is unknown, we approximate l using p^l > NL, where N is the number of reads and L is the average length of reads (Li and Waterman, 2003). Experimentally, l-tuple length of 10 bp gives us optimal results. A default read length of 250 bp and error rate of 0.5% was used for the simulations. Note that in our method, even a single base difference, if it appears consistently in all copies, will lead to its identification as a new haplotype.

Our goal is to identify mutations that co-occur in each subset of haplotypes and the frequency with which they appear in the quasispecies. We report precision and recall averaged over different suspect regions (i.e., over the l-tuples that vary between the haplotypes and that do not include the large number of l-tuples common to all). Our method estimates cluster C_yi to contain l-tuples that span the polymorphic sites that co-occur in haplotypes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$x_k \in P_{y_i}$$\end{document} . The counts of these l-tuples converge to mode \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$y_i = \sum\nolimits_{x_k \in P_{y_i}} x_{k}$$\end{document} . Subsequently, the subset of haplotypes that contribute to this mode is recovered as P_yi. Let T_yi denote the true cluster assignment of such l-tuples. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} & \qquad\quad {\rm Recall} = \frac {\mid {\hbox {l-tuples in } } \ T_ { y_i } \cap { \hbox { l-tuples in } } \ C_ { y_i } \mid } {\mid {\hbox{l-tuples in } } \ T_ { y_i } \mid } \\ & \qquad {\rm Precision } = \frac { \mid { \hbox { l-tuples in } } \ T_ { y_i } \cap { \hbox { l-tuples in } } \ C_ { y_i } \mid } { \mid l {\hbox {-tuples in } } \ C_ { y_i } \mid } \\ &\quad \hbox {F-measure} = \frac {{\rm Precision} * {\rm Recall}} {2* ( {\rm Precision} + {\rm Recall})} \tag {6} \end{align*}\end{document}

Our method derives its strength from high and uniform coverage of the data. However, sequencing biases due to both statistical (e.g., CG bias) and biological effects (e.g., mappability bias) can skew the coverage by causing certain regions of the genome to be unevenly sampled. Elimination of these biases is imperative for avoiding spurious conclusions regarding the data. In the presence of a reference genome, the adverse effects of such sequencing biases can be mitigated by normalizing the nucleotide bias in the data (Schwartz et al., 2011).

Error rates with Roche GS20 system have been estimated at 5–10 errors per kbp (Huse et al., 2007; Wang et al., 2007). We simulated datasets with sequencing errors varying from 3–40 errors per kbp, using an 454/Roche sequencing error model. Sequencing errors produce an excess of l-tuples that appear only once in the population, as opposed to polymorphic sites that appear several times. The true distribution of l-tuple counts is expected to be a mixture of an exponential distribution, for erroneous l-tuples, and a series of Poisson distributions describing the true l-tuple counts (Zhao et al., 2010). We combine two existing techniques to handle errors. Prior to clustering, our method uses spectral alignment to perform error correction (Pevzner, 2001). Subsequently, we configure our method to discard l-tuples that appear below a cutoff threshold defined by the first minimum in the frequency spectrum. More tailored approaches for error correction have been discussed elsewhere (Li et al., 2008; Zagordi et al., 2010a; Zhao et al., 2010; Yang et al., 2012). We observe from Table 1 that the outlined error correction scheme resulted in significant performance gain. Thus, Mutant-Bin is fairly robust to error rates of 2%. However, when the error rate exceeds 2.5%, the accuracy drops as more than 22% of the l-tuples are expected to be contaminated.

Table 1.

Effect of Error Correction Using Spectral Alignment with Thresholding on Reduction in the Number of Erroneous l-Tuples

	F-measure (% Erroneous l-tuples)
Error rate (%)	No error correction	Thresholding with spectral alignment
0.4	98.6 (2.8)	99.7 (0.04)
0.8	84.1 (4.4)	99.3 (0.1)
1.4	27.8 (6.6)	98.9 (0.3)
1.9	34.9 (11)	98.1 (0.4)
2.6	28.3 (14.6)	97.4 (0.9)
2.9	28 (17)	49.6 (2.2)
3.4	27 (18.8)	32.4 (4.7)

Parameters: Simulated datasets with two haplotypes in mixing proportion of 1:3, at diversities of 4–6%, containing 10,000 reads. If we consider an error rate of 0.5% per bp, then approximately 4.9% of the l-tuples and 71% of the reads are expected to be contaminated with at least one sequencing error.

Next, we assess the sensitivity of the method to different diversities, depth of coverage, and number of haplotypes. Figure 2 illustrates high recall values (above 0.85) uniformly at all diversities. Precision, on the other hand, is strongly correlated to diversity. The main problem at low diversities is that sequencing errors can masquerade as polymorphic sites, resulting in a low precision. The evaluation as a function of depth of coverage indicates that given uniform coverage, our method is robust to the overall depth of coverage in the population (Fig. 3). The overall accuracy for populations with different numbers of haplotypes is shown in Figure 4. For five or more haplotypes, we obtain a high precision at the cost of low recall. Even when the cardinality of the generating set obtained using the greedy heuristic is greater than the optimal, the true basic frequencies were correctly identified and formed a subset of the generating set.

FIG. 2.

Precision (A) and recall (B). X-axes shows the diversities varying from 0.1–10%. Parameters: Simulated datasets with two haplotypes in mixing proportion of 1:3, containing 10,000 reads.

FIG. 3.

F-measure with varying coverage. Parameters: Simulated datasets with two haplotypes in mixing proportion of 1:3 at diversities of 4–6%.

FIG. 4.

F-measure with the number of haplotypes in the population. Number of reads = 20000. Haplotypes were considered with frequencies to 1:3, 1:3:5, 1:3:5:10, and 1:3:5:10:12.

We assessed the ability of the algorithm to recover the true haplotype frequencies using the greedy heuristic. We consider populations consisting of 2 to 5 haplotypes. Figures 5 and 6 compares the estimated haplotype frequencies of Mutant-Bin and state-of-the-art ShoRAH (Zagordi et al., 2010a) on the same datasets. ShoRAH aligns the reads to a reference genome to extract a minimal subset of haplotypes that explain the observed reads, while Mutant-Bin finds the minimal set of basic frequencies that explain the observed frequencies of l-tuples and clusters the l-tuples of haplotypes with similar frequency together, without any sort of alignment. The mixing proportions predicted by our method are in agreement with the true mixing proportions and are comparable to that of ShoRAH. Note that our method being frequency-based will conflate haplotypes with identical frequencies. For instance, for a population containing four haplotypes in the ratio 1:3:3:10, our method bins the l-tuples into three clusters, with predicted frequencies in the ratio 1:3:10. Our method predicted the right number of haplotypes in 88.1% of the cases. This can be explained by the number of outliers for datasets containing haplotypes with similar frequencies. The sum of frequencies of K most frequent haplotypes predicted was more than 90% in 88.5% of the cases, where K is the actual number of haplotypes in the population. Whereas, for ShoRAH, the sum of top K frequencies accounted for 90% of the data in 98.93% of the cases.

FIG. 5.

Estimated haplotype frequencies by our method (left rectangular box plot) and ShoRAH (right circular box plot) in sampling from mixing ratios (solid black lines) indicated beneath the panel for diversities 0.1–10%. All parameters of ShoRAH were set to default values, and the algorithm was run for 5,000 iterations.

FIG. 6.

(Top) Snapshot of 600 bp length of three genomes in a sample (with the crosses representing the derived nucleotides). The lowest genome is the designated reference. (Bottom) Frequency spectrum of l-tuples count along the length of the genome for a viral population containing three haplotypes in the mixing ratio of 1:3:5.

Our final evaluation on real deep-sequencing HIV data obtained from 454/Roche FLX pyrosequencing platform allows for hard assessment of the performance (Zagordi et al., 2010a). Of the four datasets, two are of subtype A and two of subtype B. The 454 reads from the two subtypes, which are at a diversity of 10.9%, were taken in the proportion of 1:4. After error correction by spectral alignment, our method obtained an average F-measure of 87.3% and accurately predicted the haplotype frequencies (results for ShoRAH in Zagordi et al., 2010a). Simultaneously, we generated reads in silico with 454 sequencing errors from the 1.5 kbp region of HIV pol gene that has been sequenced from cloned sequences (Zagordi et al., 2010b). We applied our method to reads sampled from four clonal sequences that are at a diversity of 7.8%, mixed in proportions of 1:3:5:10. Our method achieved an F-measure of 94.2% and identified the frequencies of the three most frequent haplotypes correctly.

4. Conclusions

In this article we have proposed an unsupervised method for quantifying the genetic diversity of heterogeneous populations in datasets for which no reference genome is available. Phylogenetic and evolutionary studies that do not require the knowledge of the sequence itself, but only the number of variable sites and the nucleotides in these sites, can benefit from such a method. For viruses with variable regions, alignment of reads to the consensus is problematic. Hence there is much added value for methods that avoid alignment. Our method determines the population structure and haplotype frequencies without the use of a reference genome and hence, avoids the compute-intensive alignment required by other methods. Note that our method does not reconstruct the haplotypes, it identifies suspect regions within the population. Deep sequencing technologies can produce much shorter reads of about 36 bp at a higher coverage. Our method is especially suitable for such short reads, as long the length of the reads exceeds the length of the l-tuple.

In its current implementation, our model possesses several limitations. The method performs well on datasets obtained at uniform coverages. However, in real sequencing projects, the frequency spectrum will be obscured by errors stemming from the sequencing process and nonuniform coverage across the genomes. Our method relies on the frequencies of the haplotypes in the population being distinct, which cannot always be guaranteed. Determining exact values of composite frequencies is one step that is difficult with the increase in the number of haplotypes and reduced coverages. However, with better mode detection algorithms and high depth provided by NGS technologies, it should be possible to obtain a good separation between the distributions and improve the scalability of the method with the number of haplotypes. Currently, the output of our method is a cluster of l-tuples, each corresponding to a haplotype in the population. We intend to extend our method to enable reconstructing the l-tuples in each cluster into a genome of the corresponding haplotype. Given how mutations occur within a population, it is possible to predict the shape of the frequency spectrum in theory (Johnson and Slatkin, 2006). Using our method, we can work backward to infer the population parameters and the evolutionary tree, given the shape of the spectrum. Such analyses play an important part in evolutionary and population genetics.

We foresee the application of this method in the context of cancer and bacterial communities that are also characterized by increased genetic diversity. The method can also be used for calling SNPs without a reference genome and will have application for conservation efforts of nonmodel organisms and endangered species (Ratan et al., 2010). An SNP commonly has only two haplotypes in the population and hence, determining the composite frequencies will be much easier.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

Footnotes

Acknowledgments

We are grateful to Osvaldo Zagordi and Niko Beerenwinkel for providing the HIV deep sequencing data.

a

The frequencies of the haplotypes in the population can be approximated by their coverages. Hence, we use the terms interchangeably.

References

Astrovskaya

, Tork

, Mangul

et al. 2011. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics, 12:S1.

Beerenwinkel

, Gunthard

H.F.

, Roth

et al. 2012. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol, 3:329.

Beerli

, Grassly

, Kuhner

et al. 2002. Population genetics of HIV: parameter estimation using genealogy-based methods. 217–252

Rodrigo

A.G.

, Learn

G.H.

Jr.

Computational and Evolutionary Analysis of HIV Molecular Sequences. Springer: New York.

Collins

M.J.

, Kempe

, Saia

et al. 2007. Nonnegative integral subset representations of integer sets. Inf. Process. Lett., 101:129–133.

Comaniciu

, Meer

, Member

2002. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603–619.

Comaniciu

, Ramesh

, Meer

2001. The variable bandwidth mean shift and data-driven scale selection. Proc. 8th Intl. Conf. on Computer Vision, 438–445.

Eriksson

, Pachter

, Mitsuya

et al. 2008. Viral population estimation using pyrosequencing. PLoS Comput. Biol., 4:e1000074.

Gaschen

, Taylor

, Yusim

et al. 2002. Diversity considerations in HIV-1 vaccine selection. Science, 296:2354–2360.

Hoffmann

, Minkah

, Leipzig

et al. 2007. DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res., 35:91.

10.

Huse

, Huber

, Morrison

et al. 2007. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol., 8:R143+.

11.

Johnson

P.L.

, Slatkin

2006. Inference of population genetic parameters in metagenomics: A clean look at messy data. Genome Research, 16:1320–1327.

12.

Jojic

, Hertz

, Jojic

2008. Population sequencing using short reads: HIV as a case study. Proc. Pac. Symp. Biocomput, 114–125.

13.

, Ruan

, Durbin

2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18:1851–1858.

14.

, Waterman

M.S.

2003. Estimating the repeat structure and length of DNA sequences using -tuples. Genome Research, 13:1916–1922.

15.

Pevzner

P.A.

2001. A new approach to fragment assembly in DNA sequencing. RECOMB, 256–267.

16.

Pop

2009. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics, 10:354–366.

17.

Port

, Sun

, Martin

et al. 1995. Genomic mapping by end characterized random clones: A mathematical analysis. Genomics, 26:84–100.

18.

Prosperi

, Prosperi

, Bruselles

et al. 2011. Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing. BMC Bioinformatics, 12:5.

19.

Ratan

, Zhang

, Hayes

et al. 2010. Calling SNPS without a reference sequence. BMC Bioinformatics, 11:130.

20.

Richter

D.C.

, Ott

, Auch

A.F.

et al. 2008. Metasima sequencing simulator for genomics and metagenomics. PLoS ONE, 3:3373.

21.

Schwartz

, Oren

, Ast

2011. Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS ONE, 6:e16685+.

22.

Shankarappa

, Margolick

J.B.

, Gange

S.J.

et al. 1999. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol., 73:10489–10502.

23.

Silverman

B.W.

1986. Density estimation: for statistics and data analysis. Chapman and Hill.

24.

Turlach

B.A.

1993. Bandwidth selection in kernal density estimation: a review. In CORE and Institut de Statistique.

25.

Wang

, Mitsuya

, Gharizadeh

et al. 2007. Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res., 17:1195–1201.

26.

Y.-W.

, Ye

2010. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. RECOMB, 6044:535–549.

27.

Yang

, Chockalingam

S.P.

, Aluru

2012. A survey of error-correction methods for next-generation sequencing. Briefings in Bioinformatics, 14:56–66.

28.

Zagordi

, Geyrhofer

, Roth

et al. 2010a. Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. J. Comp. Bio., 17:417–428.

29.

Zagordi

, Klein

, Dumer

et al. 2010b. Error correction of next-generation sequencing data and reliable estimation of hiv quasispecies. Nucleic Acids Res., 38:7400–7409.

30.

Zhao

, Palmer

L.E.

, Bolanos

et al. 2010. Edar: an efficient error detection and removal algorithm for next generation sequencing data. J. Comp. Biol., 17:1549–1560.