Abstract
Abstract
The recent proliferation of next generation sequencing with short reads has enabled many new experimental opportunities but, at the same time, has raised formidable computational challenges in genome assembly. One of the key advances that has led to an improvement in contig lengths has been mate pairs, which facilitate the assembly of repeating regions. Mate pairs have been algorithmically incorporated into most next generation assemblers as various heuristic post-processing steps to correct the assembly graph or to link contigs into scaffolds. Such methods have allowed the identification of longer contigs than would be possible with single reads; however, they can still fail to resolve complex repeats. Thus, improved methods for incorporating mate pairs will have a strong effect on contig length in the future. Here, we introduce the paired de Bruijn graph, a generalization of the de Bruijn graph that incorporates mate pair information into the graph structure itself instead of analyzing mate pairs at a post-processing step. This graph has the potential to be used in place of the de Bruijn graph in any de Bruijn graph based assembler, maintaining all other assembly steps such as error-correction and repeat resolution. Through assembly results on simulated perfect data, we argue that this can effectively improve the contig sizes in assembly.
1. Introduction
The first generation of assemblers followed the overlap-layout-consensus paradigm, where overlaps were heuristically used to join reads together into contigs (Myers, 1995; Batzoglou et al., 2002). Later, the introduction of de Bruijn graphs led to significant improvements in assembly (Pevzner, 1989; Idury and Waterman, 1995; Pevzner et al., 2001). In contrast to the overlap-layout-consensus approach, these assemblers first constructed a graph where the original genome is spelled by a series of walks through the graph, and non-branching walks correspond to substrings (contigs) of the genome. Compared to the earlier heuristic approaches, de Bruijn graphs produced longer contigs and gave rise to more powerful techniques for correcting errors and resolving repeats—identical, or nearly identical, stretches of DNA (Schatz et al., 2010). Their success led to the development of other types of graphs for sequence assembly: A-Bruijn graphs (Pevzner et al., 2004) and closely related string graphs (Myers, 2005), which together have become an essential part of most modern assembly tools, including EULER-SR (Chaisson and Pevzner, 2008), Velvet (Zerbino and Birney, 2008), ALLPATHS (Butler et al., 2008), ABySS (Simpson et al., 2009), and others.
Despite these advances, the challenge of resolving repeats remains. When the length of a repeat is longer than twice the read length, it becomes difficult to correctly match its upstream and downstream regions. In order to alleviate this problem, sequencing technologies were extended to produce mate pairs (Weber and Myers, 1997)—pairs of reads between which the genomic distance (called the insert size) is well estimated. Because insert sizes could be much longer than the read length, mate pairs were able to span long repeats and could potentially match up the regions surrounding a repeat.
The challenge of algorithmically incorporating mate pair information into de Bruijn graph assemblers was first addressed by Pevzner and Tang (2001), who proposed a heuristic to look for a path between the two reads of a mate pair with a length of the insert size. If exactly one such path was found, then a mate pair transformation could be applied to “unwind” this path in the graph. Essentially, this amounted to transforming two mated reads into one long read where the gap between the mates was filled in with the nucleotide sequence representing the found path, thus potentially connecting the surrounding regions of a repeat. Several other heuristic approaches for utilizing mate pair information in the de Bruijn graph were developed (Zerbino and Birney, 2008; Butler et al., 2008; Medvedev and Brudno, 2008).
Such methods had a great impact on genome assembly, allowing the construction of much longer contigs; however, they could still fail in complex repeat-rich regions, where there are multiple paths between the read pairs. Many current technologies, including Complete Genomics (Drmanac et al., 2010) and Helicos (Harris et al., 2008), still generate very short reads (around 25 nt) for which the resulting de Bruijn graph is very tangled (even for bacterial genomes). In such cases, mate pair transformations often fail because of multiple paths. Additionally, the percentage of mate pairs that can be successfully transformed deteriorates when the insert size is high (Chaisson et al., 2009), and the search for paths between mates becomes prohibitively time-consuming. Unfortunately, these difficulties result in shorter contigs in complex repeat-rich regions. The limitations of the existing heuristics for analyzing mate pairs is thus a major hurdle towards assembling large contigs with short reads.
We believe that the shortcomings of current mate pair algorithms stem from the fact that they are heuristic approaches that are applied after the construction of the de Bruijn graph. The de Bruijn graph does an excellent job of incorporating the sequence information from the single reads; however, it ignores any mate pair information that is available. This information has to be recovered after the graph construction, and only then applied in a heuristic manner. In this article, we propose the paired de Bruijn graph, a generalization of the de Bruijn graph that incorporates the mate pair information into the structure of the graph itself, as opposed to a post-processing step. Just as moving from the heuristic overlap-layout-consensus paradigm to the de Bruijn graph paradigm resulted in better assemblies, we believe that moving from heuristic mate pair algorithms to paired de Bruijn graphs could result in a more effective use of mate pair information. The paired de Bruijn graph is a potential replacement of the de Bruijn graph in existing de Bruijn graph based assemblers; existing assembly stages, including error correction and scaffolding, would not need to be substantially modified.
Through assembly results on simulated perfect data, we argue that when mate pair information is used in this manner, the read length (once above a small threshold) becomes much less relevant (Chaisson et al., 2009). We find that the contig sizes in an assembly are largely dictated by the average insert size—when it exceeds 6000 nt, we can assemble all of E. coli into one contig and most of the human chromosome 22 into 15 contigs. Though this paper falls short of analyzing real data, we believe that, similar to how early error-free studies of de Bruijn graphs laid the foundation for their use in assembly (Pevzner, 1989), the paired de Bruijn graph can become the basis of practical assemblers.
2. From de Bruijn Graphs to Paired de Bruijn Graphs
2.1. Preliminaries
To simplify the presentation, we assume that the genome is a circular string (i.e., one circular, single-stranded chromosome) and that all reads have the same length l; extending our approach for multiple linear chromosomes or varying read length is straightforward. Moreover, we assume that reads are error-free (see Section 5 for a discussion). In this setting, a mate pair is an ordered pair of strings of length l drawn from the genome at positions i and j, respectively. Normally, the relative distance between reads is expressed in terms of the insert size, the number of nucleotides from the first nucleotide of a to the last nucleotide of b: j − i + l. However, for the purposes of our construction, it is more convenient to express it in terms of d = j − i, the difference in their leftmost coordinates. Note that d is the insert size minus one read length (Fig. 1a).

Mate pairs and the de Bruijn graph.
As with any de Bruijn graph based approach, our algorithms have a parameter k that dictates the size of the substrings into which the reads are chopped up. Thus, though our input is a set of mate pairs of reads of any length, we immediately chop them up into smaller pieces. Formally, each mate pair of reads is replaced by its constituent l − k (sub-)mate pairs, where the reads of each (sub-)mate pair now have length k + 1. Therefore, for the remainder of the paper, we will assume without loss of generality that the reads are immediately given with length k + 1. We now give some definitions.
Below we describe three A-Bruijn graphs: de Bruijn graphs (for unpaired reads); paired de Bruijn graphs (for mate pairs with an exact distance); and approximate paired de Bruijn graphs (for mate pairs with an approximate distance).
2.2. De Bruijn graphs (modeling unpaired reads)
Let C be a set of (k + 1)-mers from a circular string S. We construct an A-Bruijn graph based on C as follows.
• First, we define an initial graph G0 consisting of m = 2|C| vertices and |C| isolated edges. For each (k + 1)-mer • Second, we glue certain vertices of G0 together, by forming an m × m binary matrix A and setting Ai,j = 1 to indicate that vertices i and j should be glued together. For this construction, we set Ai,j = 1 when vertices i and j have the same label.
The labeled directed graph G = DB(C, k) obtained from these gluings is the de Bruijn graph of C (Pevzner et al., 2004) (Fig. 1b–d). It may be considered as either a simple graph (without parallel edges but with loops), or as a multigraph where the multiplicity of each edge is determined by the number of times the (k + 1)-mer it represents is present in C. Consider a walk through edge/label sequence
Traditionally, the de Bruijn graph is also defined on a string S by setting the vertex set equal to the k-spectrum of S. For every (k + 1)-mer a of S, define an edge prefix(a) → suffix(a) labeled by a. Explicitly, for each Sk+1(i) of S, define an edge Sk(i) → Sk(i + 1) labeled by Sk+1(i) (for
In the case that C is the (k + 1)-spectrum of S, the de Bruijn graph built on C using the gluing approach is identical to the one built directly on the genome S. Moreover, there is a covering cycle that spells S, where a covering cycle is a cyclical walk that visits every edge at least once. In this graph, the cycle is the sequence of edges
2.3. Graph complexity
The usefulness of a graph representation of a genome can vary widely. In general, the number of vertices can serve as a rough indicator of how useful the graph is—as the number of vertices grows (and the number of edges stays the same), the graph is likely to become less entangled, and the contigs are likely to become longer. Figure 2a shows that in the de Bruijn graph, the number of repeated k-mers in E. coli drops as k increases, indicating that the de Bruijn graph has more vertices and likely becomes less entangled. Alternatively, consider pairs of k-mers, i.e., (k, d)-mers. Figure 2b shows that, after fixing k = 50, the number of repeated (k, d)-mers drops as d increases. This is not surprising due to the repeat structure of genomes—the bigger the d, the less common it is to have pairs of repeats spaced a distance of d apart. Figure 2a,b illustrates alternatives for improving contig lengths: increasing the read length (pursued by companies such as Pacific Biosciences) versus increasing the insert size, as advocated by Chaisson et al. (2009). While the increase in the read length remains a difficult technological challenge, increasing the insert size (up to tens of thousands of nucleotides) is already within the power of current technologies. Thus, if we could build a graph whose vertices represent (k, d)-mers instead of k-mers, then the length of the contigs is likely to increase as the insert size grows. This is the basic motivation for the paired de Bruijn graph, and, as we will show in Section 3, the contig lengths in the paired de Bruijn graph do in fact increase with d.

The effect of increasing k and d.
2.4. Paired de Bruijn graphs (modeling paired reads with exact distance)
We now define a graph modeling mate pairs in the special case that all pairs are exactly the same distance d apart. This is an idealized case unachievable with current sequencing technologies, but the next section will generalize the construction to varying distances. Given a set of (k + 1, d)-mers C (modeling mate pairs), construct an A-Bruijn graph as follows:
• Define an initial graph G0 on m = 2|C| vertices. For each bilabel • Glue vertices of G0 together when they have the same label. The graph G so obtained is called the paired de Bruijn graph of C.
This procedure is illustrated in Figure 3a,b, and Figure 4 gives another example of the graph. An alternate construction of the paired de Bruijn graph is to define the vertex set as the (k, d)-mers present in C, and the edges as connecting prefix(a|b) to suffix(a|b) for every element of C.

The (approximate) paired de Bruijn graph:

Example of the standard and paired de Bruijn graphs: The reads are the (5,12)-spectrum generated from the cyclic sequence ATCGGGATGACTATGTCGCTCCTAATCGGGAAGACTATGCCGCTCCTT.
As with the regular de Bruijn graph, in this construction, every vertex of G inherits the label common to all the vertices of G0 that were glued together to form it, and this label is unique to that vertex. Any walk through the graph on edge sequence
The (k, d)-spectrum of a string S is
Just as with the de Bruijn graph, this is a key property that makes the paired graph useful for spelling contigs.
2.5. Approximate paired de Bruijn graphs (modeling inexact distance)
We now define a graph modeling mate pairs where the distance between the two reads in each pair is only known to lie within some range d ± Δ. The parameter Δ can be estimated based on the mate pair generation protocol.
Let C be an arbitrary set of (k + 1, d, Δ)-mers, representing the input data. The key insight is that if two (k, d, Δ)-mers (a|b) and (a|b′) both arise from the same instance of a in S, then in the de Bruijn graph of S, there is a directed path from b to b′, or vice-versa, with distance at most 2Δ. This insight was used for repeat resolution in Medvedev and Brudno (2008), albeit as a post-construction modification step. We construct an A-Bruijn graph from C as follows:
• The initial graph G0 consists of |C| isolated edges on 2|C| vertices. For each • For each k-mer α, glue together all vertices with labels (α|β),(α|β′) if there exists a directed path from β to β′ (or vice-versa) in the de Bruijn graph D = DB(C, k) of length at most 2Δ. Here, we assume that the construction of D implicitly breaks the (k + 1)-mer bilabels of C into independent (k + 1)-mers.
The graph G = APDB(C, k, d, Δ) so obtained is the approximate paired de Bruijn Graph of C (Fig. 3c–e). The effect of this gluing is to merge all vertices (k-mer bilabels) that might align to the same position in the genome; vertices that align to the same position are thus guaranteed to be merged. However, the converse does not hold; vertices aligning to different positions in the genome are sometimes merged, either due to repeats that are not resolved by the given parameters, or due to chance short paths in D.
In the case that k > 2Δ, we observed that if there is a directed path between β and β′ in the de Bruijn graph D of length at most 2Δ, then β and β′ should share an overlap of at least k − 2Δ characters. This observation leads to an alternate rule to glue vertices of G0: for each k-mer α, glue together all vertices with labels (α|β),(α|β′) if β and β′ share an overlap of at least k − 2Δ characters. Note that this rule can only be used if k > 2Δ and may lead to a different graph; however, it is easier to implement.
Unlike our earlier constructions of the de Bruijn and paired de Bruijn graphs, the vertices of G do not inherit a single label from G0; the vertices glued together have the same left label, but may have different right labels. In an edge walk
A set C of (k + 1)-mer bilabels is a covering spectrum of S if for every position
Theorem 1
Let S be a circular string, and C a set of (k, d, Δ)-mers that is a covering spectrum of S. Then there is a covering cycle through the graph G = APDB(C, k, d, Δ) that spells out S.
Proof
For
3. Results
We implemented a prototype assembly algorithm to test the effectiveness of the (approximate) paired de Bruijn graph approach under the ideal conditions of perfect coverage and error-free reads. We experimented with E. coli (4.6 Mbp) and Human chromosome 22 (35 Mbp after removal of ambigious bases). The reads were generated with perfect coverage, meaning for every position in the genome we generated a single (k, d, Δ)-mer aligning to it. The insert size was picked uniformly at random from the specified range. We report as contigs the (left) words spelled by all maximal walks of the graph whose interior vertices have just one out-neighbor. We validated that any generated contigs mapped perfectly back to the original genome—this was the case for all the contigs.
Constructing the de Bruijn graph and finding all its non-branching paths takes time O(n log n), where n is the number of k-mers. The construction of the approximate paired de Bruijn graph has an additional cost of searching all neighbors within a distance 2Δ of each node. Therefore, the running time of the algorithms is O(n log n + n min{2Δ, n}), where n is the number of (k, d, Δ)-mers. However, since de Bruijn graphs are sparse, the searches in the graph are usually very fast, and in practice, even the run on chr22 with Δ = 200 took less than 2 hours on an 8-core processor with 16G RAM. Moreover, the algorithm could be easily distributed over a large cluster to deal with larger Δ.
Our motivation for the paired de Bruijn graph approach was that the number of repeated (k, d)-mers quickly drops as d increases (Fig. 2b), and hence the contigs of the paired de Bruijn graph based on these (k, d)-mers could be longer. To test this hypothesis, we generated a set of mate pairs with varying insert sizes and plotted the length of the obtained contigs (Fig. 5a). To isolate the effect of the insert size, the coverage of the data was perfect (the (k, d)-spectrum), the insert sizes were perfect (Δ = 0), and the read length was fixed to 50. We observed that contig lengths improved dramatically as the insert size increased. With an insert size of 6000 nt, all of E. coli was covered with just one contig, while for chr22, an insert size of 5000 nt enabled us to cover 98% of the chromosome with the 15 largest contigs. We thus believe that properly using mate pairs has a strong potential to increase contig lengths.

Contig lengths. Cumulative contig lengths (for standard and paired de Bruijn graphs) on simulated data with perfect coverage. Contigs are sorted in order from largest to smallest. Point (x, y) means the largest x contigs have cumulative length y.
To explore the role that read length plays relative to the insert size, we generated sets of mate pairs with varying read lengths but with a fixed insert size (1000 nt). To isolate the effect of the read length, we had perfect coverage and no variation in the insert size. For E. coli, we found that, for an insert size of 1000 nt, once the read length grew over a small threshold of 10–20 nt, the contig lengths nearly reached the theoretical optimum that could be achieved by simply generating reads of length equal to the insert size (Fig. 5b). For Human, we needed to increase the read length to 300 nt in order to reach the optimum with 1000 nt insert size (Fig. 5b). However, for a longer insert size (5000 nt), a read length of 50 came close (Fig. 5a) to achieving the optimum (which, with 5000 nt reads, was a single contig). Therefore, by properly using mate pairs with large enough insert size, one can significantly reduce the limitations caused by short read length.
We measure the effect of increasing variability in the insert size (Δ) on the assembly. We fix the insert size to be 1000 nt and generate 50-long reads with perfect coverage, while varying Δ (Fig. 5c). We found that the assembly deteriorates with increasing Δ, especially for the Human genome. When Δ is large, the chance of two vertices of the de Bruijn graph being connected increases, and, hence, the number of vertices (bilabels) that do not align but nevertheless get glued together increases. In this situation, the read length is still important in determining the complexity of the (non-paired) de Bruijn graph. Some recent datasets achieve a small Δ, such as the Bentley et al. (2008) human dataset with a mean insert size of 208 nt and a standard deviation of 13 nt. Nevertheless, we see robustness with respect to Δ as an important direction for improving the practical usefulness of our method.
4. Towards a Practical Paired de Bruijn Graph Assembler
We believe that, similarly to early studies of idealized fragment assembly with error-free k-mers (Pevzner, 1989), the (approximate) paired de Bruijn graphs can be of use in practical assemblers that utilize paired reads. Though this paper falls short of analyzing real data, we present here potential ways to remove our simplifications, and to move from the current de Bruijn graph assemblers to (approximate) paired de Bruijn graphs.
5. Conclusion
In this article, we introduced the paired de Bruijn graph and motivated its use in genome assembly. Instead of incorporating mate pairs into a post-graph-construction step, we have used them to construct the graph itself. Any procedures that could be performed on the regular de Bruijn graph (e.g., error correction) can be performed in the same manner on the paired de Bruijn graph. For instance, even when there are repeats that the paired de Bruijn graph does not resolve, mate pair transformations can still be applied to the graph to help resolve the remaining repeats.
By formulating an alternative to mate pair transformations, the paired de Bruijn graph approach provides a potential method for assembly with short read mate pairs, like the ones generated by Complete Genomics (Drmanac et al., 2010) and Helicos (Harris et al., 2008). By not requiring unique paths between paired reads in the de Bruijn graph, the paired approach could still resolve repeats despite the short read length (Fig 4). Moreover, the algorithms we describe can be extended to the strobes generated by Pacific Biosciences, which extend the notion of the mate pair to a set of multiple (more than two) reads separated by some distances.
A future direction lies in the use of the right labels on edges of the approximate paired de Bruijn graph. Currently, we spell out each contig using only the left label. The positions of the right labels are only known approximately, but this is often sufficient to form a righthand word displaced approximately d from the lefthand word. Moreover, after encountering an edge (a|b) in a walk, we must encounter some edge (b|c) approximately d edges away (unless it is past the end of the walk). This compatibility requirement may help to narrow the choice of valid paths when encountering branching vertices, thereby resolving longer repeats and improving contig lengths.
Footnotes
Acknowledgments
G.T. and P.M. were supported in part by the NIH (grant 3P41RR024851-02S1).
Disclosure Statement
No competing financial interests exist.
