SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding

Abstract

Scaffolding is an important subproblem in de novo genome assembly, in which mate pair data are used to construct a linear sequence of contigs separated by gaps. Here we present SLIQ, a set of simple linear inequalities derived from the geometry of contigs on the line that can be used to predict the relative positions and orientations of contigs from individual mate pair reads and thus produce a contig digraph. The SLIQ inequalities can also filter out unreliable mate pairs and can be used as a preprocessing step for any scaffolding algorithm. We tested the SLIQ inequalities on five real data sets ranging in complexity from simple bacterial genomes to complex mammalian genomes and compared the results to the majority voting procedure used by many other scaffolding algorithms. SLIQ predicted the relative positions and orientations of the contigs with high accuracy in all cases and gave more accurate position predictions than majority voting for complex genomes, in particular the human genome. Finally, we present a simple scaffolding algorithm that produces linear scaffolds given a contig digraph. We show that our algorithm is very efficient compared to other scaffolding algorithms while maintaining high accuracy in predicting both contig positions and orientations for real data sets.

1. Introduction

D e novo genome assembly is a classical problem in bioinformatics, in which short DNA sequence reads are assembled into longer blocks of contiguous sequence (contigs), which are then arranged into linear chains of contigs separated by gaps (scaffolds). Previously, only single-end short reads were available from sequencing experiments. Modern genome sequencing technology allows reporting reads in pairs, commonly known as mate pairs or paired end. The distance between the two reads of a pair plus the two read lengths (the insert length) approximately follows a normal distribution determined during the experimental construction of the library. Some genome projects also include mate-pair libraries with several different insert lengths. Although there are experimental differences between mate pairs and paired-end reads, we will refer to them interchangeably as mate pairs since we can treat them identically from an algorithmic point of view. Mate pairs are particularly important for de novo assembly since, in addition to building contigs, we can now hypothesize about neighbors of a contig whenever the reads of a pair fall on different contigs. This opens the possibility of scaffolding contigs.

Computational genome assembly is typically performed in at least two stages—the contig building stage and the scaffolding stage. In this article we do not address the contig building problem but rather assume that we have access to a set of contigs produced by an independent algorithm. However, we discuss the relationship of the contig building and scaffolding stages later in the discussion. The scaffolding problem tries to string contigs into a chain such that the order of the contigs in the scaffold reflects their real order in the genome. For the scaffolding problem, the most popular strategy is to construct the contig graph in which nodes represent contigs and edges represent sets of mate pairs connecting two contigs (i.e., the two reads of the mate pair fall in the two different contigs). The edges are given weights equal to the number of mate pairs connecting the two contigs.

We then try to find a walk in the graph such that the minimum number of mate pairs are violated. A mate pair is violated when the contigs it is connecting do not have the relative orientation or position suggested by the mate pair. Just finding the optimal orientation assignment is reducible to the Maximum Cut problem, which is known to be NP-complete (Garey, 1979). Consequently, finding the optimal walk to get the optimal scaffolding is also NP-complete. The genome can (and often does) have repeated regions; e.g., approximately 50% of human genome is accounted for by repeats (Haubold and Wiehe, 2006). But the contig builder is likely to report one contig per repeated region. This repetitive structure of the genome makes scaffolding harder as it introduces loops and cycles in the contig graph. We also have false edges resulting from misassembly of reads into contigs. Unfortunately, the number of false edges is not negligible, and so, filtering them is a major preprocessing step.

A common procedure is to filter out unreliable edges by picking a small threshold (commonly 2–5) and removing all edges with weight less than that threshold. For the remaining edges, a majority vote is used to decide on the relative orientation and position of the contigs. This simple majority voting strategy is implemented in a number of commonly used assemblers and stand-alone scaffolders, including ARACHNE (Batzoglou et al., 2002), BAMBUS (Pop et al., 2004), SOPRA (Dayarian et al., 2010), and SOAPdenovo (Li et al., 2010), with various choices of threshold. Opera (Gao et al., 2011) and the Greedy Path-Merging algorithm (Huson et al., 2002) use a different strategy to bundle edges. Given a set of mate pairs connecting two contigs, these algorithms calculate the median and standard deviation of the insert lengths of the set of mate pairs and create a bundle using only mate pairs with insert length that are close to the median. ALLPATHS (Butler et al., 2008) and VELVET (Zerbino and Birney, 2008) do not build the contig graph and thus do not have a read-filtering step similar to the other assemblers mentioned. The majority voting procedure implicitly assumes that misleading mate pairs are random and independently generated and that majority voting should eliminate the problematic mate pairs. However, this assumption is often not true because of the complex repeat structure of large genomes, such as human.

In this article, we show that unreliable mate pairs can be reliably filtered using SLIQ, a set of simple linear inequalities derived from the geometry of contigs on the line. Thus, SLIQ produces a reduced subset of reliable mate pairs and thus a sparser graph, which results in a simpler optimization problem for the scaffolding algorithm. More importantly, SLIQ can be used to predict the relative positions and orientations of the contigs, yielding a directed contig graph. Our experiments show that both SLIQ and majority voting are very accurate at predicting relative orientations, but SLIQ is clearly more accurate at predicting relative positions for complex genomes.

The simplicity of SLIQ makes it very easy to integrate as a preprocessing step to any existing scaffolders, including recent scaffolders such as MIP scaffolder (Salmela et al., 2011), Bambus 2 (Koren et al., 2011), and SSPACE (Boetzer et al., 2011). To illustrate the effectiveness of SLIQ, we implemented a naive scaffolding algorithm that produces linear scaffolds from the contig digraph. We show that despite its simplicity, our naive scaffolder provides very accurate draft scaffolds, comparable to or improving upon the more complicated state of the art, very quickly. These scaffolds can either be output directly or used as reasonable starting points for further refinement with more complex scaffolding algorithms. An implementation of naive assembler using the inequalities is available online.

2. Algorithms

We begin with a high level outline of our algorithm for constructing a directed contig graph (Algorithm 1). The crux of the algorithm is SLIQ, a set of simple linear inequalities that are used to filter mate pairs and predict the relative position and orientation of contigs. In subsequent sections, we will present proofs for the SLIQ inequalities and a detailed version of the digraph construction algorithm (Algorithm 2). Finally, we will present a simple scaffolding algorithm (Algorithm 3) that uses the contig digraph to construct draft scaffolds. Throughout the article, we will abbreviate mate-pair reads as MPR.

Algorithm 1.

Construct Contig Digraph (Outline)

Require: input: P = a set of MPRs that connect two contigs, C = a set of contigs

1: Construct the contig graph G with vertex set C and edges representing MPRs from P that pass a certain majority cutoff.

2: Find a good orientation assignment for the contigs

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( \Theta = \{ \Theta_1 , \Theta_2 , \ldots \} )$$\end{document}

where Θ_i is the orientation of the ith contig, for example, by finding a spanning tree of G.

3: Define M_p to be the set of MPRs that satisfies the SLIQ inequalities

4: Construct a directed contig graph G_d with vertex set C and edges representing MPRs from M_p that pass certain criteria.

2.1. Definitions and assumptions

For the sake of deriving the SLIQ inequalities, we assume that we know the position of the contigs on the reference genome. However, this information cancels out later on, which allows us to analyze the MPRs without access to prior contig position information. For the derivation, we also assume that all the contigs have the same orientation. Later, we will not need this information.

Let P_i be the position of contig C_i in the genome, and l_i be the length of the contig (Fig. 1). We define gap g_ij to be the difference between the start position of contig C_j, and the end position of contig C_i, and similarly for g_ji: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} \begin{split}g_{ij} & = P_j - P_i - l_i , \\ g_{ij} & = P_i - P_j - l_j.\end{split} \tag{1}\end{align*}\end{document}

FIG. 1.

The geometry of two contigs, C_i and C_j, arranged on a line with relevant quantities indicated. Here, L is the insert length, P_i is the start position of contig C_i, l_i is the length of the contig C_i, o_i is the offset of the read of the mate-pair read (MPR) that falls on C_i, R is the read length. g_ij = P_j−P_i−l_i. The quantities for C_j are defined similarly.

FIG. 2.

Plot of Equation (5) showing the dependence of the quantity g_ij−g_ji on the relative positions of the contigs.

We assume that the maximum overlap of two contigs is one read length, R. In practical contig-building software based on De Bruijn graphs, the maximum overlap is usually one k-mer where R > k, so our assumption is valid.

2.2. Derivation of two gap equations

If we assume that P_i < P_j as in Fig. 1, and that the maximum overlap between two contigs is R (i.e., the minimum gap g_ij is −R), then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}P_j - P_i - l_i & \geq - R , \\ P_j - P_i & \geq l_i - R. & ( 2 ) \end{align*}\end{document}

Now consider the quantity g_ij−g_ji. Using Equation (1), we can derive the following inequality, which we call Gap Equation 1 \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}g_{ij} - g_{ji} & = 2 , ( P_j - P_i ) + ( l_j - l_i ) \\ & \geq 2l_i - 2R + l_j - l_i \\ & \geq l_i + l_j - 2R. & ( 3 ) \end{align*}\end{document}

Therefore, we have shown that (P_i < P_j) ⇒ (g_ij−g_ji ≥ l_i+l_j−2R). Next consider the quantity g_ij+g_ji. We can easily derive Gap Equation 2: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}g_{ij} + g_{ji} = - ( l_j + l_i ) . \tag{4}\end{align*}\end{document}

Now, we will prove the other direction of the implication in Gap Equation 1, and show that (g_ij−g_ji ≥ l_i+l_j−2R) ⇒ (P_i < P_j). Using Gap Equation 1 and Equation (1), we get \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} \begin{split}g_{ij} - g_{ji} & \geq l_i + l_j - 2R , \\ 2 ( P_j - P_i ) + ( l_j - l_i ) & \geq l_i + l_j - 2R , \\ 2 ( P_j - P_i ) & \geq 2l_i - 2R , \\ P_j - P_i & \geq l_i - R.\end{split} \tag{5}\end{align*}\end{document}

No contig length can be less than R, the length of a read. In practice, contigs of lengths R are not very reliable. Our experiments show that such contigs almost always fail to align to the reference. We suggest scaffolders enforce a minimum contig length, which is > R. We make the assumption l_i−R > 0 and that gives us P_j−P_i > 0 or P_i < P_j. Therefore, (g_ij−g_ji ≥ l_i+l_j−2R) ⇒ (P_i < P_j) and together we have proven, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}( g_{ij} - g_{ji} \geq l_i + l_j - 2R ) \Leftarrow \Rightarrow ( P_i < P_j ) . \tag{6}\end{align*}\end{document}

2.3. Using the gap equations to predict relative positions

Our definitions in Equation (1) used the quantities P_i and P_j, which are not available in practice in de novo assembly. Thus, we need to define the gaps g_ij and g_ji in terms of quantities we know, such as the insert length L and the read offsets relative to the contigs o_i and o_j. Note that the insert length for each MPR is an unknown constant, so treating it as a constant in the proof is justified. In practice, we use L = \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\bar{L}$$\end{document} +2σ, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\bar{L}$$\end{document} is the reported or computed mean and σ is the standard deviation of the insert length distribution.

Let L be the insert length, o_i and o_j be the offsets of the start positions of the paired reads in C_i and C_j, respectively, and Θ_i and Θ_j be the orientations of C_i and C_j, respectively. To simplify the notation we abbreviate Θ_i = Θ_j as Θ_i=j and Θ_i ≠ Θ_j as Θ_i≠j. Then, if P_i < P_j and Θ_i=j (Fig. 3), we can redefine the gaps g_ij and g_ji without using the contig start positions P_i and P_j: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}g_{ij} & = L - l_i + o_i - o_j - R , \\ g_{ji} & = - L - l_j + o_j + R - o_i. & ( 7 ) \end{align*}\end{document}

FIG. 3.

The geometry of two contigs arranged on a line in terms of quantities known in de novo assembly.

Note that these definitions remain consistent with Gap Equation 2 [Equation (4)]. Taking the difference of Equations (6) and (7), we can similarly remove P_i and P_j from Gap Equation 1: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}g_{ij} - g_{ji} = 2L - 2R + 2 ( o_i - o_j ) + ( l_j - l_i ) . \tag{8}\end{align*}\end{document}

Using Equations (8) and (5), we derive the following inequality: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}2L - 2R + 2 ( o_i - o_j ) + ( l_j - l_i ) & \geq l_i + l_j - 2R , \\ 2L + 2 ( o_i - o_j ) + ( l_j - l_i ) & \geq l_i + l_j , \\ L + ( o_i - o_j ) & \geq l_i.\end{align*}\end{document}

Consequently, we obtain that (P_i < P_j) ∧ Θ_i=j ⇒ L+(o_i−o_j) ≥ l_i. Negating the implication gives \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} \neg ( L + ( o_i - o_j ) \geq l_i ) & \rightarrow \neg ( ( P_i < P_j ) \wedge \Theta_{i = j} ) , \\ L + ( o_i - o_j ) < l_i & \rightarrow ( P_i > P_j ) \vee \Theta_{i \neq j}.\end{align*}\end{document}

Now, without loss of generality, we can assume that Θ_i≠j is false. This is possible because our experiments later show that the SLIQ or majority voting procedures are both very accurate at predicting relative orientation (Table 2) so we can first determine the relative orientations of the contigs and flip the orientation of one contig if required. Thus we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}L + ( o_i - o_j ) < l_i \rightarrow ( P_i > P_j ) . \tag{9}\end{align*}\end{document}

Table 2.

Summary of the Results of SLIQ vs. Majority Filtering for Contig Graph Edges of Five Real Datasets

Set ID	n	w_e	n_o	e_o	n_p	e_p	w_m	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n_o^{ \prime}$$\end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$e_o^{ \prime}$$\end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n_p^{ \prime}$$\end{document}	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$e_p^{ \prime}$$\end{document}
PSU	4454	2	2507	99.69%	3803	99.21%	4	3942	99.59%	3925	94.87%
PSY	2086	2	1628	98.40%	1852	95.62%	4	2019	98.56%	1990	98.59%
PST	2291	1	1233	75.18%	1516	87.33%	2	1365	97.87%	1336	16.54%
DS	8738	1	6305	92.18%	7097	80.55%	2	6390	91.87%	5861	77.25%
HS	36346	1	31799	79.56%	31153	89.71%	2	32676	79.14%	25750	75.62%

n, total number of edges connecting two different contigs; w_e, minimum wieght of an edge for SLIQ prediction; n_o, the number of edges for which we can predict relative orientation, e_o, the accuracy of relative orientation prediction, n_p, the number of edges for which we can predict relative position; e_p, the accuracy of relative position prediction; w_m, minimum weight of an edge for majority prediction. The same notations are used for majority filtering except with prime.

In addition, we introduce two filters that are very useful in practice for removing unreliable MPRs. To derive the first filter, if P_j < P_i, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}L & = l_j - o_j + g_{ji} + o_i + R , \\ & \geq l_j - o_j - R + o_i + R , \\ o_j - o_i & \geq l_j - L , \\ o_i - o_j & < - l_j + L. & ( 10 ) \end{align*}\end{document}

The second filter is to discard an MPR if it passes the test for both P_i < P_j and P_j < P_i.

2.4. Using the Gap Equations to Predict Relative Orientations

So far, we have only predicted relative positions when Θ_i=j. Now we show that we can also use the gap equations to infer the relative orientations of the contigs. First, if (P_i < P_j) and the minimum gap is −R, then we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}g_{ij} = L - l_i + o_i - o_j - R \geq - R. \tag{11}\end{align*}\end{document}

Similarly, if (P_j < P_i), then we define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\bar {g}$$\end{document} _ji and write \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\bar{g}_{ji} = L - l_j + o_j - o_i - R \geq - R. \tag{12}\end{align*}\end{document}

Note that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\bar {g}_{ji}$$\end{document} is different than g_ji, which we defined under the assumption P_i<P_j in Equation (7).

Since (P_i < P_j) and (P_j < P_i) are mutually exclusive and exhaustive neglecting P_i = P_j, at least one of the Equations (11) and (12) will be true. Note that possibly also both could be true. For example, if P_i<P_j then g_ij≥−R. Now (P_j < P_i) must be false, but that does not imply that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\bar {g}_{ji} \geq - R$$\end{document} is false. If both Equations (11) and (12) are true, then we can add them to get 2L ≥ l_i+l_j. To summarize, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} ( ( g_{ij} \geq - R ) \wedge ( \bar{g}_{ji} \geq - R ) \big ) & \rightarrow 2L \geq l_i + l_j , \\ 2L < l_i + l_j & \rightarrow \big ( \neg ( g_{ij} \geq - R ) \vee \neg ( \bar{g}_{ji} \geq - R ) ) \end{align*}\end{document}

Recalling again that at least one of the Equations (11) and (12) are true, we see that 2L < l_i+l_j is a sufficient condition for mutual exclusion (the XOR relation is denoted by ⨁): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} \Theta_{i = j} \wedge ( 2L < l_i + l_j ) & \rightarrow ( g_{ij} \geq - R ) \oplus ( \bar{g}_{ji} \geq - R ) , \\ \neg \big ( ( g_{ij} \geq - R ) \oplus ( \bar{g}_{ji} \geq - R ) \big ) & \rightarrow \neg \big ( \Theta_{i = j} \wedge ( 2L < l_i + l_j ) \big ) , \\ \neg \big ( ( g_{ij} \geq - R ) \oplus ( \bar{g}_{ji} \geq - R ) \big ) & \rightarrow \big ( \Theta_{i \neq j} \vee ( 2L \geq l_i + l_j ) \big ).\end{align*}\end{document}

If we use this equation only when the MPR and contigs satisfy the inequality 2L < l_i+l_j, we can then make the relative orientation prediction \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\neg \big ( ( g_{ij} \geq - R ) \oplus ( \bar{g}_{ji} \geq - R ) \big ) \rightarrow \Theta_{i \neq j}. \tag{13}\end{align*}\end{document}

Intuitively, the condition 2L < l_i+l_j means that the contig lengths should be large relative to the insert length in order for the SLIQ method to work. To find contigs of the same orientation, we arbitrarily flip one contig and run the above tests again, only this time if Equation (13) holds, then we conclude that the contigs were actually of the same orientation. Say we flip C_i. We call the new offset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$o_{ \hat{i}}$$\end{document} . Then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\neg \big ( ( g_{{ \widehat{i}}j} \geq - R ) \oplus ( \bar{g}_{j \widehat{i}} \geq - R ) \big ) \rightarrow \Theta_{{ \widehat{i}} \neq j} \rightarrow \Theta_{i = j}.\end{align*}\end{document}

Again, we introduce two additional filters that are very useful in practical applications. First, if we find an MPR that predicts both Θ_i≠j and Θ_i=j, then we leave it out of consideration. Second, if the SLIQ equations imply Θ_i≠j, then we require that both the reads of the MPR have the same mapping directions on the contigs and similarly for Θ_i=j.

We summarize our results in the following lemmas and Algorithm 2.

Algorithm 2.

Construct Contig Digraph

Require: input: M=a set of MPRs connecting contigs, C=a set of contigs, w=cutoff weight

1: Define E′={(C_i,C_j) : an MPR connects C_i and C_j}

2: Let wt(i,j)=(number of MPRs suggesting that C_i and C_j have the same orientation) - (number of MPRs suggesting that C_i and C_j have different orientations)

3: E={(C_i,C_j) : (i,j)

E′ ∧ wt(i,j) ≥ w}

4: Construct a contig graph G with vertex set C and edge set E.

5: Find a good orientation assignment

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\Theta = \{\Theta_1 , \Theta_2 , \ldots \} }$$\end{document}

for the contigs, for example, by finding a spanning tree of G.

6: Set M_p={}

7: for all p : p

M do

8: Let C_i and C_j be the contigs connected by p.

9: if Θ_i=j then

10: if (L + (o_i−o_j)<l_i) AND (o_i−o_j<−l_i + L) then

11: predict P_i>P_j

12: M_p=M_p ∪ {p}

13: end if

14: if (L + (o_j−o_i)<l_j) AND (o_j−o_i<−l_j + L) then

15: predict P_i<P_j

16: M_p=M_p ∪ {p}

17: end if

18: end if

19: end for

20: Let E(i,j) be the set of MPRs from M_p that predict P_i < P_j and E(j,i) be the set of MPRs from M_p that predict P_j< P_i.

21: Define E_d={(C_i,C_j) : |E(i,j)| > |E(j,i)|}

22: Output a contig digraph G_d with vertex set C and edge set E_d.

Lemma 1

If the maximum overlap between contigs is R and 2L < l_i+l_j, then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} & \neg \big ( ( g_{ij} \geq - R ) \oplus ( \bar{g}_{ji} \ge q - R ) \big ) \rightarrow \Theta_{i \neq j} , \\ & \neg \big ( ( g_{{\hat{ij}}} \geq - R ) \oplus ( \bar{g}_{\hat{j i}} \geq - R ) \big ) \rightarrow \Theta_{i = j}.\end{align*}\end{document}

Lemma 2

If the maximum overlap between contigs is R, the contigs have the same orientation, (i.e., Θ_i=j), then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\big ( L + ( o_i - o_j ) < l_i \big ) \rightarrow ( P_i > P_j ).\end{align*}\end{document}

We also summarize the SLIQ inequalities, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}g_{ij} - g_{ji} & \geq l_i - l_j - 2R , \\ g_{ij} + g_{ji} & = - ( l_j + l_i ) , \\ ( g_{ij} - g_{ji} \geq l_i + l_j - 2R ) & \leftarrow \rightarrow ( P_i < P_j ) , \\ g_{ij} - g_{ji} & = 2L - 2R + 2 ( o_i - o_j ) + ( l_j - l_i ).\end{align*}\end{document}

2.5. Illustrative cases and examples from real data

In this section, we present two illustrative cases that provide the intuition underlying the SLIQ equations. The ideal case for an MPR connecting two contigs is illustrated in Figure 1. In that case, the contigs are long compared to the insert length, and the reads are mapped to the ends of the contigs. However, this situation does not always occur. Suppose the contigs are short such that the two reads of an MPR fall exactly in the center of the contigs. Then, the right-hand side of Equation (8) reduces to 2L−2R. So for both cases, P_i < P_j and P_j < P_i, the right-hand side of Equation (8) has the same value, making it impossible to predict the relative positions of the two contigs. This situation is illustrated in Figure 4 on the left. It is easy to see that prediction becomes easier as the contigs get longer and the reads move away from the center of the contigs.

FIG. 4.

Illustrative cases in which both reads of the MPR fall in the center of the contigs (left) and the contigs have reversed positions (right).

Now assume that the working assumption is P_i < P_j but in reality, the reverse (P_j < P_i) is true. Then given that the contigs are long and reads map to the edges of the contigs, the insert length L would suggest the scenario depicted in Figure 4 (right side). This would make both g_ij and g_ji [as calculated from Equations (6) and (7)] smaller than they should be. In reality, the position of the contigs is similar to that shown in Figure 1, where we see that both g_ij and g_ji are larger than in Figure 4 (right side). These wrong values would then be too small to satisfy the left-hand side of Equation (5) and this would demonstrate that the working assumption of P_i < P_j is wrong.

It is also instructive to consider examples from real data. We show three cases from a real data set: One in which SLIQ made a correct prediction, one in which SLIQ made a wrong prediction, and one where SLIQ did not make any predictions (Fig. 5). We explain precisely which inequalities are violated in the figure caption. The real examples show the difficulties of making SLIQ predictions when the reads fall close to the center of a contig or when the contig lengths are small relative to the insert size.

FIG. 5.

Three real examples of SLIQ predictions from the PSY dataset. For the correct prediction, the equation L+(o_i−o_j)< l_i evaluates to 3385<5043. In the wrong prediction, it should have satisfied L+(o_j−o_i)< l_j but one of the contigs is smaller than the insert length so it evaluates to 262<217 (false). However L+(o_i−o_j)<l_i evaluates to 498<863 so the wrong prediction is made. In the no-prediction case, the condition o_i−o_j <−l_j+L is violated. Even if that did not fail, since one of the offsets falls almost in the center of a contig, both the conditions L+(o_j−o_i)<l_j, (299<1384) and L+(o_i−o_j)<l_i, (461<506) are satisfied, and we would not give a prediction for this MPR. To simplify the calculations we used L=80.

2.6. Naive scaffolding algorithm

The contig digraph constructed in Algorithm 2 can be directly processed to build linear scaffolds. To illustrate this point, here we present a naive scaffolding algorithm (Algorithm 3).

Algorithm 3.

Naive Scaffolder

1: G(V,E) = Construct Contig Digraph (Algorithm 2)

2: Identify and remove junctions from G. Junctions are defined as articulation nodes with degree ≥ 3 that connect at least 3 subgraphs of G of size larger than some given threshold. The size of a subgraph is defined as the sum of all contig sizes in that subgraph.

3: Identify all simple cycles in G and remove the edge with the lowest weight from each simple cycle.

4: If G still contains strongly connected components, those components are removed. G is now a directed acyclic graph.

5: Output each weakly connected component of G as a separate scaffold.

6: The order of contigs in each scaffold is computed by taking the topological ordering of the nodes of their respective weakly connected component in G.

To analyze the computational complexity of the naive scaffolding algorithm, let N be the number of MPRs in the library. Constructing G takes O(N) time. Finding articulation points takes O(n+m) time, where n = |V | and m = |E|(Hopcroft and Tarjan, 1973). If we have a articulation nodes, then finding junctions takes O(an) time. Identifying and breaking simple cycles takes O((n+m)(c+1)) time, where c is the number of simple cycles (Johnson, 1975). Finally, topological sorting takes O(n+m) time. In total, the complexity of the naive scaffolding algorithm is O(N)+O(n+m)+O(an)+O((n+m)(c+1)) = O(N)+O(an)+O((n+m)(c+1)). In practical data sets, a and c are small constants and N ≫n,m. Thus, for practical purposes the time complexity of the algorithm is O(N).

3. Experimental Results

To demonstrate the performance of our algorithms in practice, we ran them on five real data sets and two synthetic data sets. The data sets represent genomes ranging in size from small bacterial genomes (3 Mb) to large animal genomes (3.3 Gb) (see Table 1 for details). More importantly, they also vary in repetitiveness—from almost nonrepetitive bacteria to moderately repetitive drosophila to highly repetitive human genomes.

Table 1.

Descriptive Statistics About the Datasets

Set ID	Organism	Size	Ref. genome	Read lib	R	cov	L	L_r	σ
PSU	P. suwonensis	3.42 Mb	CP002446.1	SRR097515	76	870x	300	188.78	18.77
PSY	P. syringae	6.10 Mb	NC_007005.1	(Farrer et al., 2009)	36	40x	350	384.11	67.13
SY-CE	C. elegans	100.26 Mb	NC_003279-85	SRR006878	35	38x	200	232.13	54.44
PST	P. stipitis	15.40 Mb	(Chapman et al., 2011)	(Chapman et al., 2011)	75	25x	3.2K	3.27K	241.50
DS	D. simulans	109.69 Mb	NT_167066.1-68.1, NT_167061.1, NC_011088.1-89.1, NC_005781.1	SRR121548, SRR121549	36	62x	N/A	187.99	61.47
SY-HS	H. Sapiens	3.30 Gb	NCBI36/ hg18	ERA015743	100	45x	300	310.63	20.74
HS	H. Sapiens	3.30 Gb	NCBI36/ hg19	ERA015743	100	45x	300	310.63	20.74

R, read length; cov, coverage; L, reported insert length; L_r, the real insert length calculated by mapping reads to the reference genome; σ, standard deviation of L_r.

For each data set, we obtained a publicly available mate-pair library. We used publicly available pre-built contigs for the Drosophila simulans (DS) and human (HS) (Gnerre et al., 2011) data sets. Pre built contigs were not available for the three microbial data sets—P. suwonensis (PSU), P. syringae (PSY), and P. stipitis (PST) — so we used the short read assembler VELVET (Zerbino and Birney, 2008) to construct contigs. All software parameters and sources for the data are provided in Table 4. For the two synthetic datasets, C. elegans (SY_CE) and human (SY_HS), we constructed contigs by mapping reads back to the reference genome and declaring high-coverage regions to be contigs. So, for these experiments, we have synthetic contigs but real reads. We will discuss the performance of the algorithms on the synthetic data sets at greater length in the Discussion. We mapped the reads to the contigs using the program Bowtie (v. 0.12.7) (Langmead et al., 2009). Below we only report results for the uniquely mapped reads because we know the ground truth for them.

Table 4.

Parameter Values Used in the Analysis of all Datasets

Data set	v	Contig construction	Contig mapping
PSU	2	(velvet) Hash length = 21, cov_cutoff = 5, min_contig_lgth = 150	(vmatch) Min match length l = 150, Hamming distance h = 0
PSY	0	(velvet) Hash length = 21, cov_cutoff = 5, min_contig_lgth = 150	(vmatch) Min match length l = 150, Hamming distance h = 0
PST	0	(velvet) Hash length = 35, cov_cutoff = auto, min_contig_lgth = 100	(vmatch) Min match length l = 200, Hamming distance h = 5
SY-CE	1	(synthetic) Cov cutoff = 5, min contig len = L	Available from synthetic construction
DS	2	accession number AASR01000001-AASR01050477	(vmatch) Min match length l = 200, Hamming distance h = 5
SY-HS	2	(synthetic) Cov cutoff = 3, min contig len = 2R	Available from synthetic construction
HS	3	Accession number AEKP01000001:AEKP01231194	(vmatch) Min match length l = 300, Hamming distance h = 0

v is the number of mismatches allowed in read mapping (Bowtie v.0.12.7).

3.1. Comparison of SLIQ and majority voting predictions

On all the real data sets, SLIQ was highly accurate in predicting both relative orientation (>75%) and position (>80%) (Table 2). For orientation prediction, SLIQ and majority filtering produced almost identical accuracies except for the case of P. stipitis (PST), where SLIQ had lower accuracy (75% vs 97%). One possible reason might be that the PST library used long mate-pair reads, which may be more inaccurate than the other libraries we tested. Conversely, for PST, majority voting gave far worse accuracy (16.5%) than SLIQ (75%) in relative position prediction, confirming that this data set is an outlier.

Focusing only on the position predictions, SLIQ showed a significant advantage in both the number and accuracy of the predictions compared to majority voting for the more complex genomes — D. simulans and human (Fig. 6). Importantly, the improvement was particularly large for the human genome.

FIG. 6.

Comparison of the accuracy of SLIQ and majority voting for relative position prediction using that same data shown in Table 2.

Finally, Table 3 gives a more detailed comparison of cases in which the SLIQ and majority voting predictions disagreed. When the two methods disagreed, SLIQ clearly outperformed majority voting procedure. For example, for human, when the methods disagreed, SLIQ was right in 1852 cases and majority voting in only 165 cases. SLIQ was also generally more accurate when considering only the predictions made uniquely by each method, except in one case (PSY).

Table 3.

Comparison of Position Predictions Between the SLIQ and Majority Voting Methods

Set ID	n_a	n_d	n_de	n_dm	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n_e^{ \prime}$$\end{document}	e_q	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n_m^{ \prime}$$\end{document}	e_m
PSU	3089	646	643	3	68	95.58%	190	90.52%
PSY	1519	287	235	52	46	86.95%	184	96.19%
PST	290	794	784	10	432	58.56%	252	25.00%
DS	2447	820	804	16	409	93.15%	2035	76.41%
HS	16425	2017	1852	165	12711	85.67%	7308	52.73%

n_a, the number of predictions where the methods agreed; n_d, the number of predictions where the methods disagreed; n_de, the number of predictions not in agreement where SLIQ was correct, n_dm, the number of predictions not in agreement where majority voting was correct; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n_e^{ \prime}$$\end{document} , is the number of predictions made only by SLIQ; e_q, the accuracy of predictions made only by SLIQ; \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n_m^{ \prime}$$\end{document} , the number of predictions made only by majority voting; e_m, the accuracy of predictions made only by majority voting.

3.2. Computing the optimal insert length

In our experiments, we found that using a slightly larger value for L (e.g., 20 bp for PSY) than that reported or estimated increased both n_p(by 49), the number of MPRs for which we could make a relative position prediction, and e_p (by 2%), the accuracy of relative position prediction. This may seem surprising at first given Equation (9). However, for n_p, it can be seen from Figure 1 that underestimating L would reduce g_ij, which would lead to more overlaps between contigs. Since we assume that the maximum contig overlap is R, underestimating L would remove many MPRs from the predictions. However, at the moment we do not have an explanation for the observed increase in e_p, the prediction accuracy.

On the other hand, using a slightly smaller value for L increased n_o, the number of MPRs for which we could make a relative orientation prediction, while e_o, the prediction accuracy for orientation, remained constant. We suspect that a lower L makes Equations (11) and (12) harder to pass and thus less MPRs are excluded by the mutual exclusion test.

3.3. Computing the rank of MPRs

Our experimental results also agree with our illustrative cases (section 2.5) in that the prediction accuracy decreases as 2(o_i−o_j) gets closer to (l_i−l_j) which intuitively means that the reads are falling closer to the center of the contigs. To address this issue we can rank the MPRs by the minimum value of c for which they fail to pass the more stringent inequality |2(o_i−o_j)−(l_i−l_j)|>cR. We say that an MPR has rank c if and only if c is the smallest positive integer such that |2(o_i−o_j)−(l_i−l_j)|≤cR, and MPRs with higher rank are considered more confident with regards to their prediction. Figure 7 shows how the prediction accuracy depends on the rank of the MPRs in the PSY dataset.

FIG. 7.

Change in the prediction accuracy, e_p, as we restrict our analysis to MPRs of higher rank (c).

3.4. Effect of the number of Mate Pairs

More mate pairs connecting different contigs give better confidence in scaffolding. But we observed that this improvement is significant up to a certain threshold (4–5 for majority voting and 2–3 for the SLIQ equations). After that, the improvement in correctness in scaffolding is not worth the reduction in number of edges in the contig graph. For example, for the DS dataset, if we increase the cutoff by 1, the position prediction improves by 3% but reduces the number of edges by 1520. This reduction also depends on the coverage of the read library. For the high coverage PSU dataset, an increase of 1 in cutoff has almost no effect– a reduction of 50 edges. And of course, all this is assuming that the contigs are of reasonable quality. If you have mis-assembled or chimeric contigs, more mate pairs can create more loops and high-degree nodes in the contig graph, which are not removed by the cutoff threshold and result in worse scaffolding.

3.5. Performance of the naive scaffolder

We summarize the results of our naive scaffolder on the five real data sets in Table 6 and Table 7. For all data sets, the orientation accuracy was very high (>97%) and the position accuracy was also high (>89%). While the genome coverages of PSU and DS may appear surprising, note that the PSU library had a very high coverage while the DS library had low coverage and was also made up of a number of different D. simulans strains. It is likely that the PSU contigs include misassembled fragments in the contigs, making the total length of the contigs larger than the genome size. For DS, the combination of low coverage and relatively high rates of sequence differences between the different D. simulans strains likely resulted in lower genome coverage.

Table 6.

Summary of the Results of Our Naive Scaffolder on Real Data

Data set	N50	Genome coverage	Orientation accuracy	Position accuracy
PSU	17K	116.1%	99.64%	97.95%
PSY	75K	90.98%	98.26%	93.42%
PST	215K	97.89%	98.90%	89.89%
DS	942	59.48%	97.52%	96.07%
HS	18K	79.27%	98.28%	98.03%

N50 is the length n such that 50% of bases are in a scaffold of length at least n. The position accuracy measures how many neighboring contigs in the scaffold were placed in the correct order.

Table 7.

Run time Comparison of Our Naive Scaffolder with Two Other State-of-the-Art Scaffolders, SOPRA and MIP Scaffolder

Data set	Naive Scaffolder	SOPRA	MIP Scaffolder
PSU	6m40.39s	237m27.237s	23m32.55s
PSY	59.36s	44m57.604s	3m14.03s
PST	67.21s	3009m29.224s	124m42.68s
DS	7m7.449s	N/A	36m42.05s
HS	241m33.928s	N/A	N/A

All times are the sum of the user and system times reported by the Linux time command. We ran all software on a 48 core Linux server with 256GB of memory.

4. Discussion

In conclusion, we have presented a mathematical approach and an algorithm for constructing a contig digraph that encodes the relative positions of contigs based on mate-pair read data. Our main insight is the derivation of a set of simple linear inequalities derived from the geometry of contigs on the line that we call SLIQ. We can use SLIQ both to efficiently filter out unreliable mate-pair reads (MPR) and predict the relative positions and orientations between contigs. We have shown that SLIQ outperforms the commonly used majority voting procedure for the prediction of relative position of contigs while both methods are very accurate for orientation prediction. The contig digraph can also be directly processed into a set of linear scaffolds and we have presented a simple scaffolding algorithm for doing so. Our naive scaffolder has high accuracy on all data sets tested and is very efficient—for practical purposes, as it takes time linear in the size of the mate pair library and it is also very fast compared to other state-of the art scaffolders. The output of our naive scaffolder can either be used directly as draft scaffolds or used as a reasonable starting point for refinement with more complex optimization procedures used in other scaffolders.

One interesting and unexpected finding of our experiments was that the simple majority voting procedure performs very well for predicting the relative positions of contigs if the contigs have few errors. This can be seen by the performance of the majority voting procedure when using synthetic contigs that are not constructed using de novo assembly tools but rather by mapping the reads back to a reference genome and identifying regions of high coverage, which is expected to produce much higher quality contigs (Table 5). This observation suggests a novel way to approach the scaffolding problem in which the contig builder would output smaller but higher quality contigs and allow the scaffolder to handle the remainder of the assembly. We believe this is a significant change in philosophy of genome assembly programs to date in which during the contig building step, one generally attempts greedily to build contigs that are as long as possible. This viewpoint also differs considerably from previous approaches to scaffolding in which the focus was on resolving conflicts between mate pairs that gave conflicting information about the relative orientation and position of contigs.

Table 5.

Summary of the Results of Majority Prediction for Synthetic Datasets for C. elegans (SY_CE) and Humans (SY_HS)

Data set	n	w_m	n_o	e_o	n_p	e_p
SY-CE	17620	3	17620	99.52%	17532	99.85%
SY-HS	878380	3	878380	98.93%	868877	99.47%

n, total number of edges connecting two different contigs; w_m, minimum weight of an edge for majority prediction; n_o, the number of edges for which we can predict relative orientation; e_o, the accuracy in relative orientation prediction; n_p, the number of edges for which we can predict relative position; e_p, the accuracy in relative position prediction.

Finally, we are exploring several possible extensions of the SLIQ method. The first extension is to find the optimal value for L, the insert length, so that we optimize the number and accuracy of relative position and orientation predictions. The second extension is to assign numerical values to the accuracy of prediction of MPRs of a particular rank. Finally, for the multiply mapped MPRs, which were not included in the results, we plan to identify the most likely mapping for the MPR, for example, by using their ranks.

Footnotes

Disclosure Statement

No competing financial interests exist.

References

Batzoglou

, Jaffe

D.B.

, Stanley

et al. 2002. Arachne: a whole-genome shotgun assembler. Genome Res, 12:177–189.

Boetzer

, Henkel

C.V.

, Jansen

H.J.

et al. 2011. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics, 27:578–579.

Butler

, MacCallum

, Kleber

et al. 2008. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome Res., 18:810–820.

Chapman

J.A.

, Ho

, Sunkara

et al. 2011. Meraculous: de novo genome assembly with short paired-end reads. PLoS One, 6:e23501.

Dayarian

, Michael

T.P.

, Sengupta

A.M.

2010. Sopra: Scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics, 11:345.

Farrer

R.A.

, Kemen

, Jones

J.D.G.

et al. 2009. De novo assembly of the pseudomonas syringae pv. syringae b728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett., 291:103–111.

Gao

, Nagarajan

, kin Sung

2011. Opera: Reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. LNBI, 6577:437–451.

Garey , Michael

, Johnson

D.S.

1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman: New York.

Gnerre

, Maccallum

, Przybylski

et al. 2011. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A, 108:1513–1518.

10.

Haubold

, Wiehe

2006. How repetitive are genomes? BMC Bioinformatics, 7:541.

11.

Hopcroft

, Tarjan

1973. Efficient algorithms for graph manipulation. Communications of the ACM, 16:372–378.

12.

Huson

D.H.

, Reinert

, Myers

2002. The greedy path-merging algorithm for contig scaffolding. Journal of the ACM, 49:603–615.

13.

Johnson

D.B.

1975. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing, 4:77–84.

14.

Koren

, Treangen

T.J.

, Pop

2011. Bambus 2: scaffolding metagenomes. Bioinformatics, 27:2964–2971.

15.

Langmead

, Trapnell

, Pop

et al. 2009. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol., 10:R25.

16.

, Zhu

, Ruan

et al. 2010. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res., 20:265–272.

17.

Pop

, Kosack

D.S.

, Salzberg

S.L.

2004. Hierarchical scaffolding with Bambus. Genome Res., 14:149–159.

18.

Salmela

, Mkinen

, Vlimki

et al. 2011. Fast scaffolding with small independent mixed integer programs. Bioinformatics, 27:3259–3265

19.

Zerbino

D.R.

, Birney

2008. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18:821–829.