Abstract
Abstract
The rapid adoption of high-throughput next generation sequence data in biological research is presenting a major challenge for sequence alignment tools—specifically, the efficient alignment of vast amounts of short reads to large references in the presence of differences arising from sequencing errors and biological sequence variations. To address this challenge, we developed a short read aligner for high-throughput sequencer data that is tolerant of errors or mutations of all types—namely, substitutions, deletions, and insertions. The aligner utilizes a multi-stage approach in which template-based indexing is used to identify candidate regions for alignment with dynamic programming. A template is a pair of gapped seeds, with one used with the read and one used with the reference. In this article, we focus on the development of template families that yield error-tolerant indexing up to a given error-budget. A general algorithm for finding those families is presented, and a recursive construction that creates families with higher error tolerance from ones with a lower error tolerance is developed.
1. Introduction
For re-sequencing and counting applications, the first step in the analysis is the correct placement of the DNA sequence reads on a reference genome or transcriptome. This is a challenging step due to the need to provide sensitivity in the presence of errors and sequence variations of all types (insertion, deletions, and substitutions) in reads as short as 25 nucleotides, and the need to rapidly align a very large number of reads, up to 109 per run for a HeliScope. Furthermore, the different short read technologies have very different error profiles, with substitutions being most common in the Illumina system, while deletions and insertions are the dominant error type in single molecule sequencing. Hence, in order to achieve an accurate, cost-effective and timely analysis, the aligner should be capable of aligning massive amounts of short reads with errors or sequence variations of all types to large genomic references.
In this article, we present a high-throughput alignment tool for effectively aligning variable length short reads that contain mutations or errors of all types. Our aligner has three stages: a novel error-tolerant indexing stage for seeding alignments, an edit distance (Myers, 1999) based filter for filtering out non-promising candidate locations, and finally a full dynamic programming alignment. Many aligners use a similar multi-stage strategy, including BLAST, BLASTZ, and BLAT (Altschul et al., 1997; Kent, 2002; Schwartz et al., 2003); and in the short read arena, SHRIMP and Mosaik (Rumble et al., 2009; Stromberg, 2009).
This article focuses on the seeding stage in which the portions of the reference that are most similar to a read are identified. Seeding alignments is generally achieved by finding small patterns of nucleotides, called “seeds,” shared exactly by the read and the reference. This process is fast and can be accelerated using a lookup table. Seeding is generally followed by an expensive alignment step, so reducing the portion of the reference considered in the second stage greatly reduces the overall computational cost. In lossless seeding, there is a guarantee that all regions of the reference within a certain error budget from a fixed size window of the query are found. In the lossy case, many but not all such regions are found.
In standard seeding with k-mers, the patterns are contiguous stretches of k nucleotides called “words” or “k-mers.” This approach is subject to the following tradeoff: short k-mers yield good sensitivity at the expense of lower specificity (more false positive locations) and longer computation time. Long k-mers increase specificity but reduce sensitivity. An increase in specificity with only a small reduction in sensitivity can also be achieved by requiring that each read have multiple short k-mer matches in the same reference locus. Aligners that use this approach include BLAST (Altschul et al., 1997), BLAT (Kent, 2002), and Mosaik (Stromberg, 2009) for short reads. Another extension to the contiguous seed model is found in BLAT (Kent, 2002), where each seed is allowed to contain a mismatch in at most one location.
Substantial increases in sensitivity with the same specificity can be achieved by using gapped words or gapped-seeds. Here, the common pattern between a read and a reference consists of a subset of nucleotides in a word, with mismatches allowed in the intermediate positions. For example, if only the first, fifth, and tenth nucleotide from a 10-mer are used, those nucleotides are concatenated into a 3-mer word, which is then compared to a reference's set of words generated by the same pattern. This approach has been widely studied in both the lossy (Buhler, 2001; Buhler et al., 2003; Burkhardt and Kärkkäinen, 2002; Califano and Rigoutsos, 1993; Choi and Zhang, 2004; Keich et al., 2004; Kucherov et al., 2004; Li et al., 2004; Ma et al., 2002; Noé and Kucherov, 2004; Sun and Buhler, 2004; Xu et al., 2004) and the lossless case (Burkhardt and Kärkkäinen, 2003; Kucherov et al., 2005; Pevzner and Waterman, 1995). Aligners that use gapped-seeds include FLASH (Califano and Rigoutsos, 1993), patternHunter (Ma et al., 2002), and blastZ (Schwartz et al., 2003), and the short read aligners include SHRIMP (Rumble et al., 2009) and MAQ (Li et al., 2008). MAQ use pairs of gapped seeds.
PatternHunter (Ma et al., 2002) uses a single seed with optimal sensitivity. The optimization is for a fixed specificity and a uniform probability model for substitutions. Algorithmic solutions to the problem of finding an optimal seed are presented in Choi and Zhang (2004) and Keich et al. (2004).
Multiple gapped seeds improve the sensitivity-specificity tradeoff even further. FLASH (Califano and Rigoutsos, 1993) uses multiple randomly chosen gapped seeds, and a similar approach was taken in LHS-ALL_PAIRS (Buhler, 2001). Algorithms for designing sensitive sets of gapped seeds are presented in Buhler et al. (2003), Li et al. (2004), Sun and Buhler (2004), and Xu et al. (2004). Computing the seeds assuming a Markov alignment model was considered in Buhler et al. (2003) and Sun and Buhler (2004). Extensions that enable protein similarity searches are presented in Brejova et al. (2003) and Brown (2004).
With the exception of the work of Burkhardt and Kärkkäinen (2002), prior approaches are useful only for finding regions of un-gapped similarity. Specifically, the gapped seed approach does not generate a match when insertions or deletions occur inside a seed, even in positions where a mismatch is allowed.
The seeding we develop here is both lossless and capable of finding gapped similarities. This is achieved by using an extension to the multiple gapped seed approach that we call covering template families. Here, a template is a pair of gapped seeds: one associated with the read and the other with the reference. A template generates a match if the read seed from the query matches a reference seed at some reference position. The gapped seeds discussed in Burkhardt and Kärkkäinen (2002) are equivalent to templates for the one gap case.
Covering template families guarantee that, for a given error budget, at least one template in the family generates a match between a fixed size sequence and another at edit distance no greater than the error budget. The edit distance is the minimal number of insertions, deletions, or substitutions between two sequences. This approach has “perfect sensitivity” (sensitivity is equal to 1) within the error budget, without sacrificing specificity. Covering sets of gapped seeds for un-gapped similarity were computed in Xu et al. (2004).
Here we present several approaches for developing covering template families. First a greedy exhaustive algorithm is presented. Then we present a recursive construction that creates families with higher error tolerance from ones with lower error tolerance.
The article is organized as follows. In Section 2.1, we introduce definitions of key concepts. In Section 2.2, we describe the process of indexing with covering template families. Section 2.3 describes the exhaustive greedy algorithm for finding covering template families, and indicates some properties of those suited to common usage. In Section 2.3, we present an explicit construction that leverages a family with a certain error budget to create a family with a higher error budget. Section 3 includes various results.
2. Methods
2.1. Definitions
In this section, we introduce several definitions, with the ultimate goal of introducing the concept of covering template families.
2.1.1. Gapped words
A word S of size N is a contiguous stretch of N nucleotides, N is the word size. A key of weight w is a vector of w distinct non-negative integers sorted in increasing order Keys are used to define a subsequence pattern within an arbitrary sequence. The key size is one plus the maximum integer in the key. Note that 0 based indexing is used throughout the article.
We denote the length of a sequence S by L(S), and given sequences S1 and S2, we denote their concatenation by S1 + S2.
Let
For a sequence
2.1.2. Templates and template families
In order to develop an indexing scheme that is tolerant of both substitutions and indels, gapped words are not sufficient. What is required is the template construct, which is essentially a pair of keys: one used with the query and one with the reference.
Definition
Given two keys K1 and K2 of same weight w, the pair T = (K1, K2) is called a template. K1 is called the reference key and denoted by T.referenceKey, while K2 is called the query key and is denoted by T.queryKey.
Definition
A collection of M templates
Definition
An e-error instance E derived from a word S is the sequence of nucleotides obtained by introducing e errors (insertions deletions or substitutions) into S.
Definition
Let E be an e-error instance derived from a word S. We say that a template T induces a match between E and S provided E[T.queryKey] = S[T.referenceKey]. Here, T has reference key size no greater than L(S) and read key size no greater than L(E).
Definition
An (N,w,f) template family
2.2. Indexing with covering template families
In this section, we motivate the use of covering template families for error tolerant indexing, and explain the actual algorithm for indexing with them. We also highlight computational complexity issues and tradeoffs associated with the use of template families.
2.2.1. Template indexing algorithm
Consider a read D sequenced from a known reference R in the presence of sequencing errors and/or mutations. We assume that the prefix E of D was derived from a word S of size N in R by introducing e errors and that L(E) = f ≤ N + e. The goal of the indexing algorithm is to discover the match between D and R in order to seed an alignment between them.
Any (N, w, f, e) covering template family
This match between a small section of the read and a small section of the reference is found despite the presence of e errors or mutations in that section.
The template T* and m required to find the match are unknown. Therefore, equation (1) should be tested for all members of

Basic template indexing algorithm.
The computational complexity of the indexing algorithm of Figure 1 is O(|
The loop in line A yields the |
The loop in line C yields the |R| term and can be avoided by using an index lookup. Specifically, an index of all w-mers R[m + T.referenceKey], 0 ≤ m ≤ L(R)−N, is computed for each
The sensitivity of the algorithm is increased at the expense of run-time when a large number of positions in the read are probed. The step-size Δ on line B controls this tradeoff.
Note that conventional indexing with k-mers is a special case of the algorithm of Figure 1 with
2.2.2. Sensitivity and specificity
It is useful to compare the sensitivity and specificity of indexing with k-mers to that of indexing with an (N,w,f,e) covering template family
The use of
Moreover, a match is guaranteed when all of E′s positions are queried, i.e., Δ = 1 in line B of Figure 1.
To determine K(
The second term in the numerator on the right of (3) also found in (4) is the number of positions indexed in the read, and 1/4k is the probability of a word matching at a given location in the reference. In equation (4) |
Equating the right sides of equations (3) and (4), and solving for k yields the k-mer index size K(
In Sections 3.1.1 and 3.1.2, k(
2.3. Greedy algorithm for finding covering template families
We now present a greedy algorithm for finding small covering template families. At each iteration of the algorithm, a new template is added to that family. This template covers as many yet uncovered error instances as possible. The algorithm generates an (N,w,f,e) covering template family provided that N−e ≥ w and N ≤ f ≤ N + e. We then prove that the algorithm terminates with a correct solution.
2.3.1. Algorithm description
The algorithm is presented in Figure 2. and we refer to it as the greedy covering template family algorithm. It requires two inputs: the set of error instances and the set of query keys which we now describe.

Covering template family algorithm.
We illustrate these sets for the (N, w, f, e) = (8, 6, 8, 1) family. Table 1 presents the set of error instances, while Table 2 presents the key set
The description of the greedy algorithm requires the following definition.
Definition
Given a key K and an error instance
Definition
A position j in
2.3.2. Proof of correct termination
We now prove in three steps that the algorithm generates a covering template family
Lemma 1
For each
Introducing the e + 1st error into E′ followed by a truncation or padding with the symbol N yields E. If this error is an insertion, the positions that follow are shifted to the right and only the last position in the prefix of E′ is no longer in that of E. Hence, the latter has at least (N−e) − 1 = N − (e + 1) valid positions. The case of substitution and deletion are proved in a similar fashion. ▪
The next lemma shows that the greedy algorithm always terminates in a bounded number of steps.
Lemma 2
At each iteration of the block beginning line A, the size of errorInstanceSet decreases by at least one.
The next lemma shows that the template family generated by the algorithm yields a match between all
Lemma 3
Let
Error instances of
Remark 1
Let E be an error instance generated by introducing n < e errors into W, and let
We summarize the analysis of this section with the following theorem.
Theorem 1
The greedy covering template family algorithm always terminates in a bounded number of steps and finds a covering template family.
We note that the speed of the greedy algorithm depends on the parameters of the family, as the latter affect the size of the error instance set, and the key set. We also note that the covering template family algorithm can be used with error instances involving either errors of all types, or errors of just one type (e.g., just deletions). In fact, any combination of error types is possible.
2.4. Modular template family construction
We now construct a family tolerant to k + 1 errors using a family tolerant of k errors. Specifically, a (N + w/2, w, N + w/2 + k + 1, k + 1) covering template family
2.4.1. Preliminary results
We begin by compiling preliminary definitions and results.
Definition
A (K, D) symmetric key, is the key
A symmetric key is used to extract from a sequence two words of length K separated by a gap of length D.
Definition
A symmetric template family STF(K, D, k)
Lemma 1
Consider word S and an error instance E obtained by introducing n errors
since a substitution does not change the length of a sequence while an insertion or a deletion respectively increases or decreases the length by one. Introducing
Combining (6), (7), and the definition of Φ proves the lemma ▪
Lemma 2
Consider a word S of size N = 2K + D and its subsequences: S1 its first K nucleotides, S2 the next D nucleotides and S3 the last K nucleotides. Let
2.4.2. The modular construction
To construct
We determine
Group 1
Group 2
E has k + 1 errors and
The number of errors in
Moreover, lemma 1 implies
It follows that
This implies that there is a
Group 3
We summarize this analysis in the following theorem.
Theorem 2
Let
Then
In section 2.4.3 we show that indexes of
2.4.3. The modular recursion
The construction of Section 2.4.2 can be applied recursively to obtain a hierarchy of covering template families
Lemma 3
The local error rate k/Nk is monotonically increasing with k and
Dividing k by Nk and taking the limit with respect to k yields (13). To show monotonic behavior, we take the derivative with respect to k of k/Nk to obtain
Both numerator and denominator in (15) are positive. ▪
The derivative in (15) converges to zero as k → ∞ , indicating that only a few applications of the recursion are useful in practice.
We now determine the number of indexes and templates required to search with
Definition
Let T be a template then Ts = (m + T.referenceKey, n + T.queryKey) m, n ≥ 0 is a shifted version of T.
Lemma 4
Let (a) Any template of (b) The number of templates in Base(
In (16) 2(j + 1) + 1 is the number of templates in STF(w/2, Nj-w/2, j + 1) ▪
Lemma 5
Let (a) The index for T can be used for Ts. (b) If Ts.queryKey = T.queryKey then Ts is redundant.
Lemma 6
The number of indexes required by the k'th family is O(k).
Lemma 7
The number of non-redundant templates in
Note that for some n [T,n] may be empty and that
Moreover, if m* = min M(T, n) and T* = (T.referenceKey + m*,T.queryKey + n) Lemma 5 b) implies that all templates in [T, n] are redundant except T*. In conjunction with (17) this yields the bound |Base(
3. Results
3.1. Families generated by greedy algorithm
3.1.1. General properties
We now present properties of template families generated by the greedy covering template family algorithm. Table 3 presents the number of templates, k(
The larger the value of K(
3.1.2. Comparison with k-mer indexing
In this section, we verify computationally the results of Sections 2.2.2 and 3.1.1 for the (26,16,26,2) template family
When a query word matches more than M = 10000 locations in the reference, it will be ignored. Table 4 presents the percentage of reads with at least one match to their location of origin (% Found), the total number of candidate locations generated in each alignment run (Candidate positions), and the number of candidate locations normalized by the number generated by
The fraction of reads that match their location of origin when seeding with
3.1.3. Template family structure
In this section, we study the template structure of a frequently used family generated by the greedy covering template family algorithm. This structure yields a compression scheme for pre-computed indexes in one of the implementations of the indexing algorithm which is described in Section 3.3.2.
The Appendix presents the templates of the (18,16,18,1) family. Tables 6–9 correspond to the substitution only, deletion only, insertions only and all error types, respectively.
The nine reference keys Ki,
The last template in Tables 6 and 7 are identical so the total number in all three tables is 26, and is equal to the one in Table 9. We also note that there is some redundancy in the templates at the edge of the word. For example, the first templates in Tables 6 and 7 are equivalent, because the reference is indexed at every position.
3.2. Modular families
Here, we present the parameters of the families generated by the modular recursion of Section 2.4.3, where
We see that a relatively high error tolerance can be achieved with a modest number of indexes.
3.3. Implementation details
3.3.1. Indexing with merge sort
Indexing is implemented as a merge-sort process. Specifically, all gapped words are encoded as integers, by translating each w-mer as a base 4 number with {A, C, G, T} respectively mapped to {0, 1, 2, 3}. Translation can be done either left to right or right to left, respectively called right significant and left significant.
Matches are found by sorting reference and query gapped words, and merging the resulting sorted lists. Only a portion of the sorted index needs to be in memory at any point in time.
3.3.2. Database compression
It is desirable to pre-compute the indexes in order to reduce computation time. A substantial compression of the index data-base is achieved for certain families by leveraging the structure of the reference keys.
We illustrate this for the (18, 16, 19, 1) family. The 9 reference keys of Table 6 consist of two groups. The first five share their last 8 positions, while the last four share their first 8. A single database for the first five reference keys, and last 4 keys is generated by encoding the full 18 nucleotide word in a right and left significant mode, respectively. The resulting lists are then sorted to create two databases.
When processing a specific reference key bit-wise operations are used to extract the encoding for that key from that of the word. This results in a partially sorted index, which is fully sorted by the most significant 16 bit radix. This index is loaded into memory in small sections partitioned along the significant radix boundaries. Each section can be fully sorted with a small amount of additional computation.
3.3.3. Software
The helisphere distribution contains several programs associated with the aligner:
IndexDP: very flexible version of the aligner, designed for references of size up to 100MB. Templates: program for generating template families used by indexDP. It is based on the greedy template family algorithm. IndexDPgenomic: designed for large references it has a small memory footprint. It makes use of the pre-computed compressed database discussed in section 3.3.2. PreprocessDB: creates compressed index database for indexDPgenomic.'
4. Discussion
We have developed several approaches for generating covering template families and have demonstrated that they provide a very efficient form of indexing. Specifically, for a given error budget, they yield a match with substantially fewer spurious false positive matches than k-mer indexing. This new form of indexing enabled us to create a short read aligner that is tolerant of errors of all type.
5. Appendix
Here we present the templates associated with the (18,16,18,1) family, generated by the algorithm of Section 2.3. Tables 6–8 correspond to the case where the algorithm was run on each error type separately. Table 9 corresponds to the case where all error types where considered simultaneously.
Footnotes
Disclosure Statement
All of the authors of this manuscript (except G.M.) are or have been employees of Helicos BioSciences. G.M. was on the scientific advisory board of Helicos BioSciences Corporation.
1
Reference is assumed to be generated by an iid model, with equal probability for each nucleotide.
