Fast and Accurate Algorithms for Mapping and Aligning Long Reads

Abstract

For DNA sequence analysis, we are facing challenging tasks such as the identification of structural variants, sequencing repetitive regions, and phasing of alleles. Those challenging tasks suffer from the short length of sequencing reads, where each read may cover less than 2 single nucleotide polymorphism (SNP), or less than two occurrences of a repeated region. It is believed that long reads can help to solve those challenging tasks. In this study, we have designed new algorithms for mapping long reads to reference genomes. We have also designed efficient and effective heuristic algorithms for local alignments of long reads against the corresponding segments of the reference genome. To design the new mapping algorithm, we formulate the problem as the longest common subsequence with distance constraints. The local alignment heuristic algorithm is based on the idea of recursive alignment of k-mers, where the size of k differs in each round. We have implemented all the algorithms in C++ and produce a software package named mapAlign. Experiments on real data sets showed that the newly proposed approach can generate better alignments in terms of both identity and alignment scores for both Nanopore and single molecule real time sequencing (SMRT) data sets. For human individuals of both Nanopore and SMRT data sets, the new method can successfully math/align 91.53% and 85.36% of letters from reads to identical letters on reference genomes, respectively. In comparison, the best known method can only align 88.44% and 79.08% letters of reads for Nanopore and SMRT data sets, respectively. Our method is also faster than the best known method.

1. Introduction

Many fields in biological studies have been significantly changed by new sequencing technologies. For DNA sequence analysis, we are facing challenging tasks such as the identification of structural variants, sequencing repetitive regions, and phasing of alleles. Those challenging tasks suffer from the short length of sequencing reads, where each read may cover less than 2 single nucleotide polymorphism (SNP), or less than two occurrences of a repeated region. It is believed that long reads can help to solve those challenging tasks. However, the current long-read sequencing technology has high error rates. For DNA sequence analysis, the first step is read mapping. The other basic step is to align reads with the corresponding segments of reference genomes. There are lots of methods for read mapping and most of them do not work properly for long reads that have high error rates. A few new read mapping methods specifically for long reads are available now. However, they are still very slow. Besides, the accuracy of existing local alignment methods also needs to be improved when long reads are used.

The problem of read mapping has been well solved for short reads. A large number of tools have been developed for short reads that can map large-size data sets (such as data sets for human individuals) in reasonable time. It is known that Bwa might be the best tool for short reads (Li and Durbin, 2009). In general, tools for short reads cannot work well for large-size, long-read data sets. Recently, a few tools for long-read mapping have been developed. For single molecule real time sequencing (SMRT) data, Basic Local Alignment with Successive Refinement (Chaisson and Tesler, 2012) is a tool developed by the PacBio Company. In 2016, GraphMap (Sović et al., 2016) was proposed. It is the first tool that was reported to design for Nanopore data. The technique of minimizer was introduced in MashMap (Jain et al., 2017). Recently, Minimap (Li, 2016) and Minimap2 (Li, 2017) have been developed for long-read mapping. The most updated version Minimap2 can do read mapping and produce alignments of long reads against the corresponding segments in reference genomes. Minimap2 also uses the idea of minimizer to reduce the size of reference genomes and speed up the mapping algorithm. After that, a dynamic programming approach was designed to find the location of each read by calculating a score that represents the number of matched minimizers/k-mers and other facts.

In terms of local alignment, lots of tools have been developed for aligning reads with reference genomes. The seed-based method (Altschul et al., 1990) was first proposed in 1990. A seed DNA sequence is found based on a “hash table” containing all k-mers present in the first DNA sequence. The hash table is then used to locate the occurrences of the k-mer sequence in the other DNA sequence. Consequently, the seed is extended on both sides to complete the alignment. Other tools such as Blast (Kent, 2002), Soap (Li et al., 2008), SeqMap (Jiang and Wong, 2008), mrsFast (Hach et al., 2010), and Pass (Campagna et al., 2009) also used the seed technique. The method is simple and quick for shorter sequences. However, it is more memory-intensive for long sequences.

PatternHunter (Ma et al., 2002) extends the seed approach and proposed the “spaced seeds” concept. The main idea is to require only some positions of the seed to be matched. Some methods using multiple seed hits and allowing indels include Shrimp (Rumble et al., 2009) and RazerS (Weese et al., 2009). Several very fast tools such as Ssahs2 (Ning et al., 2001), Bwasw (Li and Durbin, 2010), Yoabs (Galinsky, 2012), and BowTie (Langmead et al., 2009) have been created based on the “Burrows-Wheeler Transform,” a technique that was first used for data compression (Burrows and Wheeler, 1994). Alignment tools such as Soap3 (Liu et al., 2012a), BarraCUDA (Klus et al., 2012), and Cushaw (Liu et al., 2012b) combine “retrieval-based” approaches with general-purpose computing on graphics processing units computing, taking advantage of parallel graphics processing unit (GPU) cores to accelerate the process. Nevertheless, the above approaches are still very slow when handling Nanopore and SMRT data sets.

Single instruction multiple data (SIMD) is a technology that uses a single instruction to control multiple processing units and perform the same operation on each set of data (also known as “data vector”) to achieve data-level parallelism. This technique is used for accelerating the dynamic programming-based pairwise alignment algorithms (Needleman and Wunsch, 1970; Smith and Waterman, 1981). Farrar's striped dynamic programming algorithm (Farrar, 2007) is the most popular one for applying SIMD technique in pairwise alignment. Ssw (Zhao et al., 2013) and Parasail (Daily, 2016) both adopted the Farrar's algorithm. Suzuki and Kasahara (2018) improved Farrar's algorithm by introducing a difference recurrence relation. Minimap2 adopted Suzuki and Kasahara's algorithms for fast pairwise alignment between the fixed anchors. SIMD technique is a key for the success of Minimap2. The speed of Minimap2 is very fast. On multi- and many-core processors, the combined SIMD vectorizing strategies are adopted for hardware accelerating of pairwise alignment algorithms (Hou et al., 2016), proposing an efficient vectorizing framework. GPU and Field Programmable Gate Array are also used for hardware accelerating of pairwise alignment algorithms. Some methods using GPUs are given in Striemer and Akoglu (2009), Benkrid et al. (2012), Li et al. (2014), Ahmed et al. (2019), and Sundfeld et al. (2020).

In this study, we have designed new algorithms for mapping long reads to reference genomes. We have also designed efficient and effective heuristic algorithms for local alignments of long reads against the corresponding segments of reference genome. To design the new mapping algorithm, we formulate the problem as the longest common subsequence with distance constraints (LCSDCs). The local alignment heuristic algorithm is based on the idea of recursive alignment of k-mers, where the size of k differs in each round. We have implemented all the algorithms in C++ and produce a software package named mapAlign. Experiments on real data sets showed that the newly proposed approach can generate better alignments in terms of both identity and alignment scores for both Nanopore and SMRT data sets.

For human individuals of both Nanopore and SMRT data sets, the new method can successfully math/align 91.53% and 85.36% of letters from reads to identical letters on reference genomes, respectively. In comparison, the best known method can only align 88.44% and 79.08% letters of reads for Nanopore and SMRT data sets, respectively. Our method is also faster than the best method. Here we did not use the SIMD technique as in the state-of-the-art method. Our new mapping method is based on the model, the longest common subsequence with approximate distance constraint ( $L C S D C_{δ}$ ). We proposed an $O (m log n)$ time algorithm for $L C S D C_{δ}$ . The heuristic local alignment algorithm makes use of the idea that recursively aligns k-mers, and in each round the value of k differs.

2. Methods

The whole method includes two parts, where the first part identifies the location of each read in the reference genome and the second part aligns the read against the corresponding segment in the reference genome. The second part is referred to as local alignment.

2.1. Identifying the location of a read in the reference genome

A common technique for all the existing methods for identifying the location of a long read in the reference genome is to start with a set of k-mers from the read and based on the set of k-mer positions in the reference genome to find the correct location of the read in the reference genome. The reason to use multiple k-mers is due to the fact that the length of long read could be very long, for example, 100k base pairs. Various methods use various strategies on selecting multiple k-mers.

2.1.1. Reference genome index

Homopolymer-compressed (HPC) k-mers are used to improve overlap sensitivity for SMRT reads (Li, 2017). It was first proposed by SmartDenovo (https://github.com/ruanjue/smartdenovo; J. Ruan). An HPC k-mer is obtained from a sequence by compressing every subsequence of identical letters into one letter. It contains k letters where any two consecutive letters in the k-mer are different. We use HPC k-mers for reference genomes.

The popular approach to convert a k-mer to an integer index is 4-based hush function in Eq. (1). It convert a k-mer string $s = a_{1} a_{2} a_{3} \dots a_{k}$ into an integer index ( $0 < = i n d e x < 4^{k}$ ), where k is the k-mer size, $a_{i} \in {A, T, G, C}$ , and $H (A) = 0$ , $H (T) = 1$ , $H (G) = 2$ , $H (C) = 3$ . $K (s) = H (a_{1}) + H (a_{2}) * 4 + H (a_{3}) * 4^{2} + \dots + H (a_{k}) * 4^{k - 1}$ (1)

To increase the speed of our algorithm, here we use another approach. We directly use an array of size $4 \times 3^{k - 1} - 1$ as a hash table. Each k-mer p corresponds to an integer $i (p)$ between 0 and $4 \times 3^{k - 1} - 1$ . We use $i (p)$ to indicate the location of k-mer p in the indexing array of size $4 \times 3^{k - 1} - 1$ . When k is very large, $i (p)$ can serve as the key in a normal hash table. Each cell in the hash table (array) corresponds to a list of positions in the reference genome where the k-mer appears. This works because any two consecutive letters in an HPC k-mer are different. The new hash function is given in Eq. (2), where $s = a_{1} a_{2} a_{3} \dots a_{k}$ is a k-mer string with size k and $M (A, A) = 0$ , $M (T, T) = 1$ , $M (G, G) = 2$ , $M (C, C) = 3$ , $M (A, T) = M (T, A) = M (C, G) = M (G, C) = 0$ , $M (A, G) = M (G, A) = M (T, C) = M (C, T) = 1$ , $M (A, C) = M (C, A) = M (T, G) = M (G, T) = 2$ . $K (s) = M (a_{1}, a_{1}) * 3^{k - 1} + M (a_{1}, a_{2}) * 3^{k - 2} + M (a_{2}, a_{3}) * 3^{k - 3} + \dots + M (a_{k - 1}, a_{k}) * 3^{k - k}$ (2)

The compressed k-mer and new hush function have the following advantages. (1) We can noticeably reduce the key space of hash function for converting a k-mer, for example, the keys spaces of a k = 15 k-mer are $4^{15} - 1$ and $4 \times 3^{14} - 1$ for the hash function in Eq. (1) and our new approach, respectively. It is easy to see that the key space of our new approach is over 50 times smaller than in Eq. (1); and (2) we can use short length-compressed k-mer string to cover a much longer uncompressed k-mer string. This is due to the fact that the compressing process reduces the length of an original uncompressed k-mer form 0 to k-1. We did experiments on the first chromosome of human genome GRCh38 with different k-mer sizes, for example, k = 14, 15, and 16. We give the distribution of original k-mer sequence length correspondingly in Table 1. We can see that a k-mer with size k can actually cover a much longer uncompressed k-mer string.

Table 1.

The Distribution of Original k-mer Sequence Length of Compressed k-mer for k = 14, 15, and 16 (Chromosome 1 of GRCh38)

Original sequence length	k = 14	k = 15	k = 16
14	$1.84 %$	—	—
15	$4.75 %$	$1.44 %$	—
16	$9.02 %$	$3.72 %$	$1.15 %$
17	$12.45 %$	$7.54 %$	$2.94 %$
18	$14.25 %$	$10.98 %$	$6.23 %$
19	$13.59 %$	$13.24 %$	$9.55 %$
20	$11.88 %$	$13.37 %$	$12.15 %$
21	$9.50 %$	$12.29 %$	$12.99 %$
22	$7.14 %$	$10.19 %$	$12.33 %$
23	$5.13 %$	$8.05 %$	$10.76 %$
24	$3.50 %$	$6.01 %$	$8.85 %$
$> 24$	$6.95 %$	$13.17 %$	$23.05 %$

2.1.2. Identifying the location of a read

To find the location of a read in the reference genome, the main difficulty is that each k-mer on a read may appear at several places in the reference genome. To find the “true” location of a read in the reference genome, multiple k-mers on the read should be used. To handle multiple k-mers, we need to design corresponding efficient and effective algorithms.

2.1.2.1. Sampling k-mers from a read

There are different ways to sample k-mers. The first issue is how many k-mers to sample. The other issue is how to combine those k-mers’ information to identify the true location of the read in the reference genome. GraphMap needs to use all the k-mers on a read and decomposes the whole genome into a set of buckets (with overlap) and looks at the number of k-mers in each bucket (Sović et al., 2016). Note that, the number of buckets is proportional to the total length of the genome; the number of buckets may be very large for large genomes. Moreover, the number of k-mers in each bucket is not an accurate measure, where the order among the k-mers and the distance between two consecutive k-mers are not taken into consideration. Thus, they have to use all the k-mers on the read in the process. Therefore, the running time is very slow compared with the best known tools, for example, Minimap2 (Li, 2017).

To reduce the number of k-mers, Minimap2 adopted the technique of minimizer and designed a dynamic programming heuristic approach to identify the true location of read in the reference genome, where an objective function considering the number of matched k-mers and other facts was used.

In this study, we propose a new method that needs a relatively small number n of k-mers approximately evenly distributed over the read. Here the default value is $n = 128$ . Note that, for the SMRT data sets, the average length of each read is 6k to 18k bps. Thus, n = 128 is very small comparing with 18k (if using all the k-mers of a read) and 18k/10 (if using minimizers with window size 10). Since n is small, we need a more accurate measure to handle the n samples of k-mers. Therefore, we formulate the problem as the LCSDC's problem.

For a read r, we select n k-mers that are evenly distributed over the read. Let $r = r_{1} r_{2} \dots r_{n}$ be a sequence of k-mers, where the n k-mers r_i are evenly distributed over the read and the distance between two consecutive k-mers is denoted as d. We treat each r_i as a letter in the sequence r.

Each k-mer r_i occurs at several locations over the reference genome. Let $g = g_{1} g_{2} \dots g_{m}$ be a sequence, where each g_j is an occurrence of a k-mer r_i for some $1 \leq i \leq n$ .

2.1.2.2. The LCSDC problem

Let $r = r_{1} r_{2} \dots r_{n}$ be the sequence obtained by sampling n k-mers from the read. Let $g = g_{1} g_{2} \dots g_{m}$ be the sequence of the occurrences of those n k-mers in the reference genome. When both r and g are treated as sequences, we assume that each g_j is a letter identical to r_i for some $1 \leq i \leq n$ . Here we want to compute a longest common subsequence $s = s_{1} s_{2} \dots s_{t}$ for g and r such that for any two consecutive letters s_i and $s_{i + 1}$ in s, the distances between s_i and $s_{i + 1}$ in r and g are identical.

Since there are indels, we have the approximate version, the $L C S D C_{δ}$ problem, where we want to find a longest common subsequence $s = s_{1} s_{2} \dots s_{t}$ for r and g such that for any two consecutive letters s_i and $s_{i + 1}$ in s, the distances between s_i and $s_{i + 1}$ in r and g can differ by at most $δ$ .

2.1.2.3. An $O (m)$ expected time exact algorithm for LCSDC

Let $r = r_{1} r_{2} \dots r_{n}$ be the sequence of n k-mers for the read. For each r_i, we have a list of all the occurrences of r_i in the genome. We represent g as n lists L₁, L₂, $\dots$ , L_n, where each list L_i stores the positions (sorted from left to right) of the occurrences of r_i on the genome. For two consecutive lists, say L_iand $L_{i + 1}$ , we can merge them into a new list $L_{i, i + 1}$ as in Algorithm 1.

It is easy to see that the running time of Algorithm 1 is linear in terms of the total length of L_i and $L_{i + 1}$ .

The complete algorithm is given as follows. We merge L_i and $L_{i + 1}$ for i = 1, 3, 5, …, n-1 (assuming that n is even). After each round, the number of lists is reduced by half. For each round, we repeatedly merge two consecutive lists. The process will be stopped after $log n$ rounds. Finally, we look at the list once more and report the longest (merged) items, which is the LCSDC. This will lead to an $O (m log n)$ time algorithm if everything is correct.

Algorithm 1. Merging Two Consecutive Sorted Lists
Input: two sorted lists L_i and $L_{i + 1}$
$x \leftarrow$ first item in L_i, $y \leftarrow$ first item in $L_{i + 1}$
while ( $(x$ is not the end of L_i) and $(y$ is not the end of $L_{i + 1}$ )) do
(Remark: x should always be before y on the genome.)
if $d (x, y) < d$ then
add y into the new list $L_{i, i + 1}$
$y = y . n e x t$
else if $d (x, y) > d$ then
add x into the new list $L_{i, i + 1}$
$x = x . n e x t$
else
create a new item (x,y) with merging two items x and y
add (x,y) into the new list $L_{i, i + 1}$
$x = x . n e x t$ , $y = y . n e x t$
end if
end while
add the rest of items in L_i or $L_{i + 1}$ to list $L_{i, i + 1}$

It is easy to see that the merge process only works well for the first round and fails on the rest of the rounds. The merge process heavily depends on the fact that the “distance” between the two lists is a fixed value d. For the newly created lists, there are items from different lists. Thus, the distance between two items from the two merged lists may differ. For example, the distances between two items from $L_{1, 2}$ and $L_{3, 4}$ may have four different values.

To overcome the problem, we need to move each item y in $L_{i + 1}$ that cannot be merged to any item in L_i forward by the distance between the original lists l_i and $L_{i + 1}$ when creating the new list $L_{i, i + 1}$ from L_i and $L_{i + 1}$ . In this way, when we merge $L_{i, i + 1}$ and $L_{i + 2, i + 3}$ , we can always use the distance between L_i and $L_{i + 2}$ as the distance between the two lists. With this modification, the $m log n$ algorithm can work correctly.

Theorem 1. There is an $O (m)$ expected running time algorithm for LCSDC problem.

Proof. The first step is to move every item in list L_i for $i = 2, 3, \dots, n$ forward by the distance between L₁ and L_i to obtain a new list $L'_{i}$ . Note that the distance between an item in L₁ and L_i is always known. After that at each newly obtained position in L_i for $i = 1$ , 2, 3, $\dots$ , n, we use a counter to store the number of times such a position is visited during the whole process. It will take $O (m)$ expected time if we use a hash table with the positions (after moving) as keys and counters as values. At the end, we output the largest counter over all the (at most m) positions as the length of the LCSDC. A common subsequence can also be generated with a backtracking process.

2.1.2.4. An $O (m log n)$ running time heuristic algorithm for $L C S D C_{δ}$

When there are indels, the approximate version $L C S D C_{δ}$ is more complicated. We do not know any efficient algorithm to obtain an optimal solution. Since the speed is crucial, we give an $O (m log n)$ time heuristic algorithm for $L C S D C_{δ}$ . The main difficult case is that when many identical k-mers occur at nearby locations in the reference genome, those nearby identical k-mers compete for matching the original k-mer on the read. We need to decide which nearby k-mer can best match the k-mer on the read. The following heuristic algorithm works well in practice for $L C S D C_{δ}$ .

First, we remove identical k-mers within distance $0.25 δ$ from both g and r. After that, we use an algorithm that is similar to Algorithm 1 to merge two consecutive lists L_i and $L_{i + 1}$ . The only difference is that (a) the condition $d (x, y) = = d$ is replaced with a new condition $d - δ \leq d (x, y) \leq d + δ$ ; and (b) when the first y in $L_{i + 1}$ is found to satisfy $d - δ \leq d (x, y) \leq d + δ$ , the next few elements of y in $L_{i + 1}$ are checked and we choose the one with the smallest error. Since this is the approximation version, we cannot use the exact position as the key for the hash table. $δ$ here is proportional to the distance between two lists and it will be very large when the two items are from far away lists. The algorithm for $L C S D C_{δ}$ is given in Algorithm 2.

Algorithm 2. Merge Two Sorted Lists with Error $δ$
Input: two sorted lists L_i and $L_{i + 1}$ , $δ$ , where $0 = < δ < d$
$x \leftarrow$ first item in L_i, $y \leftarrow$ first item in $L_{i + 1}$
while ( $(x$ is not the end of L_i) and $(y$ is not the end of $L_{i + 1}$ )) do
(Remark: x should always be before y on the genome.)
if $d (x, y) < d - δ$ then
add y into the new list $L_{i, i + 1}$
$y = y . n e x t$
else if $d (x, y) > d + δ$ then
add x into the new list $L_{i, i + 1}$
$x = x . n e x t$
else
$t = y . n e x t$
//check the next element of y to see if there is better match
if (t is not the end of $L_{i + 1}$ ) and $(a b s (d (x, t) - d) < = a b s (d (x, y) - d))$ then
add y into the new list $L_{i, i + 1}$
$y = t$
else
create a new item (x,y) with merging two items x and y
add (x,y) into the new list $L_{i, i + 1}$
$x = x . n e x t$ , $y = y . n e x t$
end if
end if
end while
add the rest of items in L_i or $L_{i + 1}$ to list $L_{i, i + 1}$

2.1.2.5. Shifting

Note that the running time is proportional to the total length of the k-mer lists. To reduce the running time, for each of the n k-mers from the read, we look at the next m base pairs with sliding step s and choose the one with the shortest list. Since the read length may vary from hundreds to millions of base pairs, we use a shifting stride to adjust m and s according to the read length l and k-mer number n. The shifting strides are given in Table 2.

Table 2.

The Shifting Stride for Different Read Lengths

Read length	m	s
$80 = < l < = 225$	2	1
$225 < l < = 640$	3	1
$640 < l < = 1280$	4	1
$1280 < l < = 2560$	6	1
$2560 < l < = 10240$	7	1
$10240 < l < = 20480$	8	1
$20480 < l$	9	2

2.1.2.6. Implementation of the algorithms

We have implemented the algorithm in C++. For each k-mer, the list containing all occurrences of the k-mer in genome is actually stored in an array (instead of a linked list) and this is one of the keys to speed up the program. It is perhaps worth emphasizing that accessing a list for a k-mer in a huge size array/hash table is time-consuming, and we access the huge size array/hash table only once for each k-mer on the read. When the final list is obtained after $log n$ rounds of the merge processes, we will select the item in the unique list that corresponds to the longest subsequence. We drop the cases when the length of LCSDC is less than 4 and we assume that we cannot find the location of the read over the genome.

We set the k-mer number $n = 128$ as the default value. However, there are some short reads in both the Nanopore and SMRT data sets (see Section 3.1) and we decrease the k-mer number n for them. We set n to 64 and 32 when the read length is between 225 and 640 and between 80 and 225, respectively. We ignore the read with length $< 80$ , since our methods are not suitable for them.

We use small-size Nanopore and SMRT data sets (see Section 3.1, data set 3 and data set 4) to test the tool for the $L C S D C_{δ}$ model. The histograms for the length of LCSDC are given in Figure 1. We can see that for the Nanopore and SMRT data sets, the average lengths of LCSDC are $21.57$ and $35.80$ , respectively.

FIG. 1.

The distributions of the LCSDC lengths for the data set 3 (Nanopore) and data set 4 (SMRT) with GRCh38 as the reference. LCSDC, longest common subsequence with distance constraint; SMRT, single molecule real time sequencing.

2.1.2.7. k-mer size versus the algorithm speed

The running time of our read mapping algorithm is $O (m log n)$ , where n is the number of k-mers selected from the read and m is the total length of the n lists for k-mers. The running time heavily depends on the k values of k-mers when indexing the reference genome. Let $l (k)$ be the expected length of a list for a k-mer. The expected total length of n list for k-mers is $m (k) = n \times l (k)$ . The expected length of a list for (k + 1)-mer is $l (k + 1) = l (k) ∕ 4$ since the list of a k-mer will be decomposed into 4 lists when there is one more letter. Thus, the running time of our algorithm will be faster when the value of k increases. On the contrary, increasing the value of k will decrease the chance of finding a matched pair of k-mer.

To illustrate the speed of our algorithm for different k values, we did experiments on a Nanopore data set and an SMRT data set (see Section 3.1, data set 3 and data set 4). The results are given in Table 3. We can see that when k increases, the running time decreases and the average length of LCSDC also decreases slightly. For $k > 16$ , we need to use an efficient way to handle the hash table and will be implemented later.

Table 3.

The Comparison of Our Algorithm for Data Set 3 and Data Set 4 with Different k-Values of k-mer (GRCh38 As the Reference)

Data set 3	k = 14	K = 15	k = 16
Average length	$24.38$	$22.69$	$21.57$
Alignment score	$89.42 %$	$89.61 %$	$89.70 %$
Identity score	$75.41 %$	$75.66 %$	$75.80 %$
Time	262 seconds	189 seconds	128 seconds
Failure cases	$1.55 %$	$1.97 %$	$2.39 %$
Peak memory	19.6	22.1	25.7
Data set 4	k = 14	k = 15	k = 16
Average length	$35.32$	$35.92$	$35.80$
Alignment score	$86.34 %$	$86.42 %$	$86.53 %$
Identity score	$81.10 %$	$81.23 %$	$81.30 %$
Time	185 seconds	141 seconds	105 seconds
Failure cases	$2.00 %$	$2.29 %$	$2.64 %$
Peak memory	19.0	20.7	25.2

2.1.2.8. $δ$ value versus the LCSDC length

The $δ$ value of $L C S D C_{δ}$ controls the distance tolerance when merging two lists in Algorithm 2. In implementation, we use a proportion factor r to calculate the $δ$ based on the distance d of the two lists to be merged. That is, $δ = d * r$ ( $0 = < r < 1$ ). The smaller $δ$ is, the more rigorous the merge condition will be. Thus, the proportion factor r heavily influences the final LCSDC length and we have to carefully adjust it. To illustrate the influence of LCSDC length for different r values, we also did experiments on data set 3 and data set 4 (see Section 3.1). We give the results in Table 4. We can see that when r increases, the average length of LCSDC also increases. This is because the bigger r allows more insertions and deletions than the smaller one.

Table 4.

The Longest Common Subsequence with Distance Constraint Length of Our Algorithm for Data Set 3 and Data Set 4 with Different r-Values (GRCh38 As the Reference and k-mer Size = 15)

Data set 3	r = 1/10	r = 1/5	r = 1/3
Average length	$17.91$	$22.23$	$22.69$
Failure cases	$4.76 %$	$2.42 %$	$1.97 %$
Data set 4	r = 1/10	r = 1/5	r = 1/3
Average length	$31.24$	$35.09$	$35.92$
Failure cases	$5.87 %$	$3.11 %$	$2.29 %$

2.2. Algorithms for local alignment

Once the read location on reference genome is found, we need to align the read with the reference genome. Of course, a quadratic running time dynamic programming algorithm can give the optimum solution. However, the algorithm is too slow to handle large-size data sets such as Nanopore or SMRT data for human individuals. Here we propose a heuristic algorithm with running time proportional to the length of the sequences for local alignment. Moreover, our method is much more accurate than all the existing heuristic methods.

We treat the DNA sequences (with 4 letters) as a sequence of k-mers. When the DNA sequence is long, we use a large-size k-mer to find an alignment of the k-mers between the two sequences. Such an alignment of k-mers between the two sequences decomposes the whole sequence region into many smaller size regions and for each pair of smaller size regions we can recursively use smaller size k-mers to do the alignment. When the size of the pair of regions is small enough, for example, $< 15$ , we use the dynamic programming algorithm to give the alignment. We refer to our algorithm as the recursive variable length k-mer alignment algorithm.

Recall that our program has found an LCSDC for k-mers with $k = 16$ for read mapping. Starting with a matched k-mer, we try to extend the matching to the left and right. The LCSDC for k-mers with $k = 16$ decomposes the whole read into many smaller size regions and the lengths of those regions could be from a few hundreds to a few thousands. Each time, we look at a pair of segments with 256 DNA letters from both sequences. We then view the two segments as sequences of k-mers with $k = 9$ and use the heuristic Algorithm 3 to align the two sequences. The right ends of the two sequences stop at the positions where the last pair of matched k-mer with $k = 9$ ends. Our algorithm can still handle large-size $> 256$ indels, since the extension from the other direction can handle this.

2.2.1. The heuristic algorithm for alignment of k-mer sequences

Let $L = L_{1} L_{2} \dots L_{n}$ be a sequence of k-mers generated from the read. Each L_i is a list of positions on the reference genome that the k-mer appears. We use $| L_{i} |$ to represent the length of the list. Let p₀ and l₀ be integers representing the position in the reference genome and the location in the read, where the previous pair of k-mers matches. Our algorithm (Algorithm 3) looks at each L_i for $i = 1, 2, \dots, n$ . If the list L_i is empty, that is, $| L_{i} | = 0$ then the corresponding k-mer does not have a match on the reference genome. If $L_{i} | > 3$ , then there are too many occurrences of the k-mer and a repeated region is found. In this case, we do not try to identify the match. When $| L_{i} | > 0$ and $| L_{i} | \leq 3$ , our algorithm will try to find a match that satisfies the condition $| (l - l_{0}) - (p - p_{0}) | \leq \frac{min {l - l_{0}, p - p_{0}}}{2}$ . Such a condition is used to ensure that the lengths of the newly created pair of regions (by determining the pair of matched k-mer) on the reference genome and the read are roughly the same. Once a match of a pair of k-mers is found, the algorithm tries to extend the length of the match by looking at the next pair of DNA letters whenever possible.

It is easy to see that the running time of Algorithm 3 is linear in terms of the input sequences and is much simpler and faster than the heuristic algorithm for $L C S D C_{δ}$ . The key point here is that once a pair of k-mers is matched, the region is decomposed into two parts of smaller sizes. We use a conservative strategy to do k-mer match since missing some k-mer matches does not affect the final alignment as long as the matched pairs are correct.

Next, we go to the last pair of matched k-mers (with $k = 9$ ) and try to handle the next pair of segments with 256 DNA letters and repeat the process. The process stops if we meet the next matched k-mer with $k = 16$ (obtained in the read mapping process) or there is no matched pair of k-mer for $k = 9$ .

If the returned $h i t = = 0$ , we will reduce the size of k by 1 and repeat the process. The process can be repeated by at most 6 times when k goes down from 9 to 4.

Algorithm 3. A Heuristic Algorithm to Alignment of Two k-mer Sequences
Input: a sequence $L = L_{1} L_{2} \dots L_{n}$ and integers p₀ and l₀.
STATE hit = 0
for i = 1 to n do
if $\| L_{i} \| > 0$ and $\| L_{i} \| \leq 3$ then
for each position p in L_i do
Let l be the location of the i-th k-mer on the read
if $\| (l - l_{0}) - (p - p_{0}) \| \leq \frac{min {l - l_{0}, p - p_{0}}}{2}$ then
match the current pair of k-mers and hit = 1
while The next pair of letters are identical do
extend the match
end while
end if
end for
end if
end for
return hit

3. Results

In this section, we compare mapAlign with different existing methods in terms of the running time and alignment quality. We use some public data sets to illustrate.

3.1. Data sets

Two large human data sets are used for full comparison purpose (data set 1 and data set 2). We also extracted two downsized data sets (data set 5 and data set 6) of different human individuals from National Center for Biotechnology Information for further comparison. Two small-size human data sets (data set 3 and data set 4) are generated for quick testing. A detailed description of these data sets is shown in Table 5. Human genome GRCh38 is used as the reference genome. Experiments were conducted under the Ubuntu 16.04 operating system and a cluster with 96 CPUs (Intel(R) Xeon(R) CPU E7-8860 v3 @ 2.20GHz).

Table 5.

The Detailed Description of the Data Sets Used for Experiments

Index	Technology	Read number	Description
1	Nanopore	15,666,888	Downloaded at http://s3.amazonaws.com/nanopore-human-wgs/rel6/rel_6.fastq.gz (NA12878; Jain et al., 2018)
2	SMRT	22,565,609	Downloaded at http://bit.ly/chm1p5c3 (Ono et al., 2013)
3	Nanopore	10,000	Randomly selected from data set 1
4	SMRT	12,000	Randomly selected from Ashkenazim trio-Child NA24385
5	Nanopore	100,000	The first 100,000 reads of ERR2631601 for NA19240
6	SMRT	100,000	The first 100,000 reads of SRR7615963 for HG00733

SMRT, single molecule real time sequencing.

3.2. Alignment quality comparison with Basic Local Alignment Search Tool

We define two different scores (alignment score and identity score) to evaluate the quality of the generated alignments. The alignment score is defined as the number of identically matched pairs of letters over the total length of read. The identity score is defined to be the number of identically matched pairs of letters over the total length of the alignment. The alignment score quantifies the percentage of matched read letters, while the identity score provides the percentage of identically matched columns over the alignment. If we align a read to itself, both the alignment score and identity score should be 100%. A reasonable alignment approach should try to achieve high values for both alignment score and identity score. Using one either the alignment score or identity score may be a bias. For example, all letters of a read with 200 base pairs may be perfectly matched to a long segment of 2000 base pairs in the reference genome. Such an alignment has high alignment score and low identity score, and should be considered very poor. On the contrary, a read with 2000 letters may have 200 consecutive letters that are identical to a segment of 200 letters in a reference genome. Then we will have a local alignment containing 200 identical alignments. The identity score is 100% in this case, while the alignment score is 10%, that is, only 10% of letters in the read are matched.

The Basic Local Alignment Search Tool (BLAST; Altschul et al., 1990) is good at doing alignment. To evaluate the performance of our method, we did a preliminary comparison with BLAST (version 2.9.0). To accelerate the experiments, we use a small human data set (data set 3) against the human reference GRCh38. The experiment results are as follows. The alignment score and identity score of mapAlign are 89.70% and 75.80%, while the alignment score and identity score for BLAST are 70.40% and 77.77%, respectively. Our method has a slightly lower identity score than BLAST. However, our method achieved better alignment score. A total of 89.70% of read letters are perfectly matched to the reference genome by our method, and BLAST could only align 70.40% of the read letters. We carefully compared the details of the alignments generated by our method with that of BLAST. We find that our method aligned slightly more space letters to the reference genome to get a better alignment score. Extra $89.70 % - 70.40 % = 19.3 %$ letters of the read are aligned to the reference genome by our method, while the slightly lower identity score is a reasonable trade-off. BLAST often excludes the two ends of read in the resulting alignment and thus it is not good at aligning the read ends. Our method is over 30 times faster than BLAST, and so, we do not give the comparison of running time.

3.3. Comparison with other methods

According to Li (2017), Minimap2 is over 30 times faster than GraphMap. Minimap2 also has better alignment quality than GraphMap. Here we compare mapAlign with Minimap2 using two large-size data sets, that is, data set 1 and data set 2. The results are shown in Table 6.

Table 6.

Comparison Between mapAlign and Minimap2 for Data Set 1 and Data Set 2 with GRCh38 As the Reference

Data set 1	mapAlign	Minimap2
Alignment score	$91.53 %$	$88.44 %$
Identity score	$83.45 %$	$81.71 %$
Time	141,559 seconds	266,375 seconds
Failure cases	$17.72 %$	$16.83 %$
Peak memory	27.1	13.5
Data set 2	mapAlign	Minimap2
Alignment score	$85.36 %$	$79.08 %$
Identity score	$79.12 %$	$74.17 %$
Time	247,005 seconds	390,029 seconds
Failure cases	$6.26 %$	$4.38 %$
Peak memory	26.2	9.2

From Table 6, we can see that mapAlign is faster than Minimap2 for both data set 1 and data set 2. Our methods use about 53% and 63% CPU time of Minimap2 for data set 1 and data set 2, respectively. Moreover, the alignment quality (in terms of both the identify and alignment scores) of our method is better than Minimap2 for data set 1 and data set 2. We have a slightly higher failure case than Minimap2. That is because these reads are not similar to the reference segments at all and they have very low alignment quality in terms of both the alignment score and identity score. We use a bigger memory than Minimap2 because we use a full reference k-mer index, and Minimap2 uses minimizer to build the reference k-mer index. We also use data sets from two different human individuals, data set 5 and data set 6, to do a comparison. The results are in Table 7. The alignment score and identity score for mapAlign and Minimap2 for data set 5 and data set 6 are given in Figures 2 and 3, respectively.

FIG. 2.

The comparisons of alignment score and identity score between mapAlign and Minimap2 for data set 5 with GRCh38 as the reference.

FIG. 3.

The comparisons of alignment score and identity score between mapAlign and Minimap2 for data set 6 with GRCh38 as the reference.

Table 7.

Comparison Between mapAlign and Minimap2 for Data Set 5 and Data Set 6 with GRCh38 As the Reference

Data set 5	mapAlign	Minimap2
Alignment score	$90.14 %$	$86.09 %$
Identity score	$81.19 %$	$78.44 %$
Time	1623 seconds	2271 seconds
Failure cases	$12.88 %$	$11.89 %$
Peak memory	25.2	11.2
Data set 6	mapAlign	Minimap2
Alignment score	$87.50 %$	$82.83 %$
Identity score	$79.68 %$	$76.34 %$
Time	2522 seconds	3220 seconds
Failure cases	$9.76 %$	$8.55 %$
Peak memory	25.5	10.9

4. Discussion

In this study, we present fast and accurate algorithms for mapping and aligning long reads and implemented them to form a software tool, mapAlign. For read mapping, we proposed the LCSDC problem and the approximate version, the $L C S D C_{δ}$ problem. We model the read mapping problem as the $L C S D C_{δ}$ problem. We proved that there is an $O (m)$ expected running time algorithm for the LCSDC problem and proposed an $O (m log n)$ running time heuristic algorithm for $L C S D C_{δ}$ . For aligning the read against the reference segment, we proposed a k-mer-based local alignment approach with variable value of k. With the above methods, we achieved better performance than the state-of-the-art methods for many large data sets in terms of running time and alignment quality. Our experiments show that our methods are quite fit for the long noisy reads of Oxford Nanopore and PacBio SMRT.

Our $L C S D C_{δ}$ model for read mapping is the key for the speed of our methods. The state-of-the-art method (Minimap2) uses the idea of minimizer and a sparse dynamic programming to solve the read mapping problem. The minimizer is the smallest representative of n consecutive k-mers, where n is the sliding window size and $n = 10$ is a typical setting. Taking a read with length 10k, for example, Minimap2 has to handle near 1k k-mers and their corresponding lists in the sparse dynamic programming. On the contrary, we could use less sampled k-mers (128 as a typical value for most of the reads) than Minimap2 by using our $L C S D C_{δ}$ model. We could generate accurate mapping result through the $L C S D C_{δ}$ model by considering the k-mers’ distances both on read and reference. Therefore, we could use less time than Minimap2 to solve the read mapping problem. Minimap2 could increase the window size n to decrease the running time. However, the accuracy of mapping results will decrease when increasing the window size.

There are many reasons that we can get better alignment quality. (1) After we get the matched pairs with a larger size k-mer (e.g., 16 or 15) from our $L C S D C_{δ}$ model, we use a smaller size k-mer (e.g., 9 or 7) to extend these matches. We use small sliding windows (e.g., 256 or 128) to extend the read and reference segments. Thus, the corresponding k-mer lists will be very short (most of them are 1 or 2) and we could handle them very efficiently. (2) Our extending approach with variable value of k could efficiently find out the “correctly” matched k-mers with smaller size by considering the relative “slope” of the current k-mer and the previously matched k-mer. Thus, we get many matched k-mers with different sizes (e.g., 16, 9, 8, 7, 6, 5, or 4). (3) The unmatched segments between these matched k-mers are often very small (most of them are ≤15) and we could handle them by using the exact dynamic programming. Since these segments are very small, the dynamic programming is efficient. On the contrary, the mapping condition of our algorithm is quite strict and many cases may be dropped if the two segments from the read and reference are not similar. That is why we have slightly higher failure cases than Minimap2. A few failure cases could consume lots of computational time for our algorithm since the two sequences are not similar, the sizes of the decomposed regions are large, and the quadratic algorithm becomes very slow.

We believe our method still has room to further accelerate. For example, we could adopt the SIMD technique used in Minimap2. For the read mapping, we use a merge process to get the LCSDC, it involves a lot of sequential operation. Recently, the read mapping stage occupies a large proportion of the total running time. Thus, our future work will focus on the accelerating of read mapping. On the contrary, our merge process was similar to the sorted-set intersection in the databases’ area. Many works have been done for this area (Schlegel et al., 2011), proposing an efficient approach for sorted-set intersection by using the SIMD technique. This gives us many new clues for future work.

Footnotes

Author Disclosure Statement

The authors declare they have no competing financial interests.

Funding Information

This work is supported by a grant from the National Science Foundation of China (NSFC 61972329), and a GRF grant for the Hong Kong Special Administrative Region, China (CityU 11210119).

References

Ahmed

, Lévy

, Ren

, et al. 2019. GASAL2: A GPU accelerated sequence alignment library for high-throughput NGS data. BMC Bioinformatics, 20, 520.

Altschul

S.F.

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410.

Benkrid

, Akoglu

, Ling

, et al. 2012. High performance biological pairwise sequence alignment: FPGA versus GPU versus cell BE versus GPP. Int. J. Reconfig. Comput. 2012, 15.

Burrows

, and Wheeler

D.J.

1994. A block-sorting lossless data compression algorithm. Digit. Equip. Corp. 124, 18.

Campagna

, Albiero

, Bilardi

, et al. 2009. PASS: A program to align short sequences. Bioinformatics, 25, 967–968.

Chaisson

M.J.

, and Tesler

2012. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): Application and theory. BMC Bioinformatics, 13, 238.

Daily

2016. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics, 13, 81.

Farrar

2007. Farrar, M. Striped Smith–Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23, 156–161.

Galinsky

V.L.

2012. YOABS: Yet other aligner of biological sequences—An efficient linearly scaling nucleotide aligner. Bioinformatics, 28, 1070–1077.

10.

Hach

, Hormozdiari

, Alkan

, et al. 2010. mrsFAST: A cache-oblivious algorithm for short-read mapping. Nat. Methods, 7, 576–577.

11.

Hou

, Wang

, and Feng

2016. Aalign: A SIMD framework for pairwise sequence alignment on x86-based multi-and many-core processors, 780–789. 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Chicago, IL, IEEE, May 23–27.

12.

Jain

, Dilthey

, Koren

, et al. 2017. A fast approximate algorithm for mapping long reads to large reference databases, 66–81. 21st Annual International Conference on Research in Computational Molecular Biology, Springer.

13.

Jain

, Koren

, Miga

K.H.

, et al. 2018. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 36, 338–345.

14.

Jiang

, and Wong

W.H.

2008. SeqMap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics, 24, 2395–2396.

15.

Kent

W.J.

2002. BLAT–the BLAST-like alignment tool. Genome Res. 12, 656–664.

16.

Klus

, Lam

, Lyberg

, et al. 2012. BarraCUDA—A fast short read sequence aligner using graphics processing units. BMC Res. Notes, 5, 27.

17.

Langmead

, Trapnell

, Pop

, et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25.

18.

2016. Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110.

19.

2017. Minimap2: Fast pairwise alignment for long nucleotide sequences. Bioinformatics, 34, 3094–3100.

20.

, and Durbin

2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.

21.

, and Durbin

2010. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26, 589–595.

22.

, Ranka

, and Sahni

2014. Pairwise sequence alignment for very long sequences on GPUs. Int. J. Bioinform. Res. Appl. 2, 10, 345–368.

23.

, Li

, Kristiansen

, et al. 2008. SOAP: Short oligonucleotide alignment program. Bioinformatics, 24, 713–714.

24.

Liu

C.M.

, Wong

, Wu

, et al. 2012a. SOAP3: Ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics 28, 878–879.

25.

Liu

, Schmidt

, and Maskell

D.L.

2012b. CUSHAW: A CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics 28, 1830–1837.

26.

, Tromp

, and Li

2002. PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18, 440–445.

27.

Needleman

S.B.

, and Wunsch

C.D.

1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453.

28.

Ning

, Cox

A.J.

, and Mullikin

J.C.

2001. SSAHA: A fast search method for large DNA databases. Genome Res. 11, 1725–1729.

29.

Ono

, Asai

, and Hamada

2013. PBSIM: pacBio reads simulator—toward accurate genome assembly. Bioinformatics, 29, 119–121.

30.

Rumble

S.M.

, Lacroute

, Dalca

A.V.

, et al. 2009. SHRiMP: Accurate mapping of short color-space reads. PLoS Comput. Biol. 5, e1000386.

31.

Schlegel

, Willhalm

, and Lehner

2011. Fast sorted-set intersection using SIMD instructions. ADMS@ VLDB, 1, 8.

32.

Smith

T.F.

, and Waterman

M.S.

1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197.

33.

Sović

, Šikić

, Wilm

, et al. 2016. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat. Commun. 7, 1–11.

34.

Striemer

G.M.

, and Akoglu

2009. Sequence alignment with GPU: Performance and design challenges, 1–10. 2009 IEEE International Symposium on Parallel and Distributed Processing, IEEE.

35.

Sundfeld

, Teodoro

, Havgaard

J.H.

, et al. 2020. Using GPU to accelerate the pairwise structural RNA alignment with base pair probabilities. Concurrency Comput. Pract. Exp. 32, e5468.

36.

Suzuki

, and Kasahara

2018. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics, 19, 33–47.

37.

Weese

, Emde

A.K.

, Rausch

, et al. 2009. RazerS–fast read mapping with sensitivity control. Genome Res, 19, 1646–1654.

38.

Zhao

, Lee

W.P.

, Garrison

E.P.

, et al. 2013. SSW library: An SIMD Smith-Waterman C/C++ library for use in genomic applications. PLoS One, 8, e82138.