On the Hardness of Sequence Alignment on De Bruijn Graphs

Abstract

The problem of aligning a sequence to a walk in a labeled graph is of fundamental importance to Computational Biology. For an arbitrary graph $G = (V, E)$ and a pattern P of length m, a lower bound based on the Strong Exponential Time Hypothesis implies that an algorithm for finding a walk in G exactly matching P significantly faster than $O (| E | m)$ time is unlikely. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is solvable in $O (| E | m)$ time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets. Moreover, NP-completeness continues to hold even when edits are restricted to only substitutions. Despite the popularity of the de Bruijn graphs in Computational Biology, the complexity of approximate pattern matching on the de Bruijn graphs remained unknown. We investigate this problem and show that the properties that make the de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. Specifically, we prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. We also demonstrate that an algorithm significantly faster than $O (| E | m)$ is unlikely for the de Bruijn graphs in the case where substitutions are only allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, such as on the de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic $Õ (n \sqrt{m})$ time, where n is the text's length.

1. INTRODUCTION

De Bruijn graphs play an essential role in Computational Biology. Their application to de novo assembly spans back to the 1980s (Pevzner, 1989) and has been the topic of extensive research since then (Chikhi et al, 2015; Chikhi and Rizk, 2013; Georganas et al, 2014; Lin et al, 2016; Peng et al, 2010, 2013; Ren et al, 2012; Zerbino and Birney, 2008). More recently, de Bruijn graphs have been applied in metagenomics and in the representation of large collections of genomes (Flick et al, 2017; Kamal et al, 2017; Li et al, 2015; Pell et al, 2012; Ye and Tang, 2016) and for solving other problems such as read error correction (Limasset et al, 2020; Morisse et al, 2018) and compression (Benoit et al, 2015; Holley et al, 2018).

This popularity of the de Bruijn graphs for the modeling of sequencing data makes having efficient algorithms to find walks in a de Bruijn graph matching (or approximately matching) a given query pattern important to numerous applications. For example, in metagenomics, such an algorithm can be used to quickly detect the presence of a particular species within genetic material obtained from an environmental sample. Or, in the case of read error correction, such an algorithm can be used to efficiently find the best mapping of reads onto a “cleaned” reference de Bruijn graph with low-frequency k-mers removed (Limasset et al, 2020). To facilitate such tasks, several algorithms and software tools that perform pattern matching on the de Bruijn (and sometimes general) graphs have been developed (Almodaresi et al, 2018; Heydari et al, 2018; Holley and Peterlongo, 2012; Kavya et al, 2019; Limasset et al, 2016; Liu et al, 2016; Navarro, 2000; Rautiainen and Marschall, 2017). These are often based on seed-and-extend heuristics.

With respect to theory, there has been a recent surge of interest in pattern matching on labeled graphs. This has led to many new fascinating algorithmic and computational complexity results. However, even with this improved understanding of the theory of pattern matching on labeled graphs, our knowledge is still lacking in many respects concerning specific, yet extremely relevant, classes of graphs, such as the de Bruijn graphs. An overview of the current state of knowledge is provided in Table 1.

Table 1.

The Computational Complexity of Pattern Matching on Labeled Graphs

	Exact matching	Approximate matching
Easy	Solvable in linear time Wheeler Graphs (Gagie et al, 2017) (e.g., de Bruijn graphs, NFAs for multiple strings)	Solvable in $O (\| E \| m)$ time DAGs: Substitutions/edits to graph (Kavya et al, 2019) General graphs: Substitutions/edits to pattern (Amir et al, 2000) de Bruijn graphs: Substitutions to pattern No strongly sub- $O (\| E \| m)$ algorithm (this study)
Hard	No strongly sub- $O (\| E \| m)$ algorithm General graphs (Equi et al, 2019; Gibney et al, 2021) (including DAGs with total degree $\leq$ 3)	NP-complete General graphs: Substitutions/edits to graph (Amir et al, 2000; Jain et al, 2019) de Bruijn Graphs: Substitutions to vertex labels (this study)

DAGs, directed acyclic graphs; NFA, nondeterministic finite automation.

For general graphs, we can consider exact and approximate matching. For exact matching, conditional lower bounds based on the Strong Exponential Time Hypothesis (SETH), and other conjectures in circuit complexity, indicate that an $O (| E | m^{1 - ε} + | E |^{1 - ε} m)$ time algorithm with any constant $ε > 0$ , for a graph with $| E |$ edges and a pattern of length m, is highly unlikely (as is the ability to shave more than a constant number of logarithmic factors from the $O (| E | m)$ time complexity) (Equi et al, 2019; Gibney et al, 2021). These results hold for even very restricted types of graphs, for example, directed acyclic graphs (DAGs) with maximum total degree three and binary alphabets. For approximate matching, when edits are only allowed in the pattern, the problem is solvable in $O (| E | m)$ time (Amir et al, 2000). If edits are also permitted in the graph, but the graph is a DAG, matching can be done in the same time complexity (Kavya et al, 2019).

However, the problem becomes NP-complete when edits are allowed in arbitrary cyclic graphs. This was originally proven in Amir et al (2000) for large alphabets and more recently proven for binary alphabets in Jain et al (2019). These results hold even when edits are restricted to only substitutions. The distinction between modifications to the graph and modifications to the pattern is important as these two problems are fundamentally different. When changes are made to cyclic graphs, the same modification can be encountered multiple times while matching a pattern with no additional cost [see section 3.1 in Jain et al (2019) for a detailed discussion]. Furthermore, algorithmic solutions appearing in the studies by Kavya et al (2019), Navarro (2000), and Rautiainen and Marschall (2017) are for the case where modifications are performed only to the pattern.

The de Bruijn graphs are an interesting class of graphs from a theoretical perspective. They fall within a more general class of graphs that allow for the extension of the Burrows–Wheeler Transformation-based techniques that enable efficient pattern matching. Sufficient conditions for doing this are captured by the definition of Wheeler graphs, introduced in the study of Gagie et al (2017) and further studied in Alanko et al (2020, 2019), Egidi et al (2020), Gagie (2021), and Gibney and Thankachan (2019). The de Bruijn graphs are themselves Wheeler graphs, which in turn implies that exact pattern matching is solvable in linear time on a de Bruijn graph. However, the complexity of approximate matching in the de Bruijn graphs when permitting modifications to the graph or modifications to the pattern remained open (Jain et al, 2019).

We make two important contributions, which are indicated in Table 1. First, we prove that for the de Bruijn graphs, despite exact matching being solvable in linear time, the approximate matching problem with vertex label substitutions is NP-complete. Second, we prove that a strongly subquadratic time algorithm for the approximate pattern matching problem on the de Bruijn graphs, where substitutions are only allowed to the pattern, is not possible under the SETH. Note that, in contrast, pattern-to-text matching (under substitutions) can be solved in subquadratic $\tilde{O} (n \sqrt{m})$ time, where n is the text's length (Abrahamson, 1987). This result establishes the optimality of the known quadratic time algorithms up to polynomial factors. To the best of our knowledge, these are the first such results for any type of Wheeler graph.

1.1. Technical preliminaries

1.1.1. Notation for edges

For a directed edge from a vertex u to a vertex v, we will use the notation $(u, v)$ . Additionally, we will refer to u as the tail of $(u, v)$ , and v as the head of $(u, v)$ .

1.1.2. Walks versus paths

A distinction must be made between the concept of a walk and a path in a graph. A walk is a sequence of vertices $v_{1}, v_{2}, \dots, v_{t}$ such that for each $i \in [1, t - 1]$ , $(v_{i}, v_{i + 1}) \in E$ . Vertices can be repeated in a walk. A path is a walk where vertices are not repeated. The length of a walk is defined as the number of edges in the walk, $t - 1$ , or equivalently one less than the number of vertices in the sequence (counted with multiplicity). This work will be concerning the existence of walks, not paths.

1.1.3. Induced subgraphs

An induced subgraph of a graph $G = (V, E)$ consists of a subset of vertices $V' \subseteq V$ , and all edges $(u, v) \in E$ such that $u, v \in V'$ . This is in contrast to an arbitrary subgraph of G, where an edge can be omitted from the subgraph, even if both of its incident vertices are included.

1.1.4. De Bruijn graphs

An order-k full de Bruijn graph is a compact representation of all k-mers (strings of length k) from an alphabet $Σ$ of size $σ$ . It consists of $σ^{k}$ vertices, each corresponding to a unique k-mer (which we call as its implicit label) in $Σ^{k}$ . There is a directed edge from each vertex with implicit label $s_{1} s_{2} \dots s_{k} \in Σ^{k}$ to the $σ$ vertices with implicit labels $s_{2} s_{3} \dots s_{k} α$ , $α \in Σ$ . We will work with induced subgraphs of the full de Bruijn graphs in this article. We assign to every vertex v a label $L (v) \in Σ$ , such that the implicit label of v is $L (u_{1}) L (u_{2}) \dots L (u_{k - 1}) L (v)$ , where $u_{1}, u_{2}, \dots, u_{k - 1}, v$ is any length $k - 1$ walk ending at v. This is equivalent to the notion of a de Bruijn graph constructed from k-mers commonly used in Computational Biology.

1.1.5. Strings and matching

For a string S of length n indexed from 1 to n, we use $S [i]$ to denote the $i^{t h}$ symbol in S. We use $S [i, j]$ to denote the substring $S [i] S [i + 1] \dots S [j]$ . If $j < i$ , then we take $S [i, j]$ as the empty string. As mentioned above, we will consider every vertex v as labeled with a single symbol $L (v) \in Σ$ . A pattern $P [1, m]$ matches a walk v₁, v₂, …, v_m iff $P [i] = L (v_{i})$ for every $i \in [1, m]$ .

With these definitions in hand, we can formally define our first problem.

Problem 1 (Approximate matching with vertex label substitutions). Given a vertex labeled graph $D = (V, E)$ with alphabet $Σ$ of size $σ$ , pattern $P [1, m]$ , and integer $δ \geq 0$ , determine if there exists a walk in D matching P after at most $δ$ substitutions to the vertex labels.

Theorem 1. Problem 1 is NP-complete on the de Bruijn graphs with $σ = 4$ .

Theorem 1 is proven in Section 2. Intuitively, our reduction transforms a general directed graph into a de Bruijn graph that maintains key topological properties related to the existence of walks. The distinct problem of approximately matching a pattern to a path in a de Bruijn graph was shown to be NP-complete in the study by Limasset et al (2016). As mentioned by the authors of that work, the techniques used there do not appear to be easily adaptable to the problem for walks. Our approach uses edge transformations more closely inspired by those used in the study by Kapun and Tsarev (2013) for proving hardness on the paired de Bruijn sound cycle problem.

Problem 2 (Approximate matching with substitutions to the pattern). Given a vertex labeled graph $D = (V, E)$ with alphabet $Σ$ of size $σ$ , pattern $P [1, m]$ , and integer $δ \geq 0$ , determine if there exists a walk in D matching P after at most $δ$ substitutions to the symbols in P.

For Problem 2, we provide a hardness result based on the SETH, which is frequently used for establishing conditional optimality of polynomial time algorithms (Abboud et al, 2018; Backurs and Indyk, 2016; Equi et al, 2019; Gibney, 2020; Gibney et al, 2021; Hoppenworth et al, 2020). We refer the reader to the study of Williams (2015) for the definition of the SETH and for the reduction to the orthogonal vectors (OV) problem, which is utilized to prove Theorem 2.

Theorem 2. Conditioned on the SETH, for all constants $ε > 0$ , there does not exist an $O (| E | m^{1 - ε} + | E |^{1 - ε} m)$ time algorithm for Problem 2 on the de Bruijn graphs with $σ = 4$ .

Note that the order, k, of the de Bruijn graphs used in ours proofs are $Θ ({log}^{2} | V |)$ for Theorem 1 and $Θ (log | V |)$ for Theorem 2.

2. HARDNESS OF PROBLEM 1 ON THE DE BRUIJN GRAPHS

Our proof of NP-completeness uses a reduction from the Hamiltonian cycle problem on directed graphs, which is the problem of deciding if there exists a cycle through a directed graph that visits every vertex exactly once. The Hamiltonian cycle problem has been proven NP-complete, even when restricted to directed graphs where the number of edges is linear in the number of vertices (Plesník, 1979). To present our reduction, we introduce the concept of merging two vertices. To merge vertices u and v, we first create a new vertex w. We then take all edges with either u or v as their head and make w their new head. Next, we take all edges with either u or v as their tail and make w their new tail. This makes the edges $(u, v)$ and $(v, u)$ (if they existed) into self-loops for w. If identical self-loops are formed, we delete all but one of them. Finally, we delete the original vertices u and v.

2.1. Reduction

We start with an instance of the Hamiltonian cycle problem on a directed graph where the number of edges is linear in the number of vertices. We can assume that there are no self-loops or vertices with in-degree or out-degree zero. To simplify the proof, we first eliminate any cycles of length 2 using the gadget in Figure 1. We denote the resulting graph as $D = (V, E)$ and let $n = | V |$ . We assign each vertex $v \in V$ a unique integer $L (v) \in [0, n - 1]$ . Let $ℓ = ⌈ log n ⌉$ , $b i n (i)$ be the standard binary encoding of i using $ℓ$ bits and $Σ = {$, #, 0, 1}$ . Define $W = | e n c (i) |$ , and $k = 3 W$ .

FIG. 1.

Gadget to remove cycles of length 2 from the initial input graph.

We construct a new (de Bruijn) graph $D' = (V', E')$ as follows: Initially, $D'$ is the empty graph. For $i = 0, 1, \dots, n - 1$ , for each edge $(u, v) \in E$ where $L (v) = i$ , create a new path whose concatenation of vertex labels is $#^{W} e n c (i) $^{W} e n c (i)$ . The vertex u will correspond with a new vertex $ϕ (u)$ at the start of this path, and the vertex v will correspond with a new vertex $ϕ (v)$ at the end of this path. The vertex $ϕ (v)$ has the implicit label $e n c (L (v)) $^{W} e n c (L (v))$ . The vertex $ϕ (u)$ is “temporarily assigned” the implicit label $e n c (L (u)) $^{W} e n c (L (u))$ . See Figure 2. We call vertices with implicit labels of the form $e n c (L (\cdot)) $^{W} e n c (L (\cdot))$ marked vertices. We use the notation $ϕ ((u, v))$ to denote the path created when applying this transformation to $(u, v) \in E$ . After the path $ϕ ((u, v))$ is created, vertices in $V'$ having the same implicit label are merged, and parallel edges are deleted (Figs. 3 and 4). See Figure 5 for a complete example. Finally, let $δ = 2 ℓ (n - 1)$ and

FIG. 2.

The transformation from edges to paths used in our reduction.

FIG. 3.

Vertices with the same implicit label are merged while transforming D to $D'$ , causing edges with shared head vertex to become paths with multiple shared vertices.

FIG. 4.

Vertices with the same implicit label are merged while transforming D to $D'$ , causing edges with shared tail vertex to become paths with multiple shared vertices.

FIG. 5.

(Top) A graph before the reduction is applied to it. (Bottom) The transformed graph. Implicit labels for marked vertices are shown, and the path directions are annotated by arrows beside each path.

\begin{matrix} P = #^{W} e n c (0) $^{W} e n c (0) #^{W} e n c (1) $^{W} e n c (1) #^{W} \dots \\ #^{W} e n c (n - 1) $^{W} e n c (n - 1) #^{W} e n c (0) $^{W} e n c (0) . \end{matrix}

We will show that there exists a walk in $D'$ matching P with at most $δ$ vertex label substitutions iff D contains a Hamiltonian cycle.

2.1.1. Proof of correctness

Lemma 1. The graph $D'$ constructed as above is a de Bruijn graph.

Proof. There are three properties that must be proven: (i) Implicit labels are unique, meaning for every implicit label at most one vertex is assigned that label; (ii) There are no edges missing, that is, if the implicit label of $y \in V'$ is $S α$ for some string $S [1, k - 1]$ and symbol $α \in Σ$ , and there exists a vertex $x \in V'$ with implicit label $β S [1, k - 1]$ for some symbol $β \in Σ$ , then $(x, y) \in E'$ ; (iii) Implicit labels are well defined, in that every walk of length $k - 1$ ending at a vertex $x \in V'$ matches the same string (the implicit label of x).

Property (i) holds since after every edge transformation, vertices with the same implicit label are merged, making every implicit label occur at most once.

For Property (ii), consider the completed graph $D'$ and an arbitrary vertex y on an arbitrary path $ϕ ((u, v))$ . Regarding a possible edge $(x, y) \in E'$ , we have the following cases: