A New Paradigm for Identifying Reconciliation-Scenario Altering Mutations Conferring Environmental Adaptation

Abstract

An important goal in microbial computational genomics is to identify crucial events in the evolution of a gene that severely alter the duplication, loss, and mobilization patterns of the gene within the genomes in which it disseminates. In this article, we formalize this microbiological goal as a new pattern-matching problem in the domain of gene tree and species tree reconciliation, denoted “Reconciliation-Scenario Altering Mutation (RSAM) Discovery.” We propose an $O (m \cdot n \cdot k)$ time algorithm to solve this new problem, where m and n are the number of vertices of the input gene tree and species tree, respectively, and k is a user-specified parameter that bounds from above the number of optimal solutions of interest. The algorithm first constructs a hypergraph representing the k highest scoring reconciliation scenarios between the given gene tree and species tree, and then interrogates this hypergraph for subtrees matching a prespecified RSAM pattern. Our algorithm is optimal in the sense that the number of hypernodes in the hypergraph can be lower bounded by $Ω (m \cdot n \cdot k)$ . We implement the new algorithm as a tool, called RSAM-finder, and demonstrate its application to the identification of RSAMs in toxins and drug resistance elements across a data set spanning hundreds of species.

1. Introduction

Prokaryotes can be found in the most diverse and severe ecological niches of the planet. Adaptation of prokaryotes to new niches requires expanding their repertoire of protein families, via two evolutionary processes: first, by selection of novel gene mutants carrying stable genetic alterations that confer adaptation, and second, by dissemination of an adaptively mutated gene. These two processes are correlated: an adaptation-conferring mutation in a gene could accelerate its mobilization across bacterial lineages populating the corresponding environmental niche (Poirel et al., 2009), and vice versa, the mobilization of a gene by transposable elements increases its chances to mutate or “pick up” novel genomic context. Thus, an important research goal is to identify gene-level mutations that affect the spreading pattern of the mutated gene within and across the genomes harboring it.

For example, consider mutations conferring adaptation of bacteria to a human pathogenesis environment. Here, a mutation to a resistance or virulence factor could enhance pathogenic adaptation, thus increasing the horizontal mobilization of the mutated gene within other human pathogens inhibiting this niche (Poirel et al., 2009). In this case, we say that the mutation has a causal association with the observed dissemination pattern of the mutated gene (i.e., the increased mobilization of the gene among pathogenic bacteria). Identifying such mutations could inform infectious disease monitoring and outbreak control, and assist in identifying potential drug targets.

The coevolution of genes and their host species is classically described by computing the most parsimonious reconciliation scenario between a given gene tree G and the corresponding species tree S, that is, a mapping of each vertex $u \in G$ to a vertex $x \in S$ . Three major evolutionary processes, traditionally considered by reconciliation approaches, are horizontal gene transfer, gene duplication, and gene loss (Tofigh et al., 2011). Each mapping of a vertex $u \in G$ to a vertex $x \in S$ is associated with one of these evolutionary events, and assigned a cost, accordingly (Fig. 1). The optimization problem of computing a least-cost reconciliation between G and S, where the total cost is computed as the sum of the costs assigned to each of the mappings, is denoted as Duplication–Transfer–Loss (DLT) Reconciliation. (Previous works on this problem are reviewed in Section 1 below.)

FIG. 1.

An example of a DLT scenario. (A) The gene tree G. (B) The species tree S. (C) A possible reconciliation scenario between G and S. DLT, duplication–transfer–loss.

Motivated by examples such as the one given above, we formalize a new pattern-matching problem in the domain of DLT reconciliation. Given are a gene tree G, a corresponding species tree S, a mapping $σ$ from the leaves of G to the leaves of S, and (optional) an environmental annotation labeling the leaves of the input trees. Let $ℋ$ denote some data structure, defined later in the article, that models the space of reconciliations between G and S. A DLT Reconciliation Scenario Pattern denotes a mapping between a vertex $u \in G$ to a vertex $x \in S$ , which obeys a set of user-defined specifications regarding the corresponding reconciliation event, the labels on the paired vertices, and other features associated with the mapping. Mappings between pairs of vertices ( $u \in G$ , $x \in S$ ) that abide by the requirements specified by P are denoted instances of P in $ℋ$ . Given a prespecified DLT Reconciliation Scenario pattern P and a data structure $ℋ$ modeling the space of reconciliations between G and S, a Reconciliation Scenario Altering Mutation (RSAM) of P in $ℋ$ is a vertex $v \in G$ representing a gene mutation with a putative causal association to instances of P in $ℋ$ . The RSAM Discovery problem is to identify RSAMs in G.

In what follows, we propose a three-stage solution to the RSAM Discovery problem defined above (illustrated in Fig. 2). The first stage constructs a hypergraph $ℋ$ that recursively aggregates all the k-best reconciliations of G and S. Each supernode in $ℋ$ consists of k hypernodes, where each hypernode represents a partial solution for the DLT-reconciliation problem. Our hypergraph ensemble approach is based on a model proposed by Patro and Kingsford (2013) for network evolution, where here we extend and adapt it to the DLT model. This hypergraph of k-best reconciliations, intended to provide some robustness to the noise typical of this data, will serve as the search space for the pattern-matching stage. The second stage of our proposed solution consists of assigning a probability to each partial solution, that is, to each hypernode of $ℋ$ . Finally, in the third stage, instances of the sought RSAM-pattern P are identified within $ℋ$ , and RSAM-ranking scores are assigned accordingly to the vertices of G. Based on these scores, vertices representing putative RSAMs are identified in G and subjected to biological interpretation.

FIG. 2.

High-level overview of the RSAM-finder algorithm. RSAM, Reconciliation Scenario Altering Mutation.

The construction of $ℋ$ , in the first stage, is the computational bottleneck of the RSAM-Discovery pipeline mentioned above. Here, we adapt the approach proposed by Bansal et al. (2012) for the basic, 1-best variant of DLT reconciliation, extending it to an efficient k-best variant. This yields an $O (m \cdot n \cdot k)$ time algorithm for the problem, where m and n are the number of vertices of the input gene tree and species tree, respectively, and k is a user-specified parameter that bounds from above the number of optimal solutions of interest. Our algorithm is optimal in the sense that the number of hypernodes in the hypergraph can be lower bounded by $Ω (m \cdot n \cdot k)$ .

We remark that a simpler $O (m n (n + k) log (n + k))$ algorithm for hypergraph construction can be obtained by directly building upon the dynamic programming (DP) algorithm of Tofigh et al. (2011) and using the method of cube pruning by Huang and Chiang (2005) to handle lists of (partial) k-best solutions efficiently. Pseudocode for this “naive” algorithm can be found in Zoller et al. (2019) section 2. Just like Bansal et al. (2012) shaved off one n factor in the time complexity of the algorithm by Tofigh et al. (2011), so do we shave the n factor in the term $(n + k)$ in the time complexity of the aforementioned naive algorithm. Surprisingly, we show that by relying on the improved DP algorithm, a further speed up is achieved—namely, the usage of a queue becomes unnecessary, and therefore, the $log (n + k)$ factor is eliminated.

Our proposed solution to the problem defined in this article is implemented as a tool called RSAM-finder, publicly available in Zoller (2019). We assert the performance of RSAM-finder in large-scale simulations, and exemplify its application to the identification of RSAMs across a data sets spanning hundreds of species.

1.1. Previous related works

The DLT reconciliation problem has been extensively studied. In particular, two main DLT variants have been considered: (1) the undated DLT-reconciliation, where the species are undated, and (2) the fully dated DLT-reconciliation, where either each vertex in the species (and gene) tree is associated with an estimated date or the vertices of the species (and gene) tree are associated with a total order, and any reconciliation must respect these dates or order (i.e., a horizontal transfer (HT) event can occur only between coexisting species).

In the acyclic version of these variants, there cannot exist two genes such that one is a descendant of the other, yet the descendant is mapped (in the species tree S) to an ancestor of the other. Tofigh et al. (2011) showed that the acyclic undated version is NP-hard. However, the acyclic dated version becomes polynomially solvable (Libeskind-Hadas and Charleston, 2009).

Tofigh et al. (2011), Tofigh (2009), and David and Alm (2011) studied a version of the undated (cyclic) problem that ignores losses and proposed an $O (m n^{2})$ DP algorithm for it. They also gave a fixed-parameter tractable algorithm for enumerating all optimal solutions. The time complexity of the algorithm was improved to $O (m n)$ in Tofigh (2009) (under a restricted model that ignores losses) and in Bansal et al. (2012) (which does not ignore losses).

It is well known that the biological data used as input to the DLT reconciliation problem could be inaccurate, whether due to a sequencing problem, a problem in the reconstruction of G or S (Bapteste et al., 2009), or due to some other problem caused by noise. To overcome this problem, previous works try to examine more than one optimal solution, for example, see Donati et al. (2015) and Scornavacca et al. (2013). A probabilistic method for exploring the space of optimal solutions was suggested in Bansal et al. (2013) and Doyon et al. (2009), where the latter was improved in Doyon et al. (2011). Additional studies considered a space of candidate co-optimal scenarios within special variants of the DLT problem, some of which used special constraints to drive the search (Charleston, 1998; Merkle et al., 2010; Stolzer et al., 2012; To et al., 2015). Although all of the previous works reviewed in this paragraph compute a space of candidate reconciliation scenarios, none of these works considered the application of pattern matching on this space, as we do in this work.

DLT reconciliation variants where the reconciliation computation is guided by constraints derived from vertex-coloring information were proposed in applications studying host/parasite coevolution, such as Berry et al. (2018), where the vertex coloring (in both G and S) represents the geographical area of residence. However, the applied constraints were “hard-wired” to the specific problem addressed in that article. In contrast, the approach proposed in this article is more general, supporting a pattern search that is guided by a user-defined pattern. Our tool RSAM-finder provides the users with a query language able to express more robust patterns, according to the various applications where the pattern search is to be used.

2. Preliminaries

For a (binary) rooted tree T, let $L (T)$ , $V (T)$ , $I (T)$ , and $E (T)$ denote the sets of leaves, vertices, internal vertices, and edges, respectively, of T. In addition, let $V (T) ⋆$ denote the set of finite (ordered) vectors over $V (T)$ , that is, $V (T) ⋆ = {(v_{1}, v_{2}, \dots, v_{l}) | v_{i} \in V (T)$ for all $i \in {1, \dots, l}, l \in}$ . When T is clear from context, let $V^{⋆} = V {(T)}^{⋆}$ . Throughout, we treat any (binary) rooted tree T as a directed graph whose edges are directed from root to leaves. Then, if $(u, v) \in E (T)$ , we say that v is a child of u, and u is the parent of v. For $u, v \in V (T)$ , the notation $v \leq_{T} u$ signifies that v is a descendant of u (alternatively, u is an ancestor of v), that is, there is a directed path from u to v or $u = v$ . We say that v is a proper descendant (respectively proper ancestor) of u if $v \leq_{T} u$ (respectively $v \geq_{T} u$ ) and $u \neq v$ , denoted $<_{T}$ (respectively $>_{T}$ ). When both $u ⁄ \leq_{T} v$ and $v ⁄ \leq_{T} u$ , we say that u and v are incomparable. For any $u, v \in V (T)$ , let $d_{T} (u, v)$ denote the number of edges in the (unique simple undirected) path between u and v in T. When T is clear from context, we drop it from the notations $v \leq_{T} u$ and $d_{T} (u, v)$ . For any $u \in V (T)$ , let T_u denote the subtree of T rooted in u (then, $V (T_{u}) = {v \in V (T) | v \leq u}$ ).

2.1. DLT scenario

A DLT scenario for two binary trees G (the gene tree) and S (the species tree) is a tuple $⟨σ, γ, Σ, Δ, Θ, Ξ⟩$ where $σ : L (G) \to L (S)$ is a mapping of the leaves of G to the leaves of S, $γ : V (G) \to V (S)$ is a mapping of the vertices of G to the vertices of S, and $(Σ, Δ, Θ)$ is a partition of $I (G)$ (the set of internal vertices of G) into three event classes: Speciation ( $Σ$ ), Duplication ( $Δ$ ), and Horizontal Transfer ( $Θ$ ). The subset $Ξ \subseteq E (G)$ specifies which edges are involved in horizontal transfer events. In addition, the following constraints should be satisfied.

Consistency of $σ$ and $γ$ . For each leaf $u \in L (G), γ (u) = σ (u)$ . This constraint ensures that $γ$ respects $σ$ —that is, each leaf of G is mapped to the species where it is found.

Consistency of $γ$ and ancestorship relations in S. For each $u \in I (G)$ with children v and w:

(a) $γ (u) ⁄ <_{S} γ (v)$ and $γ (u) ⁄ <_{S} γ (w)$ . This constraint ensures that each of the two children (in G) of the gene u is mapped by $γ$ to a species that is not a proper ancestor (in S) of the species to which the gene u is mapped; thus, it can be either a descendant of u or incomparable to u.

(b) At least one of $γ (v)$ and $γ (w)$ is a descendant of $γ (u)$ . This constraint ensures that at least one of the two children (in G) of the gene u is mapped by $γ$ to a species that is a descendant (in S) of the species to which the gene u is mapped.

Identifying horizontal transfer edges. For each edge $(u, v) \in E (G)$ , it holds that $(u, v) \in Ξ$ if and only if $γ (u) ⁄ \leq_{S} γ (v)$ and $γ (v) ⁄ \leq_{S} γ (u)$ . This constraint identifies which edges are horizontal transfer edges—specifically, a horizontal transfer edge is an edge $(u, v) \in E (G)$ from a gene u to a gene v that are mapped to species $γ (u)$ and $γ (v)$ that are incomparable.

Associating events with internal vertices. For each $u \in I (G)$ with children $v, w$ :

(a) Speciation. $u \in Σ$ only if both (i) $γ (u) = l c a (γ (v), γ (w))$ and (ii) $γ (v)$ and $γ (w)$ are incomparable (i.e., $γ (v) ⁄ \leq_{S} γ (w)$ and $γ (w) ⁄ \leq_{S} γ (v)$ ).

(b) Duplication. $u \in Δ$ only if $γ (u) \geq_{S} l c a (γ (v), γ (w))$ .

Figure 1 demonstrates a DLT scenario. The species are written below the leaves of S. The (noninjective) mapping $σ : L (G) \to L (S)$ is implied by the labels of the leaves of G: $σ (u_{1}) = x_{1}; σ (u_{2}) = x_{4}; σ (u_{3}) = x_{3}; σ (u_{5}) = x_{1}; σ (u_{8}) = x_{4}$ . In the DLT reconciliation of G, S, and $σ$ (Fig. 1C), the tubes illustrate the edges of S, and each edge of G is embedded inside the tube (edge of S) to which it is mapped by $γ$ . Then, $Σ = {u_{9}, u_{4}}$ , $Δ = {u_{7}}$ , and $Θ = {u_{6}}$ . Moreover, $Ξ = {(u_{6}, u_{4})}$ .

2.2. Losses

Our definition of a loss event is based on the definition given by Bansal et al. (2012). Consider a gene tree G, a species tree S, and a corresponding DLT scenario $α = ⟨σ, γ, Σ, Δ, Θ, Ξ⟩$ . Let $u \in V (G)$ with children v and w (if they exist). Define LOSS_α(u) as the number of losses at u. Intuitively, the number of losses at a vertex u is the number of “skips” the gene made in the tree S at the evolutionary event that u represents. Formally,

Recall that $d_{S} (u, v)$ is the distance between u and v in the tree species S. The formula above determines that the number of losses in a vertex $u \in V (G)$ is based on the event that occurred in u. First, if $u \in Σ$ (i.e., u represents a speciation event), then the number of losses is the sum of the distances between u and its two children in the species tree (by the mapping $γ$ ) without counting the first step. If $u \in Δ$ (i.e., u represents a duplication event), then the number of losses is the sum of the distances between u and its two children in the species tree. If $(u, w) \in Ξ$ (i.e., u represents a horizontal transfer event that happened in the edge $(u, w)$ ), then the number of losses is the sum of the distances between u and its other child (i.e., v) in the species tree.

2.3. Costs

Let $c_{Σ}, c_{Δ}, c_{Θ}$ and cLOSS denote the costs of a speciation event, a duplication event, a horizontal transfer event, and a loss event, respectively. Let $L O S S_{α} = \sum_{u \in V (G)}$ . Let $| Σ | \cdot c_{Σ} + | Δ | \cdot c_{Δ} + | Θ | \cdot c_{Θ} + {LOSS}_{α} \cdot c_{LOSS}$ be the reconciliation cost of $α$ . When seeking a “best” DLT scenario, the goal is to find one that minimizes this cost.

3. Hypergraph of k-Best Scenarios

To represent k-best solutions,* we use a directed hypergraph denoted by $ℋ$ based on the notation in Huang and Chiang (2005). The hypergraph is a tuple $ℋ = ⟨V, E⟩$ , where V is a finite set of hypernodes, and E is a finite set of (directed) hyperedges defined as follows. Each $e \in E$ is a pair $⟨T (e), h (e)⟩$ , where $h (e) \in V$ is the head of e, and $T (e) \in V^{*}$ (i.e., $T (e)$ is a vector of vertices in V) is its tail. In our settings, $| T (e) | = 2$ for every $e \in E$ . In what follows, we define the hypernodes and hyperedges of $ℋ$ with respect to our problem. To exemplify this, we refer the reader to Figure 3. In part B of this figure, the hypernode $(u_{5}, x_{5}, 1)$ is annotated with score 1 and event “S.” To extract the best solution from the hypergraph, we begin with the first (i.e., top-ranking) slot in the root of the hypergraph (which is $(u_{5}, x_{5}, 1)$ in the figure), and then follow the incoming hyperedges in top-down order. In Figure 3, the best solution is solution (1) in part C of the figure. To extract it from the hypergraph in part B of the figure, we map u₅ to x₅ with a cost of 1 and a speciation event. Then, by first following the hyperedges incoming to $(u_{5}, x_{5}, 1)$ , we derive the mapping of u₃ to x₁, and of u₄ to x₃. Finally, by following the hyperedges incoming to $(u_{3}, x_{1}, 1)$ , we also derive the mapping of u₁ to x₁, and of u₂ to x₂.

FIG. 3.

Various aspects of the problem addressed in this article. (A) The input trees G and S. (B) An example of a hypergraph constructed based on the input trees and parameter k. (C) Three top-scoring DLT-reconciliations for the input.

The second best solution [solution (2) in part C of the figure] is extracted in the same manner—now, we start with the hypernode $(u_{5}, x_{5}, 2)$ rather than $(u_{5}, x_{5}, 1)$ , and again follow incoming hyperedges in a top-down order until we reach the leaves. Similarly, we can extract all three non-nil solutions among the four best solutions (illustrated in part C). As before, the outer tubes illustrate the edges of S, and the edges of G are embedded inside based on the reconciliation.

Hypernodes. For every vertex u in G, a vertex x in S, and an integer $i \in {1, \dots, k}$ , we have a hypernode $(u, x, i)$ in $ℋ$ . Such a hypernode $(u, x, i)$ is associated with the ith best (where ties are broken arbitrarily) solution mapping the subtree of G rooted in u to the subtree of S rooted in x that is a DLT scenario. In addition, for every integer $i \in {1, \dots, k}$ , we have a hypernode $(r o o t, i)$ in the hypergraph $ℋ$ . Such a hypernode $(r o o t, i)$ is associated with the ith best solution of mapping $G (e n t i r e l y)$ to any subtree of S. Each hypernode $(u, x, i)$ has a score C (u,x,i), and each hypernode $(r o o t, i)$ has a score $C (r o o t, i)$ . Moreover, each hypernode $(u, x, i)$ is associated with the event corresponding to the mapping of u and x in the DLT scenario of $(u, x, i)$ (speciation, duplication, or horizontal transfer), denoted $event (u, x, i)$ .

Supernodes. For any vertex $u \in V (G)$ and vertex $x \in V (S)$ , we define the supernode $(u, x)$ as the list $\{(u, x, i) : 1 \leq i \leq k\}$ (i.e., $(u, x)$ is the set of k hypernodes corresponding to the mapping of the subtree of G rooted in u to the subtree of G rooted in x). This notation will simplify our presentation.

Hyperedges. We remind the reader that each hypernode $(u, x, i) \in V (ℋ)$ describes a DLT scenario. Each hypernode has exactly one incoming hyperedge, but it can have multiple outgoing hyperedges. In particular, for each hypernode $(u, x, i) \in V (ℋ)$ , the (only) incoming hyperedge $e = ⟨T (e), h (e)⟩ = ⟨[(v, y, j), (w, z, r)], (u, x, i)⟩$ describes the mapping of the subtrees of the children of u, namely, v and w, in the scenario of $(u, x, i)$ ; here, the subtree of v is mapped to the subtree of y as in the scenario of $(v, y, j)$ , and the subtree of w is mapped to the subtree of z as in the scenario of $(w, z, r)$ .

4. Framework and Algorithms

In this section, we elaborate on each of the three stages of the workflow in Section 1.

4.1. Stage 1: Hypergraph construction

The first stage of our framework is to construct the hypergraph described in Section 3. To this end, we develop an efficient algorithm that runs in time $O (m \cdot n \cdot k)$ and requires $O (m \cdot n \cdot k)$ space.

4.1.1. An overview of the algorithm

We iterate over all $u \in V (G)$ in postorder, as well as over all $x \in V (S)$ in postorder. (However, as explained immediately, when we consider a vertex $u \in V (G)$ , after iterating over all vertices $x \in V (S)$ in postorder, we also iterate over all vertices $x \in V (S)$ in preorder.) In each iteration, corresponding to a pair $(u, x)$ , we construct three lists: $p_{Σ} (s p e c i a t i o n)$ , $p_{Δ} (d u p l i c a t i o n)$ , and $p_{Θ}$ (horizontal transfer). Specifically, $p_{Σ}$ should be a list of k-best solutions that are DLT scenarios where the subtree of G rooted in u is mapped to the subtree of S rooted in x under the restriction that the event corresponding to matching u and x is speciation. The meaning of the lists $p_{Δ}$ and $p_{Θ}$ is similar, where the restriction of speciation is replaced by duplication or horizontal transfer, respectively. Having these three lists suffices to construct the hypernode $(u, x)$ .

To avoid repetitive computation, we maintain three additional lists: subtree, subtreeLoss, and incomp. Intuitively, (u, x, i) represents the ith best cost of reconciliation of the tree rooted in u, such that u may be mapped to any $y \leq x$ with a additional cost of one loss per edge in the path from x to y, and incomp(u,x,i) represents the ith best cost of a reconciliation of the subtree of G rooted in u with some subtree of S whose root is a vertex y incomparable to x. sbtree is used to efficiently compute incomp. The notations subtreeLoss(u,x), subtree(u,x), and incomp(u,x) refer to the lists of the k-best scores {subtreeLoss(u,x,i)}_i=1^k, {subtree(u,x,i)}_i=1^k, and {incomp(u,x,i)}_i=1^k, respectively, similarly to our usage of the notation of a supernode.

The efficient computation of $p_{Σ}$ , $p_{Δ}$ , and $p_{Θ}$ , along with the maintenance of subtreeLoss, subtree, and incomp themselves, is highly nontrivial. On a high level, we first initialize all five lists to contain only costs of ∞; then, still in the initialization phase, we add hypernodes that match between leaves of G and S in accordance with σ and update subtreeLoss and subtree consequently. After the initialization, the main computation considers each $u \in V (G)$ in postorder, and performs two steps. In the first step, we consider each $x \in V (S)$ in postorder. Then, for each $i \in {1, \dots, k}$ , we compute $p_{Σ} (u, x, i)$ , $p_{Δ} (u, x, i)$ , and $p_{Θ} (u, x, i)$ based on somewhat involved recursive formulas. Afterward, we construct the surpernode $(u, x)$ , as well as compute the lists subtreeLoss(u,x) and subtree(u,x). In the second step, we consider each $x \in I (S)$ with children y and z in preorder, and compute the lists incomp(u,y) and incomp(u,z).

Having constructed all hypernodes of the form $(u, x, i)$ along with their ingoing hyperedges, it is trivial to construct the hypernodes of the form $(r o o t, i)$ and their ingoing edges.

4.1.2. Pseudocode

The pseudocode is given in Figures 4 and 5. We use the notation $i m i n$ , defined as follows: Let X and Y be sets, and consider a function $f : X \to Y$ , and an index $i \in {1, \dots, | X |}$ . Then, $i m i n_{x' \in X} f (x') =^{Δ} f (x)$ where x is an element in X such that there are exactly i elements $x' \in X$ satisfying $f (x') \leq f (x)$ . In case f is not an injective function, hence there are multiple choices for x, we break ties arbitrarily.

FIG. 4.

Pseudocode of the algorithm (first part). The pseudocode is continued in Figure 5.

FIG. 5.

Pseudocode of the algorithm (second part). This figure continues the pseudocode given in Figure 4.

We proceed with a few clarifications of the pseudocode.

Initialization: Lines 1–13. We initialize all lists to contain only scores of $\infty$ (lines 1–3). Then, the lists associated with a matching between leaves that comply with $σ$ —that is, supernodes of the form $(u, σ (u))$ for some $u \in V (G)$ —are inserted into the hypergraphs, and their topmost items are updated with a leaf event, cost 0, and subtreeLoss and subtree 0 (because the cost of the best solutions mapping a gene to its species is 0).

Division into First and Second Phases: Lines 14–39. For each vertex $u \in I (G)$ in postorder (line 14), we have two phases, on which we elaborate below. In the first phase (lines 15–34), we consider each vertex $x \in V (S)$ in postorder and perform most computations, and in the second phase (lines 35–38), we consider each vertex x∈V(S) in postorder and compute the lists of incomp.

Recursive Formulas for $p_{Σ}$ , $p_{Δ}$ , and $p_{Θ}$ : Lines 16–30. In this part of the first phase, we find the k-best costs for mapping the subtree of G rooted in u to the subtree of S rooted in x for each possible event (speciation, duplication, or horizontal transfer), based on computations done in previous iterations or the initialization. The recursive formulas for these computations are directly given in the pseudocode.

Updating c,subtreeLoss, and subtree in First Phase: Lines 31–32. First, in line 31, we immediately find k-best costs for mapping the subtree of u to the subtree of x [i.e., we compute c(u,x)] by selecting k-best costs from the list that is the combination of $p_{Σ} (u, x)$ , $p_{Δ} (u, x)$ , and $p_{Θ} (u, x)$ . Notice that in this line, we also add the appropriate hypernodes and hyperedges to the hypergraph. event(u,x,i) is defined by the source list [ $p_{Σ} (u, x)$ , $p_{Δ} (u, x)$ , or $p_{Θ} (u, x)$ ] it came from. As before, if the combined list is shorter than k, we add hypernodes with event=Nan and cost=∞. Second, in lines 32–33, we find k-best costs for mapping the subtree of u to the subtree of some vertex $x'$ in the subtree of x, with and without loss events [i.e., we compute subtreeLoss(u,x) and subtree(u,x)] by selecting k-best costs from the combination of precalculated lists.

Updating incomp in Second Phase: Lines 35–38. To compute the lists of the form incomp(u,⋅), in the second phase we iterate over all vertices $x \in I (S)$ with children y and z in preorder. We note that now the traversal of S is in preorder rather than postorder because the computation of a list incomp(u,a) for a vertex $a \in V (S)$ that is not the root of S relies on having already computed the list incomp(u,b) where b is the parent of a in S. Specifically, for a vertex $x \in I (S)$ with children y and z, we compute the list incomp(u,y) by selecting k-best costs from the list that is the combination of incomp(u, x) and subtree(u, z), and symmetrically for incomp(u, z) (swapping the roles of y and z).

Lemma 1. Given an instance $(G, S, σ)$ of the DLT problem and a positive integer k, the algorithm correctly constructs a hypergraph $ℋ$ that represents k-best solutions for $(G, S, σ)$ .

Proof of Lemma 1. We prove that for every pair of vertices $u \in V (G)$ and $x \in V (S)$ , and every index $i \in {1, \dots, k}$ , if there exists an ith best DLT scenario mapping the subtree of G rooted in u to the subtree of S rooted in x, then the hypernode $(u, x, i)$ is inserted into the hypergraph $ℋ$ under construction with association to this scenario. In this lemma, the proof of this claim is done in conjunction with the proof that for every pair of vertices $u \in V (G)$ and $x \in V (S)$ , and every index $i \in {1, \dots, k}$ , the following equalities hold.

subtreeLoss(u, x, i) is the ith best cost of a DLT scenario mapping the subtree of G rooted in u to some subtree of S whose root is a vertex y that is a descendant of x, with additional cost of one loss per each edge in the path from x to y.

subtree(u, x, i) is the ith best cost of a DLT scenario mapping the subtree of G rooted in u to some subtree of S whose root is a vertex y that is a descendant of x.

incomp(u, x, i) is the ith best cost of a DLT scenario mapping the subtree of G rooted in u to some subtree of S whose root is a vertex y that is incomparable to x.

The proof is by induction on the order of computation.

In particular, table₁(u₁,x₁)<table₂(u₂,x₂) where table₁,table₂∈{c,subtreeLoss,subtree,incomp} if one of the following conditions holds:

The basis of the induction comprises the computation of hypernodes of the form $(u, x, i)$ where $u \in L (G)$ . To prove its correctness, consider such a hypernode $(u, x, i)$ . If $i = 1$ and $x = σ (u)$ , then the algorithm inserts the hypernode $(u, x, 1)$ , assigning it a score of 0, and setting the remaining fields as follows: subtreeLoss = 0, subtree = 0, and event=leaf; else, if $i = 1$ and $x \geq_{S} σ (u)$ , then the algorithm inserts the hypernode $(u, x, 1)$ , assigning it a score of 0 and setting the remaining fields as follows: subtreeLoss = c_Loss⋅d_S(x,σ(u)), subtree= 0, and event = leaf; otherwise, the algorithm does not insert the hypernode—more precisely, it inserts a place-holder (whose event is NaN) with score $\infty$ and subtree value $\infty$ . In both cases, incomp value remains $\infty$ as in its creation. The correctness of these operations directly follows from the definitions of subtreeLoss, subtree, and incomp, and the fact that the only possible DLT scenario in this case maps u to an ancestor of $σ (u)$ , and the score of this match is 0 in case losses are not counted, or with the additional loss costs otherwise.

For the inductive step, we consider some pair of vertices $u \in I (G)$ and $x \in V (S)$ along with a table table∈{c,subtreeLoss,subtree, incompp}, and prove that the values in table of the supernode $(u, x)$ are computed correctly. For the inductive assumption, suppose that for every triple (table′,u′,x′) ordered before (table, u, x), the values in table′ of $(u', x')$ have already been computed correctly. Here we provide a proof for table = c and table = incomp. The full proof can be found in Zoller et al. (2019) section 3.

First, consider the case where table = incomp. By the pseudocode, if x is the root of S, then $i n c o m p (u, x)$ does not contain any item (having score different from $\infty$ ) as in its creation, which is correct because in this case, there exists no vertex incomparable to x, and hence, we cannot map one of the children of u as required in the definition of the DLT scenarios that correspond to $i n c o m p (u, x)$ . Therefore, now suppose that x is not the root of S, and let p denote the parent of x in S, and s denote the sibling of x in S (i.e., the other child of p in S). Then, by the pseudocode, $i n c o m p (u, x)$ consists of the k-best scores from the lists $i n c o m p (u, p)$ and $s u b t r e e (u, s)$ . Observe that these two lists have already been computed. Thus, by the inductive hypothesis, $i n c o m p (u, p)$ consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in u to the subtree of S rooted in some vertex incomparable to p, and $s u b t r e e (u, s)$ consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in u to the subtree of S rooted in some descendant of s. Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in some vertex incomparable to x if and only if it is a DLT scenario that maps the subtree of G rooted in u to one of the following subtrees: (i) a subtree of S rooted in some vertex incomparable to p; (ii) a subtree of S rooted in some descendant of s. Thus, it follows that $i n c o m p (u, x)$ is computed correctly.

Second, consider the case where $t a b l e = c$ . By line 31 of the pseudocode, $c (u, x)$ consists of the k-best scores from the lists $p_{Σ} (u, x)$ , $p_{Δ} (u, x)$ , and $p_{Θ} (u, x)$ . Thus, to prove the correctness of the computation of $c (u, x)$ , it suffices to prove that the following statement holds: $p_{Σ} (u, x)$ , $p_{Δ} (u, x)$ , and $p_{Θ} (u, x)$ consist of k-best DLT scenarios mapping the subtree of G rooted in u to the subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is speciation, duplication, and horizontal transfer, respectively.

Toward the proof of the statement, consider the list $p_{Σ} (u, x)$ . If $x \in L (S)$ , then because $u \in I (G)$ , there does not exist a DLT scenario mapping the subtree of G rooted in u to the subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is speciation, and hence, the assignment of $\infty$ to every element $p_{Σ} (u, x, i)$ of the list is correct. Now, suppose that $x \in I (S)$ . Then, by the pseudocode, $p_{Σ} (u, x)$ consists of the k-best scores present in the following multisets:

Observe that the lists $s u b t r e e L o s s (v, y), s u b t r e e L o s s (w, z)$ , $s u b t r e e L o s s (w, y)$ , and $s u b t r e e L o s s (v, z)$ have already been computed. By the definition of $L o s s_{α} (u)$ when the event occurred in u is speciation, $L o s s_{α} (u) = | d_{S} (x, γ (v)) - 1 | + | d_{S} (x, γ (w)) - 1 | = d_{S} (y, γ (w)) + d_{S} (z, γ (v))$ in case $γ (w) \leq_{S} y$ , and $L o s s_{α} (u) = | d_{S} (x, γ (v)) - 1 | + | d_{S} (x, γ (w)) - 1 | = d_{S} (z, γ (w)) + d_{S} (y, γ (v))$ otherwise. Thus, by the inductive hypothesis, $s u b t r e e L o s s (v, y)$ (respectively, $s u b t r e e L o s s (w, z)$ , $s u b t r e e L o s s (w, y)$ , and $s u b t r e e L o s s (v, z)$ ) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (respectively, $w, w$ and v) to a subtree of S rooted in some descendant of y (respectively, $z, y$ and z), with additional loss cost for each edge in the path from y to $γ (v)$ (respectively, $γ (w)$ , $γ (w)$ , and $γ (v)$ ). Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is speciation if and only if it is a DLT scenario that matches u and x, maps the subtree of G rooted in v to a subtree of S rooted in a descendant of one child (y or z) of x, and the subtree of G rooted in w to a subtree of S rooted in a descendant of the other child of x. Thus, it follows that $p_{Σ} (u, x)$ is computed correctly.

Now, consider the list $p_{Δ} (u, x)$ . In case $x \in I (S)$ , let y and z denote its children. By the pseudocode, $p_{Δ} (u, x)$ consists of the k-best scores obtained by adding $c_{Δ}$ to the costs present in the following multisets, where only the first one is relevant in case $x \in L (S)$ :

Observe that the lists $c (v, x), c (w, x), s u b t r e e L o s s (w, y)$ , $s u b t r e e L o s s (v, z)$ , $s u b t r e e L o s s (w, z)$ , and $s u b t r e e L o s s (v, y)$ have already been computed. By the definition of $L o s s_{α} (u)$ when the event occurred in u is duplication, $L o s s_{α} (u) = d_{S} (x, γ (v)) + d_{S} (x, γ (w))$ . If v (respectively w) is mapped to x and w (respectively v) is mapped to a subtree of S rooted in y or z, it holds that $L o s s_{α} (u) = d_{S} (y, γ (w)) + 1$ (respectively $L o s s_{α} (u) = d_{S} (y, γ (v)) + 1$ , $L o s s_{α} (u) = d_{S} (z, γ (w)) + 1$ , and $L o s s_{α} (u) = d_{S} (z, γ (v)) + 1$ ). If both v and w are mapped to x, $L o s s_{α} (u) = 0$ , and if v (respectively w) is mapped to y or z and w (respectively v) is mapped to y or $z,$ it holds that $L o s s_{α} (u) = d_{S} (y, γ (v)) + d_{S} (z, γ (w)) + 2$ (respectively $L o s s_{α} (u) = d_{S} (y, γ (v)) + d_{S} (z, γ (w)) + 2$ , $L o s s_{α} (u) = d_{S} (y, γ (v)) + d_{S} (v, γ (w)) + 2$ , $L o s s_{α} (u) = d_{S} (z, γ (v)) + d_{S} (z, γ (w)) + 2$ , $L o s s_{α} (u) = d_{S} (z, γ (v)) + d_{S} (z, γ (w)) + 2)$ .

Thus, by the inductive hypothesis, we have that (i) $c (v, x)$ (respectively $c (w, x)$ ) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (respectively w) to the subtree of S rooted in x, and (ii) $s u b t r e e L o s s (v, y)$ (respectively $s u b t r e e L o s s (w, z)$ , $s u b t r e e (w, y)$ , and $s u b t r e e L o s s (v, z)$ ) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (respectively $w, w$ and v) to the subtree of S rooted in some descendant of y (respectively $z, y$ and z) with additional loss cost for each edge in the path from y to $γ (v)$ (respectively $γ (w), γ (w)$ , and $γ (v)$ ). Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is duplication if and only if it is a DLT scenario that matches u and x, maps the subtree of G rooted in v to a subtree of S rooted in a descendant of x (which can be x itself), and the subtree of G rooted in w to a subtree of S rooted in a descendant of x (which can be x itself). Thus, it follows that $p_{Δ} (u, x)$ is computed correctly.

Finally, consider the list $p_{Θ} (u, x)$ . If x is the root of S, then there does not exist a DLT scenario mapping the subtree of G rooted in u to the subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is horizontal transfer (because there is no vertex incomparable to x to whom one of the children of u should be mapped), and hence, it is correct that each element $p_{Θ} (u, x, i)$ remains with the assignment of $\infty$ as it was created. Now, suppose that x is not the root of S. Then, by the pseudocode, $p_{Θ} (u, x)$ consists of the k-best scores present in the following multisets:

Observe that the lists $s u b t r e e L o s s (v, x), s u b t r e e L o s s (w, x)$ , $i n c o m p (w, x)$ , and $i n c o m p (v, x)$ have already been computed. By the definition of $L o s s_{α} (u)$ when $(u, w) \in Ξ$ , $L o s s_{α} (u) = d_{S} (x, γ (v))$ . Thus, by the inductive hypothesis, we have that (i) $s u b t r e e L o s s (v, x)$ (respectively $s u b t r e e L o s s (w, x)$ ) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (respectively w) to a subtree of S rooted in some descendant of x (which can be x itself), with additional loss cost for each edge in the path from x to $γ (v)$ (respectively $γ (w)$ ). (ii) $i n c o m p (v, x)$ (respectively $i n c o m p (w, x)$ ) consists of the scores of the k-best DLT scenarios mapping the subtree of G rooted in v (respectively w) to a subtree of S rooted in some vertex incomparable to x. Notice that a DLT scenario maps the subtree of G rooted in u to a subtree of S rooted in x under the constraint that the event corresponding to the matching of u and x is horizontal transfer if and only if it is a DLT scenario that matches u and x, maps the subtree of G rooted in one of the children of u (v or w) to a subtree of S rooted in a descendant of of x (which can be x itself), and the subtree of G rooted in the other child of u to a subtree of S rooted in a vertex incomparable to x. Thus, it follows that $p_{Σ} (u, x)$ is computed correctly.

Observation 1. Given an instance $(G, S, σ)$ of the DLT problem and a positive integer k, the algorithm runs in time $O (m \cdot n \cdot k)$ and requires $O (m \cdot n \cdot k)$ space.

Proof of observation 1. For each pair of vertices $u \in V (G)$ and $x \in V (S)$ , we construct a tuple of lists [ $p_{Σ} (u, x), p_{Δ} (u, x), p_{Θ} (u, x), s u b t r e e L o s s (u, x),$ $s u b t r e e (u, x), i n c o m p (u, x)$ ]. From the pseudocode, it is clear that the computation of each one of these lists is done in time $O (k)$ . Thus, we have that the total running time is $O (m \cdot n \cdot k)$ . As space is bounded by time, the observation follows.

4.2. Stage 2: Assigning probabilities

In the second stage, we assign a probability to each hypernode in $ℋ$ , so that a hypernode with best score has the highest probability, and hypernodes with score $\infty$ (the worst possible score) have probability 0.

4.2.1. Weight computation

Let $γ \in ℛ^{+}$ be a user-specified parameter. $γ$ is used to control the range between poorly scoring nodes versus top scoring nodes. As $γ$ grows lower, hypernodes with higher (worse) scores are assigned probabilities much lower than hypernodes with lower scores.

Denote $r = r o o t$ , and let $m (r)$ be the largest integer $i \in {1, \dots, k}$ such that $c (r, i) \neq \infty$ . [Recall that the notation $(r o o t, i)$ was defined in Section 3.] For a node $(r, i)$ where $i \in {1, \dots, m (r)}$ , define $w' (r, i) = e^{γ \frac{c (r, 1) - c (r, i)}{c (r, 1) - c (r, m (r))}}$ . Then, the weight of a node $(r, i)$ , which stands for the (unconditional) probability that the scenario described by $(r, i)$ happens, is defined as follows: if $i \in {1, \dots, m (r)}$ , then $w (r, i) = \frac{w' (r, i)}{\sum_{j = 1}^{m (r)} w' (r, j)}$ ; otherwise (i.e., if $i \in {m (r) + 1, m (r) + 2, \dots, k}$ ), $w (r, i) = 0$ .

We now turn to define the weight of a hypernode $(u, x, i)$ , which should stand for the (unconditional) probability that the scenario described by $(u, x, i)$ happens. The definition is recursive. In the basis, where u is the root of G, we define $w (u, x, i)$ (for any $x \in V (S)$ and $i \in {1, \dots, k}$ ) as follows: if there exists an index $j \in {1, \dots, k}$ such that $(r, j)$ is derived from $(u, x, i)$ (here, it means that they represent the same scenario), then $w (u, x, i) = w (r, j)$ ; otherwise, $w (u, x, i) = 0$ .

Now, consider v that is not the root of G. We define $w (v, y, i)$ (for any $y \in V (S)$ and $i \in {1, \dots, k}$ ) as follows. First, let $D (v, y, i)$ denote the collection of nodes $(u, x, j)$ such that $c (u, x, j)$ was derived from $c (v, y, i)$ —in other words, the hypergraph has an hyperedge directed from $(v, y, i)$ (and some other node) to $(u, x, j)$ . In particular, u is the parent of v in G, hence the weight $w (u, x, j)$ is calculated before the weight $w (v, y, i)$ . Then, define $w (v, y, i) = \sum_{(u, x, j) \in D (v, y, i)} w (u, x, j)$ .

Lemma 2. For any two compatible $u \in L (G)$ and $x \in L (S)$ , $w (u, x, 1) = 1$ .

Proof of Lemma 2. We will verify a stronger property than the one in the statement of the lemma: for every vertex u in the gene tree G, it holds that

Before we verify this property, observe that when u is a leaf, then $c (u, x, 1) = 0$ for the unique vertex x that is compatible with u, and $c (u, x, i) = \infty$ (which means that $D (u, x, i) = \emptyset$ and hence $w (u, x, i) = 0$ ) for any other pair $(x, i)$ . Thus, the stronger property implies the correctness of the weaker statement regarding leaves.

To prove the (stronger) property above, we use induction. In the basis, u is the root of the gene tree G. Then, we have that $\sum_{x, i : (u, x, i) \in V (ℋ)} w (u, x, i) = \sum_{i \in {1, 2, \dots, k}} w (r, i) = 1$ , and therefore the property holds. Now, suppose that u is not the root of G, and that the property holds for each of its ancestors. Let v be the parent of u in $ℋ$ . Then, we have that

Here, the first equality follows directly from the definition of weights. The second equality follows from the fact that each hypernode $(v, y, i)$ (for any y and i) that has positive weight is derived from exactly one hypernode $(u, x, j)$ (for some specific x and j). [However, each hypernode $(u, x, j)$ can be used to derive several hypernodes $(v, y, i)$ .] The last equality follows from the inductive hypothesis. This completes the proof.

Observation 2. Time and Space Complexity: Iterating the hypergraph in $O (| V (ℋ) |) = O (m \cdot n \cdot k)$ time and space.

4.3. Stage 3.1: Pattern discovery

The current version of RSAM-finder allows pattern queries to be specified as follows. A pattern specification consists of a tuple $(E V, c o l o r, d i s t a n c e)$ where:

$E V \subseteq {S, D, H T}$ specifies the evolutionary event of the pattern ( $S$ for speciation, $D$ for duplication, and $H T$ for horizontal transfer).

$c o l o r \in {r e d, b l a c k, N o n e}$ specifies a color representing the environmental niche to which the sought RSAM confers adaptation.

$d i s t a n c e \in {T r u e, F a l s e}$ is a Boolean indicator specifying whether or not to consider edge lengths (representing evolutionary distances) in the pattern specification.

For a colored query (having the second parameter in the specification set to $r e d$ or $b l a c k$ ), the user can provide, as part of the input, a function $c o l o r s : L (X) \to ϒ$ where X specifies whether the pattern refers to a subtree of S or a subtree of G, and $ϒ = {r e d, b l a c k}$ . Here, colors represent a binary environmental annotation of the leaves. Then, a preprocessing step is applied, in which the vertices of S and G are colored based on the colors assigned to the leaves of the subtree they root. We omit the technical details entailing the implementation of this preprocessing step to section 1 of Zoller et al. (2019).

In addition to the settings described above, the user can select one of two modes:

Single-pattern mode. In this mode, the user specifies a single pattern and a threshold, and the sought RSAMs are identified as nodes $u \in I (G)$ such that G_u is enriched in the pattern, and $| V (G_{u}) |$ is bounded from below by the specified threshold.

Dual-pattern (contrasting) mode. In this mode, the user specifies two patterns and one threshold, and the sought RSAMs are identified as nodes $u \in I (G)$ with children $v, w \in V (G)$ such that G_v is enriched with one pattern, while G_w is enriched with the other pattern. Here, the subtree size-bound threshold refers to $| V (G_{v}) |$ and $| V (G_{w}) |$ .

The Pattern Identification algorithm proceeds as follows.

For each pattern $P = (E V, c o l o r, d i s t a n c e)$ and for each hypernode $(u, x, i) \in V (ℋ)$ , check whether both $e v e n t (u, x, i) \in E V$ and the colors obey the requirements derived from the $c o l o r$ field of the pattern specification [described in more details in section 1 of Zoller et al. (2019)]. If so, mark $(u, x, i)$ as interesting.

Reflect the interesting nodes identified in $ℋ$ to G, by assigning corresponding weights to $V (G)$ . Each $u \in I (G)$ is assigned a score, which is the sum of the probabilities of instances of the pattern found in G_u, normalized by the number of possible events in G_u. Additional bookkeeping details regarding how this score is computed are given in Section 4.4.

Based on the specified mode of the query (single pattern or dual pattern), identify the t top scoring vertices $u \in I (G)$ . In case of a single-pattern mode, the scores are as defined in (2). In case of dual-pattern mode, let P₁ and P₂ be the patterns. For each $u \in V (G)$ with children $v, w \in V (G)$ , the score of u is score of v for P₁ [as defined in (2)] plus the score of w for P₂, and vice versa (i.e., each vertex is assigned two scores).

Observation 3. Time and Space Complexity: Iterating over the hypergraph takes $O (| V (ℋ) |) = O (m \cdot n \cdot k)$ time and space.

4.4. Stage 3.2: Score computation.

For each defined pattern $P = (E V, c o l o r, d i s t a n c e)$ , let $c o u n t e r_{P} : V (G) \to ℛ^{+}$ be a counter, initialized by 0. For each $u \in I (G)$ , let v and w be its right and left children, respectively. Let $ℐ_{u} = {(u, x, i) \in V (ℋ) : (u, x, i) i s m a r k e d a s i n t e r e s t i n g w i t h r e s p e c t t o P} .$

That is, for each vertex $x \in V (S)$ and $i \in {1, \dots, k}$ such that $(u, x, i) \in V (ℋ)$ was marked as interesting in stage 1 with respect to pattern P, $(u, x, i) \in ℐ_{u}$ . Let

where $w (u, x, i)$ are the probabilities assigned in Section 4.2. Intuitively, for each vertex $u \in V (G)$ , we calculate its probability to be interesting, with respect to the patterns we defined. To avoid a bias due to variation in the sizes of the subtrees rooted by the competitively estimated nodes in G, we normalize each value by the number of edges in the subtree rooted in the vertex times k, which is an upper bound on the number of possible patterns in all the solutions. That is, for each vertex $u \in V (G)$ and pattern P, let $c o u n t e r_{P} (u) = \frac{c o u n t e r_{P} (u)}{| E (G_{u}) | \cdot k}$ .

5. Experimental Results

We implemented the algorithm described in this article as a tool, denoted RSAM-finder, and made it publicly available via GitHub (Zoller, 2019).

In this section, we test and exemplify the performance of RSAM-finder. The tests are based on large-scale simulations, where we demonstrate the engine's tolerance to noise (Section 5.2) and measure the practical running times of the proposed hypergraph construction algorithm as a function of increasing input size (Section 5.3). In Section 5.4, we exemplify an application of our proposed approach to the discovery and analysis of RSAMs in a beta-lactamase gene. However, first, in Section 5.1, we give the technical details regarding our simulations, tests, and experiments.

5.1. Methods and databases

Genes in our experiment are represented by their membership in cluster of orthologous genes (Tatusov et al., 2000). The STRING database (Szklarczyk et al., 2016) was used to extract the chromosomal protein sequences for the cluster of orthologous groups (COGs) of interest, annotated with their corresponding species names as well as the corresponding NCBI IDs. Protein sequences were subjected to multiple sequence alignment and dendrogram construction via Clustal Omega (Sievers and Higgins, 2018). The list of NCBI IDs was used as input for NCBI Taxonomy Browser, which provided a (nonbinary) species tree. Both gene and species trees were converted to binary trees via the Ape R package (Popescu et al., 2012). Habitat labels for the species were extracted from PATRIC, and missing tags were manually annotated by information from the GOLD database (Mukherjee et al., 2016) and from literature. CD Search (Marchler-Bauer and Bryant, 2004) was used to seek statistically significant discriminating domain-level mutations (i.e., the gain or loss of a protein functional domain). The simulator and our algorithm were implemented in Python, using NetworkX package, DendroPy (Sukumaran and Holder, 2010), and ETE Toolkit (Huerta-Cepas et al., 2016). Visualization of the trees and plots were created using Matplotlib and Seaborn tools.

For the simulation-based experiments, we generated random binary trees. The generation of a random binary tree was done in a top-down manner, using the ETE Toolkit (Huerta-Cepas et al., 2016). We began with a given set of vertices, based on which we created a random binary tree. The tree was duplicated and one copy was denoted G, while the other was denoted S. The function $σ : L (G) \to L (S)$ was implemented as the matching between each leaf in G to its copy in S, and the function $c o l o r : L (S) \to {r e d, b l a c k}$ was implemented as a random binary function.

To implant the pattern in the resulting random trees, we picked a random vertex $u \in V (G)$ , and modified the function $σ : L (G) \to L (S)$ for all vertices $w \in L (G_{u})$ in a way that created a horizontal transfer event. To this end, consider a vertex $w \in V (G_{u})$ . Vertex w is made to represent a horizontal transfer event as follows. Let $x \in V (S)$ be the copy of w in S. Let $L (S_{x})$ denote the copy of $L (G_{w})$ (the leaves of the subtree rooted in w) in the species tree, and thus, the function $σ$ maps each leaf of G_w to its copy in the leaves of S_x. Then, to create a horizontal transfer in w, we need to find a vertex $y \in V (S)$ such that y and x are incomparable, and change the mapping of the leaves of G_w to the leaves of S_y randomly—that is, for each vertex $r \in G_{w}$ define $σ (r)$ to be a random vertex $z \in S_{y}$ . This is likely to create a horizontal transfer in the DLT-reconciliation. Recall that in addition, we want to make those planted horizontal transfer events red-to-red events. To achieve this, we check to see if the random vertices u and y, which are the source and the target of the horizontal transfer, are “mostly red,” as defined in section 1 of Zoller et al. (2019). If they are not, we make another random choice and check the colors again. The query pattern $({H T}, r e d, T r u e)$ was used in the simulation-based experiments. According to this pattern, we sought subtrees that are enriched in red-to-red horizontal transfer events. (For additional details, see Section 4.3.)

5.2. Testing for noise tolerance

We tested our tool on a random data set that was generated as described above, by introducing into the simulations an additional “noise factor” affecting horizontal transfers and colors. Each noise level represents the level of random changes in $σ$ and random colors of the species. In particular, a noise level of 0% means that no changes were done to the mapping between the leaves of the gene tree to the leaves of the species tree except those of the planted pattern, and no change was made to the function $c o l o r : L (S) \to {r e d, b l a c k}$ , while a noise level of 100% means that the mappings of all of the vertices of the gene tree were randomly picked, and all of the species colors were randomly picked again.

Figure 6A demonstrates the advantage of our approach across different noise levels, following the strategy described above to generate randomized phylogenetic gene and species trees with a planted pattern. First, we constructed random phylogenetic species and gene trees with 600 leaves and one planted pattern (marked in purple in Fig. 6A). For each noise level between 0% and 20%, we constructed the corresponding hypergraph. For all experiments, we used $k = 100$ , set the minimum size of a subtree to 0.1% of the number of all edges, and set $c_{Δ} = c_{Θ} = 1 .$ Results for each noise level were computed as an average of 50 random choices for the same noise level, on the same input trees. The scores are as defined in Section 4.3.

FIG. 6.

(A) The scores of the vertices in different noise levels on the input. The purple dots represent the planted vertex, and they obey the sought pattern. (B) Running times of the naive and efficient algorithms. Green triangles represent the efficient version, and red circles represent the running times of the naive algorithm.

We found that, at the lower noise levels, the score of the planted vertex $u \in V (G)$ is higher than that of any other vertex, and this difference decreases as the noise level increases. Note that the additional noise increases the number and scores of false positives found. These findings support the claim that our method is able to find a pattern within noisy data.

5.3. Running time measurements

To demonstrate the practicality of the theoretical improvements presented in Section 4.1, we compared the running times of the efficient, $O (m n k)$ time algorithm for hypergraph construction proposed in Section 4.1, versus the naive, $O (m n (n + k) l o g (n + k))$ time algorithm mentioned in the Section 1.

The inputs to the compared algorithms were generated as follows. We picked random binary trees denoted S and G, with number of leaves ranging from 100 to 1000. For each number of leaves, we randomly created 10 such pairs of trees, and ran both the naive and the efficient algorithms on both data sets.

Figure 6B summarizes the measured time results. The green triangles correspond to the average of the time measured for the efficient version of the algorithm, and the red circles correspond to an average of the time measured for the naive version of the algorithm.

We found that as we increase the number of leaves, the differences in practical running times between the naive and the efficient algorithms become more significant. Furthermore, as expected in practice, the running time of the efficient algorithm is linear in the input size, while that of the naive one behaves as a nonlinear function.

5.4. Example: RSAM discovery in a beta-lactamase

Beta-lactamases are versatile enzymes, conferring resistance to the beta-lactam antibiotics, found in a diversity of bacterial sources. Their commonality is the ability to hydrolyze chemical compounds containing a beta-lactam ring (Bush, 2018). The secretion of antimicrobial compounds is an ancient mechanism with clear survival benefits for microbes competing with other microorganisms. Consequently, mechanisms that confer resistance are also ancient and may represent an underestimated reservoir in environmental bacteria (Bush, 2018). Antibiotic resistance factors, conferring adaptation to the pathogenesis environment, are widely spread by horizontal gene transfer mechanisms such as conjugation, transformation, and transduction (Navarro, 2006; Poirel et al., 2009). The persistent exposure of bacterial strains to a multitude of beta-lactams has induced dynamic and continuous production and mutation of beta-lactamases in these bacteria, expanding their activity even against the newly developed beta-lactam antibiotics (Stapleton et al., 2016). Thus, an important objective is to identify mutations in beta-lactamase genes conferring adaptation to human and animal hosts.

Motivated by the above, we exemplify a microbiological application of RSAM-finder to the discovery of RSAMs in beta-lactamase genes that confer adaptation to human and animal hosts. To this end, we use the pattern $(({H T}, r e d, T r u e), ({S, D, H T}, b l a c k, F a l s e))$ . Here, colors represent a binary environmental annotation: human and animal host (219 species) were annotated “red,” while species associated with all other habitats (324 species), such as soil, water, and plant, were annotated “black.”

Among the known classes (A–D) of beta-lactamase, class D (represented by COG2602) is considered to be the most diverse (Evans and Amyes, 2014). Thus, we selected COG2602 (622 genes in 543 genomes) as the data set for our example. Parameters were set as follows: $k = 50$ , the minimum size required per sought subtree was set to $0.1$ of the total number leaves of G, $c_{Δ} = c_{Θ} = 1$ , and $c_{Σ} = c_{L o s s} = 0$ . A figure displaying G, where the top-ranking RSAM node is marked with a star, is given in Zoller et al. (2019). Also provided are the corresponding sequences, a figure displaying the corresponding S, and $σ$ .

Within the top-ranking result for this query, we were interested in the subtree matching the first part of pattern (i.e., enrichment in red-to-red HT edges). The gene set represented by the leaves of this subtree (denoted “identified gene set”) was found to be enriched in an additional domain, BlaR, a signal transducer membrane protein regulating beta-lactamase production (87/119 in the identified gene set vs. 118/622 in the background, p-value = 3.94e-52). The only transcriptional regulator currently known for beta-lactamase genes is the repressor protein BlaI, previously predicted to operate in a two-component regulatory system together with BlaR in Class A beta-lactamase (Alksne and Rasmussen, 1997). The positions adjacent to the instances of the identified gene set in the corresponding genomes were found to be enriched in BlaI (70/119 of the identified gene set instances vs. 90/622 of the background gene set instances, hypergeometric, p-value = 1.11e-41). Note that this result was obtained with $c_{L o s s}$ set to 0. When repeating the experiment with $c_{L o s s} = 1$ , this result is still found among the two top ranking vertices.

In contrast to the identified gene set, the genes represented by the subtree that matches the second part of the pattern (frequent black events of all types) are not enriched in the BlaR domain (2/36), nor is there contextual enrichment in BlaI (4/36) in positions immediately adjacent to instances of these genes. Applying RSAM-finder to these data with simpler queries that take into account only enrichment in environmental coloring does not yield this result, nor does the application of RSAM-finder to these data with any part of the pattern on its own. This result is demonstrated in Figure 7.

FIG. 7.

Application of RSAM-finder to genes belonging to the class D beta-lactamase family. The sought pattern is $(({H T}, r e d, T r u e), ({S, D, H T}, b l a c k, F a l s e))$ , which codes for two patterns, one of a massive horizontal transfer events from red to red (right subtree) and the other is all black events (left subtree). The figure shows the top-scoring subtree, and the corresponding sequences. The blue rectangle marks the mutation characterizing the sequences in the leaves of the right subtree: this insertion was identified as a BlaR (signal transducer) domain.

The identified gene set for this result spans a wide range of Firmicutes, including both pathogenic (e.g., staphylococcus) and nonpathogenic species (e.g., various gut microbes from the Clostridiales order). Homology between BlaR receptor proteins and the extracellular domain of Class D beta-lactamases was previously observed (Massidda et al., 1996; Brandt et al., 2017), mainly in gram-negative bacteria (with focus on clinical samples). Thus, RSAM-finder identifies a putative beta-lactamase system in gram-positive bacteria, consisting of a COG2602-BlaR beta-lactamase receptor protein and its BlaI family repressor, predicted to confer adaptation to animal and human host environment. Further comparative sequence-level analysis (Toth et al., 2016) may reveal the affinity of this beta-lactamase system to specific beta-lactam drugs.

6. Conclusions

We defined a new optimization problem in the DLT reconciliation domain. The input to this problem consists of a gene tree, constructed for a given gene orthology group, a species tree constructed for the species harboring one or more members of this gene orthology group, and a pattern representing a sought scenario in the reconciliation of the two trees. The sought pattern could imply some evolutionary process of interest, such as a gene conferring adaptation of the species to a specific environmental niche. The goal of the problem is to compute, for any vertex in the gene tree, a score reflecting the probability that the genomic mutations associated with the edge leading into this vertex confer the occurrence of the sought pattern within high-scoring reconciliations of the subtree rooted by this vertex with corresponding subtrees in the species trees.

To solve this new problem, and overcome some of the noise associated with gene tree and species tree reconstruction, we proposed an algorithm that first constructs a hypergraph $ℋ$ that stores information regarding the k-best DLT reconciliation scenarios for a given problem instance. The time complexity of the algorithm we propose for the construction of this hypergraph is $O (m \cdot n \cdot k)$ , which is essentially optimal since the number of vertices (and hence also the size) of the hypergraph can be as large as $Ω (m \cdot n \cdot k)$ .

Interesting open problems include the goal of extending the tool to handle more robust variations of phylogenies, such as polytomies and phylogenetic networks. It may also be helpful to consider bootstrapping methods to train the parameters and thresholds utilized by RSAM-finder.

Footnotes

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

This work was supported by ISF Grant Nos. 1176/18 and 939/18 and by the Lynn and William Frankel Center for Computer Science.

References

Alksne

L.E.

, and Rasmussen

B.A.

1997. Expression of the AsbA1, OXA-12, and AsbM1 beta-lactamases in Aeromonas jandaei AER 14 is coordinated by a two-component regulon. J. Bacteriol. 179, 2006–2013.

Bansal

M.S.

, Alm

E.J.

, and Kellis

2012. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics, 28, i283–i291.

Bansal

M.S.

, Alm

E.J.

, and Kellis

2013. Reconciliation revisited: Handling multiple optima when reconciling with duplication, transfer, and loss. J. Comput. Biol. 20, 738–754.

Bapteste

, O'Malley

M.A.

, Beiko

R.G.

, et al. 2009. Prokaryotic evolution and the tree of life are two different things. Biol. Direct, 4, 34.

Berry

, Chevenet

, Doyon

J.P.

, et al. 2018. A geography-aware reconciliation method to investigate diversification patterns in host/parasite interactions. Mol. Ecol. Resour. 18, 1173–1184.

Brandt

, Braun

S.D.

, Stein

, et al. 2017. In silico serine β-lactamases analysis reveals a huge potential resistome in environmental and pathogenic species. Sci. Rep. 7:43232.

Bush

2018. Past and present perspectives on β-lactamases. Antimicrob. Agents Chemother. 62, e01076–18.

Charleston

1998. Jungles: A new solution to the host/parasite phylogeny reconciliation problem. Math. Biosci. 149, 191–223.

David

L.A.

, and Alm

E.J.

2011. Rapid evolutionary innovation during an Archaean genetic expansion. Nature, 469, 93.

10.

Donati

, Baudet

, Sinaimeri

, et al. 2015. Eucalypt: Efficient tree reconciliation enumerator. Algorithms Mol. Biol. 10, 3.

11.

Doyon

J.P.

, Chauve

, and Hamel

2009. Space of gene/species trees reconciliations and parsimonious models. J. Comput. Biol. 16, 1399–1418.

12.

Doyon

J.P.

, Hamel

, and Chauve

2011. An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework. IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 26–39.

13.

Evans

B.A.

, and Amyes

S.G.

2014. Oxa β-lactamases. Clin. Microbiol. Rev. 27, 241–263.

14.

Huang

, and Chiang

2005. Better k-best parsing, 53–64. In Proceedings of the Ninth International Workshop on Parsing Technology. Association for Computational Linguistics. Vancouver, British Columbia, Canada.

15.

Huerta-Cepas

, Serra

, and Bork

2016. Ete 3: Reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638.

16.

Libeskind-Hadas

, and Charleston

M.A.

2009. On the computational complexity of the reticulate cophylogeny reconstruction problem. J. Comput. Biol. 16, 105–117.

17.

Marchler-Bauer

, and Bryant

S.H.

2004. Cd-search: Protein domain annotations on the fly. Nucleic Acids Res. 32(suppl_2):W327–W331.

18.

Massidda

, Montanari

M.P.

, and Mingoia

1996. Borderline methicillin-susceptible Staphylococcus aureus strains have more in common than reduced susceptibility to penicillinase-resistant penicillins. Antimicrob. Agents Chemother. 40, 2769–2774.

19.

Merkle

, Middendorf

, and Wieseke

2010. A parameter-adaptive dynamic programming approach for inferring cophylogenies. BMC Bioinformatics, 11, S60.

20.

Mukherjee

, Stamatis

, Bertsch

, et al. 2016. Genomes online database (gold) v. 6: Data updates and feature enhancements. Nucleic Acids Res. 45(D1), D446–D456.

21.

Navarro

2006. Acquisition and horizontal diffusion of beta-lactam resistance among clinically relevant microorganisms. Int. Microbiol. 9, 79.

22.

Patro

, and Kingsford

2013. Predicting protein interactions via parsimonious network history inference. Bioinformatics, 29, i237–i246.

23.

Poirel

, Carrër

, Pitout

J.D.

, et al. 2009. Integron mobilization unit as a source of mobility of antibiotic resistance genes. Antimicrob. Agents Chemother. 53, 2492–2498.

24.

Popescu

A.A.

, Huber

K.T.

, and Paradis

2012. ape 3.0: New tools for distance-based phylogenetics and evolutionary analysis in R. Bioinformatics, 28, 1536–1537.

25.

Scornavacca

, Paprotny

, Berry

, et al. 2013. Representing a set of reconciliations in a compact way. J. Bioinform. Comput. Biol. 11:1250025.

26.

Sievers

, and Higgins

D.G.

2018. Clustal omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145.

27.

Stapleton

P.J.

, Murphy

, McCallion

, et al. 2016. Outbreaks of extended spectrum beta-lactamase-producing enterobacteriaceae in neonatal intensive care units: A systematic review. Arch. Dis. Child Fetal Neonatal Ed. 101, 72–78.

28.

Stolzer

, Lai

, Xu

, et al. 2012. Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics, 28, i409–i415.

29.

Sukumaran

, and Holder

M.T.

2010. Dendropy: A python library for phylogenetic computing. Bioinformatics, 26, 1569–1571.

30.

Szklarczyk

, Morris

J.H.

, Cook

, et al. 2016. The string database in 2017: Quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res., 45, D362–D368.

31.

Tatusov

R.L.

, Galperin

M.Y.

, Natale

D.A.

, et al. 2000. The cog database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36.

32.

T.H.

, Jacox

, Ranwez

, et al. 2015. A fast method for calculating reliable event supports in tree reconciliations via pareto optimality. BMC Bioinformatics, 16, 384.

33.

Tofigh

2009. Using trees to capture reticulate evolution: Lateral gene transfers and cancer progression [PhD thesis]. KTH School of Computer Science and Communication.

34.

Tofigh

, Hallett

, and Lagergren

2011. Simultaneous identification of duplications and lateral gene transfers. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 517–535.

35.

Toth

, Antunes

N.T.

, Stewart

N.K.

, et al. 2016. Class d β-lactamases do exist in gram-positive bacteria. Nat. Chem. Biol. 12, 9.

36.

Zoller

2019. Rsam-finder. Available at: https://github.com/ronizoller/RSAM Last viewd on November, 28, 2019.

37.

Zoller

, Zehavi

, and Ziv-Ukelson

2019. Supplementary materials. Available at: https://github.com/ronizoller/RSAM/tree/master