Improved Practical Algorithms for Rooted Subtree Prune and Regraft (rSPR) Distance and Hybridization Number

Abstract

The problem of computing the rooted subtree prune and regraft (rSPR) distance of two phylogenetic trees is computationally hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by Hybridization Number Computation [HNC]). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this article, we design and implement several approximation algorithms for them and one exact algorithm for HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation program's output has much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time.

1. Introduction

Constructing the evolutionary history of a set of species is an important problem in the study of biological evolution. Phylogenetic trees are used in biology to represent the ancestral history of a collection of existing species. This is appropriate for many groups of species. However, due to reticulation events such as hybridization, recombination, and lateral gene transfer, there are certain groups for which the ancestral history cannot be represented by a tree. For this kind of groups of species, it is more appropriate to represent their ancestral history by rooted acyclic digraphs, where vertices of in-degree at least two represent reticulation events.

More specifically, by looking at two different segments of sequences or two different sets of genes of a set of extant species, we may obtain two different phylogenetic trees T₁ and T₂ of the same extant species with high confidence. Given T₁ and T₂, we want to construct a reticulate network N with the smallest number of reticulation events needed to explain the evolution of the species under consideration. Roughly speaking, N is the smallest rooted acyclic digraph such that each of T₁ and T₂ is homeomorphic to a subgraph of N. The number of vertices of in-degree larger than 1 in N is called the hybridization number of T₁ and T₂. The problem of computing the hybridization number of two given phylogenetic trees, denoted by Hybridization Number Computation (HNC), is NP-hard (Hein et al., 1996; Bordewich and Semple, 2005), where NP stands for the class of problems solvable in nondeterministic polynomial time. For this reason, quite a number of approximation algorithms and fixed-parameter algorithms have been designed and implemented for HNC (Wu, 2009; Chen and Wang, 2016, 2012; Hill et al., 2010; Collins et al., 2011; Albrecht et al., 2012; Kelk et al., 2012; van Iersel et al., 2014). To the best of our knowledge, the previously best program for solving HNC approximately (respectively, exactly) is given in van Iersel et al. (2014) [respectively, Chen and Wang (2013)].

A problem closely related to HNC is the problem of computing the rooted subtree prune and regraft (rSPR) distance of two given phylogenetic trees T₁ and T₂ of the same extant species. The rSPR distance between T₁ and T₂ can be defined as the minimum number of edges that should be deleted from each of T₁ and T₂ to transform them into homeomorphic rooted forests F₁ and F₂. The problem of computing the rSPR distance of two trees, denoted by rSPR Distance Computation (RDC), is NP-hard (Hein et al., 1996; Bordewich and Semple, 2005). This has motivated researchers to design and implement either exact or approximation algorithms for RDC (Hein et al., 1996; Bordewich and Semple, 2005; Bonet et al., 2006; Whidden and Zeh, 2009; Wu, 2009; Whidden et al., 2010; Chen and Wang, 2013; Chen et al., 2015, 2016, 2018; Schalekamp et al., 2016). To the best of our knowledge, the previously best software for solving RDC either exactly or approximately is due to Chen et al. (2018) (http://rnc.r.dendai.ac.jp/rspr.html).

In this article, we first improve Chen et al.'s exact algorithm (Chen and Wang, 2013) for HNC. Since rSPR distance is a lower bound on hybridization number, the main idea is to use the lower bound on rSPR distance outputted by Chen et al.'s algorithm (Chen et al., 2018) to cut unnecessary branches of the search tree. Another main idea is to arrange the child recursive calls of each recursive call carefully. Our experimental results show that the resulting algorithm can be implemented into a software that runs more than 80 times faster than Chen et al.'s UltraNet (Chen and Wang, 2013) for the easiest dataset used in the experiments. Moreover, its superiority in speed becomes even more significant for more difficult instances.

We then present a new approximation algorithm for RDC. Although this algorithm does not necessarily always output a better result than the algorithm in Chen et al. (2018), we can obtain a new algorithm that calls the two algorithms and outputs the better result returned by them. Our experimental results show that the resulting algorithm can be implemented into a software that often outputs better results than Chen et al.'s program (Chen et al., 2018). We further propose to use the so-called Monte Carlo tree search (MCTS) method to improve any approximation algorithm A for RDC. In our application of MCTS, instead of performing a number of random play-outs in the simulation phase of each round, we make a single call of A and then in the backpropagation phase, use its returned result to update information in the sequence of nodes selected for this round. Our experimental results show that the MCTS-based algorithm can be implemented into a software that outputs much better and, indeed, always nearly optimal results.

We further combine our MCTS-based approximation algorithm for RDC with the integer-linear programming (ILP) approach in van Iersel et al. (2014) to obtain a new approximation algorithm for HNC. Our experimental results show that the new algorithm can be implemented into a software that outputs much better (indeed always nearly optimal and often optimal) results than the previous best in van Iersel et al. (2014).

Our programs are available at http://rnc.r.dendai.ac.jp/rsprHN.html

2. Preliminaries

Throughout this article, a rooted forest always means a directed acyclic graph in which every vertex has in-degree at most 1 and out-degree at most 2.

Let F be a rooted forest. The roots (respectively, leaves) of F are those vertices whose in-degrees (respectively, out-degrees) are 0. The size of F, denoted by $| F |$ , is the number of roots in F minus 1. A vertex v of F is unifurcate if it has only one child in F. If a root v of F is unifurcate, then contracting v in F is the operation that modifies F by deleting v. If a nonroot vertex v of F is unifurcate, then contracting v in F is the operation that modifies F by first adding an edge from the parent of v to the child of v and then deleting v.

For a vertex v of F, the subtree of F rooted at v, denoted by F^v, is the subgraph of F whose vertices are the descendants of v in F and whose edges are those edges connecting two descendants of v in F. If v is a root of F, then F^v is a component tree of F; otherwise, it is a pendant subtree of F. For convenience, we view each vertex u of F as an ancestor and descendant of u itself. A vertex u is lower than another vertex $v \neq u$ in F if u is a descendant of v in F. The lowest common ancestor (LCA) of a sequence U of vertices in F, denoted by $ℓ F (U)$ , is the lowest vertex v in F such that for every vertex $u \in U$ , v is an ancestor of u in F. Note that if no component tree of F contains all vertices of U, then $ℓ F (U)$ does not exist. Two vertices u and v of F are incomparable if either $ℓ F (u, v)$ does not exist or $ℓ F (u, v) \notin {u, v}$ . For two incomparable vertices u and v appearing in the same component tree of F, $D F (u, v)$ denotes the set of all vertices w such that w is not a vertex of the (undirected) path $P u, v$ between u and v in F but is the child of some inner vertex of $P u, v$ . For each pendant subtree T of F that has at least two leaves, the leaf-label set of T is a cluster of F.

A rooted binary forest is a rooted forest in which the out-degree of every nonleaf vertex is 2. Let F be a rooted binary forest. F is a rooted binary tree if it has only one root. If v is a nonroot vertex of F with parent p, then detaching F^v is the operation that modifies F by first deleting the edge $(p, v)$ and then contracting p. A detaching operation on F is the operation of detaching a pendant subtree of F.

2.1. Phylogenetic trees and forests

Let X be a set of existing species. A phylogenetic tree on X is a rooted binary tree whose leaf set is X. A phylogenetic forest is the graph obtained by applying a sequence of zero or more detaching operations on a phylogenetic tree. In other words, a phylogenetic forest is a graph whose connected components are phylogenetic trees on different sets of species.

An FF pair is a pair $(F 1, F 2)$ , where F₁ and F₂ are two phylogenetic forests on the same set X of species. A TT pair is an FF pair $(F 1, F 2)$ such that both F₁ and F₂ are trees.

For an FF pair $(F 1, F 2)$ , the labeled leaves of F₁ naturally one-to-one correspond to those of F₂ (i.e., each pair of corresponding leaves has the same label). We can extend the correspondence between the labeled leaves of F₁ and F₂ to (some of) their ancestors recursively as follows. Suppose that v₁ is a nonleaf vertex of $F 1, v 2$ , is a nonleaf vertex of F₂, and each child of v₁ in F₁ corresponds to a child of v₂ in F₂. Then, v₁ corresponds to v₂.

An FF pair $(F 1, F 2)$ is proper if every root of F₁, except at most one, corresponds to a root in F₂. Obviously, a TT pair is also a proper FF pair. Simplifying a proper FF pair $(F 1, F 2)$ is to repeatedly perform the following operation on F₁ and F₂ until it is not applicable:

If some nonroot vertex v of F₁ corresponds to a root of F₂, then modify F₁ by detaching $F_{1}^{v}$ .

Obviously, if $(F 1, F 2)$ is proper, then it remains proper after being simplified.

Throughout the remainder of this article, an FF pair always means a proper FF pair. A sub-FF pair of a TT pair $(T 1, T 2)$ is an FF pair $(F 1, F 2)$ such that for each $\in {1, 2}$ , F_i is obtained from T_i by performing zero or more detaching operations.

For an FF-pair $(F 1, F 2)$ , if a vertex v₁ of F₁ and a vertex v₂ of F₂ correspond to each other, then both v₁ and v₂ are matched and they are the mates of each other. For brevity, if v is a matched vertex of F_i for some $i \in {1, 2}$ , then we will also use v to denote its mate in $F 3 - i$ .

2.2. Agreement forests

Let $(F 1, F 2)$ be a sub-FF pair of a TT pair $(T 1, T 2)$ . If we can apply a sequence of detaching operations on each of F₁ and F₂ so that they become the same forest F, then we refer to F as an agreement forest (AF) of $(F 1, F 2)$ . A maximum agreement forest (MAF) of $(F 1, F 2)$ is an AF of $(F 1, F 2)$ whose size is minimized over all AFs of $(F 1, F 2)$ . The size of an MAF of $(F 1, F 2)$ minus $| F 2 |$ is called the rSPR distance of $(F 1, F 2)$ , and it is denoted by $d (F 1, F 2)$ . Obviously, an AF F of $(F 1, F 2)$ is also an AF of $(T 1, T 2)$ . The following lemma is shown in Chen et al. (2018).

Lemma 2.1. Chen et al. (2018). Given an FF-pair $(F 1, F 2)$ , we can compute a lower bound $b ℓ$ and an upper bound b_u on the rSPR distance of $(F 1, F 2)$ in cubic time such that $b u \leq 2 b ℓ$ .

Suppose that F is an AF of $(T 1, T 2)$ . For each $i \in {1, 2}$ , the leaves of T_i naturally one-to-one correspond to the leaves of F. For convenience, we hereafter identify each leaf v of F with the leaf of T_i corresponding to v. Similarly, the nonleaf vertices of F correspond to distinct nonleaf vertices of T_i. More precisely, a nonleaf vertex u of F corresponds to $ℓ T i (v 1, \dots, v q)$ , where $v 1, \dots, v q$ , are the leaf descendants of u in F. Again for convenience, we hereafter also use each nonleaf vertex u of F to denote the nonleaf vertex of T_i corresponding to u. With these correspondences, we can use $F, T 1$ , and T₂ to construct a directed graph G_F as follows:

The vertices of G_F are the roots of F.

For every two roots r₁ and r₂ of F, there is an edge from r₁ to r₂ in G_F if and only if r₁ is an ancestor of r₂ in T₁ or T₂.

We refer to G_F as the decision graph associated with F. If G_F is acyclic, then F is an acyclic agreement forest (AAF) of $(T 1 . T 2)$ ; otherwise, F is a cyclic agreement forest (CAF) of $(T 1, T 2)$ . If F is an AAF of $(T 1, T 2)$ and its size is minimized over all AAFs of $(T 1, T 2)$ , then F is a maximum acyclic agreement forest (MAAF) of $(T 1, T 2)$ . The hybridization number of $(T 1, T 2)$ is the size of an MAAF of $(T 1, T 2)$ , and it is denoted by $h (T 1, T 2)$ .

We are now ready to define the problems studied in this article:

Hybridization Number Computation (HNC):

Input: A TT-pair $(T 1, T 2)$ .

Output: The hybridization number of $(T 1, T 2)$ .

rSPR Distance Computation (RDC):

Input: A TT-pair $(T 1, T 2)$ .

Output: The rSPR distance of $(T 1, T 2)$ .

2.3. Transforming a CAF to an AAF

Suppose that F is a CAF of a TT-pair $(T 1, T 2)$ . We construct a directed graph D as follows. For every nonleaf vertex of F, we create a vertex in D. There is an edge in D from a vertex u to a vertex v precisely if in either F₁ or F₂ (or in both), there is a directed path from u to v. A minimum directed feedback vertex set (MDFVS) of D is a minimum-sized set U of vertices in D such that modifying D by removing the vertices in U yields a directed acyclic graph.

Lemma 2.2. Kelk et al. (2012). Let U be an MDFVS of D. Then, to transform F to an AAF of $(F 1, F 2)$ by performing a minimum number of detaching operations on F, it suffices to modify F by removing the vertices corresponding to those in U and further contracting unifurcate vertices.

Let V be the set of vertices in D. By Lemma 2.2, to compute an MDFS U of D, it suffices to solve the following ILP model (van Iersel et al., 2014):

Fortunately, in our application, we will have an integer k and only want to know whether the optimal value of the objective function is bounded by k from above. So, we modify the model by replacing the objective function with any constant (say, 0) and adding the new constraint $\sum v \in V x v \leq k$ . We refer to this modified model as the ILP model associated with $(T 1, T 2, F, k)$ .

3. Solving HNC Exactly

Our algorithm for solving HNC exactly will use a subroutine for the following parameterized version of HNC.

Parameterized HNC (PHNC):

Input: $(T 1, T 2, F 1, F 2, k)$ , where $(T 1, T 2)$ is a TT pair, $(F 1, F 2)$ is a sub-FF pair of $(T 1, T 2)$ , and k is an integer.

Output: “Yes” if performing k more detaching operations on F₂ leads to an AAF of $(T 1, T 2)$ ; “no” otherwise.

Several definitions are in order. Let $(F 1, F 2)$ be an FF-pair, and $i \in {1, 2}$ . A vertex v of F_i is active if v is a matched nonroot vertex of F_i and its parent in F_i is not matched. Note that all active vertices of F₁ fall into the same component tree of F₁. An active sibling-pair of F_i is a pair $(u, v)$ of active vertices in F_i such that u and v are siblings in F_i.

3.1. Key ideas

Basically, our algorithm is a significantly refined version of the algorithm for HNC implemented in Chen and Wang's UltraNet (Chen and Wang, 2013). In this subsection, we list the key new ideas behind our new algorithm for HNC.

First, the new algorithm builds on a recent 2-approximation algorithm for RDC (Chen et al., 2018). When we compute the hybridization number, we use the lower bound outputted by the approximation algorithm to bound the search of the hybridization number. Since the lower bound is often nearly optimal, this bounding idea makes it possible for our algorithm to find the hybridization number in short time. Since the exact algorithm for RDC in Chen et al. (2018) is also fast, we can use it to bound the search of the hybridization number instead of using the 2-approximation algorithm for RDC.

Second, the new algorithm is recursive and we make child recursive calls in a careful order. More precisely, child recursive calls that appear to finish in shorter time are made earlier than those that look likely to finish in longer time.

Third, when we make a recursive call, we may know certain vertices v such that the subtree rooted at v should not be detached, and so we lock these vertices so that the subtrees rooted at them will never be detached in subsequent recursive calls. Moreover, the locked vertices help us make fewer child recursive calls.

Finally, when our algorithm needs to transform a CAF F of a TT-pair $(T 1, T 2)$ to an AAF of $(T 1, T 2)$ , we use the ILP-based method outlined in Section 2.3. However, we modify the ILP model in Section 2.3 as follows.

Let D be the digraph constructed from F and $(T 1, T 2)$ as in Section 2.3. Since F is a CAF, D has a cycle and we need to remove at least one vertex from D to make D acyclic. Once D becomes acyclic, its number of vertices has decreased by at least 1. So, it is safe to modify the ILP model by changing the upper bound on the value of $ℓ v$ from $| V | - 1$ to $| V | - 2$ .

Some vertices of F may have been locked. So, for each locked vertex v of F, we can modify the model by fixing $x v = 0$ .

By Lemma 4 in Chen and Wang (2012), we know that for each edge $(p, c)$ of F, if removing a set U of vertices from D with ${p, c} \subseteq U$ makes D acyclic, then removing the vertices of $U ∖ {c}$ also makes D acyclic. Thus, for each edge $(p, c)$ of F, we can add the constraint $x c \leq x p$ to the model.

3.2. The algorithm

Throughout this subsection, fix an instance $(T 1, T 2)$ of HNC.

Our algorithm for computing $h (T 1, T 2)$ exactly first uses the program in Chen et al. (2018) for RDC to compute $d (T 1, T 2)$ . The program can also output an AF F of $(T 1, T 2)$ with size $d (T 1, T 2)$ . So, our algorithm checks whether F is, indeed, an AAF of $(T 1, T 2)$ (by constructing the decision graph G_F associated with F and testing whether G_F is acyclic or not). If it is, then $d (T 1, T 2)$ is also $h (T 1, T 2)$ and so the algorithm outputs $d (T 1, T 2)$ and stops. Thus, we hereafter assume that F is not an AAF of $(T 1, T 2)$ .

Our algorithm then repeatedly performs a cluster reduction on T₁ and T₂ until no such reduction is applicable. For the detail of cluster reductions, the reader is referred to Baroni et al. (2006). As the result of zero or more cluster reductions on T₁ and T₂, we obtain a sequence $(T 1, 1, T 2, 1)$ , …, $(T 1, q, T 2, q)$ of instances of HNC such that $q \geq 1$ and $h (T 1, T 2) = \sum_{i = 1}^{q} h (T 1, i, T 2, i)$ . Hence, it suffices to compute $h (T 1, i, T 2, i)$ for each $i \in {1, \dots, q}$ . Therefore, for simplicity, we hereafter assume that $q = 1$ and, in turn, $(T 1, T 2) = (T 1, 1, T 2, 1)$ .

To compute $h (T 1, T 2)$ , it suffices to solve PHNC on input $(T 1, T 2, T 1, T 2, k)$ for $k = d (T 1, T 2), d (T 1, T 2) + 1$ , … (in this order) until a “yes” is returned. So, it remains to detail our algorithm for PHNC. During its execution, our algorithm will lock certain nonroot vertices v of F₂ at certain time points so that $F_{2}^{v}$ will never be detached thereafter; it will always maintain the following invariant:

Invariant 1: Whenever a nonroot vertex is locked by the algorithm, it knows that it will return “yes” with the locking if and only if it will return “yes” without the locking.

Our algorithm for PHNC is recursive and proceeds as follows. It starts by checking whether $k \geq 0$ . If $k < 0$ , then this is Base Case 1 and it returns “no.” So, we hereafter assume $k \geq 0$ . Then, it simplifies $(F 1, F 2)$ and further checks the following base case:

Base Case 2: All roots of F₁ are matched. In this case, F₁ and F₂ are the same forest, and, hence, F₂ is an AF of $(T 1, T 2)$ . To test whether F₂ is an AAF, our algorithm constructs the decision graph $G F 2$ associated with F₂ and tests whether it is acyclic or not. If $G F 2$ is acyclic, then it returns “yes.” Otherwise, it checks whether $k \geq 1$ or not. If $k \leq 0$ , then it returns “no.” On the other hand, if $k \geq 1$ , then it constructs the ILP model associated with $(T 1, T 2, F 2, k)$ and solves the ILP model by an ILP solver (say, CPLEX or GUROBI); it returns “yes” if and only if the model is feasible.

We hereafter assume that one or more roots of F₁ are still not matched. Our algorithm then uses the program in Chen et al. (2018) to compute a lower bound $b ℓ$ and an upper bound b_u on $d (F 1, F 2)$ . The program will also return an AF F of $(F 1, F 2)$ with size b_u as a witness for b_u. If $k < b ℓ$ , then this is Base Case 3, and the algorithm returns “no.” Otherwise, the algorithm checks whether the ILP model associated with $(T 1, T 2, F, k)$ is feasible or not. If it is feasible, then this is Base Case 4, and the algorithm returns “yes.”

We hereafter assume that $k \geq b ℓ$ and the ILP model associated with $(T 1, T 2, F, k)$ is infeasible. Clearly, both F₁ and F₂ must have at least one active sibling-pair. Our algorithm now distinguishes several cases in the following order.

Case 1: There is an active sibling-pair $(u, v)$ in F₁ such that $| D F 2 (u, v) | = 1$ . In this case, we clearly know that to transform F₂ into an AF of $(F 1, F 2)$ , we need to select at least one $x \in {u, v, w}$ and detach $F_{2}^{x}$ , where $D F 2 (u, v) = {w}$ . So, if all vertices of ${u, v, w}$ are locked, then this is Base Case 5, and the algorithm returns “no.” Thus, we may assume that at least one vertex of ${u, v, w}$ is not locked. As observed in Whidden et al. (2013), selecting $x = u$ is the same as selecting $x = v$ (which means that the former selection leads to a “yes”-output if and only if so does the latter). Hence, if u or v is not locked, then our algorithm chooses an arbitrary unlocked $x \in {u, v}$ and makes a recursive call on input $(T 1, T 2, F 1, F' 2, k - 1)$ , where $F' 2$ is obtained from F₂ by detaching $F_{2}^{x}$ . In addition, if w is also not locked, then our algorithm makes a recursive call on input $(T 1, T 2, F 1, F'' 2, k - 1)$ , where $F'' 2$ is obtained from F₂ by detaching $F_{2}^{w}$ and further locking x in case the recursive call on input $(T 1, T 2, F 1, F' 2, k - 1)$ has been made. So, we make one or two recursive calls. If at least one call returns “yes,” the algorithm returns “yes”; otherwise, it returns “no.”

Case 2: There is an active sibling-pair $(u, v)$ in F₂ such that $| D F 1 (u, v) | = 1$ and the unique vertex w in $D F 1 (u, v)$ is active. This case is symmetric to Case 1; so, the algorithm proceeds as in Case 1 except that each of $u, v$ , and w is replaced by its mate.

Case 3: Neither Case 1 nor 2 occurs. In this case, our algorithm searches F₁ for an active sibling-pair $(u, v)$ in the following order:

Type 1: Both u and v are locked in F₂.

Type 2: u and v belong to different connected components of F₂.

Type 3: Either u or v is locked in F₂.

Type 4: The sibling s of the parent of u and v in F₁ satisfies that either s is active or both children of s in F₁ are active.

Type 5: $(u, v)$ is of none of the types cited earlier.

We emphasize that the smaller type of an active sibling-pair in F₁ is, the more our algorithm prioritizes it. Intuitively speaking, choosing an active sibling-pair of a smaller type in F₁ will likely lead to fewer recursive calls.

Suppose that our algorithm has selected an active sibling-pair $(u, v)$ in F₁ as cited earlier. Our algorithm constructs a family $ℱ$ of sets as follows. Initially, $ℱ$ is empty. For each $y \in {u, v}$ such that y is not locked in F₂, we add the set ${y}$ to $ℱ$ . Moreover, if no vertex in $D F 2 (u, v)$ is locked in F₂, then we add $D F 2 (u, v)$ to $ℱ$ . Since Case 1 does not occur, $| D F 2 (u, v) | \geq 2$ . Clearly, to transform F₂ into an AF of $(F 1, F 2)$ , we need to select a set $S \in ℱ$ and detach $F_{2}^{w}$ for all $w \in S$ . Thus, if $ℱ$ is empty, then our algorithm returns “no.” Otherwise, it sorts the sets in F so that larger sets precede smaller sets. Let S₁, …, S_t be the sets in $ℱ$ . For each $i \in {1, \dots, t}$ , let $F 2, i$ be the phylogenetic forest obtained from F₂ by first detaching $F_{2}^{y}$ for all $y \in S i$ and further distinguishing two cases as follows:

If $| S i | \geq 2$ , then lock both u and v in F₂.

If $i \geq 2$ and $| S i - 1 | = | S i | = 1$ , then lock the vertex of $S i - 1$ in F₂.

Now, our algorithm makes t recursive calls on input $(T 1, T 2, F 1, F 2, 1, k - | S 1 |)$ , …, $(T 1, T 2, F 1, F 2, t, k - | S t |)$ . If at least one call returns “yes,” the algorithm returns “yes”; otherwise, it returns “no.”

4. Solving RDC Approximately

Basically, we want an approximate algorithm that outputs better results than the algorithm in Chen et al. (2018). Although the algorithm in Chen et al. (2016) has a worse theoretical guarantee than the algorithm in Chen et al. (2018), it does not necessarily mean that the former always outputs worse results. So, we obtain a new approximation algorithm for RDC, which simply runs the algorithms in Chen et al. (2016, 2018) and outputs the better result returned by them.

Our new idea is to use MCTS to improve the performance of any approximation algorithm for RDC. MCTS has a number of variants, but we here use the basic one (namely, the UCT algorithm) for its simplicity.

4.1. Outline of the algorithm

In the remainder of this section, fix an FF-pair $(F 1, F 2)$ . Our algorithm for computing $d (F 1, F 2)$ is approximately recursive and starts by simplifying $(F 1, F 2)$ and further checking whether F₂ is already an AF of $(F 1, F 2)$ . If it is, then this is Base Case 1 and it returns 0. So, assume that F₂ is not an AF of $(F 1, F 2)$ . Then, F₁ has a unique nonmatched root r. If r has, at most, 6 leaf descendants in F₁, then this is Base Case 2 and our algorithm computes $d (F 1, F 2)$ in $O (1)$ time by brute force. Thus, we further assume that r has more than 6 leaf descendants in F₁. Now, our algorithm finds a promising vertex z in F₂, next detaches $F_{2}^{z}$ , further makes a recursive call on the modified $(F 1, F 2)$ , and finally returns $c + 1$ , where c is the value returned by the recursive call.

How to find a promising z needs to be considered. In the next two cases, we know an optimal choice of z, that is, we know that the choice of z will lead to an optimal solution (Chen et al., 2015):

Optimal Case 1: $(u, v)$ is an active sibling-pair in F₁ with $| D F 2 (u, v) | = 1$ . In this case, z is the unique vertex in $D F 2 (u, v)$ .

Optimal Case 2: $(u, v)$ is an active sibling-pair in F₂ with $| D F 1 (u, v) | = 1$ and the unique vertex in $D F 1 (u, v)$ is a leaf. In this case, z is the mate of the unique vertex in $D F 1 (u, v)$ .

We hereafter assume that none of the optimal cases cited earlier occurs. Next, we outline how to find a promising z with MCTS. The idea behind MCTS is to build a small-sized search tree $Γ$ . We will always use $ρ$ to denote the root of $Γ$ . In our case, each node $α$ of $Γ$ holds the following information:

$f (α)$ : A sub-FF pair of $(F 1, F 2)$ . (Comment: We use $f (α) 1$ and $f (α) 2$ to denote the first and the second forest in $f (α)$ , respectively.)

$t (α)$ : The number of times $α$ has been visited so far.

$s (α)$ : The score of $α$ .

$Q (α)$ : The reward $α$ has received so far.

When creating a node $α$ , we are always given a sub-FF pair $(\hat{F} 1, \hat{F} 2)$ of $(F 1, F 2)$ and initialize $f (α) = (\hat{F} 1, \hat{F} 2)$ , $t (α) = 0$ , $s (α) = 0$ , and $Q (α) = 0$ . To evaluate a child $α$ of a node $β$ of $Γ$ , we use the UCT value of $α$ , which is computed as follows: $\frac{Q (α)}{t (α)} + C \cdot \sqrt{\frac{2 l n t (β)}{t (α)}},$

where C is a constant (called the balance constant and fixed to be 0.2 in our experiments). The best child of a node $β$ in $Γ$ is the child of $β$ in $Γ$ whose UCT value is maximized over all children of $β$ in $Γ$ .

Initially, $Γ$ has a unique node, namely, the root $ρ$ created with $(F 1, F 2)$ . We then grow $Γ$ by repeatedly performing the following steps (in this order) for a predetermined number (fixed to be 60 in our experiments) of repetitions:

Select a leaf node $α$ in $Γ$ by starting at $ρ$ and repeatedly descending to the best child of the current node until reaching a leaf. (Comment: Ties are broken arbitrarily.)

Expand $α$ . (Comment: see Section 4.2.)

Perform a simulation for $α$ by calling an approximation algorithm [say, the algorithm in Chen et al. (2018)] on input $f (α)$ , and then update $s (α)$ to $A p p (f (α)) + | f (α) 2 | - | f (ρ) 2$ , where $A p p (f (α))$ means the approximate rSPR distance of $f (α)$ returned by the approximation algorithm. (Comment: We refer to this step as the simulation step.)

Compute the reward $Q (α) = \{\begin{matrix} \begin{matrix} 1 & i f s (α) \leq t h e a v e r a g e s c o r e o f t h e n o d e s i n Γ \\ 0 & o t h e r w i s e \end{matrix} \end{matrix} .$

Backpropagate the reward $Q (α)$ from $α$ all the way to the root $ρ$ by performing the following step for all ancestors $β$ of $α$ in $Γ$ :

Increase $t (β)$ by 1 and increase $Q (β)$ by $Q (α)$ .

Once finishing growing $Γ$ as cited earlier, we select the best child $γ$ of $ρ$ . Finally, we set z to be the vertex in $f (ρ) 2$ such that $f (γ) 2$ is obtained from $f (ρ) 2$ by detaching the subtree rooted at z.

4.2. Expanding a node $α$

Suppose that we have selected a leaf node $α$ to expand. We first simplify $f (α)$ and then search $f (α) 1$ and $f (α) 2$ for an active sibling-pair $(u, v)$ in the following order:

Type 1: $(u, v)$ is an active sibling-pair in $f (α) 1$ with $| D f (α) 2 (u, v) | = 1$ .

Type 2: $(u, v)$ is an active sibling-pair in $f (α) 2$ such that $| D f (α) 1 (u, v) | = 1$ and the unique vertex in $D f (α) 1 (u, v)$ is a leaf.

Type 3: $(u, v)$ is an active sibling-pair in $f (α) 1$ such that u and v belong to different connected components of $f (α) 2$ .

Type 4: $(u, v)$ is an active sibling-pair in $f (α) 1$ such that u and v belong to the same connected component of $f (α) 2$ and $ℓ f (α) 2 (u, v)$ is a root of $f (α) 2$ .

Type 5: $(u, v)$ is of none of the types cited earlier.

We emphasize that the smaller type of an active sibling-pair is, the more our algorithm prioritizes it.

If (u,v) is not found, we know that f(α)₂ is an AF of f(α) and, hence, we have nothing to do with expanding α. Thus, we hereafter assume that (u,v) has been found. Then, we construct a family ℱ of sets as follows.

If $(u, v)$ is of Type 1 (respectively, 2), then $ℱ$ consists of only $D f (α) 2 (u, v)$ (respectively, $D f (α) 1 (u, v)$ ).

If $(u, v)$ is of Type 3 or 4, then $ℱ$ consists of ${u}$ and ${v}$ .

If $(u, v)$ is of Type 5, then $ℱ$ consists of ${u}$ , ${v}$ , and $D f (α) 2 (u, v)$ .

We now use ℱ to create the children of α as follows. For each set S ∈ ℱ, we create a child β_S, where f(β_S)₁=f(α)₁ and f(β_S)₂ is obtained from f(α)₂ by detaching the subtrees rooted at the vertices in S.

5. Solving HNC Approximately

We say that an approximation algorithm A for RDC is useful if given a TT-pair $(T 1, T 2)$ , A can not only output an approximate value $d'$ of $d (T 1, T 2)$ but also output an AF F of $(T 1, T 2)$ with $| F | = d'$ . Our approximation algorithm given in Section 4 is useful and so are all known approximation algorithms for RDC. Using a useful approximation A for RDC, we can design an approximation algorithm for HNC, denoted by $A h n$ , as follows. Given a TT-pair $(T 1, T 2)$ , $A h n$ calls A to obtain an approximate value $d'$ of $d (T 1, T 2)$ and an AF F of $(T 1, T 2)$ with $| F | = d'$ . If F is an AAF of $(T 1, T 2)$ , then $d'$ is also an approximate value of $h (T 1, T 2)$ and hence $A h n$ returns $d'$ . So, assume that F is a CAF of $(T 1, T 2)$ . Then, as in Section 2.3, we can transform F into an AAF of $(T 1, T 2)$ by solving an ILP model. Thus, $d'$ plus the optimal value of the objective function of the model gives us an approximate value of $h (T 1, T 2)$ and so $A h n$ returns it.

6. Experimental Results

To compare our new algorithms against the previous bests, we have implemented them in Java. In this section, we compare the real performance of our programs against that of the previous bests. In our experiments, we use a Linux (x64) desktop PC with Intel i7-4790 CPU (4.00 GHz, 8 threads) and 32 GB RAM.

We define the average approximation ratio (AAR) of an approximation algorithm A (for RDC or HNC) as follows. For a given instance I, we use $A (I)$ (respectively, $B (I)$ ) to denote the value outputted by A (respectively, an exact algorithm) on input I; the approximation ratio of A for I, denoted by $r A (I)$ , is $\frac{A (I)}{B (I)}$ . The AAR of A for a set $ℐ$ of instances is $\frac{\sum I \in ℐ r A (I)}{| I |}$ .

To generate a simulated dataset, we use two parameters $(n, m)$ . Given $(n, m)$ , we use the program of Beiko and Hamilton (2006) to generate a dataset consisting of 120 TT-pairs, where each TT-pair is generated by first generating a random phylogenetic tree T₁ with n leaves and then obtaining another phylogenetic tree T₂ by applying m random rSPR operations on T₁. So, the rSPR distance of each pair $(T 1, T 2)$ in the dataset is at most m, but the hybridization number of $(T 1, T 2)$ may be larger than m. In our experiments stated next, we choose $(n, m)$ from ${(100, 50), (200, 80), (200, 100)}$ and generate a dataset $I (n, m)$ for each $(n, m)$ in this set.

6.1. Results on approximating RDC

Since all programs used in our experiments for approximating RDC are fast, it is meaningless to compare them in terms of running time. So, we compare them in terms of their AARs. We use $ℐ (100, 50)$ and $ℐ (200, 100)$ in the experiment. Our experimental results are summarized in Table 1, where the first and the second rows show the results for $ℐ (100, 50)$ and $ℐ (200, 100)$ , respectively; Svv, CMW, and CHN mean the algorithm in Schalekamp et al. (2016), Chen et al. (2016), and Chen et al. (2018), respectively; MCTS_CMW and MCTS_CHN mean our MCTS algorithm with CMW and CHN used in the simulation step, respectively; CombApp (respectively, CombMCTS) means the algorithm that runs CMW and CHN (respectively, MCTS_CMW and MCTS_CHN) and outputs the better solution returned by them. From the table, we can see that MCTS is very helpful in improving the performance of approximation algorithms for RDC. In particular, our best algorithm (namely, CombMCTS) achieves a significantly better AAR than the previous best (namely, CHN).

Table 1.

Comparing the Average Approximation Ratios of Approximation Algorithms for rSPR Distance Computation

Svv	CMW	CHN	CombApp	MCTS_CMW	MCTS_CHN	CombMCTS
1.41	1.133	1.135	1.104	1.037	1.034	1.02
1.391	1.369	1.127	1.11	1.044	1.038	1.027

6.2. Results on approximating HNC

Since we want the exact hybridization number to be known, we use the two easiest datasets [namely, $ℐ (100, 50)$ and $ℐ (200, 80)$ ] in this experiment to compare the AARs of our approximation algorithms for HNC against the previous bests. Our experimental results are summarized in Table 2, where the first and the second rows show the results for $ℐ (100, 50)$ and $ℐ (200, 80)$ , respectively. From the table, we can see that MCTS is very helpful in improving the performance of approximation algorithms for HNC as well. In particular, our best algorithm (namely, CombMCTS $h n$ ) achieves a much better AAR than the previous best (namely, Svv $h n$ ). Indeed, our experimental results show that for about half the tested instances, CombMCTS $h n$ found optimal solutions.

Table 2.

Comparing the Average Approximation Ratios of Approximation Algorithms for Hybridization Number Computation

Svv $h n$	CMW $h n$	CHN $h n$	CombApp $h n$	MCTS_CMW $h n$	MCTS_CHN $h n$	CombMCTS $h n$
1.397	1.214	1.134	1.12	1.038	1.033	1.02
1.419	1.088	1.083	1.062	1.022	1.02	1.015

6.3. Results on computing HNC exactly

To compare the speed of our new exact algorithm for HNC against the previous best [namely, UltraNet in Chen and Wang (2013)], we use $ℐ (100, 50)$ and $ℐ (200, 80)$ again. As the ILP solver, we use the IBM CPLEX that is freely available for academic research. For each tested instance, we set a 1-hour time limit on the running time of each program. As the result, UltraNet fails to solve 1 (respectively, 16) instances of $ℐ (100, 50)$ (respectively, $ℐ (200, 80)$ ), whereas our new program fails to solve none. With the failed instances excluded, the average running time of UltraNet is 54.46 (respectively, 323.86) seconds for the first (respectively, second) dataset, whereas that of our new program is only 0.66 (respectively, 0.86) seconds. So, our new program is more than 82 times faster than UltraNet and its superiority in speed over UltraNet becomes more significant for larger instances.

Footnotes

Author Disclosure Statement

The authors declare they have no competing financial interests.

Funding Information

Z.-Z.C. was supported in part by the Grant-in-Aid for Scientific Research of the Ministry of Education, Science, Sports and Culture of Japan, under Grant No. 18K11183. L.W. was supported by a GRF grant from Hong Kong SAR government Project No. [CityU 11256116] and a grant from National Foundation of China Project No. [61373048].

References

Albrecht

, Scornavacca

, Cenci

, et al. 2012. Fast computation of minimum hybridization networks. Bioinformatics, 28, 191–197.

Baroni

, Semple

, and Steel

2006. Hybrids in real time. Syst. Biol. 55, 46–56.

Beiko

R.G.

, and Hamilton

2006. Phylogenetic identification of lateral genetic transfer events. BMC Evol. Biol. 6, 159–169.

Bonet

M.L.

, John

K.S

t., Mahindru

, et al. 2006. Approximating subtree distances between phylogenies. J. Comput. Biol. 13, 1419–1434.

Bordewich

, and Semple

2005. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Comb. 8, 409–423.

Chen

Z.-Z.

, Fan

, and Wang

2015. Faster exact computation of rSPR distance. J. Comb. Optim. 29, 605–635.

Chen

Z.-Z.

, Harada

, Nakamura

, et al. 2018. Faster exact computation of rSPR distance via better approximation. IEEE/ACM Trans. Comput. Biol. Bioinform. DOI: 10.1109/TCBB.2018.2878731

Chen

Z.-Z.

, Machida

, and Wang

2016. An approximation algorithm for rSPR distance. In 22nd International Computing and Combinatorics Conference, Ho Chi Minh City, Vietnam, August 2–4, 2016, pp. 468–479.

Chen

Z.-Z.

, and Wang

2012. Algorithms for reticulate networks of multiple phylogenetic trees. IEEE/ACM Trans. Comput. Biol. Bioinform. 9, 372–384.

10.

Chen

Z.-Z.

, and Wang

2013. An ultrafast tool for minimum reticulate networks. J. Comput. Biol. 20, 38–41.

11.

Chen

Z.-Z.

, and Wang

2016. Hybridnet: A tool for constructing hybridization networks. Bioinformatics, 26, 2912–2913.

12.

Collins

, Linz

, and Semple

2011. Quantifying hybridization in realistic time. J. Comput. Biol. 18, 1305–1318.

13.

Hein

, Jiang

, Wang

, et al. 1996. On the complexity of comparing evolutionary trees. Disc. Appl. Math. 71, 153–169.

14.

Hill

, Nordstrom

K.J.

, Thollesson

, et al. 2010. Sprit: Identifying horizontal gene transfer in rooted phylogenetic trees. BMC Evol. Biol. 10, 42.

15.

Kelk

, van Iersel

, Lekic

, et al. 2012. Cycle killer…qu'est-ce que c'est? on the comparative approximability of hybridization number and directed feedback vertex set. SIAM J. Discr. Math. 26, 1635–1656.

16.

Schalekamp

, van Zuylen

, and van der Ster

2016. A duality based 2-approximation algorithm for maximum agreement forest. In 43rd International Colloquium on Automata, Languages and Programming, Rome, Italy, July 11–15, 2016, pp. 1–70.

17.

van Iersel

, Kelk

, Lekic

, et al. 2014. A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees. BMC Bioinformatics, 15, 127.

18.

Whidden

, Beiko

R.G.

, and Zeh

2010. Fast fpt algorithms for computing rooted agreement forest: theory and experiments. In International Symposium on Experimental Algorithms, Naples, Italy, May 20–22, 2010, pp. 141–153.

19.

Whidden

, Beiko

R.G.

, and Zeh

2013. Fixed-parameter algorithms for maximum agreement forests. SIAM J. Comput. 42, 431–1466.

20.

Whidden

, and Zeh

2009. A unifying view on approximation and fpt of agreement forests. In 9th International Workshop on Algorithms in Bioinformatics, Philadelphia, PA, September 12–13, 2009, pp. 390–401.

21.

2009. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics, 25, 190–196.

Improved Practical Algorithms for Rooted Subtree Prune and Regraft (rSPR) Distance and Hybridization Number

Abstract

1. Introduction

2. Preliminaries

2.1. Phylogenetic trees and forests

2.2. Agreement forests

2.3. Transforming a CAF to an AAF

3. Solving HNC Exactly

3.1. Key ideas

3.2. The algorithm

4. Solving RDC Approximately

4.1. Outline of the algorithm

4.2. Expanding a node α

5. Solving HNC Approximately

6. Experimental Results

6.1. Results on approximating RDC

6.2. Results on approximating HNC

6.3. Results on computing HNC exactly

Footnotes

Author Disclosure Statement

Funding Information

References

4.2. Expanding a node $α$