Abstract
The problem of computing the rooted subtree prune and regraft (rSPR) distance of two phylogenetic trees is computationally hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by Hybridization Number Computation [HNC]). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this article, we design and implement several approximation algorithms for them and one exact algorithm for HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation program's output has much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time.
1. Introduction
Constructing the evolutionary history of a set of species is an important problem in the study of biological evolution. Phylogenetic trees are used in biology to represent the ancestral history of a collection of existing species. This is appropriate for many groups of species. However, due to reticulation events such as hybridization, recombination, and lateral gene transfer, there are certain groups for which the ancestral history cannot be represented by a tree. For this kind of groups of species, it is more appropriate to represent their ancestral history by rooted acyclic digraphs, where vertices of in-degree at least two represent reticulation events.
More specifically, by looking at two different segments of sequences or two different sets of genes of a set of extant species, we may obtain two different phylogenetic trees T1 and T2 of the same extant species with high confidence. Given T1 and T2, we want to construct a reticulate network N with the smallest number of reticulation events needed to explain the evolution of the species under consideration. Roughly speaking, N is the smallest rooted acyclic digraph such that each of T1 and T2 is homeomorphic to a subgraph of N. The number of vertices of in-degree larger than 1 in N is called the hybridization number of T1 and T2. The problem of computing the hybridization number of two given phylogenetic trees, denoted by Hybridization Number Computation (HNC), is NP-hard (Hein et al., 1996; Bordewich and Semple, 2005), where NP stands for the class of problems solvable in nondeterministic polynomial time. For this reason, quite a number of approximation algorithms and fixed-parameter algorithms have been designed and implemented for HNC (Wu, 2009; Chen and Wang, 2016, 2012; Hill et al., 2010; Collins et al., 2011; Albrecht et al., 2012; Kelk et al., 2012; van Iersel et al., 2014). To the best of our knowledge, the previously best program for solving HNC approximately (respectively, exactly) is given in van Iersel et al. (2014) [respectively, Chen and Wang (2013)].
A problem closely related to HNC is the problem of computing the rooted subtree prune and regraft (rSPR) distance of two given phylogenetic trees T1 and T2 of the same extant species. The rSPR distance between T1 and T2 can be defined as the minimum number of edges that should be deleted from each of T1 and T2 to transform them into homeomorphic rooted forests F1 and F2. The problem of computing the rSPR distance of two trees, denoted by rSPR Distance Computation (RDC), is NP-hard (Hein et al., 1996; Bordewich and Semple, 2005). This has motivated researchers to design and implement either exact or approximation algorithms for RDC (Hein et al., 1996; Bordewich and Semple, 2005; Bonet et al., 2006; Whidden and Zeh, 2009; Wu, 2009; Whidden et al., 2010; Chen and Wang, 2013; Chen et al., 2015, 2016, 2018; Schalekamp et al., 2016). To the best of our knowledge, the previously best software for solving RDC either exactly or approximately is due to Chen et al. (2018) (http://rnc.r.dendai.ac.jp/rspr.html).
In this article, we first improve Chen et al.'s exact algorithm (Chen and Wang, 2013) for HNC. Since rSPR distance is a lower bound on hybridization number, the main idea is to use the lower bound on rSPR distance outputted by Chen et al.'s algorithm (Chen et al., 2018) to cut unnecessary branches of the search tree. Another main idea is to arrange the child recursive calls of each recursive call carefully. Our experimental results show that the resulting algorithm can be implemented into a software that runs more than 80 times faster than Chen et al.'s UltraNet (Chen and Wang, 2013) for the easiest dataset used in the experiments. Moreover, its superiority in speed becomes even more significant for more difficult instances.
We then present a new approximation algorithm for RDC. Although this algorithm does not necessarily always output a better result than the algorithm in Chen et al. (2018), we can obtain a new algorithm that calls the two algorithms and outputs the better result returned by them. Our experimental results show that the resulting algorithm can be implemented into a software that often outputs better results than Chen et al.'s program (Chen et al., 2018). We further propose to use the so-called Monte Carlo tree search (MCTS) method to improve any approximation algorithm A for RDC. In our application of MCTS, instead of performing a number of random play-outs in the simulation phase of each round, we make a single call of A and then in the backpropagation phase, use its returned result to update information in the sequence of nodes selected for this round. Our experimental results show that the MCTS-based algorithm can be implemented into a software that outputs much better and, indeed, always nearly optimal results.
We further combine our MCTS-based approximation algorithm for RDC with the integer-linear programming (ILP) approach in van Iersel et al. (2014) to obtain a new approximation algorithm for HNC. Our experimental results show that the new algorithm can be implemented into a software that outputs much better (indeed always nearly optimal and often optimal) results than the previous best in van Iersel et al. (2014).
Our programs are available at http://rnc.r.dendai.ac.jp/rsprHN.html
2. Preliminaries
Throughout this article, a rooted forest always means a directed acyclic graph in which every vertex has in-degree at most 1 and out-degree at most 2.
Let F be a rooted forest. The roots (respectively, leaves) of F are those vertices whose in-degrees (respectively, out-degrees) are 0. The size of F, denoted by
For a vertex v of F, the subtree of F rooted at v, denoted by Fv, is the subgraph of F whose vertices are the descendants of v in F and whose edges are those edges connecting two descendants of v in F. If v is a root of F, then Fv is a component tree of F; otherwise, it is a pendant subtree of F. For convenience, we view each vertex u of F as an ancestor and descendant of u itself. A vertex u is lower than another vertex
A rooted binary forest is a rooted forest in which the out-degree of every nonleaf vertex is 2. Let F be a rooted binary forest. F is a rooted binary tree if it has only one root. If v is a nonroot vertex of F with parent p, then detaching Fv is the operation that modifies F by first deleting the edge
2.1. Phylogenetic trees and forests
Let X be a set of existing species. A phylogenetic tree on X is a rooted binary tree whose leaf set is X. A phylogenetic forest is the graph obtained by applying a sequence of zero or more detaching operations on a phylogenetic tree. In other words, a phylogenetic forest is a graph whose connected components are phylogenetic trees on different sets of species.
An FF pair is a pair
For an FF pair
An FF pair If some nonroot vertex v of F1 corresponds to a root of F2, then modify F1 by detaching
Obviously, if
Throughout the remainder of this article, an FF pair always means a proper FF pair. A sub-FF pair of a TT pair
For an FF-pair
2.2. Agreement forests
Let
Suppose that F is an AF of
The vertices of GF are the roots of F.
For every two roots r1 and r2 of F, there is an edge from r1 to r2 in GF if and only if r1 is an ancestor of r2 in T1 or T2.
We refer to GF as the decision graph associated with F. If GF is acyclic, then F is an acyclic agreement forest (AAF) of
We are now ready to define the problems studied in this article:
2.3. Transforming a CAF to an AAF
Suppose that F is a CAF of a TT-pair
Let V be the set of vertices in D. By Lemma 2.2, to compute an MDFS U of D, it suffices to solve the following ILP model (van Iersel et al., 2014):
Fortunately, in our application, we will have an integer k and only want to know whether the optimal value of the objective function is bounded by k from above. So, we modify the model by replacing the objective function with any constant (say, 0) and adding the new constraint
3. Solving HNC Exactly
Our algorithm for solving HNC exactly will use a subroutine for the following parameterized version of HNC.
Several definitions are in order. Let
3.1. Key ideas
Basically, our algorithm is a significantly refined version of the algorithm for HNC implemented in Chen and Wang's UltraNet (Chen and Wang, 2013). In this subsection, we list the key new ideas behind our new algorithm for HNC.
First, the new algorithm builds on a recent 2-approximation algorithm for RDC (Chen et al., 2018). When we compute the hybridization number, we use the lower bound outputted by the approximation algorithm to bound the search of the hybridization number. Since the lower bound is often nearly optimal, this bounding idea makes it possible for our algorithm to find the hybridization number in short time. Since the exact algorithm for RDC in Chen et al. (2018) is also fast, we can use it to bound the search of the hybridization number instead of using the 2-approximation algorithm for RDC.
Second, the new algorithm is recursive and we make child recursive calls in a careful order. More precisely, child recursive calls that appear to finish in shorter time are made earlier than those that look likely to finish in longer time.
Third, when we make a recursive call, we may know certain vertices v such that the subtree rooted at v should not be detached, and so we lock these vertices so that the subtrees rooted at them will never be detached in subsequent recursive calls. Moreover, the locked vertices help us make fewer child recursive calls.
Finally, when our algorithm needs to transform a CAF F of a TT-pair
Let D be the digraph constructed from F and
Some vertices of F may have been locked. So, for each locked vertex v of F, we can modify the model by fixing
By Lemma 4 in Chen and Wang (2012), we know that for each edge
3.2. The algorithm
Throughout this subsection, fix an instance
Our algorithm for computing
Our algorithm then repeatedly performs a cluster reduction on T1 and T2 until no such reduction is applicable. For the detail of cluster reductions, the reader is referred to Baroni et al. (2006). As the result of zero or more cluster reductions on T1 and T2, we obtain a sequence
To compute
Our algorithm for PHNC is recursive and proceeds as follows. It starts by checking whether
We hereafter assume that one or more roots of F1 are still not matched. Our algorithm then uses the program in Chen et al. (2018) to compute a lower bound
We hereafter assume that
Case 1: There is an active sibling-pair
Case 2: There is an active sibling-pair
Case 3: Neither Case 1 nor 2 occurs. In this case, our algorithm searches F1 for an active sibling-pair
We emphasize that the smaller type of an active sibling-pair in F1 is, the more our algorithm prioritizes it. Intuitively speaking, choosing an active sibling-pair of a smaller type in F1 will likely lead to fewer recursive calls.
Suppose that our algorithm has selected an active sibling-pair
If
If
Now, our algorithm makes t recursive calls on input
4. Solving RDC Approximately
Basically, we want an approximate algorithm that outputs better results than the algorithm in Chen et al. (2018). Although the algorithm in Chen et al. (2016) has a worse theoretical guarantee than the algorithm in Chen et al. (2018), it does not necessarily mean that the former always outputs worse results. So, we obtain a new approximation algorithm for RDC, which simply runs the algorithms in Chen et al. (2016, 2018) and outputs the better result returned by them.
Our new idea is to use MCTS to improve the performance of any approximation algorithm for RDC. MCTS has a number of variants, but we here use the basic one (namely, the UCT algorithm) for its simplicity.
4.1. Outline of the algorithm
In the remainder of this section, fix an FF-pair
How to find a promising z needs to be considered. In the next two cases, we know an optimal choice of z, that is, we know that the choice of z will lead to an optimal solution (Chen et al., 2015):
We hereafter assume that none of the optimal cases cited earlier occurs. Next, we outline how to find a promising z with MCTS. The idea behind MCTS is to build a small-sized search tree
When creating a node
where C is a constant (called the balance constant and fixed to be 0.2 in our experiments). The best child of a node
Initially,
Select a leaf node
Expand
Perform a simulation for
Compute the reward
Backpropagate the reward
Increase
Once finishing growing
4.2. Expanding a node
Suppose that we have selected a leaf node
We emphasize that the smaller type of an active sibling-pair is, the more our algorithm prioritizes it.
If (u,v) is not found, we know that f(α)2 is an AF of f(α) and, hence, we have nothing to do with expanding α. Thus, we hereafter assume that (u,v) has been found. Then, we construct a family ℱ of sets as follows.
If
If
If
We now use ℱ to create the children of α as follows. For each set S ∈ ℱ, we create a child βS, where f(βS)1=f(α)1 and f(βS)2 is obtained from f(α)2 by detaching the subtrees rooted at the vertices in S.
5. Solving HNC Approximately
We say that an approximation algorithm A for RDC is useful if given a TT-pair
6. Experimental Results
To compare our new algorithms against the previous bests, we have implemented them in Java. In this section, we compare the real performance of our programs against that of the previous bests. In our experiments, we use a Linux (x64) desktop PC with Intel i7-4790 CPU (4.00 GHz, 8 threads) and 32 GB RAM.
We define the average approximation ratio (AAR) of an approximation algorithm A (for RDC or HNC) as follows. For a given instance I, we use
To generate a simulated dataset, we use two parameters
6.1. Results on approximating RDC
Since all programs used in our experiments for approximating RDC are fast, it is meaningless to compare them in terms of running time. So, we compare them in terms of their AARs. We use
Comparing the Average Approximation Ratios of Approximation Algorithms for rSPR Distance Computation
6.2. Results on approximating HNC
Since we want the exact hybridization number to be known, we use the two easiest datasets [namely,
Comparing the Average Approximation Ratios of Approximation Algorithms for Hybridization Number Computation
6.3. Results on computing HNC exactly
To compare the speed of our new exact algorithm for HNC against the previous best [namely, UltraNet in Chen and Wang (2013)], we use
Footnotes
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
Z.-Z.C. was supported in part by the Grant-in-Aid for Scientific Research of the Ministry of Education, Science, Sports and Culture of Japan, under Grant No. 18K11183. L.W. was supported by a GRF grant from Hong Kong SAR government Project No. [CityU 11256116] and a grant from National Foundation of China Project No. [61373048].
