Due to hybridization events in evolution, studying different genes of a set of species may yield two or more related but different phylogenetic trees for the set of species. In this case, we want to combine the trees into a reticulate network with the fewest hybridization events. In this article, we develop a software tool (named UltraNet) for several fundamental problems related to the construction of minimum reticulate networks from two or more phylogenetic trees. Our experimental results show that UltraNet is much faster than all previous tools for these problems.
1. Introduction
Due to hybridization events in evolution, studying different genes of a set of species may yield related but different phylogenetic trees for the set of species. In this case, we want to combine the trees into a reticulate network with the fewest hybridization events. This problem is NP-hard even when the number of trees is two (Hein et al.; 1996; Bordewich and Semple, 2005). Several tools had previously been developed for this problem and its variants (Albrecht et al.; 2012, Colins et al.; 2011; Chen and Wang, 2012a; Wu, 2009; Wang and Wu, 2010). However, the previously fastest tools can still take hours to finish even when only two trees are given and their size is moderate. In this article, we develop a new tool (called UltraNet) for these problems by implementing and utilizing two recent algorithms for rSPR distance and for hybridization number of two given trees (Chen and Wang, 2012b). Our experimental results show that UltraNet is much faster than the best tools previously used for these problems—namely, FastHN (Chen et al, 2012); Dendroscope 3 (Albrecht et al, 2012); CMPT and MaafB (Chen and Wang, 2012a); and PIRN (Wu, 2010).
2. Problem Definitions
A binary tree is a rooted tree in which each nonleaf vertex has exactly two children. Let X be a set of existing species. A phylogenetic X-tree is a binary tree whose leaf set is X. For our purpose, a reticulate network on X is a directed acyclic graph N in which the set of vertices of out-degree 0 (still called the leaves) is X, each non leaf vertex has out-degree 2, and there is exactly one vertex of in-degree 0 (called the root). A vertex of in-degree larger than 1 in N is called a reticulate vertex. Intuitively speaking, a reticulate vertex corresponds to a reticulation event. The hybridization number (HybNum for short) of N is the number of reticulate vertices in N. The size of N is E–H, where E is the total number of edges entering reticulate vertices in N, and H is the HybNum of N.
A reticulate network N on X displays a phylogenetic tree T on X if T can be obtained from N by first deleting some edges and then merging each vertex of out-degree 1 (resulting from the edge deletions) and its single child into a single vertex. We are interested in the following problem (denoted by HybNum) (Chen and Wang, 2012a):
Input: Phylogenetic trees \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document} with the same leaf set.
HybNum is closely related to the problem of computing a maximum acyclic agreement forest (MAAF) of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document}. Indeed, the HybNum of N equals the number of trees in an MAAF of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document} minus one (Baroni et al., 2005; Chen and Wang, 2012a).
In some cases, we may want to enumerate all minimum-HybNum reticulate networks of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document}. Unfortunately, it is not hard to construct example trees \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document} for which there are too many minimum-HybNum reticulate networks. So, we instead want to enumerate only a representative set of minimum-HybNum reticulate networks for them. This motivates us to consider the following problem (denoted by EnumHN) (Albrecht et al., 2012; Chen and Wang, 2012a; Chen et al., 2012):
Input: Phylogenetic trees \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document} with the same leaf set.
Output: All MAAFs of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document} together with a minimum-HybNum reticulate network (displaying \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document}) for each MAAF.
We also consider the following problem (denoted by SizeLB) (Chen and Wang, 2012a, Wu, 2010):
Input: Phylogenetic trees \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document} with the same leaf set.
Output: A lower bound on the size of a reticulate network displaying \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$$T_1 , \ldots , T_k$$
\end{document}.
3. Methods
In this section, we list the key ideas behind UltraNet. First, we obtain an ultrafast subroutine for computing the rSPR distance of two given trees by implementing a recent fast algorithm for this problem (Chen and Wang, 2012b). Second, we obtain an ultrafast subroutine for computing the minimum HybNum of a reticulate network displaying two given trees by implementing a recent fast algorithm for this problem (Chen and Wang, 2012b).
Third, we use the two aforementioned subroutines to speed up the best tools previously used for HybNum and EnumHN [namely, CMPT (Chen and Wang, 2012a)] and the best tool previously used for SizeLB [namely, MaafB (Chen and Wang, 2012a)].
4. Results and Discussion
Since the two ultrafast subroutines for computing the rSPR distance or the minimum HybNum of two given trees are the key components of UltraNet, here we only compare them with the best previously used—namely, RSPR (Whidden et al., 2010) and FastHN (Chen and Wang, 2012b). The experiment has been performed on a Windows-7 (x64) desktop PC with i7-975 CPU and 6GB RAM.
We use the program of Beiko and Hamilton (2006) to generate 60 pairs (T1, T2) of trees, each with 200 leaves, where T2 is obtained from T1 by performing 50 random rSPR operations. Figure 1 summarizes the average running times of the reliable version (v1.03) of RSPR1 and UltraNet for computing the rSPR distances between the generated tree-pairs, where each average is taken over those tree-pairs with the same rSPR distance. As can be seen from the figure, UltraNet is much faster than RSPR. This difference in speed becomes clearer as the rSPR distance becomes larger.
Comparing our UltraNet against the reliable version (v1.03) of Whidden et al.'s RSPR (2010) on 60 randomly generated tree-pairs, where each tree has 200 leaves.
To compare the running times of UltraNet, FastHN, and Dendroscope 3 for computing hybridization number or enumerating all MAAFs, we also use the 60 tree-pairs generated in the above by setting n = 200 and r = 50. Figure 2 summarizes the average running time of UltraNet for computing the hybridization numbers or enumerating all MAAFs of the generated tree-pairs, where each average is taken over those tree-pairs with the same hybridization number. For each of the 60 tree-pairs, FastHN cannot finish within one day and Dendroscope 3 fails to finish. In contrast, UltraNet usually finishes within 1 minute.
The average running time of UltraNet on 60 randomly generated tree-pairs, where each tree has 200 leaves. For each of the 60 tree-pairs, FastHN cannot finish within three days and Dendroscope 3 fails to finish.
Executables of UltraNet for Windows XP (x86), Windows 7 (x86-64), Linux (x86), Linux (x86-64), and Mac OS X (x86) together with the datasets used in our experiments are available from the authors upon request.
Footnotes
Acknowledgment
L.W. is supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 121608].
Author Disclosure Statement
The authors declare that no competing financial interests exist.
1
For some datasets used in our experiments, the newest version (namely, v1.1.0) of RSPR fails to output the correct rSPR distance.
References
1.
AlbrechtB., ScornavaccaC., CenciA.et al.2012. Fast computation of minimum hybridization networks. Bioinformatics, 28:191–197.
2.
BaroniM., GrunewaldS., MoultonV.et al.2005. Bounding the number of hybridisation events for a consistent evolutionary history. J. of Math. Biol., 51:171–182.
3.
BeikoR.G., HamiltonN.2006. Phylogenetic identification of lateral genetic transfer events. BMC Evol. Biol., 6:159–169.
4.
BordewichM., SempleC.2005. On the computational complexity of the rooted subtree prune and regraft distance. Annals of Combinatorics, 8:409–423.
5.
ChenZ.-Z., WangL.2012. Algorithms for reticulate networks of multiple phylogenetic trees. IEEE/ACM Trans. on Computational Biology and Bioinformatics, 9:372–384.
6.
ChenZ.-Z., WangL.2012. Faster exact algorithms for hybridization number and rSPR distance. Submitted for publication. http://rnc.r.dendai.ac.jp/ultraNet.pdf
7.
ChenZ.-Z., WangL., YamanakaS.2012. A fast tool for minimum hybridization networks. BMC Bioinformatics, 13:155.
8.
CollinsL., LinzS., SempleC.2011. Quantifying hybridization in realistic time. J. of Comput. Biol., 18:1305–1318.
9.
HeinJ., JingT., WangL.et al.1996. On the complexity of comparing evolutionary trees. Disc. Appl. Math., 71:153–169.
10.
WangJ., WuY.2010. Fast computation of the exact hybridization number of two phylogenetic trees. Proceedings of ISBRA 2010, 203–214.
11.
WhiddenC., BeikoR.G., ZehN.2010. Fast FPT algorithms for computing rooted agreement forest: theory and experiments. LNCS, 6049:141–153.
12.
WuY.2009. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics, 190–196.
13.
WuY.2010. Close lower and upper bounds for the minimum reticulate network of multiple phylogenetic trees. Bioinformatics [ISMB], 26:140–148.