Estimating Optimal Species Trees from Incomplete Gene Trees Under Deep Coalescence

Abstract

The estimation of species trees typically involves the estimation of trees and alignments on many different genes, so that the species tree can be based on many different parts of the genome. This kind of phylogenomic approach to species tree estimation has the potential to produce more accurate species tree estimates, especially when gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. Because ILS (also called “deep coalescence”) is a frequent problem in systematics, many methods have been developed to estimate species trees from gene trees or alignments that specifically take ILS into consideration. In this paper we consider the problem of estimating species trees from gene trees and alignments for the general case where the gene trees and alignments can be incomplete, which means that not all the genes contain sequences for all the species. We formalize optimization problems for this context and prove theoretical results for these problems. We also present the results of a simulation study evaluating existing methods for estimating species trees from incomplete gene trees. Our simulation study shows that *BEAST, a statistical method for estimating species trees from gene sequence alignments, produces by far the most accurate species trees. However, *BEAST can only be run on small datasets. The second most accurate method, MRP (a standard supertree method), can analyze very large datasets and produces very good trees, making MRP a potentially acceptable alternative to *BEAST for large datasets.

1. Introduction

Gene tree estimation is well known to be a computationally challenging problem, especially for large datasets where nucleotide alignment estimation can be difficult (Liu et al., 2009b, 2010; Wang et al., 2011), and errors in gene sequence alignment can result in errors in the estimated gene trees. However, over the last two decades, there have been dramatic improvements in the mathematical foundations of phylogeny estimation, and new methods for gene tree and multiple sequence alignment have also been developed.

With respect to mathematical foundations, one area of active research is the analysis of the sequence lengths that suffice for topological accuracy with high probability, and the development of methods that are guaranteed to recover the true tree with high probability from polynomial length sequences (Cryan et al., 1998; Erdos et al., 1999a,b; Csürős and Kao, 1999; Huson et al., 1999; Nakhleh et al., 2001a,b, 2002; Warnow et al., 2001; Mossel, 2010; Daskalakis et al., 2011; Gronau et al., 2011). These methods suggest that very large trees can be estimated with high accuracy without needing very long sequences, provided that appropriate methods can be used. The sequence length requirement of maximum likelihood estimation, however, remains an open problem (the current analysis given in Steel and Székely [1999] and Steel and Székely [2002] does not show that polynomial length sequences suffice for accuracy with high probability, but this bound is probably loose, as discussed in Mossel [2010]).

New alignment and tree estimation methods have also been developed, and substantially changed the way gene trees and alignments are estimated in practice. New multiple sequence alignment methods, such as MAFFT (Katoh et al., 2005a,b), Opal (Wheeler and Kececioglu, 2007), PRANK (Loytynoja and Goldman, 2005), and SATé (Liu et al., 2009b, 2011b), produce substantially improved alignments compared to earlier methods, and new maximum likelihood (ML) phylogeny estimation methods, including RAxML (Stamatakis, 2006), FastTree-2 (Price et al., 2010), GARLI (Zwickl, 2006), and PhyML (Guindon et al., 2010), have made very large-scale maximum likelihood analysis much more accessible. Of these methods, RAxML is probably the most popular, but as shown in Liu et al. (2011a), FastTree-2 is much faster than RAxML and may be as accurate as RAxML for very large datasets. Finally, co-estimation of alignments and trees (overviewed in Loytynoja and Goldman [2009]) is also an area of active research. Most of the methods that co-estimate alignments and trees are computationally intensive and not used much in practice, as noted in Lunter et al. (2005); however, the co-estimation methods POY (Varón et al., 2007), BAli-Phy (Redelings and Suchard, 2005), and SATé (Liu et al., 2009b, 2011b) are in use in biological dataset analyses. Of these, BAli-Phy is the only one based on a statistical model of sequence evolution that includes indels, but it is limited to perhaps 200 sequences; POY is an extension of maximum parsimony to include a gap cost and can analyze datasets with a few hundred sequences (though its performance relative to two-phase methods that first align and then estimate a tree is debated (Ogden and Rosenberg, 2007; Lehtonen, 2008; Liu et al., 2009a; Liu and Warnow, 2012); and finally, SATé can analyze datasets with thousands of sequences, and produces more accurate trees and alignments than standard approaches.

Thus, in the last 10 years or so, there has been a dramatic improvement in methods for estimating gene trees and sequence alignments. However, species tree estimation represents an additional challenge, because no individual gene tree is necessarily a good estimate of the true species tree. That is, for a number of different biological reasons, including incomplete lineage sorting (ILS) (also known as deep coalescence), gene duplication and loss, and horizontal gene transfer, gene trees can differ from the species tree (Maddison, 1997). As a result, species tree estimations need to take causes of discord between gene trees and species trees into consideration, in order to produce highly accurate estimates of the species tree.

In this article, we consider the problem of estimating species trees from estimated gene trees when the true gene trees can differ from the true species trees due to incomplete lineage sorting. Because of the frequency of deep coalescent events in phylogenetic analyses of closely related species, many methods for estimating species trees from gene trees or gene sequence alignments have been developed that explicitly take deep coalescence into account; see Degnan and Rosenberg (2009) for a relatively recent survey of methods and Edwards (2009) for a discussion of the importance of these methods for biological data analysis. Studies evaluating these methods have examined performance with respect to tree error and computational requirements on simulated datasets (Yang and Warnow, 2011), all restricted to datasets in which all gene trees have at least one individual for each species. Most of these studies have shown that methods that explicitly use statistical models to inform the estimation produce the best results; however, Yang and Warnow (2011) showed that some very simple fast methods (in particular, the greedy consensus) came close to the accuracy of a statistically based method, BUCKy (Larget et al., 2010) on tree distributions estimated using MrBayes (Ronquist and Huelsenbeck, 2003).

We focus our attention on the problem of estimating species trees from incomplete estimated gene trees, by which we mean the case where the gene trees might not contain any individuals for some species. In this case, methods that require that all the gene trees have the same set of taxa (such as the greedy consensus and BUCKy) cannot be applied. In addition, results from prior studies that evaluated methods on inputs in which all gene trees have at least one individual from each species are not necessarily applicable, since performance on incomplete gene trees could be different.

We begin with a study of the Minimize Deep Coalescence (MDC) problem introduced in Maddison (1997). This problem takes as input a set of rooted binary gene trees, each on the same set of taxa, and seeks the species tree for which there is a minimum total number of deep coalescences. Although this approach to species tree estimation is not statistically consistent when gene trees can differ from the species tree due to ILS (Than and Rosenberg, 2011), it is one of the most popular techniques for estimating species trees when ILS is suspected. We show how to extend MDC to the case where the gene trees are incomplete, and we prove that Phylonet-MDC (Than et al., 2008) solves this computational problem exactly. We then report on a simulation study we performed to evaluate methods for estimating species trees from incomplete gene trees or alignments for datasets with multiple genes and with 11, 17, or 100 taxa. We compare *BEAST (Heled and Drummond, 2010), a Bayesian method for estimating species trees from gene sequence alignments when genes can differ from species trees due to ILS, to methods based on MDC (iGTP-MDC [Chaudhary et al., 2010] and Phylonet-MDC). We also make comparisons to a heuristic for MRP (matrix representation with parsimony, a standard supertree method) (Baum, 1992; Ragan, 1992) known to be one of the most accurate supertree methods (Swenson et al., 2010; Kupczok et al., 2010; Swenson et al., 2011b) and to heuristics to minimize duplications or duplications + losses in iGTP (Chaudhary et al., 2010), none of which consider ILS when estimating species trees. We compare these methods on datasets simulated on gene trees that can differ from species trees due to ILS and report the missing branch rates of each species tree that we compute.

Although we did not attempt to run *BEAST on the 100-taxon datasets (due to its excessive computational requirements on large datasets), it produced the most accurate trees on the datasets with 11 or 17 taxa. Comparisons between other methods showed that generally MRP gave the most accurate results, and that (when it could be run), the exact version of Phylonet-MDC produced the next most accurate results. In addition, MRP was very fast on these datasets, producing results in under a minute on all datasets. These results suggest that at least for some conditions involving incomplete gene trees, methods that attempt to solve MRP may be computationally tractable ways of producing reasonably accurate species trees, and perhaps better than methods that optimize the MDC criterion. However, for those datasets for which statistical methods (such as *BEAST) can be run, they may be able to produce substantially more accurate trees than all other methods. (See Table 1 for a list of acronyms used in this article and their definitions, and Table 2 for a list of software used in the simulation study.)

Table 1.

Acronyms Used in This Article

Name	Meaning	Comments
ILS	Incomplete lineage sorting	Also called “deep coalescence”
MBMC	Minimizing B-maximal clusters	A computational problem for estimating species trees from complete gene trees, shown to be equivalent to MDC in Yu et al. (2011)
MBMC_inc	MBMC for incomplete gene trees	Extension of MBMC to incomplete gene trees, shown here to be equivalent to MDC_inc
MDC	Minimize deep coalescence	Optimization problem for species tree estimation in the presence of ILS, defined only for complete gene trees
MDC_inc	MDC for incomplete gene trees	MDC_inc seeks completions of all gene trees and a species tree, so that the species tree optimizes MDC with respect to the completed gene trees
MRP	Matrix representation with parsimony	Standard optimization problem for supertree computation, known to be NP-hard

Table 2.

Software Used in This Article

Name	Summary	Reference
^*BEAST	Bayesian co-estimation of gene trees and species trees, in the presence of ILS	Heled and Drummond (2010)
FastTree-2 (FT)	Fast maximum likelihood phylogeny estimation.	Price et al. (2010)
	FT-75 refers to the tree obtained by running FastTree-2 and then collapsing all branches with support below 75%.
iGTP	Gene Tree Parsimony software, implementing a heuristic search to construct species trees from sets of gene trees, under three criteria: MDC, duplications, and duplications plus losses	Chaudhary et al. (2010)
PAUP^*	Phylogenetic Analysis using Parsimony (^and Other Methods). We use heuristics in PAUP^ for parsimony, applied to an MRP matrix we compute.	Swofford (1996)
Phylonet	Software package that performs several functions related to species phylogeny estimation from sets of gene trees. In this paper we use Phylonet to find solutions (exact or heuristic) to the MDC problem.	Than et al. (2008)

2. Theoretical Results for MDC

We begin by defining the MDC problem in the context of complete rooted, binary gene trees. We then show how to extend MDC to incomplete gene trees.

2.1. MDC for complete gene trees

The MDC problem is as follows:

• Input: A set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T} = \{t_1, t_2, \ldots, t_k \}$$ \end{document} of rooted, binary gene trees with each tree t_i on the same set S of taxa.

• Output: a rooted, binary species tree T that minimizes the number of extra lineages with respect to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}$$ \end{document} , denoted by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$XL (T, {\cal T}) = \sum\nolimits_i XL (T, t_i)$$ \end{document}

To define the MDC problem, therefore, we need to define XL(T, t_i), i.e., the number of extra lineages of a species tree T with respect to a gene tree t_i. Visually, this is defined by embedding the gene tree t_i into the species tree T, and then counting how many lineages there on each edge of the species tree; for a given edge, the number of extra lineages is one less than the total number of lineages on the edge (Maddison, 1997). This visual definition of the MDC cost is not necessarily easy to understand.

An alternative definition is given in terms of what are called “B-maximal clusters,” which we now define. A cluster is a subset of the leaf set of a rooted tree, consisting of all the leaves below some internal node. Thus, given a cluster within a tree defined by the node v, we can also define its parent cluster to be the cluster associated with the parent of v. Furthermore, the “parent edge” of a cluster C defined by the node v (i.e., v is the root of the subtree whose leafset is C ) is the edge e = (v, w), where w is between v and the root of t. Let T and t be rooted binary trees on S, with T denoting the species tree and t denoting the gene tree. Let B be a cluster of T . We will say that cluster A of t is B-maximal if A ⊆ B and the parent cluster for A is not a subset of B.

For a cluster B of T , we define k_B(t) to be the number of B-maximal clusters of t, and we let w_B(t) = k_B(t) − 1. It is now known that the embedding of the gene tree t into the species tree T that maps every node in t to MRCA (most recent common ancestor) in T of the leafset below v optimizes the MDC cost. Furthermore, for this embedding, the number of lineages “leaving” the parent edge of the cluster B (i.e., the edge between B and the root of the tree T) is k_B(t); therefore, the number of extra lineages on the parent edge is w_B(t) (one less than the number of lineages) (Yu et al., 2011). Note that w_B(t) ≥ 0 since t and T have the same set of taxa, and that XL(T, t) = ∑_Bw_B(t), where the sum is taken over all clusters B in the tree T, is the number of extra lineages implied by the pair t,T. This is what is meant by the MDC cost for T with respect to gene tree t.

The MDC problem can then be restated as follows:

• Output: binary rooted species tree T on S such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$XL (T, {\cal T}) = \sum\nolimits_i \sum\nolimits_B w_B (t_i)$$ \end{document} is minimized, where i ranges from 1 to k and B ranges over the clusters in the species tree T.

Than and Nakhleh (2009) noted that this problem could be solved exactly by finding a minimum weight clique of size n − 2 in a graph in which there is a node for every possible cluster in the species tree (i.e., subset of taxa), an edge between nodes where their clusters are compatible (meaning that they can co-exist in a rooted tree), and where the weight of the node for cluster B is ∑_iw_B(t_i). This observation yielded the exact version of Phylonet-MDC (Than et al., 2008). By restricting the set of nodes to those clades that appear in the input set of gene trees, they produced the heuristic version of Phylonet-MDC; this method solves the MDC problem exactly when constrained to species trees whose clades are drawn only from the input gene tree clades. Finally, Yu et al. (2011) showed how to modify the Phylonet-MDC algorithm so that it could work with unrooted, partially resolved gene trees and find optimal rooted refinements and species trees that minimize the MDC score.

2.2. Extension to incomplete gene trees

We now discuss how to extend the MDC criterion to handle incomplete gene trees, where the gene tree leaf sets may not contain all the species. We begin with a definition: If S is the full set of taxa and t is a binary rooted tree on a subset of S, then we say that t′ is a completion of t if t′ is a binary rooted tree that contains all the taxa in S and that agrees with t when restricted to the taxa in t. Thus, a completion t′ is obtained by adding additional leaves to t so that it contains all the taxa it is missing. With this, we can now define MDC for incomplete gene trees.

MDC for incomplete gene trees (MDC_inc).

• Output: binary rooted species tree T and completions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$t^{\prime}_i$$ \end{document} of t_i so as to minimize \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$XL (T, {\cal T}^{\prime})$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime} = \{t^{\prime}_1, t^{\prime}_2, \ldots, t^{\prime}_k \}$$ \end{document} .

We will refer to this problem as MDC-incomplete, and we will denote a solution to MDC-incomplete on input set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}$$ \end{document} by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MDC_{inc} ({\cal T}) = (T, {\cal T^{\prime}})$$ \end{document} .

Recall that k_B(t) is defined for the case where the gene tree t is rooted and has the same set of taxa as the species tree; in this case, it equals the number of B-maximal clusters of t. Furthermore, again for the case where the gene trees all have the same set of taxa, we have defined \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$XL (T, {\cal T}) = \sum\nolimits_B \sum\nolimits_i w_B (t)$$ \end{document} , where B ranges over all clusters of T, i ranges from 1 to k, and w_B(t) = k_B(t) − 1. However, we will modify the definition of w_B(t) to appropriately reflect the possibility that the cluster B may contain taxa that do not appear in t. That is, we set

• w_B(t) = 0 if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$B \cap {\cal L} (t) = \emptyset$$ \end{document} (where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal L} (t)$$ \end{document} denotes the leafset of t), and

• w_B(t) = k_B(t) − 1, otherwise.

In other words, we generally use the same definition for w_B(t), except when B is entirely disjoint from the leafset of t. This definition ensures that w_B(t) ≥ 0 for all clusters B and all gene trees t.

Minimizing B-maximal clusters (MBMC_inc).

• Input: set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T} = \{t_1, t_2, \ldots, t_k \}$$ \end{document} of binary rooted trees, with t_i on leafset S_i, for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$i = 1, \ldots, k$$ \end{document} .

• Output: binary rooted species tree T on S = ∪_iS_i such that ∑_i ∑_Bw_B(t_i) is minimized, where i ranges from 1 to k and B ranges over all clusters in T.

We refer to this problem as MBMC-incomplete, and the optimal tree given input \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}$$ \end{document} is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MBMC_{inc} ({\cal T})$$ \end{document} . Note that when S_i = S_j for all i, j, then all the gene trees are complete (on the same set of taxa), and the problem is identical to the MDC problem (optimal solutions to this problem minimize the number of extra lineages).

The main result in this article is the following:

Theorem 1

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}$$ \end{document} be a set of incomplete, rooted, binary gene trees. If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$T = MBMC_{inc} ({\cal T})$$ \end{document} then there exists extensions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$t^{\prime}_i$$ \end{document} for each t_i so that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MDC_{inc} ({\cal T}) = (T, {\cal T^{\prime}})$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime} = \{t^{\prime}_1, t^{\prime}_2, \ldots, t^{\prime}_k \}$$ \end{document} . Also, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MDC_{inc} ({\cal T}) = (T, {\cal T^{\prime}})$$ \end{document} , then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$T = MBMC_{inc} ({\cal T})$$ \end{document} .

In other words, the species tree T that optimizes the MBMC_inc criterion is the species tree component of the optimal solution to MDC_inc.

Phylonet-MDC and iGTP-MDC

The software packages Phylonet (Than et al., 2008) and iGTP (Chaudhary et al., 2010) handle incomplete gene trees differently when attempting to solve MDC, in that they can compute MDC scores differently. In particular, Phylonet defines the MDC score using the MBMC_inc cost, as described above (i.e., the cost of a species tree T is ∑_i ∑_Bw_B(t_i), where B ranges over the clusters in T and i ranges from 1 to k). Theorem 1 thus shows that Phylonet-MDC computes the MDC score correctly. By contrast, there are inputs for which iGTP-MDC does not return this score, indicating that iGTP-MDC defines the MDC score differently for incomplete gene trees. One particular instance in which this occurs is as follows:

• Input gene trees: T₁ = (((a, b), c), d), T₂ = ((b, c), (d, e)), and T₃ = ((a, d), (b, e)).

• Output of Phylonet-MDC exact version: ((d,e),(a,(b,c,))), claiming 3 extra lineages.

• Output of iGTP(MDC): ((d,e),(c,(a,b))), and claims 2 extra lineages.

By our calculation and definition for MDC_inc, Phylonet-MDC correctly computes the number of extra lineages, but iGTP does not. We conjecture that iGTP seeks the species tree T that minimizes ∑_i XL(T_i, t_i), where T_i is the subtree of T induced by S_i. Therefore, iGTP-MDC and Phylonet solve different problems when given gene trees that are incomplete.

3. Establishing the Relationship Between MDC_inc and MBMC_inc

In this section, we establish the relationship between optimizing the MDC_inc and MBMC_inc problems. As a result of this theorem, it will follow that the exact formulation of Phylonet-MDC solves the MDC problem optimally. That is, given an input of incomplete, binary rooted gene trees, to find an optimal species tree and completions of the binary gene trees it will suffice to find a minimum weight clique containing n−2 vertices (where n is the number of taxa) in the graph defined by Phylonet, which has one vertex for each possible cluster, edges between vertices exist if and only if their clusters are compatible (either disjoint or one contains the other), and the weight on the vertex for cluster B set to w_B.

We begin with the following lemma.

Lemma 1

Let T and t be rooted binary trees with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal L} (t) \subset {\cal L} (T)$$ \end{document} , and let X be a maximal cluster in T with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$X \cap {\cal L} (t) = \emptyset$$ \end{document} . Let B₀ be the sibling cluster of X in T (i.e., X ∪ B₀ is the smallest cluster in T that properly contains X), and let A₀ be any B₀-maximal cluster in t. Let t′ be the rooted binary tree obtained by modifying t by inserting the clade for X as the the sibling to the clade on A₀. Then for all clusters B of T, w_B(t) = w_B(t′).

Proof

We consider the four cases that can occur in a species tree T in which B, B₀ and X are clusters:
• Case 1: B ⊆ X

• Case 2: B ⊆ B₀

• Case 3: B₀ ∪ X ⊆ B

• Case 4: (B₀ ∪ X) ∩ B = ∅

We take each case in turn.

Case 1: B ⊆ X. In this case, B is a cluster in the clade on X, and hence a cluster in t′. Therefore, w_B(t′) = 0. Since B ⊆ X, it follows that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$B \cap {\cal L} (t) = \emptyset$$ \end{document} , and so (by definition) w_B(t) = 0.

Case 2: B ⊆ B₀. First, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$B \cap {\cal L} (t) = \emptyset$$ \end{document} , then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$B \cap {\cal L} (t^{\prime}) = \emptyset$$ \end{document} and w_B(t) = w_B(t′) = 0. Hence, assume that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$B \cap {\cal L} (t) \neq \emptyset$$ \end{document} . We will show that A is a B-maximal cluster in t if and only if A is a B-maximal cluster in t′, and so w_B(t) = w_B(t′). Suppose A is B-maximal in t. Then A is a cluster of t and (since A ⊆ B ⊆ B₀) also a cluster of t′. Hence A will be B-maximal for t′ unless the parent cluster in t′ of A is a subset of B. Since A is B-maximal in t, the parent cluster of A in t is not a subset of B. Note that A's parent cluster in t′ is either the same cluster as in t, or else the parent cluster in t′ contains A₀ ∪ X; in either case, the parent cluster of A in t′ is not a subset of B. Therefore, A is also B-maximal in t′.

Conversely, suppose A is a B-maximal cluster in t′. Since A ⊆ B ⊆ B₀, A is a cluster in t. If A is B-maximal in t, then we are done. Else, suppose A is not B-maximal in t. Note that A cannot have the same parent cluster in t and t′, since otherwise A is also B-maximal in t′ (contradicting our hypothesis), and so A's parent cluster in t′ must contain A₀ ∪ X. Hence, A's parent cluster in t must be defined by an internal node on the path from the root of A₀ to the root of t. Label the nodes on that path \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$root (A_0) = v_0, v_1, \ldots, v_t = root (t)$$ \end{document} , and let the “other” child of each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$v_i, i = 1, 2, \ldots, t \ {\rm be} \ w_i$$ \end{document} be w_i, defining cluster A_i. Note that A₁ is the sibling cluster to A₀ in t. Then A = A_i for some i. Note that if A = A₀, then A is B-maximal in t (since A₀ is B₀-maximal in t and B ⊆ B₀). Note also that A ≠ A₁, since otherwise A₁ is B-maximal, and so A₁ ⊆ B ⊆ B₀, contradicting that A₀ is B₀-maximal. Now suppose that A = A_i, for some i ≥ 2. Then the parent cluster of A in t contains X, and so is not a subset of B, establishing that A is B-maximal in t as well. Therefore, w_B(t) = w_B(t′).

Case 3: B₀ ∪ X ⊆ B. Our first observation is that A₀ is B-maximal in t if and only if A₀ ∪ X is B-maximal in t′. Hence, we need only concern ourselves with the B-maximal clusters in t other than A₀, and (equally) with the B-maximal clusters in t′ other than A₀ ∪ X. However, when A ≠ A₀, it is easy to see that A is a B-maximal cluster in t if and only if A is a B-maximal cluster in t′. Hence, w_B(t) = w_B(t′).

Case 4: (B₀ ∪ X) ∩ B = ∅. It is easy to see that for any cluster A, A is B-maximal in t if and only if A is B-maximal in t′, and so w_B(t) = w_B(t′).

The following lemma is obvious and the proof is omitted:

Lemma 2

Let t be an incomplete gene tree, T a species tree, and t′ a completion of t to the taxon set of T. Then w_B(t) ≤ w_B(t′) for all clusters B of T.

Theorem 1

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}$$ \end{document} be a set of incomplete, rooted, binary gene trees. If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$T = MBMC_{inc} ({\cal T})$$ \end{document} then there exists extensions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$t^{\prime}_i$$ \end{document} for each t_i so that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MDC_{inc} ({\cal T}) = (T, {\cal T}^{\prime})$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime} = \{t^{\prime}_1, t^{\prime}_2, \ldots, t^{\prime}_k \}$$ \end{document} . Also, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MDC_{inc} ({\cal T}) = (T, {\cal T}^{\prime})$$ \end{document} , then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$T = MBMC_{inc} ({\cal T})$$ \end{document} .

Proof

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$t \in {\cal T}$$ \end{document} be given, and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$T = MBMC_{inc} ({\cal T})$$ \end{document} . By Lemma 2, for any completion t′ of t and any cluster B of T, w_B(t′) ≥ w_B(t). By Lemma 1, there is a completion t′ of t that achieves w_B(t) = w_B(t′) for all clusters B of T . Since t was arbitrary, we can let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime}$$ \end{document} denote the set of completions of each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$t \in {\cal T}$$ \end{document} so that w_B(t) = w_B(t′) for all clusters B of T. Hence, the number of extra lineages in T with respect to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime}$$ \end{document} is ∑_B ∑_i w_B(t), where B ranges over the clusters B of T and i ranges from 1 to k, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T} = \{t_1, t_2, \ldots, t_k \}$$ \end{document} . It follows, by Lemma 2, that T has the minimum number of extra lineages with respect to any set of completions of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}$$ \end{document} , and so \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$(T, {\cal T}^{\prime})$$ \end{document} is a solution to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MDC_{inc} ({\cal T})$$ \end{document} .

For the converse, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$(T, {\cal T}^{\prime})$$ \end{document} be a solution to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MDC_{inc} ({\cal T})$$ \end{document} ), with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime} = \{t_1^{\prime}, t_2^{\prime}, \ldots, t_k^{\prime} \}$$ \end{document} (each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$t^{\prime}_i$$ \end{document} a completion of t_i). Then since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime}$$ \end{document} is a set of rooted, binary, complete gene trees (i.e., all on the same set of taxa as T), it follows that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$XL (T, {\cal T}^{\prime}) = \sum\nolimits_i \sum\nolimits_B w_B (t_i^{\prime})$$ \end{document} , as B ranges over the clusters of T and i ranges from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$1 \ldots k$$ \end{document} , and that this is the minimum possible among all species trees T and set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}^{\prime}$$ \end{document} of completions of the gene trees. Therefore, for all clusters B in T and for all \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$i, w_B (t_i) = w_B (t^{\prime}_i)$$ \end{document} , since otherwise we could complete t_i differently. Now suppose the tree T isn't an optimal solution to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$MBMC_{inc} ({\cal T})$$ \end{document} . Therefore, for some other binary rooted species tree T* on the same set of taxa, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$\sum\nolimits_B \sum\nolimits_i w_B (t_i) < XL (T, {\cal T}^{\prime})$$ \end{document} , where B ranges over the clusters of T. But then there is a completion \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T^}$$ \end{document} of the gene trees in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $${\cal T}$$ \end{document} so that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6} \begin{document} $$XL (T^, {\cal T^}) < XL (T, {\cal T}^{\prime})$$ \end{document} , contradicting our hypothesis.

4. Methods

4.1. Overview

The simulation study used gene sequences that evolve down gene trees that can differ from the true species tree due to ILS. To produce these sequence datasets, we used sequences used in previous studies and provided to us by the authors of these studies–the 11-taxon datasets from Chung and Ané (2011), the 17-taxon datasets from Than and Nakhleh (2009), and the 100-taxon datasets from Yang and Warnow (2011). We summarize the simulation protocols used in these studies here, and direct the reader to the relevant publication for the details of how the data were generated.

In each case, a model species tree was generated (typically using a birth-death process). Then a set of gene trees within each species tree was produced under a coalescent process, so that for each gene one individual was sampled for each species. This produces gene trees with branch lengths that can differ topologically from their associated species tree due to ILS. DNA sequences were then simulated down each gene tree. For the 11-taxon and 17-taxon datasets, these simulations were done under a substitution-only model, and for the 100-taxon datasets these simulations were done under GTR+Gamma+gap models with varying gap lengths; thus, the 100-taxon datasets evolved with indels as well as with substitutions. Many replicates were generated for each model condition, and each replicate consisted of true sequence alignments for each gene.

For each replicate dataset, we had the true alignment as well as the true tree. We then deleted taxa randomly, varying the number of taxa removed, from each gene sequence alignment, thus producing incomplete gene sequence alignments. On each resultant gene sequence alignment we estimated trees using FastTree-2 (Price et al., 2010); this produces a tree as well as branch support estimations. We produced a 75%-branch support version of each estimated gene tree by contracting all edges with support below 75%.

For each replicate of each model, we thus have three types of datasets (each consisting of a collection of gene sequence alignments and trees): the true gene sequence alignment, the binary trees estimated by FastTree-2 on the true gene sequence alignment, and the 75%-branch support FastTree-2 trees estimated on each true gene sequence alignment.

For each such dataset, we estimated species trees using the following techniques:
• iGTP v. 1.1. We explore all three optimization criteria (deep coalescence, duplications, duplication-loss) available in iGTP. We ran iGTP on 75% support version of the input binary trees, although it is not guaranteed to give meaningful outputs for non-binary gene trees.

• Phylonet v. 2.4. We explore both heuristic and exact version of Phylonet used to solve the MDC problem on both binary and unresolved gene trees. However, the exact version can only be run on small datasets, and so we used it only on the 11-taxon datasets.

• Matrix Representation with Parsimony (MRP). We ran MRP heuristics on the FastTree-2 trees (both binary and 75%-support versions), using a Python script to run a parsimony ratchet analysis using PAUP, with 100 iterations, followed by taking the greedy consensus of the set of trees.

• BEAST v. 1.6.2. We ran BEAST on the true alignments for each dataset using its default settings.

We recorded the average (over all replicates) missing branch rate and running time for each method. When computing the missing branch rate, we compare the estimated species tree to the subtree of the true species tree induced by those species present in at least one gene tree.

4.2. Datasets

We ran our experiments on datasets that evolve with ILS. We used 11-taxon datasets, each with 10 genes, obtained from Chung and Ané (2011). We also used 17-taxon datasets with 8 genes each, used previously in Than and Nakhleh (2009). Finally, we used 100-taxon datasets with 25 genes each, used in Yang and Warnow (2011).

5. Results

5.1. Missing branch rates

We begin by discussing performance with respect to missing branch rates.

5.1.1. Results on 11-taxon datasets

For these datasets, we were able to run the exact version of Phylonet-MDC, and hence solve the MDC problem exactly. As before, we ran the heuristic version of Phylonet-MDC, the three iGTP methods (for the MDC score, duplication score, and duplication plus losses score), and MRP. We explored results with two, three, and five missing taxa (Fig. 1)

FIG. 1.
Average missing branch rates of methods on 20 11-taxon 10-gene datasets on true alignments (TA). Gene trees are estimated using FastTree-2 (FT), and in some cases the branches with support less than 75% are contracted (FT-75). From top to bottom, the number of missing taxa is 2, 3, and 5.

The first observation is that BEAST produced the most accurate species trees, for all percents of missing taxa. The second best method varied depending on the percentage of missing taxa, with MRP on the 75%-support trees best for 20% missing taxa, Phylonet-exact on the 75%-support trees best for 30% missing taxa, and MRP best for 50% missing taxa. Thus, there was no clear second best method. Furthermore, although these three methods generally gave reasonably good results, they were not always among the next most accurate. Between the iGTP methods, iGTP-dup had the worst results, and iGTP-MDC and iGTP-duploss were sometimes reasonably accurate. A noteworthy trend was that Phylonet-heuristic gave the worst results at all percents of missing taxa, whether applied to the fully resolved trees or the 75%-support trees. Finally, using the 75%-support trees instead of the fully resolved trees improved MRP and Phylonet (both exact and heuristic) for small numbers of missing taxa, but not when the number of missing taxa was large. Also, using the 75%-support trees did not help the other methods.

5.1.2. Results on 17-taxon datasets

Performance on 17-taxon datasets with 8 genes showed similar results (Fig. 2). Because of the number of taxa, we did not run Phylonet-exact. However, the results we saw here are similar to what we saw on the 11-taxon datasets. As before, BEAST was the most accurate, for all percents of missing taxa. The next best methods were MRP and iGTP-MDC (on either binary or 75%-support trees), and sometimes also iGTP-duploss on binary trees, but all had at least 7% higher missing branch rates than BEAST. The worst results were obtained using Phylonet-heuristic and iGTP-dup on either the binary or 75%-support trees.

FIG. 2.
Average missing branch rates of methods on 20 17-taxon 8-gene datasets on true alignments (TA). Gene trees are estimated using FastTree-2 (FT), and in some cases the branches with support less than 75% are contracted (FT-75). From top to bottom, the number of missing taxa is 1, 5, and 8.

5.1.3. Results on 100-taxon datasets

We now describe results on the 100-taxon datasets. Because of the number of taxa, we did not run BEAST (running long enough to reach convergence was infeasible for this experimental study), nor Phylonet-exact. However, these data allow us to compare the other methods, Phylonet-heuristic, the three variants of iGTP, and MRP, on both binary and 75%-support trees (Fig. 3).

FIG. 3.
Average missing branch rates of methods on 10 100-taxon 25-gene datasets on true alignments (TA). Gene trees are estimated using FastTree-2 (FT), and in some cases the branches with support less than 75% are contracted (FT-75). From top to bottom, the number of missing taxa is 10, 30, and 50.

On the estimated gene trees, MRP on the 75%-support trees gives the most accurate trees, but MRP on binary trees comes quite close. The least accurate method is Phylonet-heuristic on binary trees, and Phylonet-heuristic on 75%-support trees is only slightly better (and much less accurate than all the other methods). A comparison between the iGTP methods no longer shows no reliable differences: for example, sometimes iGTP-MDC is the best and sometimes it is the worst of the three.

5.1.4. Overall results

For all levels of missing data, certain trends were clearly seen. Results for all methods improved when given more estimated gene trees rather than fewer; these trends are to be expected, and consistent with prior studies (Yang and Warnow, 2011). In addition, we saw that for each species tree estimation method, the missing branch rate increased with increased levels of taxon deletion, but the increase in error was particularly large for the heuristic version of Phylonet-MDC.

The relative performance between methods showed clearly that when analyzing estimated gene trees, BEAST produced the most accurate results (as indicated by the lowest missing branch rate). MRP, especially on the 75%-support trees, typically came in second or close to second, and when Phylonet-exact could be run, it gave results that were close to that of MRP. However, in all the experiments, Phylonet-heuristic gave the least accurate results or tied for last. Comparisons between the iGTP methods depended on the model condition, and no overall trends could be observed.

Using 75%-support trees had a variable effect on the different methods we explored. First, on low taxon deletion levels, Phylonet-MDC (in either the exact or heuristic version) and MRP were improved by using the 75%-support trees, but this changed for the highest level of taxon deletion. Why this is happened is unclear, although it could be that the branches on the estimated gene trees had low support when estimated on very sparse taxon sets (such as would be obtained by deleting many taxa), leading to more loss of information when using the 75%-support trees. On the other hand, the iGTP methods did not show any advantage when used with 75%-support trees, and were often hurt.

5.2. Computational issues

We also evaluated the running time and memory usage of the different methods we studied. Phylonet-exact uses time that is exponential in the number of taxa, and so could only be run on the 11-taxon datasets; however, on these datasets it completed in less than 2 seconds. The next most expensive method is BEAST, which must be run long enough to converge to the stationary distribution. Therefore, we only ran BEAST on the 11-taxon and 17-taxon datasets. On average, BEAST finished its analyses in 15 minutes on the 11-taxon datasets and 20 minutes on the 17-taxon datasets. The remaining methods were much faster: all finished in under a second on the 11-taxon and 17-taxon datasets, and in under a minute on the 100-taxon datasets. Some differences in running time were evident on the 100-taxon datasets, where Phylonet-heuristic finished in 6 seconds, MRP finished in 20 seconds, but the three iGTP methods took between 20–64 seconds. Peak memory usage by these methods all differed, but only BEAST used any substantial memory—about 1GB on the 17-taxon datasets.

6. Discussion

We begin with some observations about methods that attempt to optimize the MDC criterion. First, it is clear that iGTP-MDC generally gives more accurate trees than Phylonet-MDC run in its heuristic mode; however, when the exact version of Phylonet-MDC can be run, it produces more accurate trees than its heuristic version, and also more accurate trees than iGTP-MDC. The reason for this is likely due to the improved MDC scores produced by using the exact version of Phylonet-MDC (which are mathematically guaranteed), compared to the other methods. It is worth noting that the substantial reduction in topological accuracy (and MDC scores, results not shown) by using the heuristic version instead of the exact version of Phylonet-MDC is almost certainly a result of the fact that all the gene trees are incomplete, with randomly deleted taxa. This greatly impairs the ability of Phylonet-MDC's heuristic to score trees that are topologically similar to the true tree, since all the clades in any estimated tree must be drawn from the input gene tree clades in this case. However, the heuristic used in iGTP-MDC explicitly searches through treespace and so is not impaired in the same way. Given that previous research (Yang and Warnow, 2011) has shown very good trees resulting from Phylonet-MDC's heuristic version when the input gene trees are all complete, it seems likely that Phylonet-MDC might give better results when the taxon deletion is not random, or when at least some of the gene trees are based on complete taxon sets. Thus, although this study showed poor accuracy for Phylonet-MDC's heuristic, this trend may not hold under other circumstances, including those that might better represent systematic practice. Future work will investigate this possibility.

We also note that contracting low support branches in estimated gene trees typically (but not always) benefited Phylonet-MDC and MRP, but not the iGTP methods. This difference is likely due to differences in the treatment of unresolved gene trees within the iGTP, Phylonet-MDC, and MRP software. For example, it seems likely that iGTP-MDC and Phylonet-MDC do not score proposed species trees identically when the input gene trees are unresolved (Phylonet-MDC scores species trees with respect to optimal refinements of unresolved gene trees (Yu et al., 2011), a guarantee that may not be true of iGTP-MDC).

This study establishes that there is currently no computationally feasible solution for estimating highly accurate species trees from incomplete gene trees for large numbers of taxa. That is, only BEAST was able to produce highly accurate species trees; all other methods had much higher error rates. Therefore, for small enough numbers of taxa so that BEAST can be run properly without huge running times, very accurate species trees can be computed. Although this study did not investigate the feasibility of running BEAST on larger datasets, other studies with Bayesian methods have shown that proper analyses of datasets (even small ones) can take weeks of analysis to reach convergence (Yang and Warnow, 2011). Therefore, the poor results of the other methods on larger datasets suggests that highly accurate species tree estimations from incomplete gene trees and alignments may be beyond what current methods can achieve.

This study also suggests some limitations to analyses based on MDC. Unsurprisingly, we saw that optimizing MDC generally gave better results than optimizing duplications or duplications and losses. We also observed (Yang and Warnow, 2011) that optimizing the total number of duplications and losses produced more accurate trees than optimizing duplications alone.

Finally, and perhaps most interestingly, we noted that optimizing MDC produced generally less accurate trees than optimizing MRP. This is a very interesting result, given that MRP is agnostic about the cause of incongruence between gene trees, and MDC explicitly addresses ILS as the cause for incongruence. However, there is no mathematical explanation for why MRP would perform well, and so this remains only an empirical observation.

This study shows that the standard heuristics (the parsimony ratchet as implemented in PAUP) for the supertree method MRP produces highly accurate species tree estimations, even though it does not consider ILS, and can do so reasonably quickly, even on large datasets. These observations, combined with the observation that none of the methods we studied (other than BEAST) that explicitly take into account events such as ILS or duplication and loss produced trees as accurate as MRP, suggest that optimizing MRP may* be a reasonable approach to species tree estimation for large datasets, when statistical methods (such as BEAST) cannot be run for computational reasons. Therefore, other supertree methods, such as SuperFine (Swenson et al., 2011a; Neves et al., 2012; Nguyen et al., 2012), a new supertree method that has been shown to produce better MRP scores and more accurate trees than standard MRP heuristics (while also being faster than standard MRP heuristics), should also be investigated. Finally, for complete gene trees, the greedy consensus produced highly accurate species trees, despite not being statistically consistent (at least when used on rooted gene trees, as shown in Degnan et al. [2009]). Given this observation, other consensus methods (Bryant, 1991; Kannan et al., 1995; Phillips and Warnow, 1996) might also be useful for estimating species trees for large numbers of taxa.

7. Conclusion

Species tree estimation from incomplete gene trees that can differ from the true species tree due to ILS presents many interesting theoretical and empirical challenges: excellent results can be obtained using BEAST, a statistical approach that explicitly models the processes that cause incongruence between gene trees and species trees, but *BEAST is too computationally intensive for even moderately large datasets. In contrast, a very simple supertree method, MRP, is able to provide reasonably good results on very large datasets, even though it does not provide statistical guarantees. Thus, while it seems likely that methods based on sound statistical models will produce the most accurate species trees, the current methods that can analyze incomplete gene trees are either limited to small datasets, or are not based upon statistical models of gene tree incongruence. Given the increased use of multi-marker analyses for species tree estimation, methods that are both statistically-based and can run on large datasets (and can analyze incomplete gene datasets), are likely to have high impact. Future work will hopefully produce methods that are scalable and statistically based, and that produce highly accurate trees on datasets with large, incomplete gene trees.

Authors' Contributions

T.W. designed the study; S.B. performed the study; S.B. and T.W. proved the theoretical results, analyzed the data, and wrote the article.

Footnotes

Acknowledgments

We thank the anonymous reviewer for helpful suggestions. This research was supported by the U.S. National Science Foundation (DEB 0733029 and DBI-1062335), the John P. Simon Guggenheim Memorial Foundation Fellowship to T.W., a David Bruton Jr. Centennial Professorship to T.W., the Fundacão para a Ciência e a Tecnologia (Portugal), and a 2010 Fulbright International Science and Technology Ph.D. Award to M.S.B.

Disclosure Statement

No competing financial interests exist.

References

Baum

B.R.

1992. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon, 41:3–10.

Bryant

1991. A Classification of Consensus Methods for Phylogenetics. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. Bioconsensus: DIMACS working group meetings on Bioconsensus, Oct. 25–26, 2003. AMS, 61:163–184.

Chaudhary

, Bansal

M.S.

, Wehe

et al. 2010. iGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinform, 11:574.

Chung

, Ané

2011. Comparing two Bayesian methods for gene tree/species tree reconstruction: a simulation with incomplete lineage sorting and horizontal gene transfer. Syst. Biol., 60:261–275.

Cryan

, Goldberg

1998. Evolutionary trees can be learned in polynomial time in the two-state general markov model. Proc. IEEE Symp. Found. Comput. Sci. (FOCS98), 436–445.

Csürős

, Kao

M-Y

. 1999. Recovering evolutionary trees through harmonic greedy triplets. Proc. 10th Annu. ACM/SIAM Symp. Discr. Algorithms (SODA99), 261–270.

Daskalakis

, Mossel

, Roch

2011. Phylogenies without branch bounds: contracting the short, pruning the deep. SIAM J. Discr. Math, 25:872–893.

Degnan

J.H.

, Rosenberg

N.A.

2009. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol., 26:332–340.

Degnan

J.H.

, DeGiorgio

, Bryant

et al. 2009. Properties of consensus methods for inferring species trees from gene trees. Syst. Biol., 58:35–54.

10.

Edwards

S.V.

2009. Is a new and general theory of molecular systematics emerging? Evolution, 63:1–19.

11.

Erdos

P.L.

, Steel

M.A.

, Székely

et al. 1999a. A few logs suffice to build (almost) all trees (i) Random Struct. Algorithms, 14:153–184.

12.

Erdos

P.L.

, Steel

M.A.

, Székely

et al. 1999b. A few logs suffice to build (almost) all trees (ii) Theor. Comput. Sci., 221:77–118.

13.

Gronau

, Moran

, Snir

2011. Fast and reliable reconstruction of phylogenetic trees with indistinguishable edges. Random Struct. Algorithms., 40:350–384.

14.

Guindon

, Dufayard

J.F.

, Lefort

et al. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of phyml 3.0. Syst. Biol., 59:307–321.

15.

Heled

, Drummond

A.J.

2010. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol., 27:570–580.

16.

Huson

, Nettles

, Warnow

. 1999. Disk-covering, a fast converging method for phylogenetic tree reconstruction. J. Comput. Biol., 6:369–386.

17.

Kannan

S.K.

, Warnow

, Yooseph

. 1995. Computing the local consensus of trees. Proc. 6th Annu. ACM-SIAM Symp. Discr. Algorithms, 68–77.

18.

Katoh

, Kuma

, Miyata

et al. 2005a. Improvement in the acccuracy of multiple sequence alignment mafft. Genome Inform., 16:22–33.

19.

Katoh

, Kuma

, Toh

et al. 2005b. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res., 33:511–518.

20.

Kupczok

, Schmidt

, von Haeseler

2010. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms Mol. Biol., 5:37–53.

21.

Larget

, Kotha

S.K.

, Dewey

C.N.

et al. 2010. BUCKy: Gene tree/species tree reconciliation with the Bayesian concordance analysis. Bioinformatics, 26:2910–2911.

22.

Lehtonen

2008. Phylogeny estimation and alignment via POY verss Clustal+PAUP*: a response to Ogden and Rosenberg. Syst. Biol., 57:653–657.

23.

Liu

, Warnow

2012. Treelength optimization for phylogeny estimation. PLoS ONE, 7:e33104.

24.

Liu

, Nelesen

, Raghavan

et al. 2009a. Barking up the wrong treelength: the impact of gap penalty on alignment and tree accuracy. IEEE Trans. Comput. Biol. Bioinform., 6:7–21.

25.

Liu

, Raghavan

, Nelesen

et al. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324:1561–1564.

26.

Liu

, Linder

C.R.

, Warnow

2010. Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS Curr. Tree of Life Available at:http://knol.google.com/k/kevin-liu/multiple-sequence-alignment-a-major/ectabesw3uba/9. Accessed April, 1:2012.

27.

Liu

, Linder

C.R.

, Warnow

2011a. RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE, 6:e27731.

28.

Liu

, Warnow

T.J.

, Holder

M.T.

et al. 2011b. SATé-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol., 61:90–106.

29.

Loytynoja

, Goldman

2005. An algorithm for progressive multiple alignment of sequences with insertions. Proc. Nat. Acad. Sci., 102:10557–10562.

30.

Loytynoja

, Goldman

2009. Uniting alignments and trees. Science, 324:1528–1529.

31.

Lunter

, Drummond

A.J.

, Miklós

et al. 2005. Statistical alignment: recent progress, new applications, and challenges, 375–406. Nielsen

Statistical Methods in Molecular Evolution (Statistics for Biology and Health) Springer: Berlin.

32.

Maddison

W.P.

1997. Gene trees in species trees. Syst. Biol., 46:523–536.

33.

Mossel

2010. Toward extracting all phylogenetic information from matrices of evolutionary distances. Science, 327:1376–1379.

34.

Nakhleh

, Roshan

, St. John

et al. 2001a. Designing fast converging phylogenetic methods. Bioinformatics, 17:190–198.

35.

Nakhleh

, Roshan

, St. John

et al. 2001b. The performance of phylogenetic methods on trees of bounded diameter. 2149. Lect. Notes Comput. Sci., 189–203.

36.

Nakhleh

, Moret

B.M.E.

, Roshan

et al. 2002. The accuracy of fast phylogenetic methods for large datasets. Proc. 7th Pac. Symp. Biocomput. (PSB02), 211–222.

37.

Neves

D.T.

, Warnow

, Sobral

et al. 2012. Parallelizing superfine. 27th Symp. Appli. Comput. (ACM-SAC).

38.

Nguyen

N-P.

, Mirarab

, Warnow

2012. MRL and SuperFine+MRL: new supertree methods. Algorithms Mol. Biol., 7.

39.

Ogden

T.H.

, Rosenberg

2007. Alignment and topological accuracy of the direct optimization approach via POY and traditional phylogenetics via ClustalW + PAUP* Syst. Biol., 56:182–193.

40.

Phillips

C.A

, Warnow

1996. The asymmetric median tree: a new model for building consensus trees. Discr. Appl. Math., 71:311–335.

41.

Price

M.N.

, Dehal

P.S.

, Arkin

A.P.

2010. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE, 5:e9490.

42.

Ragan

M.A.

1992. Phylogenetic inference based on matrix representation of trees. Mol. Phylogenet. Evol., 1:53–58.

43.

Redelings

, Suchard

2005. Joint Bayesian estimation of alignment and phylogeny. Syst. Biol., 54:401–418.

44.

Ronquist

, Huelsenbeck

J.P.

2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19:1572–1574.

45.

Stamatakis

2006. RAxML-NI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22:2688–2690.

46.

Steel

M.A.

, Székely

L.A.

1999. Inverting random functions. Ann. Comb., 3:103–113.

47.

Steel

M.A.

, Székely

L.A.

2002. Inverting random functions. II. Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discr. Math., 15:562–575.

48.

Swenson

M.S.

, Barbançon

, Linder

C.R.

et al. 2010. A simulation study comparing supertree and combined analysis methods using SMIDGen. Algorithms Mol. Biol., 5:8.

49.

Swenson

M.S.

, Suri

, Linder

C.R.

et al. 2011a. SuperFine: fast and accurate supertree estimation. Syst. Biol. http://sysbio.oxfordjournals.org/content/early/2011/09/16/sysbio.syr092.abstract. 2012 April 1.

50.

Swenson

M.S.

, Suri

, Linder

C.R.

et al. 2011b. An experimental study of Quartets MaxCut and other supertree methods. Algorithms Mol. Biol., 6:7.

51.

Swofford

D.L.

1996. PAUP*: Phylogenetic Analysis Using Parsimony (and Other Methods), version 4.0. Sinauer Associates: Sunderland, MA.

52.

Than

C.V.

, Nakhleh

2009. Species tree inference by minimizing deep coalescences. PLoS Comput. Biol., 5:e1000501.

53.

Than

C.V.

, Rosenberg

N.A.

2011. Consistency properties of species tree inference by minimizing deep coalescences. J. Comput Biol., 18:1–15.

54.

Than

C.V.

, Ruths

, Nakhleh

2008. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinform., 9:322.

55.

Varón

, Vinh

L.S.

, Bomash

et al. 2007. POY software. Documentation. Available at: http://research.amnh.org/scicomp/projects/poy.php. Accessed April, 1:2012.

56.

Wang

L-S.

, Leebens-Mack

, Wall

P.K.

et al. 2011. The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE Trans. Comput. Biol. Bioinform. (TCBB), 1108–1119.

57.

Warnow

, Moret

B.M.E

, St. John

2001. Absolute convergence: true trees from short sequences. Proc. 12th Annu ACM/SIAM Symp. Discr. Algorithms (SODA01), 186–195.

58.

Wheeler

, Kececioglu

2007. Multiple alignment by aligning alignments. Proc. 15th ISCB Conf. Intell. Sys. Mol. Biol., 559–568.

59.

Yang

, Warnow

2011. Fast and accurate methods for phylogenomic analyses. BMC Bioinform., 12.

60.

, Warnow

, Nakhleh

2011. Algorithms for MDC-based multi-locus phylogeny inference: beyond rooted binary gene trees on single alleles. J. Comput. Biol., 18:1543–1559.

61.

Zwickl

D.J.

2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion [Ph.D. dissertation] University of Texas at Austin.

Estimating Optimal Species Trees from Incomplete Gene Trees Under Deep Coalescence

Abstract

Abstract

1. Introduction

2. Theoretical Results for MDC

2.1. MDC for complete gene trees

2.2. Extension to incomplete gene trees

Theorem 1

Phylonet-MDC and iGTP-MDC

3. Establishing the Relationship Between MDCinc and MBMCinc

Lemma 1

Proof

Lemma 2

Theorem 1

Proof

4. Methods

4.1. Overview

4.2. Datasets

5. Results

5.1. Missing branch rates

5.1.1. Results on 11-taxon datasets

5.1.2. Results on 17-taxon datasets

5.1.3. Results on 100-taxon datasets

5.1.4. Overall results

5.2. Computational issues

6. Discussion

7. Conclusion

Authors' Contributions

Footnotes

Acknowledgments

Disclosure Statement

References

3. Establishing the Relationship Between MDC_inc and MBMC_inc