Abstract
Abstract
Lateral gene transfer (LGT) is a common mechanism of nonvertical evolution, during which genetic material is transferred between two more or less distantly related organisms. It is particularly common in bacteria where it contributes to adaptive evolution with important medical implications. In evolutionary studies, LGT has been shown to create widespread discordance between gene trees as genomes become mosaics of gene histories. In particular, the Tree of Life has been questioned as an appropriate representation of bacterial evolutionary history. Nevertheless a common hypothesis is that prokaryotic evolution is primarily treelike, but that the underlying trend is obscured by LGT. Extensive empirical work has sought to extract a common treelike signal from conflicting gene trees. Here we give a probabilistic perspective on the problem of recovering the treelike trend despite LGT. Under a model of randomly distributed LGT, we show that the species phylogeny can be reconstructed even in the presence of surprisingly many (almost linear number of) LGT events per gene tree. Our results, which are optimal up to logarithmic factors, are based on the analysis of a robust, computationally efficient reconstruction method and provides insight into the design of such methods. Finally, we show that our results have implications for the discovery of highways of gene sharing.
1. Introduction
Here we study specifically lateral gene transfer (LGT), that is, the nonvertical transfer of genes between more or less distantly related organisms (as opposed to the standard vertical transmission between parent and offspring). Estimates of the fraction of genes that have undergone LGT vary widely—with some as high as 99% (see, e.g., Dagan and Martin, 2006; Galtier and Daubin, 2008; and references therein). LGT is particularly common in bacterial evolution and it has been recognized to play an important role in microbial adaptation, selection, and evolution, with implications in the study of infectious diseases (Smets and Barkay, 2005). As a result, the bacterial phylogeny is usually inferred from genes that are thought to be immune to LGT, typically ribosomal RNA genes. However, there is growing evidence that even such genes have in fact experienced LGT (Yap et al., 1999; van Berkum et al., 2003; Schouls et al., 2003; Dewhirst et al., 2005). In any case, LGT appears to be a major source of conflict between gene trees that must be taken into account appropriately in phylogenomic analyses, in particular when building phylogenies. This is the problem we address in this article.
Despite the confounding effect of LGT, we operate under the prevailing assumption that the evolution of organisms is governed primarily by vertical inheritance. In particular, we ask the following:
1. How much genetic transfer can be handled before the treelike signal is completely erased? 2. What phylogenetic reconstruction methods are most effective under this hypothesis?
These questions, and other related issues, have been the subject of some empirical and simulation-based work (Beiko et al., 2005; Ge et al., 2005; Galtier, 2007; Puigbo et al., 2009, 2010; Koonin et al., 2011). See also Galtier and Daubin (2008) and Ragan and Beiko (2009) for enlightening discussions. In particular, there is ample evidence that a strong treelike signal can be extracted in the presence of extensive LGT [although some debate remains on this question (Gogarten et al., 2002)].
In this article, we provide the first (to our knowledge) mathematical analysis of the issues above. We work under a stochastic model of gene-tree topologies positing that LGT events occur at more or less random locations on the species phylogeny (Galtier, 2007). In our main result, we establish quantitative bounds implying that surprisingly high levels of LGT—almost linear in the number of branches for each gene—can be handled by simple, computationally efficient inference procedures. That amount of genetic transfer appears to be much higher than known empirical estimates of LGT frequency based on genomic datasets in prokaryotes. 1 Hence, our results indicate that an accurate, reliable bacterial phylogeny should be reconstructible if the vertical inheritance hypothesis is correct. We prove that our bound on the achievable rate of LGT is tight up to logarithmic factors. We also show that constraining LGT to closely related species makes the tree reconstruction problem significantly easier.
Our theoretical approach complements simulation-based studies by allowing a broad range of parameters and tree shapes to be considered. Moreover, our analysis provides new insights into the design of effective reconstruction methods in the presence of LGT. More precisely, we focus on methodologies—both distance-based (Kim and Salisbury, 2001) and quartet-based (Zhaxybayeva et al., 2006)—that derive their statistical power from the aggregation of basic topological information across genes.
In addition, we study the effect of so-called highways of gene sharing; roughly, preferred genetic exchanges between specific groups of species. Beiko et al. (2005) provided empirical evidence for the existence of such highways. To identify highways, they inferred LGT events by reconciling gene trees with a trusted species tree. In subsequent work, Bansal et al. (2011) formalized the problem and designed a fast highway detection algorithm that aggregates conflicting signal across genes rather than solving the difficult LGT inference problem on each gene tree. Similarly to Beiko et al. (2005), Bansal et al. (2011) rely on a trusted species tree.
Here we show that a species phylogeny can be reliably estimated in the presence of both random LGT events and highways of LGT, as long as such highways involve a small enough fraction of genes. Under extra assumptions, we also design an algorithm for inferring the location of highways. Because we first recover the species phylogeny, our highway reconstruction algorithm does not require a trusted species tree. In essence, our results on highways indicate that robust phylogeny reconstruction in the presence of random LGT extends to a phylogenetic network setting. For background on phylogenetic networks, see, for example, Huson et al. (2010).
We note that there exist related lines of work in phylogenomics, addressing the issue of incomplete lineage sorting (Degnan and Rosenberg, 2009) in the presence of gene transfers and hybridization events (Than et al., 2007; Joly et al., 2009; Kubatko, 2009; Meng and Kubatko, 2009; Chung and Ane, 2011; Yu et al., 2011) as well as work on probabilistic models involving gene duplications and losses (Arvestad et al., 2009; Csürös and Miklós, 2006).
The rest of the article is organized as follows. In Section 2, we define a stochastic model of LGT and state our main results. A high-level description of our analysis is given in Section 3. Finally, in Section 4, we extend our results to highways of gene sharing. [The results presented here were announced without proof in Roch and Snir (2012).]
2. Model and Main Results
Before stating our main results, we present a stochastic model of LGT. Roughly, following Galtier (2007), we assume that LGT events occur more or less at random along the species phylogeny. Such a model appears to be consistent with empirical evidence (Galtier and Daubin, 2008).
2.1. Stochastic model of LGT
Gene trees and species phylogeny
A species phylogeny (or phylogeny for short) is a graphical representation of the speciation history of a group of organisms. The leaves correspond to extant or extinct species. Each branching indicates a speciation event. Moreover, we associate to each edge a positive value corresponding to the time elapsed along that edge. For a tree
Definition 1 (Phylogeny)
A (species) phylogeny Ts=(Vs, Es, Ls; r, τ) is a rooted tree with vertex set Vs, edge set E
s
and n (labeled) leaves
Although we are ultimately interested in recovering the extant phylogeny, we include extinct species in the model as they can be involved in LGT events that affect the extant restriction of the tree (see, for example, Maddison, 1997).
To infer the species phylogeny, we first reconstruct gene trees, that is, trees of ancestor-descendant relationships for orthologous genes or loci. Phylogenomic studies have revealed extensive discordance between such gene trees (e.g., Bapteste et al., 2005; Doolittle and Bapteste, 2007).
Definition 2 (Gene tree)
A gene tree Tg=(Vg, Eg, Lg; ωg) for gene g is an unrooted tree with vertex set Vg, edge set Eg and 0 < ng ≤ n (labeled) leaves
Remark 1 (Gene trees vs. species phylogeny)
As we will discuss below, gene trees are derived from— or “evolve” on—the species phlyogeny. They may differ from the species phylogeny for various reasons. First, in our model, their branch lengths represent expected numbers of substitutions, instead of time elapsed. Moreover, their topology may differ as a result, in our case, of LGT events. See more details below.
Remark 2 (Rooted vs. unrooted)
Our stochastic model of LGT requires a rooted species phylogeny as time plays an important role in constraining valid LGT events (see, e.g., Jin et al., 2009). In particular, our results rely on the ultrametricity property of the extant phylogeny. In contrast, branch lengths in gene trees correspond to expected numbers of substitutions. As a result, gene trees are typically unrooted and do not satisfy ultrametricity.
Remark 3 (Taxon sampling)
Each leaf in a gene tree corresponds to an extant species in the species phylogeny. However, because of gene loss and taxon sampling, a taxon may not be represented in every gene tree.
Remark 4 (Branch lengths)
Each branch e in a gene tree Tg corresponds to a full or partial edge in the species phylogeny Ts. In particular, we allow internal vertices of degree 2 in a gene tree to potentially delineate between two consecutive species edges. We allow the branch lengths ωg(e) to be arbitrary, but one could easily consider cases where the branch lengths are determined by interspeciation times, lineage-specific rates of substitution, and gene-specific rates of substitution. The branch lengths will play a role in Section 5.
Random LGT
We formalize a stochastic model of LGT similar to Galtier's (Galtier, 2007). See also Kim and Salisbury (2001); Suchard (2005); and Jin et al. (2006) for related models. The model accounts for LGT events originating at random locations on the species phylogeny with LGT rate λ(e) prevailing along edge e.
We will need the following notation. Let Ts=(Vs, Es, Ls; r, τ) be a fixed species phylogeny. By a location in Ts, we mean any position along Ts seen as a continuous object (also called
For R > 0, we let
be the set of locations contemporaneous to x at τ-distance at most 2R from x (or in other words, with MRCA at τ-distance at most R). In particular,
be the total LGT weight of the phylogeny and
be the total LGT weight of the extant phylogeny, where
Our model of LGT is the following. Note first that, from a topological point of view, an LGT transfer is equivalent to a subtree-prune-and-regraft (SPR) operation (Semple and Steel, 2003). The recipient location—that is, the location receiving the genetic transfer—is the point of pruning. Similarly, the donor location is the point of regrafting. In other words, on the gene tree, a new internal node is created at the donor location with two children nodes, one being the original endpoint of the corresponding edge and the other being the node immediately under the recipient location in the species phylogeny. The original edge going to the latter node is removed (Fig. 1).

An LGT event. On the left, the species phylogeny is shown with the donor (D) and recipient (R) locations. On the right, the resulting (unweighted) gene tree is shown after the LGT transfer.
Definition 3 (Random LGT)
Let 0 < R ≤ +∞ possibly depending on n (i.e., not necessarily a constant), and note that we explicitly allow R=+∞. Let Ts=(Vs, Es, Ls; r, τ) be a fixed species phylogeny. Let 0 < p ≤ 1 be a sampling effort probability. A gene tree topology 1. • Recipient locations. Starting from the root, along each branch e of Ts, locations are selected as recipient of a genetic tranfer according to a continuous-time Poisson process with rate λ(e). Equivalently, the total number of LGT events is Poisson with mean • Donor locations. If x is selected as a recipient location, the corresponding donor location y is chosen uniformly at random in 2.
The resulting (random) gene tree topology is denoted by
When R < +∞, a transfer can only occur between sufficiently closely related species. One could also consider more general donor location distributions (see e.g., Puigbo et al., 2010). In Section 4, we consider a different form of preferential exchange, highways of gene sharing.
2.2. Recovering the treelike trend: Main results
Problem statement
Let Ts=(Vs, Es, Ls; r, τ) be an unknown species phylogeny. Using homologous gene sequences for every gene at hand, we generate N independent gene tree topologies
In practice, one estimates gene trees from sequence data. We come back to gene tree estimation issues below.
Bounded-rates model
The following assumption was introduced in Daskalakis and Roch (2010) and is related to a common assumption in the mathematical phylogenetics literature.
Definition 4 (Bounded-rates model)
Let 0 < ρλ < 1 and 0 < ρτ < 1 be constants. Let further
and
Our result in this case follows. We use
Theorem 1 (Main result: Bounded-rates model, R=+∞)
Let R=+∞. Under the Bounded-rates model, it is possible to reconstruct the topology of the extant phylogeny with high probability (w.h.p.) from N=Ω(log n+) gene tree topologies if
In words, we can reconstruct the species phylogeny w.h.p. as long as the expected number of LGT events
We also show that the bound on
Theorem 2 (Non-recoverability)
Under the bounded-rates model, as above, with N=O(log n+), the topology of the extant phylogeny cannot, in general, be reconstructed w.h.p. if
More generally, the species phylogeny cannot be reconstructed from N genes if
for some
An advantage of the Yule model is that, unlike the bounded-rates model, it does not place arbitrary constraints on the interspeciation times. In particular, the following analog of Theorem 1 suggests that our analysis does not rely on such constraints.
Theorem 3 (Main result: Yule process, R=+∞)
Let R=+∞. Under the Yule model, the following holds with probability arbitrarily close to 1. It is possible to reconstruct the topology of the extant phylogeny w.h.p. from N=Ω(log n) gene tree topologies if
Theorem 4 (Preferential LGT)
Let 0 < R < log n+ possibly depending on n+. Under the bounded-rates model, it is possible to reconstruct the topology of the extant phylogeny w.h.p. from N=Ω(log n+) gene tree topologies if
A similar result holds under the Yule model.
Further results
We also obtain results on highways of LGT as well as sequence-length requirements. These results require additional background. See Sections 4 and 5 respectively.
3. Probabilistic Analysis
We assume that we are given N independent gene tree topologies
Different algorithms are possible. A simple approach is to take a majority vote over all gene-tree topologies. But this approach is problematic under taxon sampling and cannot sustain the high levels of LGT we consider below.
Instead, we consider approaches that aggregate partial information over all gene trees. We focus on subtrees over four taxa whose topologies are called quartets (Semple and Steel, 2003). We show that computationally efficient quartet-based approaches can sustain high levels of LGT. Although we prove our results for the specific method described below, our analysis is likely to apply to related methods. In Section 5.1, we also give a similar analysis for a distance-based method of Kim and Salisbury (2001).
3.1. Algorithm
We consider the following approach related to an algorithm of Zhaxybayeva et al. (2006). Let X={a, b, c, d} be a four-tuple of extant species. The topology
and similarly for q2, q3. (We set the frequency to 0 if the denominator is 0.) For each X, we choose the quartet with highest frequency (breaking ties arbitrarily).
Definition 5
A set of quartets Q={qi}, with Lqi the leaf set of qi, is compatible if there is a tree
Quartet compatibility is, in general, NP-hard (Steel, 1992). However, when the set Q covers all possible four-tuple of taxa (that is, exactly
The algorithm, which we call QuartetPlurality (QP), is detailed in Figure 2.

Algorithm QuartetPlurality.
3.2. A general formula
Our asymptotic analysis is based on the following claim. Recall that, for a subset of extant species X, we let
be the total weight of the subtree
Lemma 1 (Probability of a miss)
Let
Recall that
Proof (Lemma 1)
We first note that, by our assumption that the species phylogeny is bifurcating,
Now we observe that if none of the recipient locations lands on
Hence the probability that
3.3. Bounded-rates and Yule models
Next we argue that, under appropriate assumptions on the species phylogeny, the maximum quartet weight is bounded in such a way that the plurality quartet topology for every four-tuple of taxa X={a, b, c, d}, which we denote by
3.3.1. Bounded-rates model
We bound the maximum quartet weight ϒ(4) in the bounded-rates model.
Lemma 2 (Bound on quartet weight: Bounded-rates case)
Under the Bounded-rates model, it holds that
Proof (Lemma 2)
The first part of the proof is taken from Daskalakis and Roch (2010). Let h (respectively H) be the smallest (respectively largest) number of edges on a path between the root and an extant leaf. Because the number of extant leaves is n+ and the extant phylogeny is bifurcating (recall that we suppressed vertices of degree 2 after taking a restriction to the extant species), we must have 2
h
≤ n+ and 2
H
≥ n+. Since all extant leaves are contemporaneous it must be that
Hence
The total number of edges in the extant phylogeny is 2n+ − 3 so that
Using Lemma 2, we prove Theorem 1. First recall the following standard concentration inequality (see, e.g., Motwani and Raghavan, 1995):
Lemma 3 (Azuma-Hoeffding inequality)
Suppose
Proof (Theorem 1)
Consider the quartet-based approach described in Section 3.1. Take
and using Lemmas 1 and 2, we have for any four-tuple X of extant species
and
for C1 small enough. We choose C2 > 0 large enough with
and ɛ < p4 so that, using Lemma 3, the following inequalities hold. Consider the following events
and
By Lemma 3,
and
Hence, for a constant C2 large enough,
Then the plurality vote is correct for every four-tuple of taxa, and the extant phylogeny is correctly reconstructed. ■
3.3.2. Yule process
We now consider the Yule model.
Lemma 4 (Bound on quartet weight: Yule case)
Under the Yule model, it holds that
with probability approaching 1 as n → +∞.
Proof (Lemma 4)
We consider a pure-birth process with birth rate ν starting from two species. For background on branching processes, see Athreya and Ney (1972).
Let Zi be the (i − 1)-th interspeciation time. As a minimum of i independent exponential distributions with mean 1/ν, Zi is an exponential with mean 1/ (iν). Moreover, the Zis are independent. Hence the height of the phylogeny in time units, that is, the total time until n + 1 species are present [recall that we ignore the (n + 1)-st species] is
and we have
and
The total weight of the phylogeny in time units
is a sum of n independent exponential random variables with parameter ν, and we have
and
By Chebyshev's inequality,
and
for appropriately chosen Cs not depending on n. The same holds in the other direction so that
Proof (Theorem 3)
Using Lemma 4, the proof of Theorem 3 follows from the same lines as that of Theorem 1. ■
3.4. Preferential LGT
We now prove Theorem 4.
Proof (Theorem 4)
The proof is similar to that of Theorems 1 and 3. The main difference is in the proof of Lemma 1. In that proof, note that if R < +∞, then for an LGT to affect the quartet on X, it must be that not only 1) the recipient location lands on
The result then follows. ■
3.5. Nonrecoverability
We now prove Theorem 2.
Proof (Theorem 2)
We use a coupling argument (Lindvall, 1992). Fix δ > 0 small. We construct two species phylogenies with different topologies that cannot be distinguished with probability 1−δ from N gene tree topologies when the total expected amount of LGT
We proceed as follows. Consider a complete binary tree
We generate N=Θ(log n) gene trees on each of
Since
For

Good event.
We claim that the probability that the good event does not occur is O(1/ log n). Under the assumption that Λ=Ω(n log log n) and that the LGT weights are equal, the number of LGT events on any edge is Poisson with mean Ω(log log n). Consider the time interval between the nodes at height 1 from the root and the nodes at height 2. Divide this interval into ν=O(log log n) equal subintervals
The subintervals are independent. The probability that the event above does not happen in any of
This gives an upper bound of O(1/ log n) on the probability that the good event does not happen.
Therefore, by a union bound over the genes, the probability that the good event does not occur on at least one gene tree is Θ(log n) · O(1/ log n)=O(1), which is at most δ if the constant in Λ is large enough. If the good event occurs on every gene tree, then both phylogenies output the exact same set of gene tree topologies. That concludes the proof. ■
4. Highways of LGT
In this section, we add highways of gene sharing to the model. Highways are, in essence, nonrandom patterns of LGT (Beiko et al., 2005). These can potentially take different shapes. Following Bansal et al. (2011), we focus on pairs of edges in the phylogeny that undergo an unusually large number of LGT events between them.
We give two results. As long as the frequency of genes affected by highways is low enough, the species phylogeny can be reconstructed using the same approach as in Section 3. Moreover, with extra assumptions on the positions of the highways with respect to each other, the highways themselves can be inferred.
In this section, we assume n−=0.
4.1. Model
We generalize our model of LGT as follows.
Definition 6 (Highways of LGT)
Let Ts=(Vs, Es, Ls; r, τ) be a species phylogeny with LGT rates 0 < λ(e) < +∞,
Remark 5 (Deterministic setting)
Note that the highways and which genes are involved in them are deterministic in this setting. Only the random LGT events are governed by a stochastic process. Note moreover that we allow highway events to go in either direction, that is, from
4.2. Building the species tree in the presence of highways
We first prove that the species phylogeny can still be reconstructed in the presence of highways as long as the fraction of genes involved in highways is low enough. We only discuss the Bounded-rates model with R=+∞.
Theorem 5 (Highways of LGT)
Consider the bounded-rates model with R=+∞and assume that B < +∞is constant. Assume further that there is a constant
If
then it is possible to reconstruct the topology of the extant phylogeny w.h.p. from N=Ω(log n+) gene-tree topologies if
Proof (Theorem 5)
The proof is similar to that of Theorem 1. Note that a quartet tree in the species phylogeny can be affected by a highway in at most a fraction
4.3. Inferring highways
The problem of inferring the highway locations is essentially a network reconstruction problem. Such problems are often computationally intractable (see, e.g., Huson et al., 2010). Therefore, we require some extra assumptions. Our goal here is not to provide the most general result but rather to illustrate that our analysis extends naturally to certain network settings. The following assumption is related to so-called galled trees.
Assumption 1
We assume that no highway connects two edges in Ts separated by less than two edges or edges adjacent to root edges. (Such cases cannot be reconstructed.) Seen as an edge superimposed on Ts, a highway event
We then prove the following. We use a computationally efficient algorithm, which we call RoadRoller, described in Figure 4 and explained in the proof.

Algorithm RoadRoller.
Theorem 6 (Inferring highways)
Consider the Bounded-rates model with R=+∞and assume that B < +∞is constant. Assume further that there are constants
If
and Assumption 1 holds then it is possible to reconstruct the topology of the extant phylogeny as well as the highway edges w.h.p. from N=Ω(log n+) gene tree topologies if
Proof (Theorem 6)
Consider a four-tuple X such that
Further, it follows by the proof of Theorem 5 that, if
For X, X′ with quartets in
By the argument above, quartets sharing an edge along their internal branch are necessarily affected by the same highway. Take the transitive closure ∼* of ∼. Let W be an equivalence class of ∼*. We reconstruct the corresponding highway as follows. The union of all edges in e(X, X′) for some pair X, X′ in W forms a path

Setup in the proof of Theorem 6. The gray arrow indicates a highway. Here
As we argued in the proof of Lemma 1, all quartets affected by the highway corresponding to W contain at least one leaf in a pruned subtree. Because we allow LGT events in both directions along a highway, there are two potential pruned subtrees. Moreover, the other three leaves must be in separate subtrees hanging from the path
Hence, the pruned subtrees can be identified by checking the four-tuples in W and finding the pairs of subtrees with at least one of them present in all of W. If there is a unique such pair, this gives the two highway edges and we are done. Otherwise, the recipient edge is the intersection of the pairs found. To identify the donor edge, one simply needs to use a four-tuple X of leaves in the four adjacent subtrees to the endpoints of
5. Distance Method and Sequence Lengths
In this section, in the highway-free case, we analyze an alternative, distance-based approach that has been considered in the literature, and we provide sequence-length requirements. Although the quartet-based method analyzed in Section 3 can in principle handle arbitrary branch lengths (as only the topology of the gene trees is used), here we need to assume that the gene-tree branch lengths are determined by interspeciation times and lineage-specific rates of substitution. For simplicity, we assume that there is no gene-specific substitution rate. In practice, one could incorporate such rates by using a normalization procedure as detailed in Kim and Salisbury (2001) and Ge et al. (2005).
5.1. A distance-based approach
We analyze a distance-based approach similar to that introduced in Kim and Salisbury (2001) and studied empirically in Ge et al. (2005). Given branch lengths, a gene tree is naturally equipped with a tree metric on the leaves D
g
: Lg × Lg → (0, +∞) defined as follows
where P g (u, v) is the set of edges on the path between u and v in Tg. We will refer to D g (u, v) as the evolutionary distance between u and v under g.
For each pair of extant species {a, b}, we compute the median
We abort if a pair is not included in any of the gene trees. We then use the distance matrix Dm to build a tree using the Short Quartet Method (Erdös et al., 1999a) (or any other statistically consistent, fast-converging distance-based method). We will refer to this method as the MedianTree (MT) method. The algorithm is detailed in Figure 6.

Algorithm MedianTree.
Probabilistic analysis
Define the maximum path weight (MPW)
Then:
Lemma 5 (Probability of a miss: Distance case)
Let Tg=(Vg, Eg, Lg; ωg) be a gene tree distributed according to the random LGT model such that X={a, b} ⊆ Lg. Let D
s
(a, b) be the evolutionary distance between a and b under the topology of the extant phylogeny (that is, under the event that no LGT has occurred). Then
Proof (Lemma 5)
The proof is similar to that of Lemma 1. ■
Lemma 6 (Bound on path weight: Bounded-rates case)
Under the bounded-rates model, it holds that
Proof (Lemma 6)
Note that
Lemma 7 (Bound on path weight: Yule case)
Under the Yule model, it holds that
with probability approaching 1 as n → +∞.
Proof (Lemma 7 )
The proof is similar to that of Lemma 4. ■
Proof (Theorems 1 and 3)
Using MT and Lemmas 6 and 7, the proof of Theorem 1 (and of Theorem 3) follows from the same lines as that of Theorem 1. Note however that our extra assumption on the gene-tree branch lengths is needed here to ensure that evolutionary distances are the same across all genes. ■
5.2. Taking into account sequence length
We have assumed so far that gene tree topologies and evolutionary distances are known perfectly. Of course, this is not the case in practice, and the effect of sequence length must be accounted for. One issue that arises is that LGT events may create very short branches that are difficult to infer. Nevertheless, we can prove the following. We assume that sequence data is generated independently on each gene tree according to a GTR model. Evolutionary distances are estimated using the log-det distance. See, for example, Semple and Steel (2003) for background on GTR models of substitution and the log-det distance. We assume n−=0 for simplicity.
Theorem 7 (Sequence-length requirements)
Under the bounded-rates and Yule models for the species phylogeny and the GTR model for sequences, assuming that substitution rates are bounded between constants, a sequence length per gene polynomial in n suffices for the MT algorithm to succeed if the number of genes is at most polynomial in n.
Proof (Theorem 7)
We only discuss the Yule model. The argument for the bounded-rates model is similar.
In our second proof of Theorem 3, we relied on the fact that, for every pair of taxa w.h.p., a strict majority of the gene-tree evolutionary distances has not been affected by LGT. Hence, if the worst case estimation error on the evolutionary distances is ɛ, then the median of the estimated distances must be in the interval [D s (a, b) − ɛ, D s (a, b) + ɛ] for all pairs of taxa a, b. Further, by the concentration bounds in Erdös et al. (1999b), for the SQM step of our MT algorithm to return the correct topology w.h.p., the sequence length must scale as an exponential of the depth of the tree divided by the square of the shortest branch length.
Under the Yule model, with probability approaching 1, the depth of the tree is O(log n) (by the proof of Lemma 4) and the shortest branch length (the minimum of O(n) exponentials with mean O(1)) is 1/poly(n). Hence the result follows. 2 ■
6. Discussion
We have shown that a species phylogeny or network can be reconstructed despite high levels of random LGT, and we have provided explicit quantitative bounds on tolerable rates of LGT. Moreover, our analysis sheds light on effective approaches for species tree building in the presence of LGT. Several problems remain open:
• Galtier and Daubin (2008) hypothesize that random LGT only becomes a significant hurdle when the rate of LGT greatly exceeds the rate of diversification. In our setting, this would imply that a value of Λ as high as Ω(n) may be achievable. Note that branches close to the leaves are particularly easy to reconstruct because they lie on small quartet trees that are less likely than deep ones to be hit by an LGT event. Is a recursive approach starting from the leaves possible here? See Mossel (2004) and Daskalakis et al. (2011) for recursive approaches in a related context. • In a related problem, we have analyzed distance-based and quartet-based methods. A better understanding of bipartition-based approaches is needed and may lead to a higher threshold for Λ. • What can be proved when a model of extinction is incorporated? • What can be proved when the number of genes is significantly less than log n? • In the presence of highways, dealing with more general network settings would be desirable. Also, our definition of highways as connecting two edges is somewhat restrictive. In general, one is also interested in preferential genetic transfers between clades. • On the practical side, the predictions made here should be further tested on real and simulated datasets. We note that there is extisting work in this direction (Beiko et al., 2005; Ge et al., 2005; Galtier, 2007; Puigbo et al., 2009, 2010; Koonin et al., 2011; Bansal et al., 2011).
Footnotes
Acknowledgments
S.R. was supported by NSF grant DMS-1007144. S.S. was supported by the USA-Israel Binational Science Foundation and by the Israel Science Foundation.
Disclosure Statement
No competing financial interests exist.
