From Small Parsimony to Horizontal Gene Transfer: Inferring Horizontal Transfer and Gene Loss for Single-Origin Characters

Abstract

The simple underlying pattern of presence–absence of a character within a species tree provides useful steps to trace complex evolutionary histories. Character-based models such as perfect transfer networks and its galled variant aim to leverage this information to predict horizontal gene transfers. Under the assumption that characters have a single origin, are rarely lost, and can be transferred horizontally, they remain an efficient inference method for almost tree-like scenarios. Nevertheless, they can sometimes predict overly complicated scenarios, and its simplest structural variants are too restrictive for practical uses. With the goal of extending this model to include loss events, we present a Sankoff–Rousseau-like algorithm that aims to recover the simplest possible scenarios that combine gene transfers and losses using solely the single character information already contained in a given species tree. We establish a link between the small parsimony problem and the inference of scenarios with a minimum number of losses and transfers, allowing losses and transfers to have a user-defined penalization for this end. We also explore the utility of our model for tracing possible highways of gene transfers by presenting a real case study on a dataset of bacterial species and Kyoto Encyclopedia of Genes and Genome functions as characters.

Keywords

character-based methods homoplasy-free horizontal gene transfer LGT networks

1. INTRODUCTION

Horizontal gene transfer (HGT), also known as lateral gene transfer (LGT), is the transmission of genetic material between coexisting organisms and provides an alternative DNA exchange mechanism to the more widely studied parent–offspring relationships. Occurring both between and within all domains of life, HGT constitutes an important contribution to genetic diversity and serves as a source of functional innovation through the introduction of novel genes and metabolic pathways. It is one of the most conspicuous features found in bacterial genomes (Choi and Kim, 2007) and notably eases the adaptation of organisms to new conditions, which are sometimes extreme and life-threatening (Arnold et al., 2022). Transfers are also involved in eukaryotes, but large-scale HGT studies are hampered by the complexity of their genomes. They are nevertheless present, with a recent example including the transfer of genes that degrade cellulose from bacteria to beetles (Zhang et al., 2025).

There are currently two basic approaches to infer HGT (Ravenhall et al., 2015): parametric methods, which look for outliers throughout the genomic composition of single individuals, and phylogenetic methods, which find unusual evolutionary histories among groups of organisms. The majority of current methods rely on sequence-based information as input. However, these methods become unreliable to predict ancient transfers as sequences are too divergent to provide clear evolutionary signals (Yang and Rannala, 2012).

An alternative to sequence-based methods is character-based methods. A character is a morphological or molecular trait that a taxon may possess or lack. Character-based approaches may recover phylogenetic signals more accurately when highly divergent sequences are involved (Alexander et al., 2007) and aim to explain the character diversity of a set of taxa in terms of state changes. Several character-based models have been established in the literature, each imposing conditions on how a character may emerge and evolve. For example, a perfect phylogeny requires that characters change their state at most once. Here, we will be concerned with a particular type of characters that have only two states: presence and absence. A character transition from absence to presence is called a gain, while a transition from presence to absence is called a loss. In this setting, a variant of perfect phylogeny requires a character to be gained once and never lost. Although the perfect phylogeny model has been shown to be too restrictive to explain the evolutionary history of a set of characters, its extensions remain an active area of research with promising applications (Bonizzoni et al., 2014).

Relaxations of character state transition models were subsequently developed, including Dollo parsimony (Farris, 1977), where characters can only be gained once but can be lost many times. This assumption has been shown to be more suitable for complex characters such as restriction sites and introns (Felsenstein, 2004). Optimization problems involving the reconstruction of phylogenies under Dollo parsimony were shown to be NP-hard, even for binary characters (Day et al., 1986). More recently, a variant of this model, the Dollo-k model, where characters can be lost at most k times, was studied in Bouckaert et al. (2021).

An appealing alternative is to consider network-like structures instead of trees to explain characters. A representative model called perfect phylogenetic networks (PPNs) was introduced in Nakhleh et al. (2005) and Nakhleh (2004) and asks for a tree displayed by the network to be a perfect phylogeny. Unfortunately, this model is difficult to work with, since even deciding whether a known network explains a set of (multistate) characters is NP-hard (Warnow et al., 2025). In López Sánchez and Lafond (2022), we introduced a special case of PPNs called perfect transfer networks (PTNs). In this model, characters are binary, have a unique origin, and cannot be lost once acquired. The biological motivation behind PTNs is to study transferable characters that are difficult to revert such as organelles in endosymbiotic events (Zachar and Boza, 2020; Anselmetti et al., 2021) and metabolites (Goyal, 2022). In contrast to PPNs, the tree that dictates the vertical inheritance is fixed in PTNs. This affects the complexity of recognition and optimization problems: Deciding whether a network explains a given set of characters and whether we can augment a tree with transfer arcs to explain the evolution of a set of characters are polynomial-time solvable.

Although easy to compute, PTNs can sometimes predict overly complicated evolutionary scenarios that require a larger number of transfers [note that so-called galled PTNs were introduced in López Sánchez and Lafond (2024) to simplify these scenarios, but the structure imposed on the networks was shown to be too restrictive in practice]. This is mainly because these prior models do not allow losses, sometimes requiring the addition of several transfers to explain a single loss. In contrast, the explicit inclusion of transfer and loss parameters would allow modeling a greater diversity of biological scenarios. For example, higher loss rates are often associated with the evolution of pathogenic bacteria and symbionts (Moran, 2002), whereas certain genera, such as Listeria, exhibit reduced rates of both gene gain and loss (den Bakker et al., 2010).

1.1. Our contribution

In this work, we broaden the definition of PTNs by incorporating loss events as in the Dollo parsimony model. We aim to find the most parsimonious evolutionary scenario under the assumption that each character emerges only once (this is known as the “no-homoplasy” condition). We allow transfers and losses to have a different cost, as previously explored in the context of host–parasite reconciliation networks (Charleston, 1998).

Our main problem of interest is the minimum homoplasy-free completion problem, where we must augment a given leaf-labeled tree with internal node labels and transfer arcs in order to achieve a minimum cost scenario. Importantly, the resulting network is required to be time-consistent. We start by establishing a connection between this Dollo parsimony variant and the small parsimony problem and then show that using a modification of the well-known Sankoff–Rousseau algorithm (Sankoff and Rousseau, 1975), we can find an inner labeling of the given tree that minimizes the loss and transfer cost.

We conclude with a case study of inferred scenarios on a bacterial dataset consisting of bacterial species and functional characters obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al., 2016a,b). We show that the phylogenetic signal of some transfer events reported in the literature is contained within special pairs of nodes in the given species tree known as transfer highways (Beiko et al., 2005; Bansal et al., 2011).

This article is an extended version of our earlier RECOMB-CG 2025 conference paper. While the original work focused on single-character algorithms, we now introduce a new section discussing the extension of our model to handle multiple characters simultaneously. The experimental analysis has been expanded with detailed descriptions of the functional characters used in the study, an evaluation of the transfer and loss costs associated with the inferred networks, and a deeper comparison of the transfers identified by using an unmodified two-state Sankoff–Rousseau algorithm and our model. In addition, we provide a step-by-step working example to illustrate the application of our algorithm.

1.2. Related work

From a stochastic perspective, a model that combines the inference of horizontal transfers with Bayesian inference was presented in Kelly and Nicholls (2017). From a combinatorial perspective, besides PPNs and variants (Warnow et al., 2025), the model closest to our work, to our knowledge, was presented in Van Iersel et al. (2010). In this work, the authors provide upper or lower bounds of the number of transfers needed when considering transfer and loss events to explain the evolutionary history of a set of characters, when we require that every node in the resulting network contains at most k characters. Additionally, they show that finding a character assignment for a given network and a set of characters that satisfies these conditions is NP-complete. Furthermore, the problem of penalizing losses in this context was left as an open problem in their conclusion.

1.3. Overview of this article

Section 2 introduces the mathematical framework of our model, including time-consistent LGT networks, homoplasy-free scenarios, and the formal definition of our main problem: the minimum homoplasy-free completion problem. Section 3 presents our main algorithmic contribution, where we show how and optimal internal labeling of the given tree guides the placement of transfer arcs and ensures time-consistency. We also formally connect out labeling problem to the classical small parsimony problem and motivate the need for a modification of the Sankoff–Rousseau algorithm. Section 4 extends our approach from single characters to sets of characters, describing how individual solutions can be combined into a generalized framework. Section 5 provides an experimental evaluation on KEGG functional characters, illustrating the behavior of our model on real data and highlighting applications to the detection of transfer highways.

2. PRELIMINARIES

Unless stated otherwise, all graphs in this work are directed and loopless. For a graph N, $V (N)$ and $E (N)$ denote the sets of nodes and arcs of N, respectively, and $L (N)$ denotes the set of leaves, which are the nodes of outdegree 0. The nodes in $V (N) ∖ L (N)$ are called internal nodes. For a subset of nodes $X \subseteq V (N)$ , N[X] denotes the subgraph of N induced by X, which is the graph with node set X and arc set ${(u, v) : u \in X, v \in X, (u, v) \in E (N)}$ .

A phylogenetic network, or simply a network, is a directed acyclic graph N with a unique node $ρ (N)$ of indegree zero called the root, in which all leaves have indegree 1. A node of indegree one and outdegree 1 is a subdivision node, which we allow. We say that a node $v \in V (N)$ reaches a node $u \in V (N)$ if there exists a directed path from v to u in N.

We now introduce notation that is specific to trees. A tree T is a network whose underlying undirected graph has no cycles. For $v \in V (T)$ , a child of v is a node u such that $(v, u) \in E (T)$ , and v is the parent of u. Thus, edges are oriented from parent to child. A node $v \in V (T)$ is an ancestor of $u \in V (T)$ if v is on the path from $ρ (T)$ to u. In this case, we call u a descendant of v and write $u ⪯_{T} v$ (or $u ≺_{T} v$ if we know that $u \neq v$ ). The ancestor order $⪯_{T}$ is a partial order of $V (T)$ and, in particular, $ρ (T)$ is the unique maximal element. We say that two nodes u and v are comparable if $u ⪯_{T} v$ or $v ⪯_{T} u$ , and we say that they are incomparable otherwise. We will drop the subscript T when T is clear from the context. For $v \in V (T)$ , T[v] denotes the subtree rooted at v.

2.1. LGT networks and time-consistency

An LGT network (where LGT comes from lateral gene transfers; Cardona et al., 2015) is a network $N = (V, E_{S} \cup E_{T})$ , where ${E_{S}, E_{T}}$ is a specified partition of the arc set of N, such that the subgraph $T_{N} : = (V, E_{S})$ is a tree with the same set of nodes as N. The tree $T_{N}$ is called the support tree of N. The arcs in $E_{S}$ are called support arcs, and the arcs in $E_{T}$ are called transfer arcs. For a transfer arc $(u, v) \in E_{T}$ , the endpoints u and v are called transfer nodes, and in particular, u is called the donor and v is called the recipient. For example, see Figure 1 (Left). The black edges represent the edges of the support tree $E_{S}$ , and the gray edges represent $E_{T}$ , the transfer arcs.

FIG. 1.

(Left) A network N and a character $γ$ . (Right) A homoplasy-free scenario for $γ$ , where v is the unique node that reaches all 1-nodes.

We assume that transfer nodes have exactly one out-neighbor in $T_{N}$ , and so they are subdivision nodes in $T_{N}$ . The tree obtained from $T_{N}$ by suppressing its subdivision nodes1 is called the base tree of N. We emphasize that we have not defined a notion of ancestry or partial order between the nodes of an LGT network, as it is not needed. Instead, we always refer to the ancestry relationships between nodes of N in the support tree $T_{N}$ , using the notation $⪯_{T_{N}}$ (which is well-defined above since $T_{N}$ is a tree).

We aim to reconstruct LGT networks that are biologically feasible in terms of time. This implies that the transfers that appear should exist only between ancestral species that co-existed. We define a time-consistent map over the nodes of an LGT network $N = (V, E_{S} \cup E_{T})$ as a function $T : V \to R$ such that:

for every $(u, v) \in E_{S}$ , $T (u) > T (v)$ .

for every $(u, v) \in E_{T}$ , $T (u) = T (v)$ .

A network N is time-consistent if there exists a time-consistent map of N. Note that given an LGT network, one can tell in linear time whether there exists a time-consistent map for it. See Corollary 1 in Górecki (2004).

2.2. Homoplasy-free scenarios

Let $S$ be a set of taxa. A character $γ$ (on $S$ ) is a function $γ : S \to {0, 1}$ , where $γ (x) = 1$ indicates that species x possesses the character, while $γ (x) = 0$ indicates that it does not. We focus on the evolution of a single character and discuss later on possible ways to handle multiple characters.

To formalize transfer networks, given an LGT network N, a labeling of N is a function $l : V (N) \to {0, 1}$ that indicates which nodes possess the character or not.

Note that $l^{- 1} (1)$ denotes the set of nodes of N labeled 1 in l, and $l^{- 1} (0)$ those labeled 0. If N and l are clear from the context, the nodes in $l^{- 1} (1)$ may be called 1-nodes, and those of $l^{- 1} (0)$ may be called 0-nodes.

Definition 1. Let $S$ be a set of taxa, let $γ$ be a character on $S$ , and let $N = (V, E_{S} \cup E_{T})$ be an LGT network with leafset $L (N) = S$ . We say that a labeling l of N is a homoplasy-free fit of $γ$ if the following conditions hold:

1.
For each leaf x, $l (x) = γ (x)$ (leaves are labeled by their character);
2.
Denote $V_{1} = l^{- 1} (1)$ . Then in $N [V_{1}]$ , the subgraph induced by $1 -$ nodes, there exists a unique node $v \in V_{1}$ that reaches every node in $V_{1}$ (single origin).

Furthermore, we call the pair $(N, l)$ a homoplasy-free scenario for $γ$ .

The single origin condition states that the first node that acquires a character transmits it to every other species that possess the character through a directed path of 1-nodes. See node v in Figure 1 (Right). This models the “no-homoplasy” condition, that is, that a character cannot emerge independently in two species.
2.2.1. Transfer and loss events

Given a homoplasy-free scenario $(N, l)$ for some character $γ$ , where $N = (V, E_{S} \cup E_{T})$ , the number of transfers of $(N, l)$ is simply $| E_{T} |$ , the number of transfer arcs. A loss is an arc $(u, v) \in E_{S}$ of the support tree such that $l (u) = 1$ and $l (v) = 0$ . This represents failure to transmit the character to a direct vertical descendant.

We emphasize that a loss can only occur on a support tree arc, representing the lack of vertical transmission—a character that is present on the donor of a transfer arc but is not on the recipient is not seen as a loss. Also, note that in previous work, losses were entirely forbidden (López Sánchez and Lafond, 2022), while here we do not restrict the number of losses of a character.

2.3. The minimum homoplasy-free completion problem

We now turn to the problem of predicting transfer and loss locations on a given species tree. As noted above, the species tree depicts the vertical inheritance of character, and this serves as a backbone for any homoplasy-free scenario. However, for any given tree T and character $γ$ , there always exists a trivial homoplasy-free fit of $γ$ : Assign each leaf x their character $l (x) = γ (x)$ and assign every internal node v with $l (v) = 1$ . In this labeling, the root $ρ (T)$ becomes the unique origin of the character that reaches every node in $V_{1}$ , yet the scenario is uninformative, since it ignores the possibility of horizontal transfers. For this reason, we focus on the more meaningful optimization variant, where the goal is to find a homoplasy-free scenario that explains the data using as few transfer and loss events as possible.

Let $(N, l)$ be a homoplasy-free scenario for some character $γ$ . Given a transfer cost $\cos t_{T}$ and a loss cost $\cos t_{L}$ , we define the transfer–loss cost of $(N, l)$ as:

\cos t_{T L} (N, l) = \cos t_{T} \cdot | E_{T} | + \cos t_{L} \cdot | losses (N, l) |

where

losses (N, l) = {(u, v) \in E_{S} : l (u) = 1, l (v) = 0}

is the set of support arcs on which the character was lost.

Our main problem of interest is the following.

The minimum homoplasy-free completion problem

Input. A character $γ$ on a taxa set $S$ ; a tree T with leaf-set $L (T) = S$ ; a weight $\cos t_{T}$ for transfer events; a weight $\cos t_{L}$ for loss events.

Output. A homoplasy-free scenario $(N, l)$ for $γ$ , such that N is time-consistent, has T as base tree, and such that $\cos t_{T L} (N, l)$ is minimized.

In the following, a homoplasy-free scenario $(N, l)$ for $γ$ that is time-consistent and has T as a base tree will be called a completion of T. Note that minimizing $\cos t_{T L} (N, l)$ goes beyond simply minimizing the number of transfers and losses, since the cost function allows us to weight transfers and losses differently. In this way, we can reflect biological assumptions about their relative likelihood but also cover the unweighted case when $\cos t_{T} = \cos t_{L}$ .

3. A POLYNOMIAL-TIME ALGORITHM FOR THE MINIMUM HOMOPLASY-FREE COMPLETION PROBLEM

Our strategy is to infer a labeling $λ$ on the input tree T directly and then let the labeling guide the locations on which to add transfer arcs. We call a labeling $λ$ of T an infra-labeling, as it serves as the labeling underlying an intended completion. We argue that the losses inferred on T by the labeling $λ$ can be extended to a completion $(N, l)$ of T with the same number of losses. In a similar manner, the arcs $(u, v)$ of T such that $λ (u) = 0$ and $λ (v) = 1$ correspond to a gain, which must be acquired by a transfer event—except for one such gain that corresponds to the origin of the character. In other words, a labeling $λ$ of T implies the existence of a completion $(N, l)$ whose number of losses is the number of arcs $(u, v)$ of T that transition from 1 to 0, and the number of transfers is the number of arcs $(u, v)$ of T that transition from 0 to 1, minus one.

3.1. Infra-labelings and completions

Formally, for a tree T an infra-labeling $λ$ for character $γ$ is a function $λ : V (T) \to {0, 1}$ such that leaves are labeled by their characters, that is, $λ (x) = γ (x)$ for each $x \in L (T)$ . The pair $(T, λ)$ is called an infra-labeled tree (for $γ$ ). Given such a pair $(T, λ)$ , an arc $(u, v) \in E (T)$ is a gain arc if $λ (u) = 0$ and $λ (v) = 1$ , and we call v a gain node, or just a gain for short. For technical reasons, if the root of T satisfies $λ (ρ (T)) = 1$ , then $ρ (T)$ is also a gain node. We assume henceforth that at least one gain node exists, as otherwise this implies that no leaf has the character. We next define

\begin{array}{l} losses (T, λ) = {(u, v) \in E (T) : λ (u) = 1, λ (v) = 0}, \\ gains (T, λ) = {(u, v) \in E (T) : λ (u) = 0, λ (v) = 1} . \end{array}

An island of $λ$ is an inclusion-maximal subset of nodes $I \subseteq V (T)$ such that: $λ (v) = 1$ for all $v \in I$ ; and between any $u, v \in I$ , there exists a path in the underlying undirected graph of T that consists only of nodes w with label $λ (w) = 1$ . In other words, an island is a connected component of the subgraph of T induced by the nodes labeled 1, where here connected components ignore arc directions. Figure 2 (left) illustrates this notion: for example, the node $g_{1}$ together with its right subtree forms an island, while other islands are also present in the figure. The following follows from this definition.

FIG. 2.

(Left) An example instance $(T, γ, λ)$ , with white nodes and stars representing the nodes whose label is 1, and stars representing gain nodes. (Right) The completion $(N, l)$ returned by Algorithm 1.

Lemma 1. Let $(T, λ)$ be an infra-labeled tree for a character $γ$ , and let I be an island of $λ$ . Then I contains exactly one gain node v. Furthermore, v is the unique node of I that reaches every node in I.

Proof. Since I is a connected subgraph of the underlying undirected graph of a tree, the induced subgraph T[I] must also be a tree (directed). Letting v be the root of that tree T[I], we see that v is the unique node of I that reaches the other nodes in I. Then, v must be a gain of $(T, λ)$ : if v is the root of T, this holds by definition, and otherwise if v has a parent w, then w cannot be labeled 1 by the maximality of I. All nodes $u \in I ∖ {v}$ have a parent labeled 1, since v reaches u in T[I] through that parent, and thus such a u is not a gain. Therefore, v is the only gain of I.□

As a consequence, the gain nodes of T are in one-to-one correspondence with the islands. We can now introduce the main idea of our correspondence between infra-labelings and completions: the islands present in the tree already ensure a “vertical” connectivity; thus, it suffices to find a suitable ordering of the gains to connect them using transfers. The order of connection matters for time-consistency, but we show that if we add transfers from the “highest” gains to the “lowest” gains in order and assign times to transfer nodes in decreasing order, then this can be done. Algorithm 1 shows how this can be achieved.

Algorithm 1 mostly serves as an existential result, that is, it shows that any infra-labeling can be turned into a completion by adding one particular sequence of transfer arcs. In general, there will be many other ways to add such transfers to obtain alternate completions, since different orderings of the gains lead to different completions. Thus, the algorithm should be seen as a tool to make the correspondence between the cost of infra-labelings and completions, not as a direct transfer prediction approach. Nonetheless, our experimental analysis shows how the approach can be used to predict a special type of horizontal transfer. We now proceed to show that the algorithm constructs a time-consistent network with our desired cost.

Lemma 2. The network N returned by Algorithm 1 is time-consistent.

Proof. Let $N = (V, E_{S} \cup E_{T})$ be the network returned by the algorithm on the input $T, γ, λ$ and let $T_{N}$ be its support tree. Let us first note that the ordering ${g_{1}, \dots, g_{k}}$ of the gain nodes as specified in the algorithm exists, since it can be obtained from a topological sort of T. The algorithm adds transfer arcs one at a time. Note that for each $i \in [k - 1]$ , when the i-th transfer arc is added, the choice of $w_{i}$ in the algorithm ensures that transfer arcs are only between incomparable nodes of N, whether $g_{i}$ is a leaf or not. Also, the partial time map $T^{*}$ assigns an equal time to each endpoint of each transfer arc, and each transfer created gets a time lower than the preceding.

We argue that after every iteration of the main loop of the algorithm, the following invariant holds in the network N obtained after the iteration: for any two transfer nodes u, v that exist after the i-th iteration, we have that $u ≺_{T_{N}} v$ implies $T^{*} (u) < T^{*} (v)$ .

Initially, when no transfer is added, this is vacuously true. Consider the i-th iteration of the algorithm, when transfer $(w_{i}^{'}, g_{i + 1}^{'})$ is added. Here, $w_{i}^{'}$ is either the parent of $g_{i}$ in $T_{N}$ if $g_{i}$ is a leaf or a child of $g_{i}$ in $T_{N}$ . Note that either way, before the transfer arc is added, $g_{i}$ has none of $g_{1}, \dots, g_{i - 1}$ as a descendant in $T_{N}$ (by our ordering of the gain nodes). For this reason, $g_{i}$ has none of the transfer nodes $g_{1}^{'}, \dots, g_{i - 1}^{'}, w_{1}^{'}, \dots, w_{i - 1}^{'}$ as a descendant in $T_{N}$ . The same is true for $g_{i + 1}$ . Thus, when $w_{i}^{'}$ is created as a parent or child of $g_{i}$ in $T_{N}$ , $w_{i}^{'}$ has none of the aforementioned transfer nodes as a descendant. The same is true for $g_{i + 1}^{'}$ when it is created as the parent of $g_{i + 1}$ . It follows that if $u ≺_{T_{N}} v$ and one of u, v is $w_{i}^{'}$ or $g_{i + 1}^{'}$ , then only $u = w_{i}^{'}$ or $u = g_{i + 1}^{'}$ is possible (v cannot be equal to either, since no other transfer descends from them). At the moment of their creation, $w_{i}^{'}$ and $g_{i + 1}^{'}$ are assigned time $| V (T) | - i$ , the smallest time among the transfer nodes so far, and it follows that $T^{*} (u) < T^{*} (v)$ when $u \in {w_{i}^{'}, g_{i + 1}^{'}}$ . As for pairs of transfer nodes not involving $w_{i}^{'}$ nor $g_{i + 1}^{'}$ , their time and ancestry relationship in the support tree have not changed and so our invariant holds.

It follows that after the loop is finished and all transfer arcs are created, for any pair of transfer nodes u, v, we have $u ≺_{T_{N}} v$ implies $T^{*} (u) < T^{*} (v)$ (and endpoints of transfer arcs have an equal time). To argue time-consistency, it remains to extend $T^{*}$ to nontransfer nodes. First assign a large time to $ρ (N)$ , say $T^{*} (ρ (N)) = V (T)$ (note, the root is not a transfer node). Then, as long as there exists a support arc $(u, v) \in E_{S}$ such that $T^{*}$ assigns a time to u but not to v, put $T^{*} (v) = T^{*} (u) - ϵ$ , where $ϵ$ is a very small quantity. One can easily see that this maintains time-consistency, and thus N is time-consistent.□

Lemma 3. Let $N = (V, E_{S} \cup E_{T})$ and l be the network and labeling returned by Algorithm 1, on input $T, γ, λ$ . Then, T is the base tree of N. Furthermore, l is a homoplasy-free fit on $γ$ that satisfies: 1.

$l (v) = λ (v)$ for every $v \in V (N) \cap V (T)$ .

$| losses (N, l) | = | losses (T, λ) |$ .

$| E_{T} | = | gains (T, λ) | - 1$ .

Proof. Let $N = (V, E_{S} \cup E_{T})$ and l be the network and labeling returned by the algorithm. It is clear that T is the base tree of N, since the algorithm starts from T and only subdivides some of its arcs to add transfer arcs.

We argue that l is a homoplasy-free fit of $γ$ . First, by requirement on the input $λ$ we know that for every $l (v) = λ (v)$ for every $v \in L (N)$ . Next, define $V_{1} = l^{- 1} (1)$ , and let us argue that $N [V_{1}]$ has a node that reaches all of $V_{1}$ . Consider the node $g_{1}$ of N, which is the first gain considered in Algorithm 1. Define $g^{*}$ as follows: if $g_{1}$ is a leaf, then $g^{*} = w_{1}^{'}$ , which is the parent of $g_{1}$ in $T_{N}$ ; and if $g_{1}$ is not a leaf, then $g^{*} = g_{1}$ . We argue that $g^{*}$ is able to reach every node in $N [V_{1}]$ .

We show this by induction over the iterations of Algorithm 1. To avoid confusion, for $i \in [k - 1$ ], denote by $(N_{i}, l_{i})$ , the pair of network and labeling in the algorithm after the i-th iteration has finished. So, $(N_{1}, l_{1})$ is after the first transfer arc is added, $(N_{2}, l_{2})$ after the second, and so on. Note that $(N, l) = (N_{k - 1}, l_{k - 1})$ . We show that for any $i \in [k - 1]$ , there is a path of 1-nodes in $N_{i}$ from $g^{*}$ to any transfer node $g_{1}^{'}, w_{1}^{'}, \dots, g_{i}^{'}, w_{i}^{'}, g_{i + 1}^{'}$ created so far, and to any node of $V (N_{i}) \cap V (T)$ that is in the same island as one of $g_{1}, \dots, g_{i + 1}$ . For $i = 1$ , if $g_{1}$ is a leaf of T, then $g^{*} = w_{1}^{'}$ and $g^{*}$ reaches both $w_{1}^{'}$ and $g_{1}$ . If $g_{1}$ is not a leaf, then $g^{*} = g_{1}$ and it again reaches both $g_{1}$ and $w_{1}^{'}$ using 1-nodes. Either way, $g_{1}$ reaches all the 1-nodes of $V (T) \cap V (N)$ on the island of $g_{1}$ (by Lemma 1). Also, $g^{*}$ can reach $g_{2}^{'}$ through the arc $(w_{1}^{'}, g_{2}^{'})$ , then through $g_{2}^{'}$ it reaches $g_{2}$ and all the 1-nodes of its island. Now consider $i > 1$ . We assume that using only 1-nodes in $N_{i - 1}$ , $g^{*}$ reaches $g_{i}^{'}$ , the parent of $g_{i}$ in $T_{N_{i - 1}}$ . Whether $w_{i}^{'}$ is created as a parent of $g_{i}$ or as a child of $g_{i}$ in $T_{N_{i}}$ , $g_{i}^{'}$ reaches $w_{i}^{'}$ using 1-nodes. This allows $g^{*}$ to reach the newly created transfer nodes $w_{i}^{'}$ and $g_{i + 1}^{'}$ , and then $g_{i + 1}$ and the nodes of its island.

It follows that at the end of the algorithm, in $N [V_{1}]$ , $g^{*}$ can reach every 1-node of an island and every transfer node. In T, every node v with $λ (v) = 1$ is part of an island rooted at a gain $g_{i}$ , and in $(N, l)$ , the set of 1-nodes consists of $λ^{- 1} (1)$ , plus the transfer nodes, and therefore it follows that $g_{1}$ reaches every 1-node of $N [V_{1}]$ . This shows that l is a homoplasy-free fit of $γ$ .

To see that $l (v) = λ (v)$ for every node in the base tree, it suffices to point out that the algorithm never changes the labeling of a node originally present in T. It only assigns characters to the subdivision nodes created for transfers.

We next show that $| losses (N, l) | = | losses (T, λ) |$ . It suffices to show that each time an arc is subdivided to create a new node $w_{i}^{'}$ or $g_{i + 1}^{'}$ , we do not generate any additional losses in the support tree. If the algorithm subdivides an arc $(p, g_{i})$ , for some node p and $i \in [k]$ , then it is replaced with two new arcs $(p, q)$ and $(q, g_{i})$ , where $q \in {w_{i}^{'}, g_{i}^{'}}$ . Note that $l (q) = l (g_{i}) = λ (g_{i}) = 1$ and $λ (q) = 1$ , so $(q, g_{i})$ cannot be a loss, and $(p, q)$ is a loss if and only if $(p, g_{i})$ was a loss. So overall, the number of losses stays the same. The only other possible subdivision occurs on arcs $(g_{i}, w_{i})$ , creating arcs $(g_{i}, q), (q, w_{i})$ . Again, $l (q) = l (g_{i}) = 1$ and $(q, w_{i})$ is a loss if and only if $(g_{i}, w_{i})$ was a loss, and we get the same result.

Finally, to see that $| E_{T} | = | gains (T, λ) | - 1$ , observe that the number of transfer arcs added by the algorithm is $k - 1 = | gains (T, λ) | - 1$ , the number of iterations it performs.□

We can finally establish the correspondence between infra-labelings and homoplasy-free scenarios.

Theorem 1. Let $λ$ be an infra-labeling of T on character $γ$ that minimizes the quantity $\cos t_{T} \cdot (| gains (T, λ) | - 1) + \cos t_{L} \cdot | losses (T, λ) |$ . Then, the pair $(N, l)$ returned by Algorithm 1 on input $T, γ, λ$ is a completion of T that minimizes $\cos t_{T L}$ , among all possible completions of T.

Proof.First note that, for the pair $(N, l)$ returned by the algorithm, we know by Lemma 2 that N is time-consistent. We also know by Lemma 3 that T is the base tree of N and that $(N, l)$ is a valid homoplasy-free scenario for $γ$ . Therefore, $(N, l)$ is indeed a completion of T.

Also, by Lemma 3, $(N, l)$ has exactly $| gain (T, λ) | - 1$ transfer arcs and exactly $| losses (T, λ) |$ losses, and so the cost of $(N, l)$ is the same as stated in the theorem.

It remains to argue that the cost of the scenario $(N, l)$ is minimum.

Consider an alternate completion $(N^{'}, l^{'})$ of T, and assume that it has $t - 1$ transfer arcs and q losses, where $t \geq 1$ and $q \geq 0$ . Suppose for contradiction that $\cos t_{T L} (N^{'}, l^{'}) < \cos t_{T L} (N, l)$ , that is, that

\begin{array}{l} \cos t_{T L} (N^{'}, l^{'}) & = \cos t_{T} \cdot (t - 1) + \cos t_{L} \cdot q \\ < \cos t_{T} \cdot (| gains (T, λ) | - 1) + \cos t_{L} \cdot | losses (T, λ) | . \end{array}

Consider the infra-labeling $λ^{'}$ of T such that $λ^{'} (v) = l^{'} (v)$ for every node $v \in V (N^{'}) \cap V (T)$ . We show that $λ^{'}$ has at most t gains and at most q losses. This will imply that

\cos t_{T} (| gains (T, λ^{'}) | - 1) + \cos t_{L} | losses (T, λ^{'}) | \leq \cos t_{T} (t - 1 |) + \cos t_{L} \cdot q = \cos t_{T L} (N^{'}, l^{'}) < \cos t_{T} (| gains (T, λ) | - 1) + \cos t_{L} | losses (T, λ) |

which leads to a contradiction since

λ

minimizes that later quantity.

Let us first show that $λ^{'}$ has at most q losses. We map each loss of $λ^{'}$ to a distinct loss of $(N^{'}, l^{'})$ . Suppose that $(u, v)$ is a loss arc of $(T, λ^{'})$ , so that $l^{'} (u) = 1$ and $l^{'} (v) = 0$ . In $N^{'}$ , if $(u, v) \in E (N^{'})$ , then $(u, v)$ is a support arc and also a loss in $(N^{'}, l^{'})$ since $l^{'} (u) = λ^{'} (u), l^{'} (v) = λ^{'} (v)$ . If $(u, v) \notin E (N^{'})$ , then in $N^{'}$ there is a path $u = u_{0}, u_{1}, \dots, u_{k} = v$ consisting only of support arcs. One of these arcs $(u^{'}, v^{'})$ must be a loss, since u is a 1-node and v is a 0-node. In either case, we charge the loss on $(u, v)$ in T to the loss $(u^{'}, v^{'})$ of $(N^{'}, l^{'})$ . Note that no two losses of $(T, λ^{'})$ charge are the same loss of $(N^{'}, l^{'})$ .

Now consider gains of $λ^{'}$ . Next suppose that $(u, v)$ is a gain arc of $(T, λ^{'})$ , so that v is a gain node and $λ^{'} (u) = 0, λ^{'} (v) = 1$ . Denote $V_{1} = l^{' - 1} (1)$ , and let $g^{*}$ denote the unique node of $V_{1}$ that reaches every node of $N^{'} [V_{1}]$ . As before, in $N^{'}$ , there is an arc $(u^{'}, v^{'})$ such that $l (u^{'}) = 0, l (v^{'}) = 1$ , where $u^{'}$ and $v^{'}$ are on the path from u to v in $T_{N^{'}}$ (note, $u = u^{'}$ or $v = v^{'}$ can happen). Unless $v^{'} = g^{*}$ , $v^{'}$ must be the receiving end of a transfer arc $(p, v^{'})$ of $N^{'}$ , as $v^{'}$ has only one in-neighbor $u^{'}$ in $T_{N^{'}}$ and $l^{'} (u^{'}) = 0$ (so the only possibility for $g^{*}$ to reach $v^{'}$ in $N^{'} [V_{1}]$ is through a transfer arc). In this case, we charge the gain arc $(u, v)$ of $(T, λ^{'})$ to the transfer arc $(p, v^{'})$ . If $v = g^{*}$ , the gain arc $(u, v)$ charges to nothing. Note that no two gain arcs of $(T, λ^{'})$ charge to the same transfer arc, as gain arcs all have distinct endpoints. Moreover, at most one gain arc $(u, v)$ charges to nothing, namely the one that has $g^{*}$ on the $u - v$ path of the support tree of $N^{'}$ .

We have thus established that each loss of $(T, λ^{'})$ has a distinct corresponding loss in $(N^{'}, l^{'})$ to which it charges to, and each gain in $(T, λ^{'})$ has a distinct corresponding transfer arc in $N^{'}$ , except perhaps one. It follows that $(T, λ^{'})$ has at most t gains and at most q losses. As we previously established, this contradicts the assumption that $λ$ minimizes the quantity stated in the theorem. We deduce that $(N, l)$ is indeed a minimum cost homoplasy-free scenario.□

3.2. Using Sankoff–Rousseau algorithm to find an infra-labeling

By Theorem 1, it now suffices to find an infra-labeling $λ$ of T that minimizes the following cost:

\cos t_{T L}^{*} (T, λ) = \cos t_{T} \cdot (| gains (T, λ) | - 1) + \cos t_{L} \cdot | losses (T, λ) | .

Recalling that $| losses (T, λ) |$ counts the number of arcs that transition from 1 to 0, and $| gains (T, λ) |$ the arcs that go from 0 to 1, the problem is very similar to the small parsimony problem. In the latter, given a tree T, an assignment of character-states to its leaves and a predefined cost matrix M where every possible change of state has an assigned weight, the algorithm assigns a label $λ (v)$ to each node of T in a way that minimizes the cost

s (T, λ) = \sum_{(u, v) \in E (T)} M (λ (u), λ (v)) .

Note that in our case, we have only two states ${0, 1}$ . Since the aim is to penalize the changes from different states, the cost matrix M is defined as $M (1, 0) = \cos t_{L}, M (0, 1) = \cos t_{T}$ and is 0 otherwise. We show that our problem can be solved by a variant of the Sankoff–Rousseau algorithm (Sankoff and Rousseau, 1975), which we recall below. We then show how the “minus one” applied to gains in our cost function changes the problem.

For binary labels, given a character $γ$ on $S$ and a tree T with $L (T) = S$ , we consider the minimum cost $s (T [v], λ)$ of a labeling $λ$ of the T[v] subtree, under the condition that $λ (v) = a$ . The Sankoff–Rousseau algorithm uses a dynamic programming table C[v, a] that stores $s (T [v], λ)$ for every $v \in V (T)$ and every label $a \in {0, 1}$ . If v is a leaf, we have $C [v, λ (v)] = 0$ and $C [v, 1 - λ (v)] = \infty$ . For each internal node v with label $a \in {0, 1}$ , we have

C [v, a] = \sum_{\forall x \in ch (v)} \min_{a^{'} \in {0, 1}} {C [x, a^{'}] + M (a, a^{'})}

where

ch (v)

represents the set of all children of the node v. Once C[v, a] is computed for all nodes v of T and all labels

a \in {0, 1}

, the minimum cost of a labeling of the nodes of T is

\min_{a \in {0, 1}} C [ρ (T), a]

. A standard backtracking procedure can then reconstruct a labeling of minimum

s (T, λ)

cost.

This cost is almost identical to the cost of a completion $(N, l)$ of T, but not quite, since one $0 - 1$ arc does not need to be penalized as it represents the origin. This actually matters: even under unit costs under Sankoff–Rousseau parsimony, such that the corresponding number of transfer and loss events is not minimum, and there are scenarios with a minimum number of transfer and loss events that are not minimum under Sankoff–Rousseau, which can therefore not be identified by the Sankoff–Rousseau algorithm. We provide a small example for each of these two problematic situations.

Consider first the tree T depicted in Figure 3a, in which two leaves have the character. Then, both labelings depicted in Figure 3c and d are optimal in the Sankoff–Rousseau sense (ignoring the subdivision nodes and transfer arc), as both contain exactly two arcs whose ends are given different labels (and these are exactly the minimum cost scenarios). However, the labeling in Figure 3c corresponds to a scenario with two loss events and no transfer event, while the labeling in Figure 3d corresponds to a scenario with one transfer event and no loss event. Hence, the second scenario is more parsimonious, in our sense, than the first one. Moreover, if $1 - 0$ arcs have a cost of 2 and $0 - 1$ arcs have a cost of 3, then under Sankoff–Rousseau parsimony, the tree in Figure 3d is preferred since its cost is 4, whereas the tree in Figure 3c has cost 6. But in terms of homoplasy-free scenarios, the tree in Figure 3c has two losses (cost 4), whereas the network in Figure 3d only needs one transfer (cost 3).

FIG. 3.

Examples illustrating that not penalizing one $0 - 1$ arc changes the space of optimal solutions (see text). (a) Shows a given tree and associated cost matrix. (b) The Sankoff algorithm applied to the tree. (c) and (d) Show two possible optimal solutions in the Sankoff sense and their respective networks.

Now, consider the tree T depicted in Figure 4 (left). Suppose that $\cos t_{L} = \cos t_{T}$ . One can verify that the labeling in Figure 4 (middle) is the only optimal labeling of T in Sankoff–Rousseau’s sense. It contains exactly one arc whose ends are given different labels, and it coincides with a scenario with one loss event and no transfer. The labeling depicted in Figure 4 (right) has two arcs whose ends are given different labels, so it is not optimal in Sankoff–Rousseau’s sense. However, this labeling corresponds to a scenario with one transfer event and no loss event, so the scenario is equally optimal, in our sense, as the first one.

FIG. 4.

Another example.

3.3. The Genesis algorithm: A two-variable Sankoff–Rousseau

The above shows that we cannot use the Sankoff–Rousseau algorithm as is. We adapt the dynamic programming table by adding an additional dimension that keeps track of where a character could have its origin, if encountered in the current subtree (the originator, thus the genesis term). An example of computation of the table is provided afterwards.

To keep track of the location of the genesis within the subtree, we define a mapping $g : V (T) \to {0, 1}$ . The value $g (v) = 1$ indicates that the character has possibly originated in the subtree T[v]. Conversely, if $g (v) = 0$ it signifies that the character did not originate in T[v]. This way, we know that the $0 - 1$ arc responsible for the origin should not be penalized. Using the same cost matrix as in the previous subsection, we define our dynamic programming table as follows. For $v \in V (T)$ , $a \in {0, 1}, b \in {0, 1}$ :

\begin{array}{l} C [v, a, b] = the minimum cost \cos t^{*} (T [v], λ) of a labeling λ of T [v], \\ given that l (v) = a and g (v) = b . \end{array}

If v is a leaf of T, we set

C [v, 1, b] = {\begin{array}{l} 0, & if l (v) = 1 \\ \infty, & otherwise . \end{array} C [v, 0, b] = {\begin{array}{l} 0, & if l (v) = 0 and b = 0. \\ \infty, & otherwise . \end{array}

The idea is that if v has the character, it could be the origin or not, so both $b \in {0, 1}$ are allowed, but if v does not it cannot be the origin.

If v is an internal node, then we will have the following cases:

Case C[v, 0, 0]: This corresponds to labeling $l (v) = 0$ and assuming that the origin does not lie in T[v]. This implies that all losses and gains in the subtree below will contribute to the cost, and that no subtree below can have the origin, hence:

C [v, 0, 0] = \sum_{x \in ch (v)} \min_{a \in {0, 1}} {C [x, a, 0] + M (0, a)}

Case C[v, 1, 0]: This case corresponds to labeling $l (v) = 1$ and assuming that the origin does not lie within T[v]. As in the previous case, all losses and gains in T[v] will contribute to the cost, thus:

C [v, 1, 0] = \sum_{x \in ch (v)} \min_{a \in {0, 1}} {C [x, a, 0] + M (1, a)}

Case C[v, 1, 1]: This case corresponds to labeling $l (v) = 1$ and assuming that the origin lies in T[v]. In this case, v must be the origin: if that origin was a strict descendant of v, it would need to violate time-consistency to transfer to v. Thus, we may assume that no child of v contains the origin, so we have

C [v, 1, 1] = \sum_{x \in ch (v)} \min_{a \in {0, 1}} {C [x, a, 0] + M (1, a)}

Note that although the recurrence is identical to the preceding case, the difference lies in the fact that v should be the origin.

Case C[v, 0, 1]: In this case, we assign $l (v) = 0$ and assume that the origin lies within T[v]. Note that this implies that exactly one child w of v must satisfy that $g (w) = 1$ : v itself cannot be an origin since it has label 0, and only one child subtree has the origin. We do need to try every possible w. Note that if the chosen w has label 1, then it will be the origin and no penalty for the state change $M (0, a)$ at the arc $(v, w)$ will be considered. This leads to the following expression:

C [v, 0, 1] = \min_{w \in ch (v)} {\min_{a \in {0, 1}} {C [w, a, 1]} + \sum_{x \in ch (v) ∖ {w}} \min_{a \in {0, 1}} {C [x, a, 0] + M (0, a)}}

Once the optimal score has been computed for the root, the minimum labeling $λ$ can be obtained by back tracing in a way that is generic to dynamic programming algorithms in general. Reversing the order of the traversal, that is, proceeding from the root to the leaves and asking which of the alternative values of label a and child $w \in c h (v)$ achieves a minimum C[v, p, q].

To obtain a completion for a given $(T, λ)$ , it suffices to run Algorithm 1 on the resulting infra-labeling. Note that the running time of the Genesis algorithm for a given species tree T on n species is $O (n)$ .

Our algorithm is best understood by an example. Suppose that we want to find an infra-labeling for a one-character tree T on five leaves as shown in Figure 5a with costs $\cos t_{L} = M (1, 0) = \cos t_{T} = M (0, 1) = 1$ and $M (0, 0) = M (1, 1) = 0$ . The full dynamic programming table for this tree is shown in Figure 5b.

FIG. 5.

(a) A tree T on five leaves, each associated with a character state in ${0, 1}$ . (b) The values $C (., a, b)$ computed by our algorithm on T with a cost matrix M given by $M (0, 0) = M (1, 1) = 0$ and $M (0, 1) = M (1, 0) = 0$ . See text for the details of the computation.

We will now break down each entry as a tuple that represents the four cases $(C [v, 0, 0], C [v, 0, 1], C [v, 1, 0], C [v, 1, 1])$ for every node v in T. For the leftmost and rightmost pair of leaves $x_{1}$ and $x_{5}$ , we observe $l (x_{1}) = l (x_{5}) = 0$ so their entries are equal to $(0, \infty, \infty, \infty)$ . The only 0 as first element in the entry corresponds to the case when $l (v) = 0$ and $g (v) = 0$ , that is, v cannot be an origin and have a label 0. For the rest of the leaves, ${x_{2}, x_{3}, x_{4}}$ we have $(\infty, \infty, 0, 0)$ . The reasoning behind the smallest entries for this case is: If v contains the character, then v could be an origin for the character.

Since we compute the nodes of the tree in a postorder traversal, after all values on the leaves are computed, we move up the internal nodes. We start with the root of the left subtree, the node v whose entry is $(1, 0, 1, 1)$ . Note that their descendants $x_{1}$ and $x_{2}$ have the entries $(0, \infty, \infty, \infty)$ and $(\infty, \infty, 0, 0)$ , respectively. The reasoning behind is the following:

For C[v, 0, 0]. Note that for $x_{1}$ the minimum cost is attained with $C [x_{1}, 0, 0] + M (0, 0)$ and for $x_{2}$ we have $C [x_{2}, 1, 0] + M (1, 0) = 0$ , thus $C [v, 0, 0] = 1$ .

For C[v, 1, 0] and C[v, 1, 1]. We have that the minimum cost is attained on the left subtree with $C [x_{1}, 0, 0] + M (1, 0) = 1$ and on the right subtree with $C [x_{2}, 1, 0] + M (1, 1) = 0$ , thus $C [v, 1, 0] = C [v, 1, 1] = 1$ .

For C[v, 0, 1], we compare the cost of fixing $x_{1}$ as origin versus the cost of fixing $x_{2}$ as origin. Given that $C [x_{1}, 0, 1] = C [x_{1}, 1, 1] = \infty$ , $x_{1}$ is not a good origin candidate. In contrast, $x_{2}$ looks more promising since the minimum cost happens when $C [x_{2}, 1, 1] + C [x_{1}, 0, 0] + M (0, 0) = 0$ . Hence, $C [v, 0, 1] = 0$ .

Note that the situation of node w is similar to v, thus leading to the entry $(1, 0, 1, 1)$ as well. Finally, we disclose the entry of the root of the tree $ρ (T)$ :

For $C [ρ (T), 0, 0]$ , we have on the left subtree $C [u, 0, 0] + M (0, 0) = 2$ and the right subtree $C [w, 0, 0] + M (0, 0) = 1$ . Thus, $C [ρ (T), 0, 0] = 3$ .

For C[v, 1, 0] and C[v, 1, 1], we have the minimum cost on the left subtree when $C [u, 1, 0] + M (1, 1) = 1$ and on the right subtree when $C [w, 1, 0] + M (1, 1) = 1$ . Thus, $C [ρ (T), 1, 0] = C [ρ (T), 1, 1] = 2$ follows.

For $C [ρ (T), 0, 1]$ , we compare the cost of fixing u as origin, which is attained when $C [u, 0, 1] + C [w, 0, 0] + M (0, 0) = 2$ versus fixing w as origin which happens when $C [w, 0, 1] + C [u, 0, 0] + M (0, 0) = 2$ . Thus, we get $C [ρ (T), 0, 1] = 2$ .

4. ON COMBINING SOLUTIONS FOR MULTIPLE CHARACTERS

In this section, we initiate a discussion on the problem of reconstructing transfer–loss scenarios involving more than one character. Let us first observe that there are several ways of defining a transfer–loss cost in this context. If multiple characters are lost on the same arc, we could either count that as a single block loss event or we could count one event per lost character. Similarly, if multiple characters borrow a transfer arc, we must decide whether it counts as a single block transfer or as independent transfers.

4.1. Independent transfer and loss events

Let us consider the simpler problem where block events are not allowed, that is, if k characters are lost/transferred on the same arc, then this counts for k separate events. In that case, it is tempting to use our Genesis algorithm on each character separately to obtain an infra-labeling for each of them and to add transfer arcs according to Algorithm 1 for each character. However, this may create time inconsistencies if not done carefully. As an example, consider the networks depicted in Figure 6a. These scenarios are irreconcilable, since the two transfer arcs would induce a directed cycle if we added both of them. However, these networks are obtained using the infra-labelings depicted in Figure 6b, and these infra-labelings can be turned into consistent scenarios by incorporating the transfer arcs in a different direction, see Figure 6c.

FIG. 6.

(a) Two evolutionary scenarios for the first (left) and second (right) character, respectively. (b) The infra-labeling corresponding to these two scenarios. (c) Two evolutionary scenarios that can be obtained from the same infra-labeling.

It is nonetheless possible to generalize multiple characters, all while preserving the same number of transfers and losses per character. The details of this extension can be seen in Algorithm 2, which builds a time-consistent evolutionary scenario from multiple infra-labelings.

The input of the algorithm is a triple $(T, C, Λ)$ , where $C = (γ_{1}, \dots, γ_{k})$ , $k \geq 1$ is the (ordered) set of characters under consideration, and $Λ = (λ_{1}, \dots, λ_{k})$ is the (ordered) set of respective infra-labeling for each of these characters.

We say that a node g of T is a gain node if it is a gain node for at least one character of C. In other words, v is a gain node if there exists $i \in {1, \dots, k}$ such that $λ_{i} (v) = 1$ and v is the root of T or $λ_{i} (u) = 0$ for u the parent of T. In that case, we say that v is a gain node for character $γ_{i}$ . Note that v can be a gain node for more than one character. As before, the idea is to order the gain nodes according to $≺_{T}$ and to add transfer arcs between consecutive gains of the same character. The proof that it builds a time-consistent network is similar to that of Lemma 2; we omit the details.

Note that, as in the case of Algorithm 1, this algorithm should be viewed as a proof of concept, since it outputs only one of many solutions. However, it shows that multiple infra-labelings can always give rise to a time-consistent scenario, and that the multicharacter variant where character events are independent can still be solved in polynomial time.

4.2. Block transfer and loss events

We now turn to the cost function that allows block events. Specifically, we assume that:

If a transfer edge transfers more than one character, it is considered as a single transfer event.

If two or more characters are lost along the same arc of the tree, it is considered as a single loss event.

We also assume hereafter that transfers and losses have the same cost. This version of the problem is much more complicated, as it is unclear whether we can still compute an optimal infra-labeling for each character independently.

As it turns out, Algorithm 2 does not necessarily build an optimal scenario, even if the original infra-labelings are optimal for each character. Such a situation can be seen in Figure 7. On the left, one can verify that the infra-labelings are optimal for the first character (1 transfer, 0 loss) and for the second (0 transfer, 1 loss), but the generalized scenario has 1 transfer and 1 loss. However, there exists a scenario for the same input with only one transfer (right).

FIG. 7.

(a) An infra-labeling that is optimal for the first character and for the second, but the generalized scenario has one transfer and one loss. (b) A scenario for the same input with only one transfer.

More generally, an optimal global scenario may not be optimal when restricted to one of the characters, as is illustrated in Figure 8. In other words, we may have to make suboptimal choices in one character to undertake a block event. Note that the dashed transfer in Figure 8b, which was added to explain the first character, is not necessary in Figure 8c because this character can be explained through the two transfers for the rest of the characters, where the star node in the left subtree becomes the origin for this character. In other words, instead of requiring one transfer per character, the leftmost transfer in Figure 8c is used to transfer both first and second characters, while the other transfer is used for the first and third character.

FIG. 8.

An example with three characters shows that combining the individual optimal character solutions for a given tree does not necessarily imply an optimal solution for all characters. (a) Given a tree with three characters. (b) Combination of the individual optimal solutions for each character, which results in a network with three transfer arcs and no losses. (c) Optimal solution with two transfer arcs and no losses.

We leave the complexity of computing an optimal scenario under block events as an open problem. We suspect it is NP-hard due to the interdependence between characters, as we have demonstrated. We also do not know the complexity of the problem with cost functions in which losses are independent, but transfers occur in blocks.

5. EXPERIMENTS ON KEGG CHARACTERS: A PROOF OF CONCEPT

We now illustrate our dynamic programming schemes for the inference of transfers that involve gain nodes shared between a set of characters. The whole datasets and implementations used for this part are available from the following repository: https://github.com/AliLopSan/Genesis-Sankoff.

Recall that our strategy is to find the set of gain nodes on a tree and then infer the connections between them a posteriori. Although there may be an exponential number of ways to connect the gains of a character $γ$ (Algorithm 1 just gives one way to do it), there are only $O (| gains (T, λ) |^{2})$ pairs of gain nodes. Thus, it is possible to look for pairs of gains that characters have in common. We use this to predict transfer highways (Bansal et al., 2011; Beiko et al., 2005), which are horizontal arcs in a network where a significantly large number of gene (character) transfers have taken place. We now look at a set $C$ of multiple characters independently, and we say that a pair of gain nodes ${a, b}$ in a given tree T is a transfer highway if it is present in at least $α$ characters of $C$ .

To build our base species tree T, we took a random subset of 45 species from the bacterial species used in Zhou et al. (2021), which consists of species that were predicted to be involved in interphylum transfers. We obtained the corresponding species tree from NCBI Taxonomy Browser (Schoch et al., 2020), noting that it is not completely resolved and is therefore nonbinary (see Supplementary Data). The whole annotated genomes of these species are contained in the KEGG (Kanehisa et al., 2016a,b) database.

As set of characters, we chose a set of 180 KEGG Ortholog groups, called KOs, taken from Zhou et al. (2021), which contain functions that are classified as metabolism-related, information processing, and antibiotic resistance, as seen in Figure 9. This choice was to ensure that our methods operate on the same input for consistency. We computed infra-labelings $λ$ using three different approaches:

FIG. 9.

Overview of the functional distribution of the 180 characters used in these experiments. Only the top 10 functions with more than one KO were considered.

The Basic labeling described in López Sánchez and Lafond (2022) maps every leaf to the character it possesses, and an internal node v has a 1 label if and only if all of its children have a 1 label.

The Sankoff labeling is derived from the algorithm presented in Section 3.2, with transfer/loss cost ratios $\cos t_{T} / \cos t_{L} \in {0.25, 0.5, 1}$ .

The Genesis labeling is computed using the algorithm from Section 3.3, with transfer and loss costs identical to those in the previous approach.

For every character $γ$ , we computed a minimum cost infra-labeling $λ$ and obtained the gain nodes. To find the transfer highways, we then look at the pairs of gains that are common to the different sets of gain nodes between the characters. We applied different threshold values: $α \in {9, 18, 27, 36}$ corresponding to 5%, 10%, 15%, 20%, and 25% of the total number of characters, respectively. We observed that when loss costs are lower than transfer costs, almost all characters are explained solely by losses. As a result, all characters appear at the root, and no gain arcs are observed throughout the tree. This outcome has been discussed in the literature (Doolittle et al., 2003). In the following, we report data on unit costs, $\cos t_{T} = \cos t_{L} = 1$ , as they gave the most balanced results.

5.1. HGTs at the species level

We contrasted the inferred highways with different transfers found throughout the literature. In Figure 10, we compare our inference with pairwise interphylum transfers reported in Zhou et al. (2021), shown in (a), which corresponds to a network whose nodes represent the bacterial species and an edge between a pair of species represents a transfer event. We will refer to these predictions as sequence-based predictions. This network was built by finding blocks of nearly identical DNA (i.e., more than 500 nucleotides, more than 99% identity) in distantly related genomes (<97% of 16S ribosomal RNA similarity). We conjectured that interphylum transfers would be more visible to our model than transfers that happen between closely related species. This is because when a transfer happens between closely related species, the parsimony criterion implied by our algorithm will most likely explain it through vertical inheritance, rather than transfers. To compare the outcome of our methods at the species level, we take a transfer highway $(a, b)$ , and we connect all the leaves in the subtrees T[a] and T[b] between them. In this way, we create a graph where nodes represent species and edges represent transfer highways whose weight is proportional to the number of characters that share this highway. We observe that Sankoff and Genesis preserve some of the transfer relationships contained in (a), especially concerning the interphylum transfers between Proteobacteria (purple) and Actinobacteria (green), which are shown to be the most popular highways in Sankoff and Genesis and remain faintly in the Basic labeling. Note that contrary to (a), there is a large number of transfers between (17) Mucispirillum schaedleri (T08201) and Proteobacteria that remained throughout the three models. It has been previously reported in the literature that Epsilon- and Deltaproteobacteria have shaped the evolution of this genome (Loy et al., 2017). We emphasize that in (a) the underlying inference is based on sequence similarity and thus follows a fundamentally different paradigm than our character-based approach. As a result, inferred transfers are not expected to match exactly. In particular, our methods may appear to infer a larger number of transfers, which can be explained by the fact that many transfer events occur higher in the species tree, closer to the root, thereby affecting larger sets of taxa. This effect is examined more closely in the following paragraph and motivates a finer-grained analysis across different taxonomic levels.

FIG. 10.

Transfer highways for different labelings. Every species in T is represented as a segment of the circle, and edges represent transfer arcs. For (b), (c), and (d), the color bar indicates the number of characters that share the transfer arc and $α = 18$ . We used $\cos t_{T} = \cos t_{L}$ . Taxa are numbered, and the key can be found in Table 1.

5.3. HGTs at different taxonomic ranges

When looking at higher level taxa, as shown in Figure 11, we see that throughout the different types of labeling and parameters used, the members of the Pseudomonas genus remain as potential gain nodes. This is consistent with the literature, since Pseudomonas are known to be not only ecologically versatile pathogens (Silby et al., 2011) but also implicated in the interphylum transfers of genetic material with members of the Bacteroidales order (Gschwind et al., 2024). This shows that our model is biologically sound. It remains to validate other pairs of ancestral species with highly supported transfers and investigate further the differences between the Sankoff and Genesis outputs.

FIG. 11.

(a) Quantitative differences between the inferred transfer highways using the three methods with $\cos t_{T} = \cos t_{L}$ and $α = 5$ . (b) A Sankey plot of the highways found with the Genesis labeling. The thickness of the lines is proportional to the number of characters that share the pair. Out of the 45 transfer edges contained in the sequence-based predictions, the Basic labeling found 30, Sankoff labeling 38, and Genesis 40.

5.4. Overall differences between Sankoff and Genesis outputs

For each of the 180 characters, we computed the $\cos t_{T L} (N, λ)$ (Fig. 12). The left panel shows that Sankoff tends to produce infra-labelings with fewer losses, whereas Genesis yields more varied scenarios. Conversely, the middle panel indicates that Sankoff favors infra-labelings with more gains compared with Genesis. These differences become more pronounced when examining the transfer highways unique to each method (Fig. 13). For example, Sankoff alone predicts highways connecting the Actinomycetota subtree to the leaf descendants of the root (T04453, T08201) and to the Bacilli subtree. In Figure 13a, these highways also appear in one of its direct descendants, Actinomycetes. Another striking difference involves the highways (Pseudomonadota, T06554) and (Actinomycetes, T06794), where one endpoint is a descendant of the other. This pattern could reflect a character that was lost and later regained during evolution. A close-up of the Pseudomonadota subtree for character K01669 (Fig. 14) illustrates this: one leaf (T03770) lacks the character. In such cases, Sankoff explains the pattern via transfer arcs, while Genesis interprets it as a loss.

FIG. 12.

Cost distributions of Sankoff and Genesis labeling for the 180 KOs using $\cos t_{T} = \cos t_{L}$ .

FIG. 13.

Inferred transfer highways using Sankoff and Genesis labeling with $\cos t_{T} = \cos t_{L}$ and $α = 5$ . Gray triangles represent the indicated collapsed clades.

FIG. 14.

The Pseudomonadota subtree for character K01669 under (a) Sankoff and (b) Genesis labeling with $\cos t_{T} = \cos t_{L}$ . Nodes that contain the character are marked as $< * >$ .

Table 1.

Information Concerning the 45 Species Used in This Work

Key	NCBI ID	Species	KEGG ID	Phylum
1	74426	Collinsella aerofaciens	T05143	Actinobacteria
2	679935	Alistipes finegoldii	T02133	Bacteroidetes
3	88431	Dorea longicatena	T08069	Firmicutes
4	39486	Dorea formicigenerans	T08666	Firmicutes
5	226186	Bacteroides thetaiotaomicron	T00122	Bacteroidetes
6	479437	Eggerthella lenta	T00984	Actinobacteria
7	1791	Mycolicibacterium aurum	T06794	Actinobacteria
8	869209	Treponema succinifaciens	T01461	Spirochaetes
9	1150423	Bifidobacterium dentium	T04222	Actinobacteria
10	1042403	Bifidobacterium animalis	T01842	Actinobacteria
11	1806905	Arthrobacter sp. ZXY-2	T05201	Actinobacteria
12	52242	Lactobacillus gallinarum	T04115	Firmicutes
13	43765	Corynebacterium amycolatum	T07354	Actinobacteria
14	1979527	Corynebacterium kefirresidentii	T07267	Actinobacteria
15	1254439	Bifidobacterium thermophilum	T02505	Actinobacteria
16	290340	Paenarthrobacter aurescens	T00447	Actinobacteria
17	1379858	Mucispirillum schaedleri	T08201	Deferribacteres
18	1297617	Intestinimonas butyriciproducens	T04182	Firmicutes
19	154288	Turicibacter sanguinis	T06584	Firmicutes
20	360107	Campylobacter hominis	T00577	Proteobacteria
21	1886	Streptomyces albidoflavus	T02545	Actinobacteria
22	1032069	Campylobacter ureolyticus	T04021	Proteobacteria
23	1784719	Leucobacter triazinivorans	T05843	Actinobacteria
24	291645	Bacteroides nordii	T07941	Bacteroidetes
25	1197717	Cloacibacillus porcorum	T04453	Synergistetes
26	411470	[Ruminococcus] gnavus	T06719	Firmicutes
27	378753	Kocuria rhizophila	T00701	Actinobacteria
28	879243	Porphyromonas asaccharolytica	T01485	Bacteroidetes
29	1578720	Helicobacter ailurogastricus	T08780	Proteobacteria
30	1396826	Leisingera aquaemixtae	T06085	Proteobacteria
31	80854	Moritella viscosa	T03770	Proteobacteria
32	411474	Coprococcus eutactus	T07943	Firmicutes
33	33069	Pseudomonas viridiflava	T06554	Proteobacteria
34	695562	Lactobacillus amylovorus	T01954	Firmicutes
35	82633	Cupriavidus pauculus	T05738	Proteobacteria
36	428406	Ralstonia pickettii	T00925	Proteobacteria
37	339670	Burkholderia ambifaria	T00398	Proteobacteria
38	1842533	Acidovorax sp. RAC01	T04481	Proteobacteria
39	1842727	Rhodoferax koreense	T04721	Proteobacteria
40	537007	Blautia hansenii	T05095	Firmicutes
41	29380	Staphylococcus caprae	T06252	Firmicutes
42	72758	Staphylococcus capitis	T03970	Firmicutes
43	1447716	Bifidobacterium kashiwanohense	T03496	Actinobacteria
44	820	Bacteroides uniformis	T06523	Bacteroidetes
45	525919	Anaerococcus prevotii	T00964	Firmicutes

6. CONCLUSION

In this work, we have incorporated losses to character-based model on phylogenetic networks. We have shown that a most parsimonious scenario can be found efficiently for a single character, but much remains to be done. Notably, although the transfer network that results from applying Algorithm 1 is time-consistent for the one-character case, it does not guarantee that it will remain time-consistent when we have more than one character. Because of this, explaining multiple characters with a single scenario while minimizing transfers and losses appears challenging—it probably leads to NP-complete problem formulations, but investigating good heuristics or structural restrictions in the resulting networks are interesting future directions. Moreover, in López Sánchez and Lafond (2024), the authors provide examples on which incorporating losses can be used to resolve polytomies in species trees, by looking at resolutions that minimize our cost criterion. Given that several species phylogenies are only partially resolved, including the NCBI tree used in our experiments, we will look at possible algorithms to resolve them using our model. Finally, we observe that our experiments on real data are preliminary and that the approach can recover transfers that are not far from those found in the literature. It remains to perform experiments at a larger scale and see if our approach can find well-supported and novel transfer highways, especially the ancient ones that are difficult to recover using only sequence comparisons. To make our approach more impactful, one could also consider modeling the adaptability of the transferred genetic material, since there are factors such as the codon usage bias that could lead to certain characters being lost more easily than others (Callens et al., 2021).

AUTHORS’ CONTRIBUTIONS

A.L.S.: Conceptualization, methodology, formal analysis, algorithm design, software, investigation, writing—original draft, and visualization. G.E.S.: Methodology, formal analysis, validation, and writing—review and editing. P.F.S. and M.L.: Supervision, conceptualization, resources, and writing—review and editing. All authors have read and approved the final article.

Footnotes

ACKNOWLEDGMENTS

The authors thank the RECOMB-CG 2025 reviewers for their valuable comments.

AUTHOR DISCLOSURE STATEMENT

The authors declare that they have no competing interests or conflicts of interest.

FUNDING INFORMATION

A.L.S. acknowledges financial support from the program de bourses d’excellence en recherche from the faculty of sciences of the University of Sherbrooke. Research in the Stadler Lab was supported by the Federal Ministry of Research, Technology and Space of Germany (BMFTR) through DAAD project 57616814 (SECAI, School of Embedded Composite AI) and jointly with SMWK (Saxony) through the Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig (SCADS24B).

Supplemental Material

References

Alexander

, He

, Chen

, et al. The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci U S A, 2007; 104(29):11963–11968; doi: 10.1073/pnas.0700922104

Anselmetti

, El-Mabrouk

, Lafond

, et al. Gene tree and species tree reconciliation with endosymbiotic gene transfer. Bioinformatics, 2021; 37(Suppl_1):i120–i132; doi: 10.1093/bioinformatics/btab328

Arnold

, Huang

, Hanage

. Horizontal gene transfer and adaptive evolution in bacteria. Nat Rev Microbiol, 2022; 20(4):206–218; doi: 10.1038/s41579-021-00650-4

Bansal

, Banay

, Gogarten

, et al. Detecting highways of horizontal gene transfer. J Comput Biol, 2011; 18(9):1087–1114; doi: 10.1089/cmb.2011.0066

Beiko

, Harlow

, Ragan

. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA, 2005; 102(40):14332–14337; doi: 10.1073/pnas.0504068102

Bonizzoni

, Carrieri

, Vedova

, et al. (2014) When and how the perfect phylogeny model explains evolution. Discrete and Topological Models in Molecular Biology: 67–83; doi: 10.1007/978-3-642-40193-0_4

Bouckaert

, Fischer

, Wicke

. Combinatorial perspectives on dollo-k characters in phylogenetics. Adv Appl Math, 2021; 131:102252; doi: 10.1016/j.aam.2021.102252

Callens

, Scornavacca

, Bedhomme

. Evolutionary responses to codon usage of horizontally transferred genes in pseudomonas aeruginosa: Gene retention, amelioration and compensatory evolution. Microb Genom, 2021; 7(6):e000587; doi: 10.1099/mgen.0.000587

Cardona

, Pons

, Rosselló

. A reconstruction problem for a class of phylogenetic networks with lateral gene transfers. Algorithms Mol Biol, 2015; 10(1):28; doi: 10.1186/s13015-015-0059-z

10.

Charleston

. Jungles: A new solution to the host/parasite phylogeny reconciliation problem. Math Biosci, 1998; 149(2):191–223; doi: 10.1016/S0025-5564(97)10012-8

11.

Choi

, Kim

. Global extent of horizontal gene transfer. Proc Natl Acad Sci U S A, 2007; 104(11):4489–4494; doi: 10.1073/pnas.0611557104

12.

Day

, Johnson

, Sankoff

. The computational complexity of inferring rooted phylogenies by parsimony. Math Biosci, 1986; 81(1):33–42; doi: 10.1016/0025-5564(86)90161-6

13.

den Bakker

, Cummings

, Ferreira

, et al. Comparative genomics of the bacterial genus listeria: Genome evolution is characterized by limited gene acquisition and limited gene loss. BMC Genomics, 2010; 11(1):688; doi: 10.1186/1471-2164-11-688

14.

Doolittle

, Boucher

, Nesbø

, et al. How big is the iceberg of which Organellar genes in nuclear genomes are but the tip? Philos Trans R Soc Lond B Biol Sci, 2003; 358(1429):39–58; doi: 10.1098/rstb.2002.1185

15.

Farris

. Phylogenetic analysis under Dollo’s law. Syst Biol, 1977; 26(1):77–88.

16.

Felsenstein

. Inferring phylogenies. Sinauer Associates: Sunderland, MA; 2004.

17.

Górecki

. Reconciliation problems for duplication, loss and horizontal gene transfer. In: Proceedings of the eighth annual international conference on Research in computational molecular biology. 2004; pp. 316–325; doi: 10.1145/974614.974656

18.

Goyal

. Horizontal gene transfer drives the evolution of dependencies in bacteria. iScience, 2022; 25(5):104312; doi: 10.1016/j.isci.2022.104312

19.

Gschwind

, Petitjean

, Fournier

, et al. Inter-phylum circulation of a beta-lactamase-encoding gene: A rare but observable event. Antimicrob Agents Chemother, 2024; 68(4):e01459–23; doi: 10.1128/aac.01459-23

20.

Kanehisa

, Sato

, Morishima

. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J Mol Biol, 2016b;428(4):726–731; doi: 10.1016/j.jmb.2015.11.006

21.

Kanehisa

, Sato

, Kawashima

, et al. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res, 2016a;44(D1):D457–D462; doi: 10.1093/nar/gkv1070

22.

Kelly

, Nicholls

. Lateral transfer in stochastic dollo models. Ann. Appl. Stat., 2017; 11(2):1146–1168; doi: 10.1214/17-AOAS1040

23.

López Sánchez

, Lafond

. Predicting horizontal gene transfers with perfect transfer networks. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. In: 22nd International Workshop on Algorithms in Bioinformatics. WABI; 2022; doi: 10.4230/LIPICS.WABI.2022.3

24.

López Sánchez

, Lafond

. Galled perfect transfer networks. In: Comparative Genomics. Cham: Springer Nature Switzerland; 2024, pp. 24–43; doi: 10.1007/978-3-031-58072-7_2

25.

Loy

, Pfann

, Steinberger

, et al. Lifestyle and horizontal gene transfer-mediated evolution of mucispirillum schaedleri, a core member of the murine gut microbiota. mSystems, 2017; 2(1):e00171–e00116; doi: 10.1128/msystems.00171-16

26.

Moran

. Microbial minimalism: Genome reduction in bacterial pathogens. Cell, 2002; 108(5):583–586; doi: 10.1016/S0092-8674(02)00665-7

27.

Nakhleh

, Ringe

, Warnow

. Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Lan, 2005; 81(2):382–420.

28.

Nakhleh

. Phylogenetic networks. PhD Thesis, The University of Texas at Austin; 2004.

29.

Ravenhall

, Škunca

, Lassalle

, et al. Inferring horizontal gene transfer. PLoS Comput Biol, 2015; 11(5):e1004095; doi: 10.1371/journal.pcbi.1004095

30.

Sankoff

, Rousseau

. Locating the vertices of a steiner tree in an arbitrary metric space. Math Program, 1975; 9(1):240–246; doi: 10.1007/BF01681346

31.

Schoch

, Ciufo

, Domrachev

, et al. NCBI Taxonomy: A comprehensive update on curation, resources and tools. Database (Oxford), 2020; 2020:baaa062; doi: 10.1093/database/baaa062

32.

Silby

, Winstanley

, Godfrey

, et al. Pseudomonas genomes: Diverse and adaptable. FEMS Microbiol Rev, 2011; 35(4):652–680; doi: 10.1111/j.1574-6976.2011.00269.x

33.

Van Iersel

, Semple

, Steel

. Quantifying the extent of lateral gene transfer required to avert a ‘genome of eden’. Bull Math Biol, 2010; 72(7):1783–1798; doi: 10.1007/s11538-010-9506-7

34.

Warnow

, Tabatabaee

, Evans

. Advances in estimating level-1 phylogenetic networks from unrooted SNPs. J Comput Biol, 2025; 32(1):3–27; doi: 10.1089/cmb.2024.0710

35.

Yang

, Rannala

. Molecular phylogenetics: Principles and practice. Nat Rev Genet, 2012; 13(5):303–314; doi: 10.1038/nrg3186

36.

Zachar

, Boza

. Endosymbiosis before eukaryotes: Mitochondrial establishment in protoeukaryotes. Cell Mol Life Sci, 2020; 77(18):3503–3523; doi: 10.1007/s00018-020-03462-6

37.

Zhang

, Tu

, Bai

, et al. Metabolic enhancement contributed by horizontal gene transfer is essential for dietary specialization in leaf beetles. Proc Natl Acad Sci USA, 2025; 122(1); doi: 10.1073/pnas.2415717122

38.

Zhou

, Beltrán

, Brito

. Functions predict horizontal gene transfer and the emergence of antibiotic resistance. Sci Adv, 2021; 7(43):eabj5056; doi: 10.1126/sciadv.abj5056

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.35 MB

0.00 MB