Abstract
The simple underlying pattern of presence–absence of a character within a species tree provides useful steps to trace complex evolutionary histories. Character-based models such as perfect transfer networks and its galled variant aim to leverage this information to predict horizontal gene transfers. Under the assumption that characters have a single origin, are rarely lost, and can be transferred horizontally, they remain an efficient inference method for almost tree-like scenarios. Nevertheless, they can sometimes predict overly complicated scenarios, and its simplest structural variants are too restrictive for practical uses. With the goal of extending this model to include loss events, we present a Sankoff–Rousseau-like algorithm that aims to recover the simplest possible scenarios that combine gene transfers and losses using solely the single character information already contained in a given species tree. We establish a link between the small parsimony problem and the inference of scenarios with a minimum number of losses and transfers, allowing losses and transfers to have a user-defined penalization for this end. We also explore the utility of our model for tracing possible highways of gene transfers by presenting a real case study on a dataset of bacterial species and Kyoto Encyclopedia of Genes and Genome functions as characters.
INTRODUCTION
Horizontal gene transfer (HGT), also known as lateral gene transfer (LGT), is the transmission of genetic material between coexisting organisms and provides an alternative DNA exchange mechanism to the more widely studied parent–offspring relationships. Occurring both between and within all domains of life, HGT constitutes an important contribution to genetic diversity and serves as a source of functional innovation through the introduction of novel genes and metabolic pathways. It is one of the most conspicuous features found in bacterial genomes (Choi and Kim, 2007) and notably eases the adaptation of organisms to new conditions, which are sometimes extreme and life-threatening (Arnold et al., 2022). Transfers are also involved in eukaryotes, but large-scale HGT studies are hampered by the complexity of their genomes. They are nevertheless present, with a recent example including the transfer of genes that degrade cellulose from bacteria to beetles (Zhang et al., 2025).
There are currently two basic approaches to infer HGT (Ravenhall et al., 2015): parametric methods, which look for outliers throughout the genomic composition of single individuals, and phylogenetic methods, which find unusual evolutionary histories among groups of organisms. The majority of current methods rely on sequence-based information as input. However, these methods become unreliable to predict ancient transfers as sequences are too divergent to provide clear evolutionary signals (Yang and Rannala, 2012).
An alternative to sequence-based methods is character-based methods. A character is a morphological or molecular trait that a taxon may possess or lack. Character-based approaches may recover phylogenetic signals more accurately when highly divergent sequences are involved (Alexander et al., 2007) and aim to explain the character diversity of a set of taxa in terms of state changes. Several character-based models have been established in the literature, each imposing conditions on how a character may emerge and evolve. For example, a perfect phylogeny requires that characters change their state at most once. Here, we will be concerned with a particular type of characters that have only two states: presence and absence. A character transition from absence to presence is called a gain, while a transition from presence to absence is called a loss. In this setting, a variant of perfect phylogeny requires a character to be gained once and never lost. Although the perfect phylogeny model has been shown to be too restrictive to explain the evolutionary history of a set of characters, its extensions remain an active area of research with promising applications (Bonizzoni et al., 2014).
Relaxations of character state transition models were subsequently developed, including Dollo parsimony (Farris, 1977), where characters can only be gained once but can be lost many times. This assumption has been shown to be more suitable for complex characters such as restriction sites and introns (Felsenstein, 2004). Optimization problems involving the reconstruction of phylogenies under Dollo parsimony were shown to be NP-hard, even for binary characters (Day et al., 1986). More recently, a variant of this model, the Dollo-k model, where characters can be lost at most k times, was studied in Bouckaert et al. (2021).
An appealing alternative is to consider network-like structures instead of trees to explain characters. A representative model called perfect phylogenetic networks (PPNs) was introduced in Nakhleh et al. (2005) and Nakhleh (2004) and asks for a tree displayed by the network to be a perfect phylogeny. Unfortunately, this model is difficult to work with, since even deciding whether a known network explains a set of (multistate) characters is NP-hard (Warnow et al., 2025). In López Sánchez and Lafond (2022), we introduced a special case of PPNs called perfect transfer networks (PTNs). In this model, characters are binary, have a unique origin, and cannot be lost once acquired. The biological motivation behind PTNs is to study transferable characters that are difficult to revert such as organelles in endosymbiotic events (Zachar and Boza, 2020; Anselmetti et al., 2021) and metabolites (Goyal, 2022). In contrast to PPNs, the tree that dictates the vertical inheritance is fixed in PTNs. This affects the complexity of recognition and optimization problems: Deciding whether a network explains a given set of characters and whether we can augment a tree with transfer arcs to explain the evolution of a set of characters are polynomial-time solvable.
Although easy to compute, PTNs can sometimes predict overly complicated evolutionary scenarios that require a larger number of transfers [note that so-called galled PTNs were introduced in López Sánchez and Lafond (2024) to simplify these scenarios, but the structure imposed on the networks was shown to be too restrictive in practice]. This is mainly because these prior models do not allow losses, sometimes requiring the addition of several transfers to explain a single loss. In contrast, the explicit inclusion of transfer and loss parameters would allow modeling a greater diversity of biological scenarios. For example, higher loss rates are often associated with the evolution of pathogenic bacteria and symbionts (Moran, 2002), whereas certain genera, such as Listeria, exhibit reduced rates of both gene gain and loss (den Bakker et al., 2010).
Our contribution
In this work, we broaden the definition of PTNs by incorporating loss events as in the Dollo parsimony model. We aim to find the most parsimonious evolutionary scenario under the assumption that each character emerges only once (this is known as the “no-homoplasy” condition). We allow transfers and losses to have a different cost, as previously explored in the context of host–parasite reconciliation networks (Charleston, 1998).
Our main problem of interest is the minimum homoplasy-free completion problem, where we must augment a given leaf-labeled tree with internal node labels and transfer arcs in order to achieve a minimum cost scenario. Importantly, the resulting network is required to be time-consistent. We start by establishing a connection between this Dollo parsimony variant and the small parsimony problem and then show that using a modification of the well-known Sankoff–Rousseau algorithm (Sankoff and Rousseau, 1975), we can find an inner labeling of the given tree that minimizes the loss and transfer cost.
We conclude with a case study of inferred scenarios on a bacterial dataset consisting of bacterial species and functional characters obtained from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al., 2016a,b). We show that the phylogenetic signal of some transfer events reported in the literature is contained within special pairs of nodes in the given species tree known as transfer highways (Beiko et al., 2005; Bansal et al., 2011).
This article is an extended version of our earlier RECOMB-CG 2025 conference paper. While the original work focused on single-character algorithms, we now introduce a new section discussing the extension of our model to handle multiple characters simultaneously. The experimental analysis has been expanded with detailed descriptions of the functional characters used in the study, an evaluation of the transfer and loss costs associated with the inferred networks, and a deeper comparison of the transfers identified by using an unmodified two-state Sankoff–Rousseau algorithm and our model. In addition, we provide a step-by-step working example to illustrate the application of our algorithm.
Related work
From a stochastic perspective, a model that combines the inference of horizontal transfers with Bayesian inference was presented in Kelly and Nicholls (2017). From a combinatorial perspective, besides PPNs and variants (Warnow et al., 2025), the model closest to our work, to our knowledge, was presented in Van Iersel et al. (2010). In this work, the authors provide upper or lower bounds of the number of transfers needed when considering transfer and loss events to explain the evolutionary history of a set of characters, when we require that every node in the resulting network contains at most k characters. Additionally, they show that finding a character assignment for a given network and a set of characters that satisfies these conditions is NP-complete. Furthermore, the problem of penalizing losses in this context was left as an open problem in their conclusion.
Overview of this article
Section 2 introduces the mathematical framework of our model, including time-consistent LGT networks, homoplasy-free scenarios, and the formal definition of our main problem: the minimum homoplasy-free completion problem. Section 3 presents our main algorithmic contribution, where we show how and optimal internal labeling of the given tree guides the placement of transfer arcs and ensures time-consistency. We also formally connect out labeling problem to the classical small parsimony problem and motivate the need for a modification of the Sankoff–Rousseau algorithm. Section 4 extends our approach from single characters to sets of characters, describing how individual solutions can be combined into a generalized framework. Section 5 provides an experimental evaluation on KEGG functional characters, illustrating the behavior of our model on real data and highlighting applications to the detection of transfer highways.
PRELIMINARIES
Unless stated otherwise, all graphs in this work are directed and loopless. For a graph N,
A phylogenetic network, or simply a network, is a directed acyclic graph N with a unique node
We now introduce notation that is specific to trees. A tree T is a network whose underlying undirected graph has no cycles. For
LGT networks and time-consistency
An LGT network (where LGT comes from lateral gene transfers; Cardona et al., 2015) is a network (Left) A network N and a character 
We assume that transfer nodes have exactly one out-neighbor in
We aim to reconstruct LGT networks that are biologically feasible in terms of time. This implies that the transfers that appear should exist only between ancestral species that co-existed. We define a time-consistent map over the nodes of an LGT network for every for every
A network N is time-consistent if there exists a time-consistent map of N. Note that given an LGT network, one can tell in linear time whether there exists a time-consistent map for it. See Corollary 1 in Górecki (2004).
Let
To formalize transfer networks, given an LGT network N, a labeling of N is a function
Note that
For each leaf x, Denote
Furthermore, we call the pair
The single origin condition states that the first node that acquires a character transmits it to every other species that possess the character through a directed path of 1-nodes. See node v in Figure 1 (Right). This models the “no-homoplasy” condition, that is, that a character cannot emerge independently in two species.
Given a homoplasy-free scenario
We emphasize that a loss can only occur on a support tree arc, representing the lack of vertical transmission—a character that is present on the donor of a transfer arc but is not on the recipient is not seen as a loss. Also, note that in previous work, losses were entirely forbidden (López Sánchez and Lafond, 2022), while here we do not restrict the number of losses of a character.
The minimum homoplasy-free completion problem
We now turn to the problem of predicting transfer and loss locations on a given species tree. As noted above, the species tree depicts the vertical inheritance of character, and this serves as a backbone for any homoplasy-free scenario. However, for any given tree T and character
Let
Our main problem of interest is the following.
The minimum homoplasy-free completion problem
In the following, a homoplasy-free scenario
A POLYNOMIAL-TIME ALGORITHM FOR THE MINIMUM HOMOPLASY-FREE COMPLETION PROBLEM
Our strategy is to infer a labeling
Infra-labelings and completions
Formally, for a tree T an infra-labeling
An island of

(Left) An example instance
As a consequence, the gain nodes of T are in one-to-one correspondence with the islands. We can now introduce the main idea of our correspondence between infra-labelings and completions: the islands present in the tree already ensure a “vertical” connectivity; thus, it suffices to find a suitable ordering of the gains to connect them using transfers. The order of connection matters for time-consistency, but we show that if we add transfers from the “highest” gains to the “lowest” gains in order and assign times to transfer nodes in decreasing order, then this can be done. Algorithm 1 shows how this can be achieved.
Algorithm 1 mostly serves as an existential result, that is, it shows that any infra-labeling can be turned into a completion by adding one particular sequence of transfer arcs. In general, there will be many other ways to add such transfers to obtain alternate completions, since different orderings of the gains lead to different completions. Thus, the algorithm should be seen as a tool to make the correspondence between the cost of infra-labelings and completions, not as a direct transfer prediction approach. Nonetheless, our experimental analysis shows how the approach can be used to predict a special type of horizontal transfer. We now proceed to show that the algorithm constructs a time-consistent network with our desired cost.
We argue that after every iteration of the main loop of the algorithm, the following invariant holds in the network N obtained after the iteration: for any two transfer nodes u, v that exist after the i-th iteration, we have that
Initially, when no transfer is added, this is vacuously true. Consider the i-th iteration of the algorithm, when transfer
It follows that after the loop is finished and all transfer arcs are created, for any pair of transfer nodes u, v, we have
We argue that l is a homoplasy-free fit of
We show this by induction over the iterations of Algorithm 1. To avoid confusion, for
It follows that at the end of the algorithm, in
To see that
We next show that
Finally, to see that
We can finally establish the correspondence between infra-labelings and homoplasy-free scenarios.
Also, by Lemma 3,
It remains to argue that the cost of the scenario
Consider an alternate completion
Consider the infra-labeling
Let us first show that
Now consider gains of
We have thus established that each loss of
By Theorem 1, it now suffices to find an infra-labeling
Recalling that
Note that in our case, we have only two states
For binary labels, given a character
This cost is almost identical to the cost of a completion
Consider first the tree T depicted in Figure 3a, in which two leaves have the character. Then, both labelings depicted in Figure 3c and d are optimal in the Sankoff–Rousseau sense (ignoring the subdivision nodes and transfer arc), as both contain exactly two arcs whose ends are given different labels (and these are exactly the minimum cost scenarios). However, the labeling in Figure 3c corresponds to a scenario with two loss events and no transfer event, while the labeling in Figure 3d corresponds to a scenario with one transfer event and no loss event. Hence, the second scenario is more parsimonious, in our sense, than the first one. Moreover, if

Examples illustrating that not penalizing one
Now, consider the tree T depicted in Figure 4 (left). Suppose that

Another example.
The above shows that we cannot use the Sankoff–Rousseau algorithm as is. We adapt the dynamic programming table by adding an additional dimension that keeps track of where a character could have its origin, if encountered in the current subtree (the originator, thus the genesis term). An example of computation of the table is provided afterwards.
To keep track of the location of the genesis within the subtree, we define a mapping
If v is a leaf of T, we set
The idea is that if v has the character, it could be the origin or not, so both
If v is an internal node, then we will have the following cases:
Case C[v, 0, 0]: This corresponds to labeling Case C[v, 1, 0]: This case corresponds to labeling
Case C[v, 1, 1]: This case corresponds to labeling
Note that although the recurrence is identical to the preceding case, the difference lies in the fact that v should be the origin.
Case C[v, 0, 1]: In this case, we assign
Once the optimal score has been computed for the root, the minimum labeling
To obtain a completion for a given
Our algorithm is best understood by an example. Suppose that we want to find an infra-labeling for a one-character tree T on five leaves as shown in Figure 5a with costs 
We will now break down each entry as a tuple that represents the four cases
Since we compute the nodes of the tree in a postorder traversal, after all values on the leaves are computed, we move up the internal nodes. We start with the root of the left subtree, the node v whose entry is For C[v, 0, 0]. Note that for For C[v, 1, 0] and C[v, 1, 1]. We have that the minimum cost is attained on the left subtree with For C[v, 0, 1], we compare the cost of fixing
Note that the situation of node w is similar to v, thus leading to the entry For For C[v, 1, 0] and C[v, 1, 1], we have the minimum cost on the left subtree when For
In this section, we initiate a discussion on the problem of reconstructing transfer–loss scenarios involving more than one character. Let us first observe that there are several ways of defining a transfer–loss cost in this context. If multiple characters are lost on the same arc, we could either count that as a single block loss event or we could count one event per lost character. Similarly, if multiple characters borrow a transfer arc, we must decide whether it counts as a single block transfer or as independent transfers.
Independent transfer and loss events
Let us consider the simpler problem where block events are not allowed, that is, if k characters are lost/transferred on the same arc, then this counts for k separate events. In that case, it is tempting to use our Genesis algorithm on each character separately to obtain an infra-labeling for each of them and to add transfer arcs according to Algorithm 1 for each character. However, this may create time inconsistencies if not done carefully. As an example, consider the networks depicted in Figure 6a. These scenarios are irreconcilable, since the two transfer arcs would induce a directed cycle if we added both of them. However, these networks are obtained using the infra-labelings depicted in Figure 6b, and these infra-labelings can be turned into consistent scenarios by incorporating the transfer arcs in a different direction, see Figure 6c.

It is nonetheless possible to generalize multiple characters, all while preserving the same number of transfers and losses per character. The details of this extension can be seen in Algorithm 2, which builds a time-consistent evolutionary scenario from multiple infra-labelings.
The input of the algorithm is a triple
We say that a node g of T is a gain node if it is a gain node for at least one character of C. In other words, v is a gain node if there exists
Note that, as in the case of Algorithm 1, this algorithm should be viewed as a proof of concept, since it outputs only one of many solutions. However, it shows that multiple infra-labelings can always give rise to a time-consistent scenario, and that the multicharacter variant where character events are independent can still be solved in polynomial time.
We now turn to the cost function that allows block events. Specifically, we assume that:
If a transfer edge transfers more than one character, it is considered as a single transfer event. If two or more characters are lost along the same arc of the tree, it is considered as a single loss event.
We also assume hereafter that transfers and losses have the same cost. This version of the problem is much more complicated, as it is unclear whether we can still compute an optimal infra-labeling for each character independently.
As it turns out, Algorithm 2 does not necessarily build an optimal scenario, even if the original infra-labelings are optimal for each character. Such a situation can be seen in Figure 7. On the left, one can verify that the infra-labelings are optimal for the first character (1 transfer, 0 loss) and for the second (0 transfer, 1 loss), but the generalized scenario has 1 transfer and 1 loss. However, there exists a scenario for the same input with only one transfer (right).

More generally, an optimal global scenario may not be optimal when restricted to one of the characters, as is illustrated in Figure 8. In other words, we may have to make suboptimal choices in one character to undertake a block event. Note that the dashed transfer in Figure 8b, which was added to explain the first character, is not necessary in Figure 8c because this character can be explained through the two transfers for the rest of the characters, where the star node in the left subtree becomes the origin for this character. In other words, instead of requiring one transfer per character, the leftmost transfer in Figure 8c is used to transfer both first and second characters, while the other transfer is used for the first and third character.

An example with three characters shows that combining the individual optimal character solutions for a given tree does not necessarily imply an optimal solution for all characters.
We leave the complexity of computing an optimal scenario under block events as an open problem. We suspect it is NP-hard due to the interdependence between characters, as we have demonstrated. We also do not know the complexity of the problem with cost functions in which losses are independent, but transfers occur in blocks.
We now illustrate our dynamic programming schemes for the inference of transfers that involve gain nodes shared between a set of characters. The whole datasets and implementations used for this part are available from the following repository: https://github.com/AliLopSan/Genesis-Sankoff.
Recall that our strategy is to find the set of gain nodes on a tree and then infer the connections between them a posteriori. Although there may be an exponential number of ways to connect the gains of a character
To build our base species tree T, we took a random subset of 45 species from the bacterial species used in Zhou et al. (2021), which consists of species that were predicted to be involved in interphylum transfers. We obtained the corresponding species tree from NCBI Taxonomy Browser (Schoch et al., 2020), noting that it is not completely resolved and is therefore nonbinary (see Supplementary Data). The whole annotated genomes of these species are contained in the KEGG (Kanehisa et al., 2016a,b) database.
As set of characters, we chose a set of 180 KEGG Ortholog groups, called KOs, taken from Zhou et al. (2021), which contain functions that are classified as metabolism-related, information processing, and antibiotic resistance, as seen in Figure 9. This choice was to ensure that our methods operate on the same input for consistency. We computed infra-labelings

Overview of the functional distribution of the 180 characters used in these experiments. Only the top 10 functions with more than one KO were considered.
The Basic labeling described in López Sánchez and Lafond (2022) maps every leaf to the character it possesses, and an internal node v has a 1 label if and only if all of its children have a 1 label.
The Sankoff labeling is derived from the algorithm presented in Section 3.2, with transfer/loss cost ratios
The Genesis labeling is computed using the algorithm from Section 3.3, with transfer and loss costs identical to those in the previous approach.
For every character
We contrasted the inferred highways with different transfers found throughout the literature. In Figure 10, we compare our inference with pairwise interphylum transfers reported in Zhou et al. (2021), shown in (a), which corresponds to a network whose nodes represent the bacterial species and an edge between a pair of species represents a transfer event. We will refer to these predictions as sequence-based predictions. This network was built by finding blocks of nearly identical DNA (i.e., more than 500 nucleotides, more than 99% identity) in distantly related genomes (<97% of 16S ribosomal RNA similarity). We conjectured that interphylum transfers would be more visible to our model than transfers that happen between closely related species. This is because when a transfer happens between closely related species, the parsimony criterion implied by our algorithm will most likely explain it through vertical inheritance, rather than transfers. To compare the outcome of our methods at the species level, we take a transfer highway

Transfer highways for different labelings. Every species in T is represented as a segment of the circle, and edges represent transfer arcs. For
When looking at higher level taxa, as shown in Figure 11, we see that throughout the different types of labeling and parameters used, the members of the Pseudomonas genus remain as potential gain nodes. This is consistent with the literature, since Pseudomonas are known to be not only ecologically versatile pathogens (Silby et al., 2011) but also implicated in the interphylum transfers of genetic material with members of the Bacteroidales order (Gschwind et al., 2024). This shows that our model is biologically sound. It remains to validate other pairs of ancestral species with highly supported transfers and investigate further the differences between the Sankoff and Genesis outputs.

For each of the 180 characters, we computed the

Cost distributions of Sankoff and Genesis labeling for the 180 KOs using

Inferred transfer highways using Sankoff and Genesis labeling with

The Pseudomonadota subtree for character K01669 under
Information Concerning the 45 Species Used in This Work
In this work, we have incorporated losses to character-based model on phylogenetic networks. We have shown that a most parsimonious scenario can be found efficiently for a single character, but much remains to be done. Notably, although the transfer network that results from applying Algorithm 1 is time-consistent for the one-character case, it does not guarantee that it will remain time-consistent when we have more than one character. Because of this, explaining multiple characters with a single scenario while minimizing transfers and losses appears challenging—it probably leads to NP-complete problem formulations, but investigating good heuristics or structural restrictions in the resulting networks are interesting future directions. Moreover, in López Sánchez and Lafond (2024), the authors provide examples on which incorporating losses can be used to resolve polytomies in species trees, by looking at resolutions that minimize our cost criterion. Given that several species phylogenies are only partially resolved, including the NCBI tree used in our experiments, we will look at possible algorithms to resolve them using our model. Finally, we observe that our experiments on real data are preliminary and that the approach can recover transfers that are not far from those found in the literature. It remains to perform experiments at a larger scale and see if our approach can find well-supported and novel transfer highways, especially the ancient ones that are difficult to recover using only sequence comparisons. To make our approach more impactful, one could also consider modeling the adaptability of the transferred genetic material, since there are factors such as the codon usage bias that could lead to certain characters being lost more easily than others (Callens et al., 2021).
AUTHORS’ CONTRIBUTIONS
A.L.S.: Conceptualization, methodology, formal analysis, algorithm design, software, investigation, writing—original draft, and visualization. G.E.S.: Methodology, formal analysis, validation, and writing—review and editing. P.F.S. and M.L.: Supervision, conceptualization, resources, and writing—review and editing. All authors have read and approved the final article.
Footnotes
ACKNOWLEDGMENTS
The authors thank the RECOMB-CG 2025 reviewers for their valuable comments.
AUTHOR DISCLOSURE STATEMENT
The authors declare that they have no competing interests or conflicts of interest.
FUNDING INFORMATION
A.L.S. acknowledges financial support from the program de bourses d’excellence en recherche from the faculty of sciences of the University of Sherbrooke. Research in the Stadler Lab was supported by the Federal Ministry of Research, Technology and Space of Germany (BMFTR) through DAAD project 57616814 (SECAI, School of Embedded Composite AI) and jointly with SMWK (Saxony) through the Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig (SCADS24B).
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
