Abstract
In addition to undergoing evolution, members of biological populations may also migrate between locations. Examples include the spread of tumor cells from the primary tumor to distant metastases or the spread of pathogens from one host to another. One may represent migration histories by assigning a location label to each vertex of a given phylogenetic tree such that an edge connecting vertices with distinct locations represents a migration. Some biological populations undergo comigration, a phenomenon where multiple taxa from distinct lineages simultaneously comigrate from one location to another. In this work, we show that a previous problem statement for inferring migration histories that are parsimonious in terms of migrations and comigrations may lead to temporally inconsistent solutions. To remedy this deficiency, we introduce precise definitions of temporal consistency of comigrations in a phylogenetic tree, leading to three successive problems. First, we formulate the temporally consistent comigration problem to check if a set of comigrations is temporally consistent and provide a linear time algorithm for solving this problem. Second, we formulate the parsimonious consistent comigrations (PCC) problem, which aims to find comigrations given a location labeling of a phylogenetic tree. We show that PCC is NP-hard. Third, we formulate the parsimonious consistent comigration history (PCCH) problem, which infers the migration history given a phylogenetic tree and locations of its extant vertices only. We show that PCCH is NP-hard as well. On the positive side, we propose integer linear programming models to solve the PCC and PCCH problems. We demonstrate our algorithms on simulated and real data.
INTRODUCTION
Studying the precise pattern of migration of biological populations holds significant importance in various areas of biology and medical science. For instance, understanding the migration history of metastatic cancer can provide insights into the mechanism of metastasis and aid in the development of novel drugs (Comen et al., 2011; El-Kebir et al., 2018; Faries et al., 2013; Sanborn et al., 2015; Somarelli et al., 2017; Tabassum and Polyak, 2015). Similarly, investigating the transmission of pathogens can help in identifying the source of an outbreak and tracing the patterns of disease spread (Campbell et al., 2019; Dellicour et al., 2018; Faye et al., 2015; Ferguson et al., 2001; Spada et al., 2004).
To successfully trace the migration history of a biological population, one may analyze genomic data as the migrated subpopulations have evolved independently, resulting in genomic differences that are location specific. More specifically, from the genomic data, one may first construct a phylogenetic tree T with each vertex v corresponding to a subpopulation with similar genetic makeup, and then label each vertex v with their location of origin
While the approach used by Slatkin and Maddison (1989) and McPherson et al. (2016) considers each migration in isolation, there are evolutionary processes where multiple migrations between the same pair of locations may occur simultaneously. For instance, cancer cells from distinct clones may comigrate as part of a single cluster (Aceto et al., 2014; Birkbak and McGranahan, 2020; Cheung and Ewald, 2016; Cheung et al., 2016; Dadiani et al., 2006; El-Kebir et al., 2018; Kok et al., 2021; Maddipati and Stanger, 2015; Marrinucci et al., 2012; Yamamoto et al., 2023; Yu et al., 2013). Similarly, many pathogens are subject to a weak transmission bottleneck, where multiple variants of the same pathogen are cotransmitted in a single event, including influenza (Sobel Leonard et al., 2017), SARS-CoV-2 (Rambaut et al., 2004; Sashittal and El-Kebir, 2020; Sashittal and El-Kebir, 2019), HIV (Tonkin-Hill et al., 2021), and hepatitis B (Margeridon-Thermet et al., 2009; Wang et al., 2010).
MACHINA (El-Kebir et al., 2018) was the first method to incorporate comigrations in the analysis of metastatic cancer, defining a comigration as a set of migrations that occur on distinct lineages of the tree and are between the same pair of locations. Using this definition, MACHINA extended Slatkin and Maddison (1989)'s approach by choosing the location labeling that first minimized the number of migrations followed by minimizing the number of comigrations. Two other methods, SharpTNI (Sashittal and El-Kebir, 2019) and TiTUS (Sashittal and El-Kebir, 2020), use a similar definition of comigration to infer transmission histories during pathogen outbreaks.
A key problem with the MACHINA definition of comigration is its failure to adequately capture temporal dependencies between migrations. Note that time moves forward along the directed edges of a phylogeny. Therefore, if a migration
In species phylogenetics, similar temporal restrictions arise concerning lateral gene transfers. Specifically, since gene transfer occurs in coexisting entities, if a transfer occurs from a species X to another species Y in a species tree, there cannot be another transfer from an ancestor of X to a descendant of Y. The temporal consistency of lateral gene transfers has been addressed in studies involving gene tree reconciliation (David and Alm, 2011; Libeskind-Hadas and Charleston, 2009; Merkle and Middendorf, 2005; Nøjgaard et al., 2018; Tofigh et al., 2010), species tree ranking (Chauve et al., 2017), and species tree inference (Lafond and Hellmuth, 2020).
Here, we present a new model that enforces spatial and temporal consistency of comigrations as well as three problems that use this new model. First, the temporally consistent comigration (TCC) problem seeks to assign timestamps to migrations such that migrations in the same comigration have the same timestamp and timestamps increase monotonically along the edges of any root-to-leaf path of the tree (Fig. 1a). We present a linear time algorithm to solve TCC. Second, the parsimonious consistent comigrations (PCC) problem seeks a minimum-cardinality set of spatially and TCC given a rooted tree with locations assigned to all vertices (Fig. 1b). We prove that this problem is NP-hard. Third, we formulate the parsimonious consistent comigration history (PCCH) problem, where, given a rooted tree with locations assigned to only the leaves, we seek a location labeling and comigrations that minimize the number of migrations and subsequently comigrations, while maintaining spatial and temporal consistency (Fig. 1c). We prove that PCCH is also NP-hard.

Outline of the three problems studied in this article.
We formulate integer linear programs (ILPs) for exactly solving PCC and PCCH. We introduce a workflow for checking MACHINA migration histories for temporally consistency, and, if necessary, correcting them using the problems and algorithms introduced in this article. On simulated data, we find that MACHINA may fail to return temporally consistent solutions. On real data of metastatic cancers with relatively small phylogenetic trees, we find that MACHINA returned temporally consistent solutions. In summary, this work addresses a deficiency in a previous mathematical model of comigration, providing precise definitions and conditions for temporal consistency.
We consider directed trees T rooted at a vertex
Migrations are edges of T whose endpoints are assigned different locations by
We say a location s is seeding a location t if there exists a migration
For comigrations
To model temporal consistency, we first introduce a timestamp labeling that labels each migration by a timestamp defined as follows.
We now define temporal consistency as follows.
For the first problem, we focus on finding the chronological order of comigrations. That is, given a set
We say that comigrations
Next, we consider the problem where we are no longer given the set
We note that in practice, we are only given a leaf labeling
To understand why we chose this particular ordering of the two objectives, note that there is a trade-off between the number of migrations and comigrations, where minimizing one objective comes at the expense of the other. Assuming that the location leaf labeling
Location labelings with
COMBINATORIAL CHARACTERIZATION AND COMPLEXITY
This section includes the theoretical results on the combinatorial characteristics and complexity of the three discussed problems. The proofs have been moved to Appendix A (Supplementary Data) because of space constraints.
Combinatorial characterization of the TCC problem
To solve the TCC problem, we define the comigration graph
A comigration graph
Temporally inconsistent and consistent comigrations with comigration graphs. 
We have the following important theorem, stating that comigrations
We show how to solve TCC in
As we have mentioned earlier, MACHINA (El-Kebir et al., 2018) was the first method to incorporate comigrations in its problem formulation. Our notion of comigrations is similar to the one introduced in MACHINA (El-Kebir et al., 2018), but there are significant distinctions. MACHINA requires comigrations
The minimum number
where
While comigrations
Comigrations inferred by MACHINA (El-Kebir et al., 2018) might not be temporally consistent. 
The following corollary directly follows from Lemma 2 and Lemma 3.
Note that MACHINA only computes the number
Note that the greedy approach is not guaranteed to output comigrations that are both compatible with location labeling
Finally, we explore a sufficient condition under which compatible comigrations
This means that versions of MACHINA that restrict location labeling
In conclusion, MACHINA does not guarantee temporal consistency unless the inferred location labeling is reseeding-free. We will make use of this when developing a workflow for solving the PCCH problem in Section 4.4.
The example in Figure 3 and Lemma 1 demonstrates that the smallest set
We prove this by a reduction from shortest common supersequence (SCS) in polynomial time. The SCS problem takes as input a set
Add the root o to the empty tree T and set the label of root o to be For each input sequence Si, attach the path Label each a-vertex
The lower bound of
Reduction from SCS to PCC. 
In the following, let
Next, we show that there exists a mapping between supersequences S of length m and balanced sets
Finally, we prove the following lemma from which Theorem 2 follows.
In this subsection, we prove PCCH to be NP-hard.
We show that PCCH is NP-hard by reduction from PCC. To that end, given a tree T with location labeling
For every vertex For every edge For every leaf For each internal vertex
Clearly, the reduction described above takes polynomial time. Note that the set
Given the constructed tree
The previous lemma means that the number
Finally, we prove the main lemma from which hardness follows.
In this section, we introduce algorithms to solve the three problems we discussed, and also introduce a workflow for inferring a temporally consistent migration history from input trees with leaf labeling.
Linear time algorithm for the TCC problem
The proof of Theorem 1 describes a way of solving TCC by computing a topological ordering of the vertices of the given comigration graph
Thus, by Lemma 10, TCC can be solved in
ILP for the PCC problem
In the previous section, we have shown that PCC is NP-hard. We solve the problem to optimality using an ILP. To solve the problem to optimality, we formulate an ILP, modeling comigrations
Timestamp labeling
First, we begin by noting that the number of unique timestamps is at most the number
For any two migrations
where
Comigrations
For spatiotemporally consistent comigrations
For each migration
Symmetry-breaking constraints
To increase performance, we use symmetry breaking constraints enforcing smaller timestamps to be used first.
Optimization function
Since we require each comigration to have a unique timestamp, the total number of comigrations equals the number of nonzero entries in y.
Note that the objective function will ensure that
Model size
PCC's ILP consists of
variables and
ILP for the PCCH problem
In the previous section, we showed PCCH to be NP-hard. We solve the problem to optimality using an ILP. To do so, we must model (i) a location labeling
Location labeling
To model location labeling
In addition, for the leaves of T, we force location labeling
Timestamp labeling
For efficient ILP formulation, we assign timestamps on nonmigrations and include them in comigrations. This modification does not change the original PCCH algorithm, as the timestamps on nonmigrations can be ignored while still ensuring temporal consistency. Again the number of distinct comigrations and thus timestamps is at most the number
To ensure temporal consistency, for any two consecutive edges
Comigrations
Similar to our ILP for PCC, we again require each comigration to have a unique timestamp and use the timestamps to identify individual comigrations in this ILP. To that end, we introduce binary variables
For each edge
Symmetry-breaking constraints
Like the ILP model for PCC, we eliminate some symmetrical solutions by forcing smaller partition numbers to be used first.
Optimization function
We compute the number of migrations from variables x by counting the number of migrations. Since we ignore the comigrations with nonmigrations, we only count the number of comigrations that contain migrations from variables y. Thus, we define the objective function as
In the optimization function, the factor
Model size
PCCH's ILP consists of
Workflow for inferring temporally consistent migration histories
MACHINA, like PCCH, employs an ILP for migration history inference. While both methods minimize migrations and comigrations lexicographically, MACHINA does not enforce temporal consistency like PCCH, resulting in a simpler ILP with
variables and
The workflow has five steps in total (Fig. 5). In step I, given an input tree T and leaf labeling

Workflow for inferring temporally consistent migration histories. The workflow consists of sequentially running MACHINA and the algorithms discussed in this article, falling back on more complex algorithms whenever necessary. *In case the user is not interested in the specific set
Otherwise if
In this section, we compare the performance of MACHINA with our methods on simulated (Section 5.1) and real data (Section 5.2). All experiments were run on a server with Intel Xeon Gold 5120 dual CPUs with 14 cores each at 2.20 GHz and 512 GB RAM. The code, which uses Gurobi to solve the ILPs, as well as simulation and real data instances are available at https://github.com/elkebir-group/PCCH.
Simulated data
This section aims to evaluate the performance of our algorithms relative to MACHINA. To that end, we generated simulation instances following a three-step process. First, we sampled a comigration graph G resulting in a set
We generated three classes of simulation instances, with increasing complexity in the initially sampled comigration graphs in the form of cycles. The details are provided in Supplementary Section D.1 and Supplementary Figure S3. We ran all five steps of the workflow for each instance without terminating prematurely. Thus, we ran MACHINA on each simulation instance
For our first set of simulations, we sampled five comigration graphs without any cycles, obtaining a total of five instances

Simulation results.
for both solutions, we have
We also executed our workflow on all five instances, which terminated at step III because of
To generate the second set of simulation instances, we picked comigration graphs with
Note that MACHINA's inability to accurately determine the number of comigrations for a specific instance does not necessarily imply that the associated location labeling is incorrect. For example, in 9 out of 20 cases,
In Figure 6, we present a simulation instance with
Although MACHINA was faster in
Finally, we constructed our third set of simulations by sampling comigration graphs with complex, nested cycles. Specifically, we began by sampling a comigration graph with one cycle. Then, we randomly selected pairs of vertices from the comigration graph, ensuring that they do not share an edge with the cycle, and connected them. We generated five such comigration graphs, and for each of these comigration graph, we simulated one tree T with leaf labeling
In terms of running time, we observed MACHINA outpacing PCCH, with a median running time of 31.176 seconds for MACHINA and 37.542 seconds for PCCH (Fig. 6b and Supplementary Table S4). Like the second class of simulations, the workflow terminated at step V, and the running time was dominated by MACHINA and PCCH.
Ovarian cancer
We applied PCCH to infer the migration history of seven patients diagnosed with high-grade serous ovarian cancer from McPherson et al. (2016). McPherson et al. (2016) sequenced 68 tumor samples across seven patients, encompassing samples from various sites such as the ovary, omentum, fallopian tube, peritoneal locations, and distant metastatic sites, using whole genome and targeted sequencing. After identifying the dominant clones from detected SNVs and rearrangement breakpoints, they constructed clone trees T using a probabilistic phylogenetic model based on the stochastic Dollo process. Finally, for each patient, they inferred the migration history by finding the location labeling
For instance, for patient 1, McPherson et al. (2016) originally identified the right ovary (ROv) as the primary tumor location, as their reported optimal location labeling had 13 migrations and 10 comigrations with ROv as the primary site. Also, they reported the occurrence of metastasis-to-metastasis migration for patient 1. In contrast, MACHINA found a more optimal solution with the same number of migrations but only seven comigrations, designating the left ovary as the primary tumor location. Furthermore, MACHINA inferred a simpler migration pattern for patient 1 without reporting any metastasis-to-metastasis migration.
For each of the seven patients, we generated the location labeling with timestamps by solving PCCH. We found that PCCHs location labelings perfectly matched those of MACHINA. Moreover, we found both methods returned the same number of comigrations. As both the location labelings and the number of comigrations matched, MACHINA's solutions are temporally consistent. As an example, we show the PCCH output for patient 1 in Figure 7a with location and timestamp labels. Both MACHINA and PCCH report reseeding in the migration history, which can easily be seen by observing the edges with timestamps 1 and 7. Note that there are other possible timestamp labelings, and PCCH returns only one single solution.

MACHINA and PCCH results for ovarian (McPherson et al., 2016), prostate (Gundem et al., 2015), and breast cancer (Hoadley et al., 2016) datasets.
We show the running time analysis for PCC, PCCH, MACHINA, and the workflow in Figure 7b and Supplementary Table S1. We found that PCCH generally takes slightly longer to finish (median of 0.474 seconds vs. 0.244 seconds for MACHINA). This is expected, as unlike MACHINA, PCCH includes checks for temporal consistency and returns timestamps along with a location labeling. Similar to the findings on simulated data, we found PCC to be significantly faster than PCCH or MACHINA. Since the MACHINA comigrations are temporally consistent, the workflow stops at step III, resulting in the running time of the workflow matching closely with that of MACHINA.
We ran PCCH and inferred the migration history of five androgen-deprived metastaic prostate cancer patients from Gundem et al. (2015). For the five selected patients, Gundem et al. (2015) sequenced both primary (prostate) and metastatic samples using whole-genome sequencing (WGS) technology. For each patient, they constructed a clone tree T by first identifying mutation clusters and calculating cancer cell fractions of each cluster in each sample by using an n-dimensional Bayesian Dirichlet process, and then inferring evolutionary relationships between pairs of mutation clusters by applying the “pigeon-hole” principle to mutation clusters within individual samples. To infer the migration histories, they deduced the location of origin of each mutation cluster by examining cancer cell fractions in each sample and using the “pigeon-hole” principle, and reported metastasis-to-metastasis migration in four (A10, A22, A31, and A32) out of five patients in consideration.
The samples from the same five patients were reanalyzed by MACHINA in El-Kebir et al. (2018), where it found simpler solutions with metastasis-to-metastasis spread only in two patients (A22, A32). MACHINA also did not report reseeding for any of the patients, which implies that the migration histories inferred by MACHINA are temporally consistent by Proposition 1. Indeed, we found that the inferred location labeling and the number of comigrations were identical for both PCCH and MACHINA. In terms of running times, we observed similar trends (Fig. 7b and Supplementary Table S2)—MACHINA was slightly faster than PCCH (median of 27.795 seconds vs. 0.67 seconds for MACHINA), although for patient A22, MACHINA (1702.24 seconds) needed more time than PCCH (185.18 seconds). For PCC, the running time was significantly shorter (median: 0.025 seconds). The workflow stops at step III because of
Breast cancer
We applied our methods to examine the migration history of two triple-negative breast cancer patients from Hoadley et al. (2016). DNA whole-genome sequencing was conducted on matched primary and multiple distant metastasis samples for both patients. The clonal structure was inferred using SciClone (Miller et al., 2014), and the phylogeny was determined using the ClonEvol R package (Dang et al., 2017). For patient A1, ClonEvol reported two potential clone trees due to its inability to accurately determine the evolutionary origin of clone 7. For patient A1, MACHINA recapitulated the findings reported in Hoadley et al. (2016) that all the clones except clones 6 and 9 originated in the primary location for both trees. For patient A7, MACHINA reported a parsimonious solution with eight migrations and six comigrations, and a comigration from primary location to lung for clones 2 and 4, which agreed with Hoadley et al. (2016).
All the results returned by MACHINA were temporally consistent, and so the workflow stopped at step III. Consequently, the migration histories inferred by MACHINA and PCCH were identical. Running times followed the same trend (Fig. 7b and Supplementary Table S3), with MACHINA being slightly faster than PCCH (median of 0.613 seconds vs. 0.074 seconds for MACHINA), and PCC being the fastest (median: 0.004 seconds).
CONCLUSION
In this article, we addressed a flaw in the definition of comigration adopted by MACHINA (El-Kebir et al., 2018). Specifically, we precisely defined spatial and temporal consistency for comigrations, leading to the formulation of three successive problems. The first problem, TCC, determines temporal consistency given a set of comigrations and derives a timestamp labeling for migrations in case the comigrations are temporally consistent. We showed that TCC can be solved in linear time. The second problem, PCC infers the smallest set of TCC given the locations of both leaf and internal vertices. We proved the problem to be NP-hard, indicating that even if the location of origin of every vertex and thus every migration is given as input, it is still computationally hard to deduce which migrations occurred simultaneously under a parsimony criterion.
Our third problem, PCCH, takes as input a leaf labeling, and infers the location labeling that minimizes the number of migrations, and subsequently the number of spatiotemporally consistent comigrations. We proved that PCCH is also NP-hard. In addition, we discussed MACHINA's views on comigrations and its limitations concerning temporal consistency and reported a sufficient condition under which MACHINA accurately computes comigrations. We presented ILP models for PCC and PCCH and proposed a workflow that combines the strengths of MACHINA, PCC, and PCCH—by using TCC to verify MACHINA's results and resorting to PCC and PCCH when needed. Finally, we conducted a comparative analysis of PCCH and MACHINA's performance on simulated and real data.
We generated simulation instances to investigate when MACHINA fails to determine temporally consistent comigrations and showed that MACHINA underestimates comigrations and may yield suboptimal location labeling in the presence of comigration graph cycles. For real data, PCCH returned the same location labeling as MACHINA for all instances.
PCCH offers several promising avenues for future research. While our current study focused on applying PCCH exclusively to cancer data, its versatility extends to inferring migration history in various organisms, including disease pathogens, as discussed earlier. Broadening the application of PCCH to diverse real datasets is crucial for gaining a comprehensive understanding of temporal inconsistency in practical scenarios. Drawing inspiration from MACHINA, which introduced parsimonious migration history with tree refinement, we plan to expand PCCH to incorporate tree refinement, aiming to minimize the number of migrations and comigrations lexicographically across all location labelings for possible tree refinements of the input tree. Furthermore, a captivating challenge lies in exploring the existence of multiple optimal solutions within the PCCH framework. Currently, PCCH provides a single optimal solution, yet, instances may arise where distinct location labelings yield the same number of migrations and TCC. Investigating the solution space within PCCH to detect and characterizing these alternatives represents a promising avenue for future research in this field.
Footnotes
ACKNOWLEDGMENTS
AUTHORs' CONTRIBUTIONS
M.S.R.: Conceptualization, Implementation, Formal analysis, and Writing—Review. S.S.: Conceptualization and Writing—review. M.E.-K.: Conceptualization, Validation, and Writing—review and editing.
AUTHOR DISCLOSURE STATEMENT
The authors declare they have no conflicting financial interests.
FUNDING INFORMATION
Mohammed El-Kebir was supported by the National Science Foundation (CCF-2046488) as well as funding from the Cancer Center at Illinois. Sagi Snir was supported by the Israel Science Foundation (Grant No. ISF 1927/21) and the American/Israeli Binational Science Foundation (Grant No. BSF 2021139).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
