Abstract
Abstract
The live phylogeny problem generalizes the phylogeny problem while admitting the existence of living ancestors among the taxonomic objects. This problem suits the case of fast-evolving species, like virus, and the construction of phylogenies for nonbiological objects like documents, images, and database records. In this article, we formalize the live phylogeny problem for distances and character states and introduce polynomial-time algorithms for particular versions of the problems. We believe that more general versions of the problems are NP-hard and that many heuristic and approximation approaches may be developed as solution strategies.
1. Introduction
This article introduces the problem of live phylogeny, where a phylogenetic tree must be reconstructed but ancestors are present among the input taxonomic objects. This way, internal nodes in the resulting tree may be either actual objects or hypothetical ancestors. Real-world applications are the analysis of viral populations or other fast-evolving organisms (Castro-Nallar et al., 2012; Gojobori et al., 1990), and the phylogenetic analysis of nonbiological objects, such as documents, images, or relational database entries (Cuadros et al., 2007; Paiva et al., 2011). We present the problem both for distances and characters. For distances, we investigated the case in which the matrices are additive. For characters, we considered absence of convergence and reversals. We give polynomial algorithms for both problems. To our best knowledge, this is the first characterization of these problems in phylogeny.
This article is organized as follows. Section 2 is devoted to the distance-based live phylogen problem, and Section 3 to the character states live phylogeny problem. In Section 4, we present some conclusions.
2. Distance-Based Live Phylogeny
In the distance-based phylogeny problem, one wants to build an unrooted, weighted tree in which the distances among leaves are equal to the distances given in a distance matrix. The input is an n × n matrix M, where Mi,j is the distance between objects oi and oj. The output is a tree in which each leaf represents an object and all internal nodes have degree 3. When it is possible to build such a tree, then the distances in M are said to be additive.
It is known that if M is additive, then a polynomial algorithm solves the problem (Setubal and Meidanis, 1997). It is also known that a distance matrix M is additive if it is a metric space and respects the four-point condition, which states that given any four objects, it is possible to label them as oi, oj, ok, ol, such that
By the other side, minimizing the nonadditivity deviation is an NP-hard problem (Day, 1987).
In distance-based live phylogeny, objects may be represented by internal nodes of the tree as well, in order to reflect living ancestors in the set of objects.
Formally, let Mn be a square matrix of order n, representing objects • each leaf of Tn is labeled with one object, • each object labels exactly one node, and •
Internal nodes are called ancestors. An ancestor is live if it is labeled oi, for some i, and is hypothetical otherwise.
The distance-based live phylogeny problem is, given Mn additive, to build a live phylogeny Tn. Here we provide a constructive proof that a live phylogeny can always be built from an additive matrix Mn.
Suppose k = 2 and let x, y be the only two leaves of T2. Let z = o3 be the new node to be added to T2, obtaining T3. We have the four following possible cases, based on the relationships among x, y, z in T3.

In Case 1, node z is a live ancestor of x and y.

In Case 2, node y became a live ancestor of x and z.

In Case 3, node x becomes a live ancestor of y and z.
Notice that Cases 1, 2, and 3 are exclusive, otherwise z = x or z = y, and we are assuming that all objects are distinct.
In this case, add a new internal node c on the edge (x, y) and connect it to z obtaining T3, such that
(Fig. 4). This completes the basis.

In Case 4, there is a hypothetical ancestor c of x, y, and z.
Suppose that Tk ∼ Mk, k ≥ 3. We will show how to add a new node z to Tk, obtaining Tk+1 ∼ Mk+1. Let x, y be any two leaves of Tk. Again, we have four possibilities.
In favor of a clearer notation, we denote Mk+1 by M, dk+1 by d, and the only path connecting any nodes x and y in a tree by (x, y)-path.
Let us suppose there is no node in the position where z has been added. The case in which there is already such a node will be seen soon.
We need to show that dzw = Mzw for any node w ≠ x, y. Suppose that w is not in (x, y)-path, but it is connected to it by a node c in the (z, y)-path (Fig. 5). The case where c is in (x, z)-path is analogous.

Case i when w is not in (x, y)-path but it is connected to it by a node c and the new z is in (c, y)-path.
From the tree dwx + dwy − dxy = 2dcw = dwz + dwy − dzy, so dwz = dwx − dxy + dzy.
The four-point condition for these points can be verified by the labeling that results in Mxw + Mzy = Mxy + Mzw. Thus, Mwx − Mxy = Mzw − Mzy. Then
Note that this proof works also for the case in which c = w.
Finally, let us see what happens when there is already an internal node c in the position where z should be added. Node c is a hypothetical ancestor, otherwise z would be already in the tree. It is enough to transform this internal node c into the live ancestral z = ok+1. We only need to show that dzw = Mzw, for any live node w connected to z without using edges or nodes in (x, y)-path, because the situations in which there are nodes from (x, y)-path are covered above. (Fig. 6).

Case i when a hypothetical c exists at z's position, and c is replaced by z.
The four-point condition for these points can be verified by the labeling that results in Mzw + Myx = Mxz + Mwy. Then,
This concludes Case i.
We need to show that, dzw = Mzw, for any node w in Tk+1, w ≠ x, y. Let w ≠ x, y be a node of Tk+1 and c the node connecting w to the path that connects x to y. (Fig. 7).

In Case ii, the new node z is connected to y by a new edge.
The four-point condition for these points can be verified by the labeling that results in Mxy + Mwz = Mxz + Myw. Then,
Notice that this proof holds also in the case where w = c.
If none of the cases i, ii, and iii happens, then we try to add the new node z to Tk through an edge connecting z and a node c in the (x, y)-path, as we do in Case 4 of the basis. There are three possibilities to consider.
We need to show that dwz = Mwz for every w ≠ x, y. Let us suppose that there is a path from v to w, where v is a node in the (x, c)-path. The case where v is in the (c, y)-path is analogous (Fig. 8).

In Case iv-a, the new node z is connected to a new hypothetical ancestor c by an edge.
The four-point condition for these points can be verified by the labeling that results in Mxy + Mwz = Mxz + Myw. From the tree, dzw + dzy − dwy = 2dcz = dxz + dzy − dxy. So dzw = dxz − dxy + dwy.
Note that this proof works also for the case in which v = w.
We need to show that dwz = Mwz for every w ≠ x, y. If w is either in the (x, y)-path, or there is a path from d to w, where d is a node in the (x, c)-path, then the proofs are similar to the previous ones. Otherwise w is in a path connected to c, as shown in Figure 9.

In Case iv-c, the new node z is connected to an existing hypothetical ancestor c by an edge.
The four-point condition for these points can be verified by the labeling that results in Mxy + Mwz = Mxz + Myw. From the tree, dxz + dzy − dxy = 2dcz = dzw + dzy − dwy, and the proof follows exactly as the one for Case iv-a. ▪
The constructive proof of Theorem 1 gives us an algorithm to build the live phylogeny given an additive matrix. The algorithm consists of starting with two objects connected by an edge and applying, in each step, one of the described cases. This algorithm clearly has polynomial time in the number of objects, since the test for the correct case can be done in constant time, except for Case iv-b, where we need to find a pair of leaves satisfying any of the other cases. Because there are O(k2) possible pairs of leaves, in each step k, the total time is O(n3).
3. Character States Live Phylogeny
For this problem, one is interested in building a phylogenetic rooted tree that explains the evolutionary relationship among objects, based on states of characters that each object possesses. More formally, the input is an n × m matrix M, where Mi,j is the state of character j for object i.
In the related literature, the character-based phylogeny problem has been approached, considering the number of possible states for each character and whether or not there is an order relation defined for character states. If the number of states is fixed, then the problem can be easily solved. Otherwise, it is NP-hard. When the order is totally defined, then there is a polynomial-time algorithm for the problem. Otherwise, the problem is also NP-hard (Setubal and Meidanis, 1997).
Another issue concerns more complicated evolutionary facts, such as reversal and parallel evolution events. A reversal happens when a character changes back to a previous state. A parallel evolution happens when a character changes to the same state in two distinct lineages. When there are no such events, the problem is known as the perfect phylogeny problem. Other works in the literature concern minimizing the number of reversals and parallel evolution events to obtain a perfect phylogeny or maximizing the number of characters that admit a perfect phylogeny (Setubal and Meidanis, 1997).
In this work, we introduce the perfect live phylogeny with two character states. Let M be an n × m binary matrix whose rows are labeled 1. Each edge in T is labeled with a distinct cj; 2. each object oi labels exactly one node in T; 3. for every node labeled oi, Mi, j = 1 if and only if cj is in the path from the root of T to the node labeled oi.
Observe that this definition allows T to have labeled internal nodes. Then, objects given in the input may be ancestors in the phylogeny.
The polynomial time algorithm to solve this problem is presented in Figure 10. It is a simple adaptation of the algorithm for perfect phylogeny by Waterman et al. (1977). The input is a binary matrix M with columns sorted in nonincreasing order of the number of 1s. An example input and tree appear in Figure 11.

Perfect live phylogeny polynomial-time algorithm.

A binary matrix with columns sorted in nonincreasing order of the number of 1s, and the corresponding live phylogeny tree, in which B is a live ancestor.
For the correctness of the algorithm, let T be the tree whose root is returned by the algorithm. It is easy to see that each oi is used to label exactly one node of T. It is also easy to see that no edge is created without a label. Now, suppose that two edges in T were labeled cj, while processing objects oi and ok, i < k. The ordering of the columns of M, and the fact that each pair of columns of M is either comparable or disjoint, guarantee that Mi,j′ = Mk,j′, 1 ≤ j′ ≤ j − 1. Thus, by construction, the paths between the root and both edges labeled cj must be the same, which is a contradiction.
To see that if Mi,j = 1 then cj is in the path from the root of T to the node labeled oi, it is enough to see that during the processing of row i of M, either the edge cj is created or traversed. By the other side, if edge cj is in the path between the root and the node labeled oi, then oi was labeled after creating or the traversing of edge cj, and that happens only if Mi,j = 1.
4. Conclusions
Live phylogeny generalizes phylogeny while broadening its application to other areas distinct from molecular biology, such as visualization, data mining, and forensics. As with phylogeny, live phylogeny will certainly lead to NP-hard problems when the restrictions we considered here are released, namely the absence of reversals and parallel evolution, and additivity. A broad class of approximation algorithms and heuristics may then be explored for the problem. We conclude our article noting that applications beyond molecular biology deal with very large datasets, and large-scale techniques may be considered as well.
Footnotes
Acknowledgments
G.P.T. and R.M. acknowledge the financial support of CNPq and FAPESP. N.F.A. acknowledges CNPq 305503/2010-3 grant. M.E.M.T.W. acknowledges CNPq 306731/2009-6 and FINEP 01.08.0166.00 grants.
Author Disclosure Statement
The authors declare that no competing financial interests exist.
