Abstract
Abstract
Scoring a given phylogenetic network is the first step that is required in searching for the best evolutionary framework for a given dataset. Using the principle of maximum parsimony, we can score phylogenetic networks based on the minimum number of state changes across a subset of edges of the network for each character that are required for a given set of characters to realize the input states at the leaves of the networks. Two such subsets of edges of networks are interesting in light of studying evolutionary histories of datasets: (i) the set of all edges of the network, and (ii) the set of all edges of a spanning tree that minimizes the score. The problems of finding the parsimony scores under these two criteria define slightly different mathematical problems that are both NP-hard. In this article, we show that both problems, with scores generalized to adding substitution costs between states on the endpoints of the edges, can be solved exactly using dynamic programming. We show that our algorithms require O(mpk) storage at each vertex (per character), where k is the number of states the character can take, p is the number of reticulate vertices in the network, m = k for the problem with edge set (i), and m = 2 for the problem with edge set (ii). This establishes an O(nmpk2) algorithm for both the problems (n is the number of leaves in the network), which are extensions of Sankoff's algorithm for finding the parsimony scores for phylogenetic trees. We will discuss improvements in the complexities and show that for phylogenetic networks whose underlying undirected graphs have disjoint cycles, the storage at each vertex can be reduced to O(mk), thus making the algorithm polynomial for this class of networks. We will present some properties of the two approaches and guidance on choosing between the criteria, as well as traverse through the network space using either of the definitions. We show that our methodology provides an effective means to study a wide variety of datasets.
1. Introduction
P
1.1. Maximum parsimony on networks
Unifying the definitions given earlier in the literature, we state that maximum parsimony is a network inference method that scores networks based on the minimum number of state changes across a subset of edges of the network for each character that are required for a given set of characters to realize the input states at the leaves of the networks. If the entire edge set of the network is chosen, then the network score assumes all possible changes that can take place in the network for the character. If the edges are those of a spanning tree of the network that minimizes the parsimony score, the network score assumes a treelike evolutionary scenario for the character; and reticulate events are explained by different spanning trees in the network that achieve the best parsimony score for various characters. The former is called the hardwired approach, and the latter the softwired approach (Fisher et al., 2013). In this article, we use the same terms when referring to these approaches (see Section 2.3). Both these scenarios are useful depending on the type of input characters. For example, if the character is a single position of the DNA string of various taxa, then one might be interested in the softwired approach; and on the other hand, if each character itself is a set of homologous gene sequences in various taxa, which could have hybridized in the past, the hardwired approach would be better suited.
The optimal solutions for the hardwired approach on two states can be computed in polynomial time; Menger's theorem (Schrijver, 2003, 131–132) provides a variety of algorithms (Schrijver, 2003, 138–140)—one such example is Ford-Fulkerson's method to compute the max-flow min-cut between two vertices of a graph. In general, the hardwired approach can be cast as a multiway-cut problem. This problem has received attention for many years (Schrijver, 2003, 254 for detailed references) and has been shown to be NP-complete (Dahlhaus et al., 1994) for more than two states. Recently, it has been shown to be fixed-parameter tractable with the unknown score as the parameter (Fisher et al., 2013). On the other hand, the softwired problem is a bit more challenging. It is proven to be NP-hard even for two states; and no fixed-parameter tractable algorithm with the unknown score as the parameter can be devised unless P = NP (Fisher et al., 2013).
Parsimony methods give multiple assignment scenarios, and it is of interest to generate all of them in order to analyze the given data completely. Integer linear programming (ILP) formulations for both problems are provided in Fisher et al. (2013), which in turn provide practical methods to compute the scores. With the existing methods to compute all optimal solutions of an ILP, the ILP formulations would be very helpful in compiling the complete solution set. In this article, we provide another method to find all optimal solutions.
Parsimony scores can be further generalized by using substitution cost between different states and using the addition of the costs along edges as the scores. These substitution costs are represented by a cost matrix. While the parsimony score without such costs already gives a means to best test a network, the cost matrices provide greater sensitivity to the type substitutions that the data would have undergone. In this article, we devise a dynamic programming algorithm to find exact scores for general cost matrix for both problems. This approach is the first method to efficiently compute the parsimony score of networks under both the hardwired and softwired approaches, along with giving all possible optimal character assignments on the internal vertices of the network.
1.2. Dynamic programming
Dynamic programming has been used to provide efficient solutions for finding the exact parsimony score when the network is a phylogenetic tree (Sankoff, 1975; Sankoff and Rousseau, 1975), and more generally for cost matrices specific to each edge (Erdös and Székely, 1994). This later approach was used in Tuffley and Steel (1997) to show that that the likelihood approach achieves identical results under some model as the parsimony approach does in searching for trees. In Section 3, we show that the same approach can be generalized to phylogenetic networks for both the hardwired and softwired problems. Sankoff's algorithm traverses the vertices of the tree via postorder while computing the minimum costs of each state at each vertex from the leaves to the root, and then chooses the best assignments on each vertex by backtracking from the root to the leaves by a preorder traversal of the tree. Here, we show that the algorithm can be extended to networks by performing the pre- and postorder traversals on the tree representation of the network that is considered.
In Section 3.3, we discuss the complexity of the algorithms and show that both are fixed parameter tractable in terms of the number of reticulate vertices in the network that, for networks whose underlying undirected graph has disjoint cycles, can further be reduced to polynomial-time algorithms (see Section 3.3.1). The dynamic programming approach also yields all possible minimum cost assignments of the characters to the internal vertices. This enables us to look for “redundant” reticulate vertices by choosing only those optimal solutions that maximize the same set of parents of reticulate vertices that are “utilized” for a most parsimonious representation of all characters. In Section 4.2, we will give a formal definition for this concept. Redundancy in the numbers of reticulate vertices can be used as a criterion while navigating through the network space by increasing or decreasing the number of reticulate vertices. While one can search for better hypotheses on networks with higher numbers of reticulate vertices, redundancy will help bound the number of reticulate vertices that are appropriate for the given dataset. In Section 5, we present an example in which our method identifies a network with reticulate events as a lower-cost evolutionary history than trees for a real dataset from a set of taxa in which hybridizations are common.
2. Preliminaries
2.1. Phylogenetic networks
We follow the definition of the phylogenetic networks as given in Moret et al. (2004, definition 4, page 16). For all other graph-theoretical definitions that are not given here, we follow Schrijver (2003). A rooted phylogenetic network, simply called here a phylogenetic network, is defined in Huson et al. (2011) as a rooted, directed acyclic graph (DAG), whose root has in-degree zero and leaves have out-degree zero. The vertices whose in-degree is greater than one are called reticulate vertices, and the edges with reticulate vertices as head vertices are called reticulate edges. All other edges are termed tree edges. The definition given in Moret et al. (2004) takes care of the so-called “time-consistency” restraint, namely, that the tree edges take place in a positive time and the reticulate vertices have parents that can only “coexist in time.” We recall the formal definition of phylogenetic networks as given in Moret et al. (2004).
Given any directed graph, we say two vertices u and v cannot coexist in time if there exists a sequence 1) pi is a directed path that contains at least one tree edge for every 1 ≤ i ≤ k, 2) u is the tail of p1 and v is the head of pk, and 3) for every 1 ≤ i ≤ k−1, there exists a network vertex whose two parents are the head of pi and the tail of pi+1.
A phylogenetic network N is a rooted DAG obeying the following constraints:
1) Every vertex has indegree and outdegree defined by one of the four combinations (0, 2), (1, 0), (1, 2), or (2, 1)—corresponding to, respectively, root, leaves, internal tree vertices, and reticulate vertices. All vertices other than reticulate vertices are called tree vertices. 2) If two vertices u and v cannot coexist in time, then a network vertex w with edges (u, w) and (v, w) does not exist. 3) Given any edge of the network, at least one of its endpoints must be a tree vertex. Another component of this definition is that for any edge in the phylogenetic network, at least one of its endpoints (either the head or tail) is a tree vertex. We also assume that no internal vertex has two reticulate children. We call this class of phylogenetic networks a phylogenetic network with no sister reticulations. Wherever possible, we point out whether the conditions of the definition are necessary.
Phylogenetic networks can naïvely be thought of as networks that contain, as subgraphs, trees that explain the evolutionary histories of different segments of input terminal sequences. For a phylogenetic network N with no sister reticulations, and r reticulate vertices with leaf set X, we denote
Below are some results about phylogenetic networks. In Lemma 1, we show that the phylogenetic networks are indeed planar, that is, they can be drawn on a plane such that the edges do not cross each other. In fact, Hartvigsen (1998) proved that calculating the hardwired score is polynomial. A graph G is said to be planar if and only if for each subgraph H with at least 3 vertices of G, |E(H)| ≤ 3|V (H)|−6.
Lemma 1
Let N be a phylogenetic network as defined above. Then N is planar.
Proof
Since the vertices of networks that we are considering have degree at most 3, it is straight forward to notice that our networks are all planar.
Let H be any subgraph of N. Then, we have
The above result does not have immediate consequences that are used in this article. However, we point out that the phylogenetic networks as defined here can be drawn on a sheet of paper.
The below result is straightforward, so we omit the proof.
Lemma 2
Let N be a phylogenetic network as defined above. Then N has 2n−2 + 2r vertices and 2n−2 + 3r edges, where r is the number of reticulate vertices.
The following theorem gives a bound on the number of reticulate vertices in a phylogenetic network.
Theorem 3
In any phylogenetic network N, and for any vertex v in N, the number of reticulate descendants, including v is nv−1 if v is not reticulate and nv if v is reticulate, where nv is the number of descendants of v that are leaves (terminal descendants). Then the number of reticulate nodes in N cannot exceed n−2, where n is the number of leaves.
Proof (by induction on nv)
Let v be a vertex in N, such that nv = 3. Let Nv be the network induced by v and its descendants. There is a network with three leaves and with exactly one reticulate vertex up to permutation of the leaf labels. Note that there can be at most one other reticulate vertex that can be added, including v (since no two reticulate vertices can have a common immediate ancestor and no two reticulate vertices are adjacent—one is not a child of another), so it matches with Nv. Thus, the number of reticulate vertices is at most nv−1.
Similarly, if nv = 2, the number of reticulate vertices in Nv is at most 1, including v. Thus, the statement of the theorem holds when nv = 1, 2, 3.
Now assume that v is any internal vertex with nv = n1 + n2, where n1 and n2 are the number of leaf descendants of the left v1 and right v2 child respectively. If n1, n2 > 1, then the number of reticulate vertices descendant to v cannot exceed n1 + (n2−1) = nv−1 (since both v1 and v2 cannot be reticulate vertices).
Suppose v is reticulate and has only one child v1. Then since v1 cannot be a reticulate vertex, the number of reticulate descendants of v1 is at most nv − 1. Thus, counting v, the number of reticulate descendants of v is nv.
Now the children of the root cannot be reticulate vertices. Thus, if v is a root, then there are at most (n1− 1) + (n2−1) = nv−2 = n−2 reticulate vertices in N. ■
In the following result, we give a constructive proof on how reticulate vertices can be added to a phylogenetic tree on n leaves to obtain a phylogenetic network with n−2 reticulate vertices. This establishes the tightness of the bound on the number of reticulate vertices that we allow in a phylogenetic network.
Theorem 4
Let N be a phylogenetic tree with n(≥2) leaves. Then n−2 edges can be added to N to obtain a phylogenetic network with n leaves and n−2 reticulate vertices.
Proof (by induction on n:)
The result holds for n = 2. Now assume n = k for an integer k > 2; and that the theorem holds for all phylogenetic networks with k−1 or fewer leaves. Let T be a phylogenetic tree with k leaves and let r be the root of the tree. Let v1 and v2 be the children of r in T; and let Ti, i = 1, 2 be the subtrees of T induced by vi and the vertices reachable from vi. We have the following two cases to consider.
Case (i): The number of leaves in T1 and T2 are two or greater and are n1 and n2 respectively. By induction hypothesis, ni−2 edges can be added to Ti to obtain a network Ni with ni leaves and ni−2 leaves, i = 1, 2. Now replace Ti by Ni in T to obtain a network N′ on n leaves and (n1−2) + (n2−2) = n1 + n2−4 = k−4 reticulate vertices. Now we will add two more reticulate vertices to N′, and we will be done.
Let wi, i = 1, 2 be a child of vi in N′. Now, remove the edges (r, v2), (v1, w1), (v2, w2) and add the edges (r, u1), (u1, v2), (u1, u2), (v1, u2), (u2, u3), (u2, w), (u3, u4), (v2, u4), (u4, w2) to N′, where u1, u2, u3, u4 are new vertices. This results in a network with n = k leaves and k−2 reticulate vertices.
Case (ii): Without loss of generality, let v1 be a leaf. Then the tree T2 has k−1 leaves. By induction hypothesis, k − 3 edges can be added to T2 to obtain a network N2 on k−1 leaves and k−3 reticulate vertices. Replace T2 by N2 in T to obtain a network N′ with k−3 reticulate vertices. Let w2 be a child of v2 in N′. Now remove the edges (r, v1), (v2, w2) and add the edges (r, u1), (u1, v1), (u1, u2) to N′, where u1 and u2 are new vertices. This results in a network with n = k reticulate vertices. ■
We now state an open problem, which holds for phylogenetic trees.
• Given any phylogenetic network N with n leaves, can reticulate vertices be added until we obtain a phylogenetic network with n−2 reticulate vertices? Given a phylogenetic network, what is the maximum number of reticulate vertices that can be added to obtain a phylogenetic network?
By Theorem 4, the answer for the first question above is yes if N is any phylogenetic tree with n leaves.
2.2. Tree representation of a network
The main idea here is to modify a phylogenetic network N so it becomes a rooted phylogenetic tree TN, constructed as follows: Assume that N has p reticulate vertices
Let v be any vertex in TN. We say that a reticulate vertex rq is reachable from v in TN if there is a directed path from v to either of rq or dq. Note that this is not a standard graph-theoretical definition of “reachability” in TN, but we are defining it this way since the correspondence between rq and dq in TN is necessary here.
We also note that it suffices to fix TN for vertex traversals and for storage in the case of the hardwired approach. For storage purposes in the softwired approach, we need to switch dynamically between the parents of each reticulate vertex. So, we will define trees in
2.3. Parsimony score
Let
Input: A phylogenetic network N with leaf labels [n] and a state assignment function λ over the alphabet Σ for N.
Parsimony criterion: For an extension
and
Output: Given
We note that
2.4. Network traversal
Preorder traversal of a phylogenetic network from a vertex v
1) Visit vertex v.
2) Recursively perform preorder traversal from each child that has not yet been visited.
Postorder traversal of a phylogenetic network from a vertex v
1) Recursively perform postorder traversal from each child that has not yet been visited.
2) Visit vertex v.
Since a phylogenetic network is a DAG, such traversals will visit all the vertices of the network exactly once, thus using a time-complexity of O(n). Refer to Schrijver (2003) for more details on existence of such traversals on DAGs. For the purposes of this article, we assume that the vertices of a network are uniquely labeled by integers. Note that the leaves are already labeled from the set [n]; and so we use other integers for other vertices.
3. Computing Parsimony Score With The Dynamic Programming Approach
In order to decide whether a problem P can be solved by dynamic programming, we must define a collection of subproblems with the following key property that allows them to be solved in a single pass: There is an ordering on the subproblems and a relation that shows how to solve a subproblem given the answers to smaller subproblems, that is, subproblems that appear earlier in the ordering. In other words, we require a recursive expression of the problem in terms of its subproblems.
For the problem of computing the parsimony score on a network N, we define the problem P as that of computing the optimum parsimony score on the tree TN. Let
Note that if N is a phylogenetic tree, then TN = N the problem of computing the optimal parsimony score has been described recursively by defining Sv(i) for any vertex as follows: For a vertex v with children vertices wl and wr,
where
In our previous article, we extended the same recursion for networks. We showed that the dynamic programming approach constructed from that recursion only produces a heuristic lower bound of the optimum. The reason for this is because the assignment of the vertices rq and dq may not necessarily be the same when the equality of states is necessary for any assignment on the network, in particular, the assignment that yields the optimum parsimony score. We also provided a method to fix this conflict and to obtain a heuristic upper bound for the quantity. In the earlier method, the dynamic algorithm yielded us an efficient algorithm albeit only for heuristic solutions. In this article, we provide recursive expressions that guarantee the exact solutions for each of the hardwired and softwired approaches.
3.1. For the hardwired approach
Let N be any phylogenetic network. For each vertex v in TN, we require a storage of a multidimensional array whose dimension is equal to one more than the number of reticulate vertices or tail vertices corresponding to the reticulate vertices reachable from v in TN. For the leaves of TN, we store the quantities Sv(i) for
Let
Let v be a vertex in TN and let
Now, suppose v is any vertex of TN. Then
Suppose v is a reticulate vertex and w is its only descendant. Note that v is reachable from v in TN. Without loss of generality, let vt = v. Then,
We also let
Suppose v is a tree vertex with descendant vertices wl and wr. Then the set
Subcase (i): If both wl and wr are tree vertices in N, then we have
We also let
Subcase (ii): Without loss of generality, if wr were a reticulate vertex rq or the tail vertex dq corresponding to a reticulate vertex rq in TN, then wr appears in the list
In this case,
We give the pseudocodes of the post- and preorder traversals below. The correctness of the algorithm follows from the correctness of the recursive expressions (2), (3), and (4), which can be verified easily.
1: Input: Network TN and the observed states from Σ at the leaves of N, i.e., a state assignment function λ over the alphabet Σ for N. Let
2: For each leaf v in N (and therefore a leaf of TN), let Sv(i) = 0 if λ(v) = i and
3: Repeat in postorder for each vertex v in TN with reachable reticulate vertices
4: Output:
We further define
where the minimum runs over all possible states (
1: Input:
2: Let
3: For each element in
4: Find optimal states by the preorder traversal of TN: For each vertex w with parent v whose optimal assignment is
5: Output: All optimal hardwired assignments.
In Figure 1, we present the subproblems of the hardwired approach for a character on a network.

An example showing one among the at-most two dimensions of storage at each vertex of a phylogenetic network for computing the hardwired score and all optimal assignments on the network. The costs of each state at each node are shown next to the corresponding vertex; the cost matrix considered here is given by cij = 1 if i ≠ j and 0 if i = j for the states
3.2. For softwired approach
Now we are ready to modify the above algorithm for the softwired approach. We need the following changes.
1) For the hardwired approach, TN was fixed, and we used it for both vertex traversal of N and also for cost storage by using the reachability definition. For this approach, although we will still traverse the vertices of N via TN for the storage of costs, we need to specify which parent of each reticulate vertex is used.
2) In the term
The only place we need to make additional changes is in Subcase (ii), where we include a new function depending on whether the vertex v is traversed before the other reticulate parent of wr, and if the left or the right parent is removed to construct the putative spanning tree.
3.2.1. Case 2
Subcase (ii): Without loss of generality, if wr were a reticulate vertex rq or the tail vertex dq corresponding to a reticulate vertex rq in TN, then wr appears in the list
where
and
Note that regardless of whether wr = rq or dq,
When performing the preorder traversal on TN, any conflict assignments are ignored. In Figure 2, we present the subproblems of the hardwired approach for a character on a network. This set of subproblems also give multiple solutions as described in the figure.

An example showing one among the at-most two dimensions of storage at each vertex of a phylogenetic network for computing the softwired score and all optimal assignments on the network. The costs of each state at each node are shown next to the corresponding vertex; the cost matrix considered here is given by cij = 1 if i ≠ j and 0 if i = j for the states
3.3. Problem complexity
Suppose p is the number of reticulate vertices in the network and k is the number of possible states of the character. Then we need O(kp+1) storage at each vertex for the hardwired approach and O(2 p k) for the softwired approach. If n is the number of leaves in N, by Lemma 2 there are O(n + p) vertices. Thus our algorithm requires O(mrk(n + p)) storage, where n is the number of leaves and m = k for the hardwired approach and m = 2 for the softwired approach. Also by Theorem 3, p is at most n−2, thus rendering the total storage to O(mpkn).
The tree traversal takes O(n), and to calculate the storage at a vertex, we need a calculation of O(k) for each of the O(kmpk) entries at each vertex. Thus the time complexity is O(nmpk2). We note however that if all solutions were to be found, the preorder phase can be exponential, as is the case in Sankoff's algorithm.
3.3.1. Reduction step
We include the following clause to reduce storage and time complexity:
In case 2 of the algorithms, if y = z = 0, we have a special scenario in which both children wl and wr have exactly the same set of reticulate vertices that are reachable in TN. When this occurs, we note that there is no subsequent vertex u in the postorder that is an ancestor of v, such that w is reachable from u. Suppose such a u exists, then one of the following holds, y = 1, z = 1, a contradiction. When this happens at a vertex v, the superscripts in
4. Comparing Networks Based On Costs And Redundant Reticulate Vertices
In this section, we deal with comparing scores between networks. Although this is not part of the problem of finding the optimal score of a network, we show that there are some features in network score that can be directly relevant in comparing networks based on scores, which will be useful later for the problem of searching for networks with best scores.
4.1. Comparable scores across all networks
We first note that the cost of the softwired approach of a network is comparable with the softwired approach of any other network that has possibly different numbers of reticulate vertices compared to the original if the cost matrix is a metric by the argument below. For the softwired approach, the optimal score corresponds to a spanning tree in the network, which has 2n−2+2r edges by Lemma 2. Although, the number of edges in a phylogenetic tree is only 2n−2 since the costs along r edges are effectively omitted in the softwired cost, and by the triangle inequality, we notice that the states at the reticulate vertices must match either the states of the parents being used or the states of their child. Thus the network score has to be equal to that of the corresponding phylogenetic tree that can be constructed by contracting the edges incident to the reticulate vertices in the optimal spanning tree.
On the other hand, the hardwired score is not comparable across networks because of the differing quantity of edges in the network depending on the number of reticulate vertices present (namely, 2n− 2+3r). To address this fact, we adjust the score by subtracting, for each reticulate vertex, the maximum substitution cost across either of its incident reticulate edges. By doing this, we convert the score to that across only 2n−2+2r edges. Using the triangle inequality of the cost matrix, the network score is higher than that of the phylogenetic tree. Note that the network assignment might also not be an optimal assignment on the phylogenetic tree. Thus, we note that networks are penalized in this approach. We also present an interesting result below, namely, for binary states and unweighted costs (i.e., with cost matrix with diagonal entries zero and one elsewhere), the adjusted net score is the same as that of the softwired approach as shown in the result below.
Proof:
We will prove the theorem for N with exactly one reticulate vertex. The general case can be proven by induction on the number of ret nodes.
The following are immediate:
i) adj(on) ≥ ot ii) on−1 ≤ adj(on) ≤ on, iii) on−1 ≤ ot ≤ on.
Case A: Suppose ot = on.
By ii), adj(on) ≤ on = ot. Thus, using i) and the above equation, we have adj(on) = ot.
Case B: Suppose ot = on−1.
Let α be an assignment on the nodes of N corresponding to softwired cost ot. Let w be the reticulate node with parents p1 and p2, with p2 not the parent of w in the optimum tree. Let c be the child of w.
Then without loss of generality (with 0 and 1 interchangeable), we have the following cases:
• α(p1) = 0; α(r) = 0; α(c) = 0 or 1. Then α(p2) = 1, for if α(p2) = 0, then on = ot, a contradiction to the assumption of the case. Thus α is also the optimal hardwired assignment on. Thus, adj(on) = on−1 = ot. • α(p1) = 0 or 1; α(r) = 1; α(c) = 1. Then α(p2) = 0, for if α(p2) = 1, then on = ot, a contradiction to the assumption of the case. Thus α is also a optimum hardwired assignment, and thus adj(on) = on−1 = ot.
Note that α(p1) = 0; α(r) = 1; α(c) = 0 is not possible, since α is an optimal softwired assignment. ■
Note that the above theorem is not extendable to general cost matrices or for nonbinary characters since we obtained counter-examples for these cases (that are not reported here). Thus in general, the adjusted hardwired score of a network is equal to or greater than the softwired score, and when they are equal, the number of solutions achieved by the adjusted hardwired approach is smaller than the number of solutions given by the softwired approach. For example, the adjusted hardwired cost of the network in Figure 1 is the same as its softwired cost of 1 (see Fig. 2). However, among the optimal softwired assignments α1(.), α2(.), and α3(.), only α1(.) and α2(.) are optimal adjusted hardwired assignments.
4.2. Redundancy in reticulate nodes
Another ingredient in comparing networks is to be able to decide if the reticulate vertices are indeed needed. To check this, we make use of all solutions for the given problem. For a given optimal solution for a set of characters, we call a set of reticulate vertices R to be redundant if the solution does not use any of the reticulate vertices—that is, only a same parent of each vertex in R was used in the optimal spanning tree of all the characters. Thus the maximum number of redundant reticulate vertices can be computed by looking at all solutions of the characters formed by considering all combinations of solutions of the individual characters and computing the number of reticulate vertices that are redundant under each solution. Now, if the reticulate edge from the unused parent of a redundant vertex is removed, and any vertex of degree two is contracted, we will obtain a network with the same softwired cost and the same or lower hardwired cost. The list of unused parents can be computed immediately after the postorder traversal for the softwired approach; but only after the adjusted cost is computed (after the preorder traversal of finding all assignments) for the hardwired approach. Nevertheless, we need this computation only when we want to decrease or increase the number of reticulate vertices present.
In Figures 3 and 4, we give simple examples of when a tree is preferred and when a network with reticulate vertices is preferred for different datasets.

Example of a character dataset that displays a tree that models better than networks with reticulate vertices. The input set of three characters on two states “A” and “C” are given in the leaves. It can be seen that there is perfect phylogeny for this set of characters, namely, there is a tree (on the left) such that each of its internal edges splits a unique character into the two states (the edges that split the different characters are shown in dotted lines with the corresponding character site written next to each of them). Thus the tree is an optimal tree with parsimony cost of three, which equals the number of characters. Note that this is the minimum cost that can be attained for this dataset. Shown on the right is a network with the same input dataset. The softwired and adjusted hardwired parsimony costs of the network is three. However, it can be noted that the reticulate vertex 14 is redundant since the edge (13, 14) may be removed to render the same cost. In general, the network cost of this dataset on any network with at least one reticulate vertex will either have a cost of four or higher or if the score is three, then the set of all its reticulate vertices will be redundant. Thus the score on the left hand tree is better than the score on the right.

Example of two binary characters that are not compatible on a phylogenetic tree. Thus the parsimony score of the characters on any phylogenetic tree of six leaves has a cost of three or more. Shown in the figure is a network with one reticulate vertex with softwired and adjusted hardwired scores equal to two. This can be seen by the two dotted edges that allow one substitution each in different characters. While the first character has an additional substitution on the reticulate edge (11, 10), the second character has an additional substitution on the other reticulate edge (7, 10). Thus the two characters choose different optimal spanning trees (whose reticulate edges are shown in different colors with the corresponding character written next to them) inside the network, and the reticulate vertex 10 is not redundant. Hence, the score on this network is better than the score on any tree.
5. Testing A Real Dataset
We implemented our algorithm in OCAML, and it will be made available in the upcoming versions of the POY software (Varón et al., 2010). In order to see how our algorithm performs, we tested it on real sequence data collected from ant species complex in which hybridization could be common. Hybridization events have been observed in the evolutionary histories of ants belonging to various genera, such as Formica (Seifert et al., 2010) and Dorylus (Kronauer et al., 2011). We chose a fraction of a dataset taken from the Camponotus maculatus species complex, the full version of which will appear in Clouse et al. (2010). The dataset consisted of a marker each from the nucleus (LR, long wavelength rhodopsin) and mitochondrion (COI, cytochrome c oxidase I) from 82 specimens collected throughout the Indo-Pacific. The data gathered was reduced to a manageable size by first conducting tree searches in POY (Varón et al., 2010) for the two markers separately. The sequences of 57 taxa showed clear conflict in phylogenetic placement or were needed to establish the main lineages in the complex retained in the dataset. The mitocondrial and the nuclear sequences were aligned and concatenated. As shown in Table 1, the softwired parsimony cost of the network shown in Figure 5 that was formed by adding two reticulate vertices to the most parsimonious tree formed from nuclear sequences was lower than the sum of the best parsimony scores of the individual mitochondrial and nuclear sequences. This network was compared against other random networks with two reticulate vertices for costs and by far has the lowest hardwired and softwired cost. While it is expected that with the addition of more and more reticulate vertices, the softwired cost will not be lesser compared to the tree cost, we did not see any decrease in cost as we added more than two reticulate vertices. Additionally, we identified redundant reticulate vertices in all the random networks we generated with three or more reticulate vertices. Also shown in the table are the hardwired costs with the adjusted scores in parenthesis. Although the hardwired scores and their adjusted scores are higher than the softwired costs for the network, the given network has the best hardwired cost among all the phylogenetic networks with two or more reticulate vertices that we considered. In both the approaches, the network did not have any redundant vertices.

Experimental real data showing hybridization events in the Camponotus maculatus species complex. The blue and red edges correspond to the spanning trees that are most parsimonious to most characters in the LR and COI markers respectively.
Aligned mitocondrial and nuclear sequences of 57 taxa taken from Camponotus maculatus species: (i) Sequence type; (ii) length of sequence; (iii) tree cost; (iv) softwired network cost; and (v) hardwired network cost (adjusted cost).
While the softwired score can be used to compare networks directly, the hardwired scores can only be used to compare networks with the same number of reticulate vertices. In both the approaches, the number of reticulate vertices must be checked for redundancy. This will help place a bound on the number of reticulate vertices that we allow in the networks.
6. Conclusion
In this article, we provided a dynamic programming solution to compute exactly the hardwired and softwired parsimony scores on phylogenetic networks. As opposed to the hardwired approach, the subproblems that we consider for the softwired approach are finding the costs of each state at each vertex for each subgraph by removing one reticulate edge each of the reticulate vertices reachable from the vertex. We showed that although the subproblems are different, they can both be solved by the dynamic programming approach using the same vertex traversal order. We also gave a reduction step that reduces both the storage and time complexities of the algorithms.
We noticed that the two approaches have some properties that suggest the existence of simpler algorithms. First, for binary states and binary cost-matrix, the softwired and hardwired scores are the same; although, the number of solutions can differ. It is well known that the hardwired problem is polynomially solvable. It would be interesting to derive a polynomial algorithm for the softwired approach as well. Second, for planar graph and binary cost-matrix, it has been proven in Hartvigsen (1998) that the hardwired problem is polynomially solvable. It would thus be possible to obtain an algorithm for computing the hardwired score on phylogenetic networks, perhaps similar to the Fitch algorithm (Fitch, 1971) for phylogenetic trees. Note that by Lemma 1, the networks that we consider are planar.
The algorithms provide us with the states at the reticulate vertices that fetch the optimal assignments. For the softwired approach, this readily gives us a method to realize if a set of vertices are redundant. Maximizing the size of the redundant networks helps us to realize a network with a minimum number of reticulate vertices. For the hardwired approach, we provide a method to adjust the cost by removing a reticulate edge for each reticulate vertex that has the highest traversal cost. This gives a score that is comparable across different networks with different vertices, a feature that is readily available in the softwired approach. Additionally, redundancy can also be incorporated in the hardwired approach after the adjusted scores are found. Thus, we provide a method to compare network cost to navigate through the space of networks. We also note that although the softwired approach in theory is more challenging to tackle, we have a faster exact algorithm for this approach that readily gives a method to compare networks based on the cost and also the number of redundant reticulate vertices. Methods such as these will help search for the best networks with an optimum number of reticulate vertices, the next major challenge that needs to be addressed to find the “Net of Life.”
Footnotes
Acknowledgments
We thank Nicolas Lucaroni for suggestions provided while implementing the algorithms. We sincerely thank Ronald Clouse and Milan Janda for Camponotus sequence data, which were collected with the following support: DARPA (W911NF-05-1-0271); Marie Currie Fellowship (PIOFGA2009-25448); Czech Science Foundation (P505/10/0673; 206/09/0115); Czech Ministry of Education Grants (LC06073; 6007665801); and Putnam Expedition Grants (Museum of Comparative Zoology). We also thank Anupam Kumar and Sameer Siddiqi, who filtered the dataset to a manageable size. This material is based upon work supported by, or in part by, the U.S. Army Research Laboratory and the U.S. Army Research Office under grant number W911NF- 05-1-0271.
Author Disclosure Statement
The authors declare that no competing financial interests exist.
