Abstract
Abstract
We give an algorithm for discovering co-evolution in biosequences from a dataset consisting of aligned data and a phylogeny. The method correlates vectors of parsimony scores on the edges of a graph, averaged over all optimally parsimonious reconstructions of the data. We describe an efficient data structure, and a preprocessing step that allows for rapid, interactive computation of many correlation scores, at the expense of storage space.
1. Introduction
U
One approach to discovering such interacting pairs is based on the observation that two molecules which interact tend also to evolve together, a process called co-evolution. Any method which allows us to quantify evolution also allows us to measure co-evolution, and therefore make conjectures about interaction. This has been done, for example in Ramani and Marcotte (2003), using similarity matrices, in Kim et al. (2006), using mutual information, and in Jothi et al. (2005), using tree topology, respectively, to quantify evolution and co-evolution.
In the present paper, we use parsimony scores to quantify evolution, and correlation of vectors of scores to quantify co-evolution. We also use a pre-processing step to allow for rapid computation of correlations.
2. The Parsimony Graph
We begin with a precise description of the problem, which may be thought of as an abstract problem on strings. Let
Let

A rooted binary tree with five leaves, and the strings on those leaves. Two domains in these sequences, S[1, 3] of size 3 and S[8, 8] of size 1, are shown to the right.
2.1. Parsimony
The art of inferring the evolutionary histories of species that we see in the present is called phylogenetic reconstruction, and consists of constructing a hypothetical tree describing the speciation events that took place as they evolved from a common ancestor, and constructing hypothetical sequences on the non-leaf nodes in the tree representing best guesses as to what the molecules might have been in those ancestors. In this paper, we will assume that a tree has been found already, and will concentrate solely on reconstructing and studying the ancestral sequences. Of course, any such reconstruction represents nothing more than an educated guess. So we use the notion of parsimony to decide which guesses are better than others, and then select the best.
Notice that, in the computation of pc(T), we compare the j th character of any string only with the jth characters in other strings, so that, if we let Δ (x, y) = 0 if x = y and 1 otherwise, then
and our computations may be done on one character at a time. Let us therefore assume that the strings on the leaves are all of length one, and that we seek to find a single character for each internal node to minimize the total parsimony cost. Such an assignment will be called a reconstruction. This problem was solved by Fitch in 1971, but we will not use his method here (Fitch, 1971). Rather, we will take a dynamic programming approach similar to that used by Blanchette (2002).
2.2. Dynamic programming
The quantity we wish to consider is the following: For each node v of the tree and each x ∈ A, what is the minimal cost we could realize on the subtree consisting of v and its descendents (relative to the root of T) if we placed x on node v? We denote this cost by c(vx). For example, for the alphabet {a, c, g, t} and the tree shown in Figure 2, we have c(ua) = c(uc) = 1 and c(ug) = c(ut) = 2, while c(wa) = 1, c(wg) = 2 and c(wc) = c(wt) = 3.

The labels on the leaves of the tree, and the names of the internal nodes, are shown.
Let us introduce a data structure called a parsimony graph, whose vertices are {vx|v ∈ V (T), x ∈ A}, and whose edges we will define shortly. The shape of this graph is roughly that of the shape of T, but with each vertex of T replaced by k “character vertices.” Thus, the node vx would represent the placement of character x ∈ A on node v, and c(vx) would be the cost defined above. We observe that the value c(vx) depends on the values of c(wy) for each child w of v in T and each y ∈ A. In particular, if wL and wR are the left and right children of v in T, then
where Δ
xy
= 0 if x = y, and 1 otherwise. (This is 1 − δxy, where δ is the usual Kronecker delta function.) To see why equation (2) holds consider the best cost that can be obtained by placing character x on node v. For each label y that could be used on the left child wL we obtain a cost of 0 on the edge (v,wL) if y = x, and a cost of 1 otherwise. Additionally, we consider the minimal cost of the subtree of wL when wL is labeled with y. Finally, we take the minimum over all choices for y. Independently we make the same computation on the right subtree, and add together the two minimal costs. The “base cases” on the leaves are treated as follows:
We use ∞ to signify that we may not use a different character than that given.
2.3. Adding the edges
The parsimony graph is simply the memoization step of the dynamic program described above. In particular, if v has children wL and wR in T and

The parsimony graph for the tree shown in Figure 2, over the alphabet {a, c, g, t}. The circled characters show the labels given on the leaves, and the numbers show the costs c(vx) for placing character x on node v in the tree.
A glance at the parsimony graph reveals much about the optimal reconstructions. First of all, the only optimal choice for the root of the tree is “a,” as that yields cost 2 while all other character choices yield higher cost. Furthermore, following the edges down from character “a” at the root, we find that its left child must be labeled with “a” also, since no edge goes to any other character there. Similarly, the right child must be labeled with “a,” and its right child must be labeled with “a.” In this fashion, the parsimony graph encodes all optimal reconstructions. In our example, there was just a single optimal selection. In general, the number of optimal reconstructions may be exponential in the number of leaves, but the parsimony graph will always have |A||V(T)| nodes and no more than |A|2 · (|V(T)| − 1) edges. That is, for any fixed alphabet A, the graph size will be linear in the number of leaves of the tree.
2.4. Average parsimony vectors
The parsimony graph allows us to compute the number of optimal reconstructions as well. Again we use dynamic programming, but we re-use the memos (edges) already computed above. For each character node vx, let us define n(vx) to be the number of optimal reconstructions of the subtree of T rooted at v, given that character x is used on node v. Then
where wL and wR are the children of v in T. Equation (4) says that the number of trees is the product of the number of optimal left subtrees times the number of optimal right subtrees, since they may be selected independently. The number of optimal reconstructions for the whole tree is then given by
where r is the root of the tree. This computation can be done in time linear in the number of leaves, even including the step of building the parsimony graph in the first place. Finding this number in linear time was also done in Rinsma et al. (1990) using generating functions.
Given an optimal reconstruction, we may place on each edge (v,w) of T a value 0 or 1 according as v and w get the same or different characters, respectively, in that reconstruction. If we order the edges of T, then this placement becomes a binary vector. The ordering we use is a “post-ordering” of the edges, where each edge (v,w) gets a number only after the left and right subtrees of w have been numbered, in that order. We call this ordering of the edges a “post-ordering” because it arises from a traditional post-ordering of the vertices by moving the number on each vertex to the edge above it. (We use a post-ordering because it places the edge from the root to its right child last. As described in Section 2.3, this edge will always have cost 0, and will therefore contribute nothing to our correlations. We may therefore ignore the last component of the vector, retaining only the initial 2k − 3 components.) From an evolution point of view, this vector may be thought of as a cross-section of the evolutionary history of the domain. It indicates those times when a change took place in the sequence comprising that molecule. If we find another molecule whose changes correlate with the changes in this molecule, we may see that as evidence that the molecules have co-evolved.
Prior to correlating, however, we have two considerations: First, there may be more than one optimal reconstruction, and second, that each domain may have length greater than one. Longer domains are easily handled by our previous observation that the parsimony cost of a domain is simply the sum of the parsimony costs of the individual positions in that domain. But handling the plurality of optimal reconstructions presents a whole research area in itself. For example, a tree with 15 leaves and 200 optimal reconstructions can be thought of as a set of 200 binary vectors in 27-dimensional space, and we need some way to correlate such a point set with another point set of the same dimension but perhaps of a different size. In this paper we represent the set of vectors by the average of the vectors, giving us just a single vector of floating point numbers representing a domain, so that correlating two domains can be done by correlating two vectors. But one could use Tukey depth (Tukey, 1975) or any of the other notions of data depth to identify some central point to use as a representative of the domain. We have not made any explorations in this direction yet.
To find the average optimal reconstruction we do not want to actually find all optimal reconstructions and take their average, as the number of optimal reconstructions may be very large. Instead, for each edge (v,w) of T we count the number of optimal reconstructions that place different characters on v and w, and divide by the total number of reconstructions. This is done in linear time for each edge by removing from the parsimony graph all edges (vx,wy) for which x = y, making use of Equations (4) and (5) to get the desired count, and then restoring the parsimony graph. The total time to find the average vector for one character in the string is therefore quadratic in the number of leaves of T. Let us denote by
3. Correlation
Now that we have quantified evolution, we may infer co-evolution of two domains by correlating the average vectors associated with those domains, where the average vector associated with a domain is the sum of the average vectors associated with each character location in that domain. We will use the Pearson correlation coefficient which, for two vectors
In our case, however, a and b are sums of average vectors. Suppose that we wished to correlate the average vectors for the domains D1 = S[k1, l1] and D2 = S[k2, l2]. Then our correlation is given by:
where the notation (·) i selects the ith component of the vector sum.
If D1 and D2 have lengths L1 and L2 respectively, and k is the length of the vectors, then the time to compute this correlation would be O(k · (L1 + L2)). Since we wish to search the input strings for highly correlated pairs of domains, we will be performing O((n1 − L1) · (n2 − L2)) of these correlations, for strings of length n1 and n2, which for long strings could become prohibitively expensive. If we knew the size of the domains that we were concerned about correlating, then at the start of our search we could pre-compute the sums of the average vectors for each domain of that length, reducing the time to compute each correlation to O(k).
3.1. Precomputation
When the lengths of the domains are also parameters which the user may wish to vary in real time, recomputing the correlations each time a domain length changes becomes prohibitively time-consuming, particularly when they get very long. Here we describe a bit of precomputation which, when complete, allows us to correlate any pair of domains, regardless of length, in time O(k).
For any character position p, 1 ≤ p ≤ n and any edge e of T, let us define the functions:
We further define E (0, e) = 0 for all edges e in T, and T(0) = 0. These may be computed in time O(k n), where k is the number of leaves of the tree and n is the length of the strings.
Once this computation has been performed once, say, when a file containing the biosequences has been loaded by some computer program, it becomes possible to compute the correlation between any two domains in time O(k), regardless of the size of the domains. Indeed, one loop over the edges of the tree computes in time O(k):
We may then compute the correlation coefficient:
The storage space for the precomputed table is O(n k), which may be prohibitively great for long sequences. In such cases, one may instead store E(p, e) and T(p) only for values of p which are multiples of M, using a value for M which is large enough to enable the table to fit into memory, but still as small as possible. Then quantities such as E(l1, e) − E(k1 − 1, e) may be computed using the identity:
where M · k′1 is the least multiple of M greater than or equal to k1, and M · l1′ is the greatest multiple of M less than or equal to l1. This gives constant time for the two table lookups, and time O(M · k) to compute the two sums.
3.2. Java implementation
This algorithm has been implemented in Java, as shown in Figure 4. The program allows the user to adjust the sizes of the domains interactively, as well as the threshold of correlation, and gives a graphical representation of those portions of the string, of the user-set lengths, which have correlation above the threshold. Source code is freely available by writing to the corresponding author.

Screenshot of a Java implementation.
The algorithm has been tested on simulated data, as follows: In each experiment we generate a random binary tree with 20 leaves, and place a random DNA string (alphabet = {a, c, g, t}) of length 2000 on the root. Ten disjoint “domains” of length 12 are selected at random in this string, and a directed acyclic graph G is generated at random, having these ten domains as vertices, with arcs between two domains that are to coevolve in this simulation. (Seven was a typical number of arcs.) Next we repeatedly mutate the string from parent to child until each leaf has a string on it. (We have no insertions, deletions or rearrangements in this model of co-evolution.) For mutating we select two probabilities: pin for mutating characters in domains and pout for mutating characters which are not in any domain. If the i th character in the parent is not in any domain, then the ith character in the child's string will be the same with probability 1 − pout, and each of the other three characters with probability pout /3. If the ith character is in a domain, then we distinguish two cases: If the domain has no directed arc to another domain, then we mutate it just as we did for non-domain characters, but with probability pin in place of pout. If the domain has directed arcs to other domains, say D1 … Dk, then we first recursively mutate those domains and afterward, look at what happened. For each Di we compare the domain in the parent to the domain in the child, count the number ci of changes (mutations) that occur, and let pi = ci /12 represent the observed probability of mutation inside the ith domain. Finally, we let P = min i pi, and mutate the characters inside this domain as we did for characters not in any domain, but using P in place of pout. This has the effect of suppressing mutation in domains when mutation does not occur in domains with which they are co-evolving, and encouraging mutation when many mutations happen in their co-evolving partners. C++ source code for generating such test cases (but with commandline parameters to vary the constants mentioned above) is also freely available from the author.
When the correlation algorithm is run and highly correlated pairs of domains of length 12 were output, it was typical to find such things as domain S (100, 111) highly correlated with S (350, 361), and also highly correlated with regions S (351, 362) and S (352, 363). So we put in a post-processing clustering step so that, within each cluster, only the most highly-correlated pair of domains would be output.
Tests on the simulated data indicate that approximately one-third of co-evolving pairs are discovered by the correlation algorithm. We tested as follows: Once the digraph was generated, we knew how many co-evolving pairs we should expect to find. Then in the Java program, we set the threshold so that twice that many pairs would be presented to the user, and we counted a success if among them we found a pair of domains both of which overlapped at least 9 of the 12 characters of the domains in some pair that was present in our simulated data. For example, in one set of runs we generated five sets of data (each consisting of 20 strings of length 2000) using probability pin = pout = 1/10. Each had ten domains, the numbers of arcs in the diagraph were 7, 7, 8, 5, and 9, and the Java program found 3, 2, 2, 3, and 3 of the pairs, respectively. These results get worse once the value of pin drops below approximately 1/20 (for domains of length 12) but don't change appreciably for larger values of pin.
3.3. Contrast with “footprinting” and further uses
In phylogenetic footprinting, one searches for regions of the strings which have mutated less than the surrounding regions, and identify these conserved regions as potentially significant. With the algorithm herein proposed we can find regions whose mutation rate is the same as the background string, or, indeed, higher. What we're measuring is not the amount of change, but the relative rates of change across the phylogenetic tree. Where footprinting is looking for a single region with low elevation, we're looking for pairs of regions having similar terrain, regardless of elevation. Which raises the possibility of using this algorithm as a data mining technique, wherein one wishes to find features which individually may not stand out from the general population, but which can be discerned by virtue of their correlation to other, equally unnoticeable features.
Footnotes
Disclosure Statement
No competing financial interests exist.
