Abstract
Minimum contradiction matrices are a useful complement to distance-based phylogenies. A minimum contradiction matrix represents phylogenetic information under the form of an ordered distance matrix
Introduction
The discovery of the importance of lateral transfers, losses and duplications events in the evolution of genetic sequences has motivated the development of new approaches to graphically represent phylogenies. Methods like NeighborNet (Bryant and Moulton, 2004), T-Rex (Makarenkov et al. 2006), SplitTrees (Bandelt and Dress, 1992; Dress and Huson, 2004; Huson, 1998), Qnet (Grünewald et al. 2006), Pyramids (Bertrand and Diday, 1985), Tree of Life (Kunin et al. 2005a) allow visualizing deviations from a tree topology. All these methods have in common that they summarize the information in the form of a planar network. Deviations from an X-tree are often represented by supplementary edges (Makarenkov et al. 2006; Nakhleh et al. 2004) that create cycles in the graph.
Phylogenetic information can be represented by a distance matrix
With the availability of complete genomes, many methods have been proposed to determine the evolution of whole genomes (For reviews see Galperin et al. 2006; Delsuc et al. 2005; Henz et al. 2005). The construction of trees from whole genomes has proved over recent years to be a quite difficult task. This is mainly because of the very limited number of genes shared by Archaea, Eukaryota and Bacteria. Furthermore, gene evolution can sometimes be very different from species evolution. The main difficulty consists in finding a good operator to estimate the distance between genomes. Distances have been estimated with measures based on gene order or arrangement (Wolf et al. 2002; Wang et al. 2006), gene content (Fitz-Gibbon and House, 1999; Snel et al. 1999; Korbel et al. 2002), protein domain organization (Fukami-Kobayashi et al. 2007; Yang et al. 2005), folds (Lin and Gerstein, 2007), combining the information from many genes in a supertree or a superdistance (Dutihl et al. 2007 for a comparative study) or using a local alignment search tool such as Blast (Kunin et al. 2005b; Clarke et al. 2002). Among genome distances obtained with Blast, the genome conservation (Kunin et al. 2005b) has furnished some of the best trees up to date, if the quality of a whole genome phylogeny is measured by its concordance to broadly accepted classifications. The genome conservation estimates the distance between two taxa using the sum of BlastP reciprocal best hits between two genomes. The method is capable of quite correctly recovering all main phyla. At the phylum level, the evolution of the different genes is sufficiently similar to form a distinct cluster. The main uncertainties in whole genome phylogenies are on the relationships between phyla. Different evolution rates of the genes, gene losses or duplications, lateral gene transfer may result into large deviations of the distance matrix from a tree topology. In this context, minimum contradiction matrices can furnish information not contained in a single tree or a split network.
The paper is organized as follows. After introducing minimum contradiction matrices in section 2 and their connection to Robinson matrices and Kalmanson inequalities, section 3 explains why the identification of deviations from perfect order is a useful complement to phylogenetic studies. Section 4 presents an algorithm to search for the order minimizing a measure of the deviation from perfect order over all taxa. This order can be interpreted as an average best order over all reference taxa
Circular Order and the Minimum Contradiction Approach
Definitions
Let us start by recalling a number of definitions that are necessary to introduce the notion of circular order. A graph G is denned by a set of vertices V(G) and a set of edges E(G). Let us write e(x, y), the edge between the two vertices x and y. In a graph G, a path P between two vertices x and y is a sequence of non-repeating edges e(x1, z1), e(z1, z2),…, e(z i ,y) connecting x to y. The degree of a vertex x is the number of edges e ∊ E(G) to which x belongs. A leaf x of a graph is a vertex of degree one. A vertex of degree larger than one is called an internal vertex.
A valued X-tree T is a graph with X as its set of leaves and a unique path between any two distinct vertices x and y, with internal vertices of at most degree 3. The distance d between leaves satisfies the classical triangular inequality
A central problem in phylogeny is to determine if there is an X-tree T and a real-valued weighting of the edges of T that fits a dissimilarity matrix δ. Typically, a dissimilarity matrix δ corresponds to an estimation of the pairwise distance d(x
i
,x
j
) between all elements in X. A necessary and satisfactory condition for the existence of a unique tree is that the dissimilarity matrix δ satisfies the so-called 4-point condition (Bunemann, 1971). For any four elements in X, the 4-point condition requires that
Consider a planar representation of a tree T or a split network S. A circular order corresponds to an indexing of the n leaves according to a circular (clockwise or anti-clockwise) scanning of the leaves (Barthélémy and Guénoche, 1991; Makarenkov and Leclerc, 1997, 2000; Yushmanov, 1984).
In an X-tree, a circular order has the property that for any integer k (modulo n), all the branches on the path P(x
k
, xk+1) between x
k
and xk+1 correspond to the left branch (or right branch if anti-clockwise). A circular order can be obtained by considering the distance matrix

The distance matrix
The above inequalities characterize also a Robinson matrix (Christopher et al. 1996; Thuillard, 2007). Using the definition of
These inequalities have a similar form to the 4-point condition (2) and are known as the Kalmanson inequalities.
In real applications, the distance matrix
The best order of a distance matrix is, per definition, the order minimizing the contradiction. The ordered matrix
For a perfectly ordered X-tree, the contradiction C is zero. A tree with a low contradiction value C is a tree that can be trusted, while a high contradiction value C is the indication of a distance matrix deviating significantly from an X-tree.
Kalmanson inequalities are at the center of a number of important results relating convexity (Kalmanson, 1975), the Traveling Salesman Problem (TSP) (Deineko et al. 1995; Korostensky and Gonnet, 2000), phylogenetic trees and networks (Christopher et al. 1996; Dress and Huson, 2004). Let us explain why perfect order is an important property.
If the error on the distance in an X-tree is not greater than xmin/2 with xminthe shortest edge on the tree, then the Neighbor-Joining algorithm will recover the correct tree topology and Kalmanson inequalities hold (Atteson, 1999; Korostensky and Gonnet, 2000).
If a distance matrix d fulfills Kalmanson inequalities, then the distance matrix can be exactly represented by a split network (Bandelt and Dress, 1992).
If Kalmanson inequalities are fulfilled, then the tour (1,2,…,n) corresponds to a solution of the Traveling Salesman Problem (Christopher et al. 1996).
The last result can be demonstrated starting from the sum
The solution to the TSP has the Master Tour property (Deineko et al. 1995). A Master Tour is a solution of the TSP with the property that the optimal tour restricted to a subset of points is also a solution of the reduced TSP. This result follows directly from the inequalities for perfect order
Fast algorithm to search for the best order
The choice of the reference taxon n in
The contradiction over all n reference taxa is given by
The best order is the order (1, …, i0, …, j0, …, n0) minimizing the contradiction. The computation of the contradiction requires O(n4) operations. For a large ensemble of taxa, the computational cost may become quite high. We will therefore introduce below an algorithm requiring only O(n3) operations to compute a (slightly different) measure of the contradiction.
Let us start by considering an X-tree and the 3 vertices i j, k as in Figure 2. The distance matrix fulfills the inequalities for perfect order. The order between the vertices i, j, k is preserved for any reference vertex not in the interval (i, k) and the inequalities

The inequalities
If the contradiction c
i, j
between the vertices i, j is defined as the sum of two terms
The quantities Sa and Sb in Eq. (6) can be related to the NJ algorithm. For 3 consecutive vertices (i, j = i + 1, k = i + 2), Eq. (6a) can be written, assuming perfect order, as
The value S
i, j
is central to the NJ algorithm (Saitou and Nei, 1987; Gascuel and Steel, 2006). Two vertices i, j are joined by the NJ algorithm, if they maximize S (i.e. max(S) = S
i, j
). From the above discussion, it seems natural to initialize the search for the best order on the NJ tree. The search for the best order of
For whole genome phylogenies, the search for appropriate measures to estimate the evolutionary distance between taxa is still the subject of significant research efforts (Korbel et al. 2002; Kunin et al. 2005b; Yang et al. 2005; Fukami-Kobayashi, 2007). Distance matrices obtained from BlastP scores have been quite successful to generate good trees. The similarity score obtained with BlastP programs can be given a probabilistic interpretation. The statistics of high scoring segments in the absence of gaps tends to an extreme value distribution (Karlin and Altschul, 1990). The probability P of finding at least a high scoring segment is well approximated, for small values of P, by the formula P = m1·m2·2−Score with m1, m2 the length of the 2 sequences. It follows that Score = –log2 P + log2(m1·m2). Denning the distance d between two sequences as d = –Score and assuming equal lengths one has d = log2(P/m2). Using that definition, the distance matrix
The log term has the form of a mutual information and furnishes a measure of the similarity of the genomes i and j in reference to the genome n.
Different approaches have been proposed to normalize the distance matrix using the marginal entropy (Kraskov et al. 2005), the self-score (Kunin et al. 2005b), Korbel normalization (Korbel et al. 2002) or the average score. The normalization by the self-score in the genome conservation gives some of the best results. It is based on a nonlinear weighted sum of the BlastP scores. The gene conservation method computes the distance between two taxa by normalizing the sum of reciprocal best hits between genome i and j by the self-score. The effect of duplication is limited by using only reciprocal best hits. The normalization by the self-score is important to correct, at least partially, the effect of different genome sizes. The genome conservation similarity matrix is given by
Search for the best average order
The algorithms described in section 4 have been used to search for the best order. The distance matrix was computed using the data furnished by the genome phylogeny server (Kunin et al. 2005b) obtained with an e-value cut-off set to 10−10. The contradiction is significantly lower with the score (1 – S i, j ) than with the logarithm of the score. Figure 3 shows the best order after optimization with the algorithms described in section 4 followed by 5000 steps of the multiresolution search algorithm using Eq. (7) to compute the contradiction.

Table 1 gives the order of the different taxa corresponding to the best order. Archaea and Eukaryota are grouped into two adjacent clusters of taxa. One observes, for Bacteria, that all the members of a class or a phylum are neighbors. All proteobacteria (together with Aquifex?) are grouped together. The best order obtained with the minimum contraction approach differs from the NJ tree on the following aspect: all spirochetes and δ-proteobacteria form a cluster. This is not the case of the NJ tree.
(see annex for detailed list of taxa).
This article focus on the mathematical aspects of Minimum Contradiction Matrices. We will limit the discussion to 3 examples showing how to interpret Minimum Contradiction Matrices. The matrix

Distance matrix
The best order in Figure 3 is obtained by minimizing the contradiction using all taxa as reference vertex at least once. The best order is therefore a kind of “average” best order. The matrix
Many contradictions in Figure 5 can be associated to well accepted endosymbiotic events (Chloroplasts in plants or mitochondria in Eukaryota). Figure 5a shows

Distance matrix
Figure 5b shows the distance matrix using all Cyanobacteria as reference taxa. The elements associated to Arabidopsis and Cyanidioschyzon have lower values than both adjacent lines (resp. columns). The observed contradictions for Arabidopsis and Cyanidioschyzon merolae (a plant and a red alga) may be explained by the many genes that are found in both Cyanobacteria and plants/red alga but absent in other Eukaryota, a hypothesis that is supported by the small value of the distance between Cyanobacteria and (Arabidopsis, Cyanidioschyzon). Chloroplasts in plants and red alga are generally considered to have originated as endosymbiotic Cyanobacteria. The low values of
Conclusions
For an X-tree or a split network the minimum contradiction matrix
An average best order can be obtained by searching for the best circular order over
Footnotes
Disclosure
The authors report no conflicts of interest.
