On the Complexity of Rearrangement Problems under the Breakpoint Distance

Abstract

We study the complexity of rearrangement problems in the generalized breakpoint model of Tannier et al. and settle several open questions. We improve the algorithm for the median problem and show that it is equivalent to the problem of finding maximum cardinality nonbipartite matching (under linear reduction). On the other hand, we prove that the more general small phylogeny problem is NP-hard. Surprisingly, we show that it is already NP-hard (or even APX-hard) for a quartet phylogeny. We also show that in the unichromosomal and the multilinear breakpoint model the halving problem is NP-hard, refuting the conjecture of Tannier et al. Interestingly, this is the first problem that is harder in the breakpoint model than in the double cut and join or reversal models.

1. Introduction

In this article, we study several rearrangement problems in different variants of the breakpoint model. Throughout evolution, large-scale rearrangement mutations, such as inversions or translocations, affect the order of genes in a genome. When two genes (or conserved segments or markers) are adjacent in one genome, but not in the other, we call this position a breakpoint. We can then define the breakpoint distance simply by counting the number of breakpoints.

The breakpoint model was originally introduced by Sankoff and Blanchette (1998) who tried to reconstruct the ancestral gene orders, given a phylogenetic tree and gene orders of the extant species, based on the parsimony criterion, minimizing the sum of distances along the branches of the tree. This is known as the small phylogeny problem.¹ Unfortunatelly, the problem is NP-hard already when we have three species—an important special case known as the median problem. In fact, the median problem turns out to be NP-hard for almost all rearrangement distances [breakpoint (Bryant, 1998; Pe'er and Shamir, 1998; Tannier et al., 2009), reversal (Caprara, 2003), and double cut and join (DCJ) (Tannier et al., 2009)].

One notable exception is the general breakpoint model. Tannier et al. (2009) observed that if we drop the condition that genomes are unichromosomal and that all chromosomes are linear, we get a simple model where the median problem is solvable in polynomial time. Even though this model is not very biologically plausible and more realistic models exist, the breakpoint model may still be useful for upper and lower bounds, and solutions in this model may serve as good starting points for the more elaborate and complicated models.

1.1. Previous results and our contribution

There are several variants of the breakpoint model depending on what karyotypes we allow. In the unichromosomal (linear or circular) model, the genome may only consist of one chromosome. In the multilinear model, the genome may consist of multiple linear chromosomes, and finally, the mixed model allows for any number of linear and circular chromosomes (even though this is not biologically plausible).

For the unichromosomal model, Pe'er and Shamir (1998) and Bryant (1998) showed that the median problem is NP-hard. This result was extended to the multilinear model by Tannier et al. (2009), and Zheng et al. (2008) showed the NP-hardness for a related problem called guided halving (see Preliminaries).

Curiously, the ordinary halving problem was not studied before in the breakpoint model, and Tannier et al. (2009) also leave it open. Moreover, they conjecture that the problem is polynomially solvable; this might perhaps be attributed to the fact that the halving problem is polynomially solvable in far more complicated models such as reversal/translocation (RT) (El-Mabrouk and Sankoff, 2003) or DCJ (Alekseyev and Pezner, 2007; Mixtacki, 2008; Warren and Sankoff, 2009; Kováč et al., 2011). Nevertheless, we refute this conjecture (unless P = NP) by proving that the halving problem is NP-complete in the unichromosomal and multilinear models.

In the mixed model, Tannier et al. (2009) showed that median, halving, and guided halving problems are solvable in polynomial time. Two interesting open questions remained in their work. These are also articulated in the monograph by Fertin et al. (2009):

1. The best time complexity for the median and guided halving problems under the breakpoint distance on multichromosomal genomes (with circular chromosomes allowed) is O(n³), using a reduction to the maximum weight perfect matching problem. It is an open problem to devise an ad-hoc algorithm with better complexity.

2. The small phylogeny and large phylogeny problems under the breakpoint distance are open regarding multichromosomal signed genomes where linear and circular chromosomes are allowed.

We resolve the first question in a positive way by showing a more efficient algorithm running in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt n )$$ \end{document} time. In fact, we show that maximum cardinality matching and our problem have the same complexity. The same technique also improves the algorithms for halving and guided halving.

The second question is resolved negatively: small and large phylogeny problems are NP-hard under the breakpoint distance. Surprisingly, the problems are NP-hard (and APX-hard) even for four species, i.e., a quartet phylogeny. In other words, while finding an ancestor for three species is easy, finding two ancestors for four species is already hard.

The previous work and our new results are summarized in Table 1.

Table 1.

Our New Results in Context of the Previously Known Results

Breakpoint model	Median	Halving	Guided halving	Small phylogeny
Unichromosomal (linear or circular)	NP-hard^a	NP-hard^*	NP-hard^b	NP-hard^c
Multilinear	NP-hard^d	NP-hard^*	NP-hard^b	NP-hard^c
Multichromosomal (circular or mixed)	O(n³)^d	O(n³)^d	O(n³)^d	NP-hard^*
	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt{n} )$$ \end{document} ^*	O(n)^*	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt{n} )$$ \end{document} ^*

New.

Bryant, 1998; Pe'er and Shamir, 1998. ^bZheng et al., 2008. ^cTrivial. ^dTannier et al., 2009.

1.2. Road map

In the next section, we define the different variants of the breakpoint model and state the rearrangement problems. In Section 3, we refute the conjecture of Tannier et al. (2009) and prove that the halving problem is NP-hard for the unichromosomal and multilinear breakpoint model. In the following two sections, we study the general breakpoint model. In Section 4, we improve upon the algorithm of Tannier et al. (2009) and show that it is equivalent to the maximum matching problem. The hardness of the small phylogeny problem is studied in Section 5, and we conclude in Section 6.

2. Preliminaries

2.1. Genome models and the breakpoint distance

We assume that all of the studied genomes have the same gene content \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal G}$$ \end{document} . Each gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g \in{ \cal G}$$ \end{document} is an oriented segment of DNA having two extremities: a head g_h and a tail g_t. We represent a circular genome π by a set of edges: An edge between extremities x and y, called adjacency, indicates that x and y are adjacent in the genome. Every extremity is adjacent to exactly one other extremity, so we can identify genomes with perfect matchings over the set of extremities.

Let us define an auxiliary base matching \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B = \{ g_h g_t : g \in { \cal G} \} $$ \end{document} where each edge connects the two ends of some gene. Then all vertices have degree 2 in the union² \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi\, \dot {\rm \cup} B$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi\, \dot {\rm \cup} B$$ \end{document} decomposes into a set of cycles, which naturally correspond to the circular chromosomes of our genome (see Fig. 1, left).

FIG. 1.

Example of a circular genome (left) and a mixed genome (right) and their representations by matchings. (a) The order of genes in a genome. Each arrow corresponds to a single gene with known orientation. (b) A genome with two linear and one circular chromosome. Such genomes are not found in nature; however, the model is motivated by tractability of the rearrangement problems such as median. (c) Representation of the genome above by a perfect matching. The green edges are the adjacencies of π; the gray edges form the base matching B. The Hamiltonian cycle \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi\, \dot {\rm \cup} B$$ \end{document} corresponds to the single chromosome. (d) The genome above represented as a set of adjacencies (green matching). Gray edges form the base matching B. Components of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi\, \dot {\rm \cup} B$$ \end{document} are paths and cycles, corresponding to linear and circular chromosomes, respectively.

In the general (multichromosomal circular) model, genomes can have multiple circular chromosomes and any perfect matching π corresponds to a genome. In the unichromosomal circular model, we require that the genome only consists of a single chromosome, so \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi\, \dot {\rm \cup} B$$ \end{document} is a Hamiltonian cycle. Such a matching π is sometimes called a Hamiltonian matching.

Let π₁ and π₂ be two genomes—two perfect matchings. Then the breakpoint distance between π₁ and π₂ is defined as d(π₁, π₂) = n − sim(π₁, π₂), where n is the number of genes and sim(π₁, π₂) is the number of common adjacencies. The breakpoint distance satisfies all the properties of a metric and is used in the literature; however, we find it easier to work directly with the similarity measure sim(π₁, π₂).

In models involving linear chromosomes, we add a vertex T_x for each extremity x. These vertices are called telomeres, and a telomeric adjacency xT_x indicates that x is an end of a linear chromosome (see Fig. 1, right). Genomes will again correspond to matchings with an added condition that T_x may only be adjacent to x. If π is such a matching, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi\, \dot {\rm \cup} B$$ \end{document} consists of cycles and paths ending in telomeres, which correspond to circular and linear chromosomes, respectively. In the mixed model, any such matching π represents a genome; in the multilinear model, we require that every chromosome is linear; and in the linear model, we only allow a single linear chromosome.

We can write the breakpoint distance again in the form d(π₁, π₂) = n − sim(π₁, π₂), where this time, sim(π₁, π₂) is the number of common adjacencies plus half the number of common telomeric adjacencies (as introduced by Tannier et al., 2009).

2.2. Duplicated genomes

We will also work with duplicated genomes that have exactly two copies of each gene. For each gene g, let us label the first copy g¹ and the second copy g². Then we can represent a duplicated genome by an ordinary genome δ over the gene set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ g^1 , g^2 : g \in { \cal G} \} $$ \end{document} . However, note that the labels were introduced arbitrarily and we consider two genomes that differ only in the subscripts of some genes as equivalent. A duplicated genome corresponds to the equivalence class [δ]. A breakpoint distance (similarity) between two duplicated genomes [γ] and [δ] is the minimum distance (maximum similarity) between ordinary genomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\gamma^{ \prime} \in [ \gamma ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\delta^{ \prime} \in [ \delta ]$$ \end{document} .

Let us write \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta = \pi \oplus \pi$$ \end{document} for a perfectly duplicated genome. For each linear chromosome in π, θ contains two copies of the chromosome, and for each circular chromosome in π, θ contains either two copies of the chromosome or one chromosome consisting of the two copies consecutively. The distance between an ordinary genome π and a duplicated genome [δ], also called double distance and denoted dd(π, δ), is then the distance between \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi \oplus \pi$$ \end{document} and [δ].

We say that π and [δ] have adjacency xy in common, if x, y are adjacent in π and xⁱ, y^j are adjacent in δ for some i and j. We say that they have the adjacency xy twice in common, if either x¹y¹ and x²y², or x¹y² and x²y¹ are adjacent in δ. Tannier et al. (2009) showed that the double distance dd(π, δ) can be computed simply as dd(π, δ) = 2n − sim(π, δ), where sim(π, δ) is the number of adjacencies in common plus half the number of telomeric adjacencies in common (adjacencies twice in common are counted as 2).

2.3. Rearrangement problems

Once we have defined a genome model and a distance measure, we can define the following parsimony problems of interest. Assume that we have two genomes π₁ and π₂, and we would like to reconstruct their common ancestor α. Using a third, outgroup genome π₃, we can formulate the task as the median problem: Given π₁, π₂, and π₃, find genome α (called median) that minimizes the total distance from π₁, π₂, and π₃. In the breakpoint median problem, we are minimizing the breakpoint distance, which is the same as maximizing the median score S(α) = sim(α, π₁) + sim(α, π₂) + sim(α, π₃). Note that the genome model imposes further constraints on the solution: the number and the type of chromosomes.

We can generalize the median problem to the median of k genomes problem, where given genomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\pi_1 , \ldots , \pi_k$$ \end{document} , we should find genome α that maximizes the score \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S ( \alpha ) = \sum_i { \rm sim} ( \alpha , \pi_i )$$ \end{document} . However, an even more important generalization is the small phylogeny problem, where we are given a phylogenetic tree and gene orders of the extant species (leaves of the tree). The task is to reconstruct all the ancestral genomes, i.e., to find gene orders for each internal vertex, while minimizing the sum of breakpoint distances along the edges of the phylogenetic tree. The median problem is a special case of the small phylogeny problem with just three species.

Another classical problem is the halving problem. Imagine a genome π that underwent a whole genome duplication. The perfectly duplicated genome \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\theta = \pi \oplus \pi$$ \end{document} was then rearranged to its present-day form γ. In the halving problem, we would like to reconstruct the preduplication ancestor π given the present-day genome γ. More precisely, we would like to find an ordinary genome α that minimizes the double distance from γ. The halving problem has usually many equivalent solutions. For better results, we can use an ordinary outgroup genome ρ (such that the speciation happened before the whole genome duplication) and search for genome α that minimizes the sum dd(α, γ) + d(α, ρ). This is called the guided genome halving problem.

3. Halving Problem

Bryant (1998) showed that the median problem is NP-hard in the circular breakpoint model by reduction from the directed hamiltonian cycle problem. The halving problem was not studied previously in the breakpoint model, but we show that it suffers the same “Hamiltonian” curse as the median problem. As the halving problem is polynomially solvable in more realistic models, such as the RT model (El-Mabrouk and Sankoff, 2003) or the DCJ model (Alekseyev and Pezner, 2007; Mixtacki, 2008; Warren and Sankoff, 2009; Kováč et al., 2011), the halving problem under the breakpoint distance will remain a mere curiosity: It is the first problem that is easier in the DCJ or even in the RT model than in the breakpoint model. Furthermore, it is the only known case where halving is NP-hard, even though the double distance is computable in polynomial time [in the DCJ model, the opposite is true: halving is easy, whereas the double distance is NP-hard (Tannier et al., 2009)].

Theorem 1.

Halving problem is NP-hard in the circular, linear, and multilinear breakpoint models.

Proof. The proof is by reduction from the directed hamiltonian cycle problem. Plesník (1979) proved that this problem is still NP-hard for graphs with maximum degree 2, and the construction implies the problem is also NP-hard if all in-degrees and out-degrees are equal to 2. Note that such graphs have an Eulerian cycle.

Let G = (V, E) be such a directed graph; the corresponding doubled genome δ will have two copies of a gene for each vertex in G, and an Eulerian cycle in G traversing each vertex twice will be the order of genes in δ. More precisely, let G′ = (V′, E′), where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$V^{ \prime} = \{ x_h^1 , x_t^1 , x_h^2 , x_t^2 : x \in V \} $$ \end{document} and the edges in E′ are defined as follows: traverse the Eulerian walk and, for each edge \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$xy \in E$$ \end{document} , include edge \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_h^i y_t^j$$ \end{document} in E′, where i and j are 1 if we are visiting the vertex for the first time, and 2 if we are visiting the vertex for the second time. Note that all edges go from head to tail, E′ is a perfect matching, and G′ defines the doubled genome δ consisting of a single circular chromosome.

Let α be a circular genome that is a solution to the halving problem. Note that δ has no double adjacencies, so α can have at most n adjacencies in common (none twice in common). This maximum can be attained if and only if all the adjacencies in α are of the form x_hy_t (from head to tail), and for each such adjacency, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_h^iy_t^j$$ \end{document} is an adjacency in δ for some i, j. This is true if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$xy \in E$$ \end{document} . So by contracting the base matching (each head and tail of a gene into a single vertex) and orienting the edges (from head to tail), we get a directed Hamiltonian cycle in G.

For the linear and multilinear models, remove one edge xy from G and consider the problem of deciding whether G contains a directed Hamiltonian path. This problem is still NP-hard and can be reduced to the halving problem in the linear models: G now has an Eulerian path starting in y and ending in x. We replace the last adjacency \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_h^2 y_t^1$$ \end{document} in δ (corresponding to the removed edge) by two telomeric adjacencies \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_h^2 { \rm T}_{x_h^2}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y_t^1 { \rm T}_{y_t^1}$$ \end{document} to get a linear genome. If α is a linear or multilinear solution to the halving problem, it can reach the maximum similarity if and only if all of its adjacencies (including the telomeric adjacencies) are in common with δ, and this is true if and only if contraction of α is a directed Hamiltonian path in G. ■

4. Median and Halving Problems in the General Model

From now on, we will study the general breakpoint model, i.e., the multichromosomal circular model where genomes are perfect matchings. We will also show how to extend the results to the mixed model and use the same techniques for halving and guided halving problem.

4.1. Breakpoint median

Tannier et al. (2009) noticed that finding a breakpoint median can be reduced to finding a maximum weight perfect matching. This can be done in O(n³) time by the algorithm of Gabow (1973) and Lawler (1976). We improve on this by showing an \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt{n} )$$ \end{document} algorithm.

The solution by Tannier et al. (2009) (if we rephrase it using the similarity measure instead of the breakpoint distance) was to create a complete weighted graph G where vertices are extremities and weight w(xy) of edge xy is the number of genomes that contain the adjacency xy. Any perfect matching α corresponds to some genome, and the weight of the matching is equal to its median score S(α).

Notice that instead of finding a maximum weight perfect matching, we can remove all the zero-weight edges from G and find an ordinary (not necessarily perfect) matching. We can then complete the genome by joining the free vertices arbitrarily. As the number of edges in G is now linear, maximum weight matching can be found in O(n² log n) time by the algorithm of Gabow (1990) or even in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\tilde O ( n \sqrt{n} )$$ \end{document} time by the state-of-the art algorithm of Gabow and Tarjan (1991) using the fact that the weights are small integers.

Theorem 2.

The breakpoint median problem for k genomes can be solved in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( kn \sqrt{n} \cdot \log ( kn ) \sqrt{ \mathop{ \alpha} ( kn , n ) \log n} )$$ \end{document} time in the general model. [Here, α(m, n) is the inverse Ackermann function.]

We further improve the algorithm for the most important special case, k = 3: Notice that when xy is an edge with weight 3, there is no other edge incident to x or y. Therefore, xy must belong to the maximum weight matching. Moreover, if xy has weight 2, there is a maximum weight matching that contains xy. Suppose to the contrary that xu and yv were matched in α instead. Then w(xu) and w(yv) are at most 1, and by exchanging these edges for xy and uv with weights w(xy) = 2 and w(uv) ≥ 0, we get a matching with the same or even higher weight.

Thus, we can include all edges of weight 2 and 3 in the matching and remove matched vertices together with their incident edges. The remaining graph has only unit edge weights, so it suffices to find maximum cardinality matching. This can be done in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( m \sqrt{n} )$$ \end{document} time by the algorithm of Micali and Vazirani (1980). Thus, we have the following claim.

Theorem 3.

The breakpoint median problem for three genomes can be solved in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt n )$$ \end{document} time (in the general model).

One might still wonder whether there is an even better algorithm for the median problem, which perhaps can avoid computation of maximum matching. Alas, we show that improving upon our result would be very hard, because it would immediately imply a better algorithm for the matching problem, beating the result of Micali and Vazirani (1980) (at least on cubic graphs), which has been an open problem for more than 30 years.

Biedl (2001) showed that the maximum matching problem is reducible to the maximum matching problem in cubic graphs by a linear reduction. This means that we can transform any given graph G with m edges to a cubic graph G′ with O(m) edges such that the maximum matching in G can be recovered from one in G′ in O(m) time. Thus, any O(f (m)) algorithm for finding maximum matching in cubic graphs implies an O(f (m) + m) algorithm for arbitrary graphs.

We say that a reduction is strongly linear if it is linear, and both the number of vertices and the number of edges increase at most linearly. Such a reduction preserves the running time O(f (m, n)) depending on both the number of vertices and the number of edges.

We prove that the breakpoint median problem is equivalent to matching under linear reduction and to cubic matching under strongly linear reduction. If we write ≤_ℓ for linear and ≤_sℓ for strongly linear reduction, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{\rm matching} \leq_ \ell {\rm cubic} \ {\rm matching} \leq_{s \ell} {\rm breakpoint \ median} \leq_{s \ell} {\rm matching}.\end{align*} \end{document}

The first reduction is by Biedl (2001) and the last one was shown in Theorem 3 [in fact, a reduction to subcubic matching, where the degrees are at most 3, was shown; this is equivalent to cubic matching under the strongly linear reduction (Biedl, 2001)]. We now prove the middle reduction.

Let G be a cubic graph, an instance of the cubic matching problem. In breakpoint median, the input multigraph consists of three perfect matchings, i.e., is edge 3-colorable. However, not all cubic graphs are edge 3-colorable. The solution is to color edges arbitrarily and resolve conflicts as shown in Figure 2a. We can, for example, color the ends of edges at each vertex randomly with three different colors. When both ends of an edge are assigned the same color, we color the edge appropriately. When the ends have different colors, we subdivide the edge into three parts and use the third color for the middle edge (see Fig. 2a). Note that the size of a maximum matching in the modified graph is exactly one more than the size in the original graph: If xy is matched in the original, xu and vy can be matched in the modified graph. If xy is not matched, we can still match uv.

FIG. 2.

Linear reduction of maximum matching in cubic graphs to breakpoint median problem. (a) Edge xy (top) should be colored green (this is the only missing color at x) and red at the same time (this is the missing color at y). We resolve this conflict by subdividing edge xy by two new vertices (bottom); we color xu green, vy red, and uv blue. (b) In the second phase, we duplicate graph G and connect the corresponding vertices with degree 2 as shown in the figure.

Now, the modified graph is edge 3-colorable but not cubic. We remedy this by duplicating the whole graph and connecting the corresponding vertices of low degree as shown in Figure 2b. As noted above, we may suppose that the auxiliary double edges \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a_ua_u^{ \prime}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a_va_v^{ \prime}$$ \end{document} are matched, so ua_u, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$u^{ \prime}a_u^{ \prime}$$ \end{document} , va_v, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$v^{ \prime}a_v^{ \prime}$$ \end{document} are not matched, and given the solution for the breakpoint median problem, we can recover the maximum matching of G in O(n) time. The reduction is obviously linear, so we have the following claim.

Theorem 4.

The breakpoint median problem (in the general model) has the same complexity as finding maximum cardinality matching in cubic graphs.

4.2. Median in the mixed model

In the mixed model, the weight of a telomeric adjacency xT_x equals to half the number of genomes that contain xT_x. If we multiply all weights by 2, we can use the algorithm by Gabow and Tarjan (1991) for integer weights, so the result of Theorem 2 remains valid also in the mixed model.

For the median of three genomes, an \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt n )$$ \end{document} algorithm exists: We can include all the double and triple adjacencies in the matching, as well as double and triple telomeric adjacencies (edges of weight 1 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$1 {1\over2}$$ \end{document} ). If \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$w ( x { \rm T } _x ) = 1 \frac { 1 } { 2 } , x { \rm T } _x$$ \end{document} is a triple adjacency and no other edge is incident to either x or T_x in G. If w(xT_x) = 1 but the median α contains adjacency xy instead, then w(xy) ≤ 1, and as T_x can only be incident to x, it must be unmatched (or matched by a zero-weight edge), and so we can replace xy by xT_x in α.

The remaining graph consists of edges with unit weight and weight \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${1\over2}$$ \end{document} . Note, however, that all the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${1\over2}$$ \end{document} -weight edges are of the form xT_x, and there is no other edge incident to T_x. We use the doubling trick again: we take two copies of graph G, and replace all pairs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x{ \rm T}_x , x^{ \prime}{ \rm T}_x^{ \prime}$$ \end{document} by a single edge xx′ of unit weight. We can then remove all the telomere vertices. The resulting graph will have only unit weight edges, and the maximum matching will be exactly twice the size of the maximum matching in the original graph.

4.3. Halving problems in the general model

The same tricks can be used for halving and guided halving problems. Recall that in the halving problem, we are given a duplicated genome γ, and we are searching for genome α that minimizes the double distance dd(α, γ); in the guided halving problem, we are also given genome ρ and we are minimizing the sum dd(α, γ) + d(α, ρ).

Again, we construct graph G, where this time, the weight of edge xy is the number of adjacencies among x¹y¹, x¹y², x²y¹, x²y² in γ and possibly xy in ρ (in the case of the guided halving problem). The rest of the solution is identical, leading to an \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O ( n \sqrt n )$$ \end{document} algorithm for the guided halving problem. In the halving problem, the degrees of vertices in G are at most 2, and after including all the double edges in the solution, the remaining graph consists only of cycles and the maximum matching can be found trivially in linear time.

5. Breakpoint Phylogeny

In the small phylogeny problem, we reconstruct ancestral genomes given a phylogenetic tree and gene orders of the extant species while minimizing the sum of distances along the edges of the tree. This problem is NP-hard for most rearrangement distances and for most models; this follows trivially from the NP-hardness of the median problem. However, as we have seen in the previous section, this is not the case in the general breakpoint model, and the complexity of the small-phylogeny problem remained open (Fertin et al., 2009; Tannier et al., 2009).

We prove that the small phylogeny problem is NP-hard even for four species. More precisely, given four genomes π₁,π₂,π₃,π₄, the breakpoint quartet problem is to find ancestral genomes α₁, α₂ that maximize the sum of similarities S(α₁, α₂) = sim(π₁, α₁) + sim(π₂, α₁) + sim(α₁, α₂) + sim(α₂, π₃) + sim(α₂, π₄).

Theorem 5.

The breakpoint quartet problem is NP-hard and even APX-hard in the general breakpoint model.

The proof is inspired by the work of Dees (2009), who showed that the following problem is NP-hard: Given two graphs G₁ = (V, E₁), G₂ = (V, E₂), find two perfect matchings M₁ ⊆ E₁ and M₂ ⊆ E₂ with the maximum overlap M₁ ∩ M₂. The problem is NP-hard even when the components in G₁ and G₂ are just cycles. In our proof, π₁ ∪ π₂ will correspond to E₁, π₃ ∪ π₄ will correspond to E₂, and the unknown ancestors α₁, α₂ will correspond to the unknown perfect matchings M₁, M₂.

Our proof is, however, much more involved, and there are two reasons for this. First, the problem formulation does not guarantee that α₁ ⊆ π₁ ∪ π₂ and α₂ ⊆ π₃ ∪ π₄. We will say that solution α₁,α₂ that satisfies this condition is in a normal form. The hard part of the proof is showing that we can transform any solution α₁, α₂ into a solution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha_1^{ \prime} , \alpha_2^{ \prime}$$ \end{document} that has the same score and, moreover, it is in the normal form. The second major difficulty is that we are maximizing the sum S(α₁, α₂) instead of just the size of the intersection sim(α₁, α₂). To overcome these difficulties, we had to modify the edge gadget from the original proof and use a more restricted problem for the reduction.

5.1. Overview of the proof

The proof is by reduction from the cubic max-cut problem. Given a graph G, the max-cut problem is to find a cut of maximum size. We may rephrase this as a problem of coloring all vertices in G with two colors, red or green, while maximizing the number of red-green edges.

(Partition of V into the red part and the green part defines a cut, and its size is the number of edges with endpoints of different color.) In the cubic max-cut problem, the instances are cubic graphs; this variant is still NP-hard and APX-hard (Alimonti and Kann, 1997).

Let G = (V, E) be a given cubic graph, an instance of the cubic max-cut problem. We will construct genomes π₁, π₂, π₃, and π₄ such that the maximum cut in G can be recovered from the solution α₁, α₂ of the breakpoint quartet problem in polynomial time.

For each vertex of G, there will be a vertex gadget (see Fig. 3a) made of adjacencies of π₁ and π₂. Let π₁ be the red matching and π₂ the green matching. As we will prove later, we may suppose that α₁ ⊆ π₁ ∪ π₂, so within each vertex gadget, α₁ will contain either the red edges of π₁ or the green edges of π₂. This naturally corresponds to a red/green vertex coloring in the cubic max-cut problem.

FIG. 3.

The vertex and edge gadgets used in our reduction and the terminology used for different types of vertices and edges. The red and green edges are the adjacencies of π₁ and π₂, respectively. The cycles made of blue edges can be decomposed into two matchings: the adjacencies of π₃ and π₄. (a) Vertex gadget. (b) Edge gadget.

The framed vertices in Figure 3a are called “ports”; this is where the edge gadgets for the three incident edges are attached. For each edge of G, an edge gadget connecting the ports of the corresponding vertex gadgets is constructed as shown in Figure 3b. The blue cycles consist of two matchings: the adjacencies of π₃ and π₄. Again, as we will prove later, we may suppose that α₂ ⊆ π₃ ∪ π₄, i.e., the second ancestor consists only of the blue edges.

For future reference, let us state here again the claims to be proved in the form of a lemma:

Lemma 1

(Normal form). Let π₁, π₂, π₃, π₄ be an instance of the breakpoint quartet problem constructed from a cubic max-cut instance as described above. Then any solution α₁,α₂ can be transformed in polynomial time into a solution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha_1^{ \prime} , \alpha_2^{ \prime}$$ \end{document} such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S ( \alpha_1^{ \prime} , \alpha_2^{ \prime} ) \geq S ( \alpha_1 , \alpha_2 ) , \alpha_1^{ \prime} \subseteq \pi_1 \cup \pi_2$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha_2^{ \prime} \subseteq \pi_3 \cup \pi_4$$ \end{document} .

Once we prove the normal form lemma, the rest of the proof is easy: If α₁, α₂ is a solution in the normal form, term sim(π₁, α₁) + sim(π₂, α₁) is always the same: we get +6 for each vertex gadget and +6 for each edge gadget. Similarly, term sim(α₂, π₃) + sim(α₂, π₄) is always the same: we get +9 for each edge gadget. So the score S(α₁, α₂) is maximized, when sim(α₁, α₂) = |α₁ ∩ α₂| is maximized. Let uv be an edge in our graph G from the cubic max-cut problem. If we choose matchings of the same color for both vertex gadgets u and v, then α₁ and α₂ can only have one edge in common within the edge gadget uv (see Fig. 4a). However, if u and v have matchings of different color, we can set adjacencies of α₂ so that α₁ and α₂ have two edges in common (see Fig. 4b). When we sum up all these contributions, we get S(α₁, α₂) = 20m + c, where m is the number of edges in G, and c is the size of the cut corresponding to the matching α₁, so a polynomial algorithm for breakpoint quartet would imply a polynomial algorithm for cubic max-cut.

FIG. 4.

The dashed edges indicate the underlying vertex and edge gadgets, the blue edges are adjacencies of α₂, and the red, green, and yellow edges are adjacencies of α₁. Here, we assume that α₁ and α₂ are in the normal form. (a) Adjacencies of the first ancestor α₁ (red edges) agree with the adjacencies of π₁ at both vertex gadgets. This corresponds to coloring both vertices red in the cubic max-cut problem. Note that α₁ and α₂ can only have one adjacency in common. (b) In the first vertex gadget, α₁ agrees with π₁ (red edges), and in the second gadget, α₁ agrees with π₂ (green edges). This corresponds to coloring the first vertex red and the second vertex green in the cubic max-cut problem. In this case, α₁ and α₂ have two adjacencies in common.

For the APX-hardness, note that for any graph with m edges, we can easily find a cut of size c ≥ m/2. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha_1^* , \alpha_2^*$$ \end{document} be an optimal solution for an instance of the breakpoint quartet problem and α₁, α₂ a solution such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S ( \alpha_1^* , \alpha_2^* ) \leq ( 1 + \epsilon ) S ( \alpha_1 , \alpha_2 )$$ \end{document} . Let both solutions be in the normal form, and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c^*$$ \end{document} and c ≥ m/2 be the sizes of the corresponding cuts. Then 20m + \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c^*$$ \end{document} ≤ (1 + ɛ)(20m + c), and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$c^*$$ \end{document} ≤ (1 + ɛ)c + 20ɛm ≤ (1 + 41ɛ)c. So a (1 + ɛ)-approximation algorithm for the breakpoint quartet problem would lead to a (1 + 41ɛ)-approximation algorithm for the cubic max-cut problem.

It can also be proved that the phylogenetic tree ((π₁, π₂), (π₃, π₄)) is the most parsimonious. The alternative quartets ((π₁, π₃), (π₂, π₄)) and ((π₁, π₄), (π₂, π₃)) yield score ≤20m, so this result also implies the NP- and APX-hardness of the large phylogeny problem. It remains an open problem whether computing the correct quartet (without reconstructing the ancestors) is hard.

5.2. Notation, terminology, and other conventions

We say that an adjacency \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e \in \alpha_1$$ \end{document} is supported if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e \in \pi_1 \cup \pi_2$$ \end{document} . Similarly, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e \in \alpha_2$$ \end{document} is supported if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e \in \pi_3 \cup \pi_4$$ \end{document} . An adjacency that is not supported is unsupported. Furthermore, let Π = π₁ ∪ π₂ ∪ π₃ ∪ π₄ be the set of adjacencies present in at least one extant species. We will say that an adjacency \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e \in \alpha_i$$ \end{document} is weakly supported, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e \in \prod$$ \end{document} .

Let us name the different types of vertices (extremities) and edges (adjacencies) in the following manner. The framed vertices in Figure 3a are called ports, and edges from π₁ ∪ π₂ that connect them are called port edges. We use the same names also for other (extant or ancestral) adjacencies that are parallel to these.

Each port consists of two outer extremities called corners, and the middle vertex between them. The sets of all ports, corners, and middle vertices are denoted by P, C, and M, respectively (P = C ∪ M). The set of intermediate extremities located between ports of vertex gadgets is denoted by I.

The double edges and the two vertices at the top of Figure 3b are auxiliary: they just complete the matchings into perfect matchings.

As the edge gadget without auxiliary and port edges is reminiscent of a ladder, we use the following terminology (see Fig. 3b). The red-green double adjacencies are the rungs and the blue adjacencies are the rails of the ladder. Again, we use the same name for parallel adjacencies. The set of auxiliary extremities is denoted by A, and the set of ladder extremities is denoted by L.

We say that uv is an X–Y-edge if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$u \in X$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$v \in Y$$ \end{document} (X and Y do not have to be disjoint); an X-edge is any edge uv such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$u \in X$$ \end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$v \in X$$ \end{document} .

In the proof of the normal form lemma, we will gradually transform a given solution α₁, α₂ by exchanging some of the adjacencies in the solution for other adjacencies. The method is analogous to improving a given matching by an augmenting path. An α_i-alternating cycle is a cycle where edges belonging to α_i and edges not belonging to α_i alternate. We will say that C₁, C₂ is a non-negative pair of cycles for the solution α₁, α₂, if C_i is an α_i-alternating cycle and exchanging the matched and the unmatched edges of C_i in α_i (for i = 1, 2) does not decrease the score: S(α₁ ⨁ C₁,α₂ ⨁ C₂) ≥ S(α₁, α₂). One of the cycles may be empty, in which case we simply say that C₁ or C₂ is a non-negative cycle, and if the exchange in fact increases the score, we may speak of an augmenting pair of cycles (or an augmenting cycle).

In the figures that follow, we will color adjacencies of α₂ blue and adjacencies of α₁ red, green, or yellow. We use red or green for edges in the vertex gadgets that are shared with π₁ or π₂, respectively (this corresponds to choosing the red or green color in the cubic max-cut problem). We use yellow for the other edges. We use straight lines for the actual adjacencies and wavy lines for the suggested adjacencies in non-negative cycles that should be included instead.

In the proof, we will often say “we may suppose that the solution has property \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal P}$$ \end{document} ” as a shorthand for a more precise (and longer) statement:

Given any solution α₁, α₂, we can transform it to a solution \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha_1^{ \prime} , \alpha_2^{ \prime}$$ \end{document} with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S ( \alpha_1^{ \prime} , \alpha_2^{ \prime} ) \geq S ( \alpha_1 , \alpha_2 )$$ \end{document} having property \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal P}$$ \end{document} in polynomial time; in particular, if α₁, α₂ is an optimal solution, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\alpha_1^{ \prime} , \alpha_2^{ \prime}$$ \end{document} is also optimal, with property \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal P}$$ \end{document} . From now on, we will assume that the solution has property \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal P}$$ \end{document} .

With this terminology, we may rephrase the normal form lemma more succinctly as follows: We may suppose that solutions of the instances obtained by reduction from cubic max-cut as described above have all adjacencies supported.

5.3. Proof of the normal form lemma

First, we focus on the adjacencies that the ancestors α₁ and α₂ have in common. We will show that these may be assumed to be at least weakly supported.

Proposition 1.

We may suppose that all red-green double edges (auxiliary adjacencies and rungs) are matched in α₁ and all blue double edges (auxiliary adjacencies) are matched in α₂, i.e., π₁ ∩ π₂ ⊆ α₁ and π₃ ∩ π₄ ⊆ α₂.

Proof. We can alternately replace genome α₁ or α₂ by the median of its neighbors in the phylogenetic tree until we converge to a local optimum. As we have already proved in the previous section, we may assume that a median contains all adjacencies occurring at least twice. ■

Proposition 2.

We may suppose that α ₁ and α ₂ do not contain unsupported M-edges. In other words, we may suppose that in both α ₁ and α ₂ , one of the edges in each port is chosen.

Proof. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x \in M$$ \end{document} . First, consider the case that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$xy_1 \in \alpha_1$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$xy_2 \in \alpha_2$$ \end{document} are both unsupported. Let p be a neighboring corner vertex. Whereas xy₁ and xy₂ contribute at most +1 to the score (if y₁ = y₂), common adjacency xp would contribute +3. Let pz₁ and pz₂ be the actual adjacencies in α₁ and α₂; either z₁ ≠ z₂, or z₁ = z₂ and one of the adjacencies is unsupported. Either way, these two edges contribute at most +2 to the score; so xpz₁y₁x and xpz₂y₂x is a non-negative pair of cycles and we can exchange the edges.

Similarly, if one ancestor contains a port edge xp and the other one adjacencies pz and unsupported xy, then xpzyx is a non-negative cycle. ■

Proposition 3.

We may suppose that all L-edges are weakly supported—they are ladder edges.

Proof. In α₁, all L-edges are the rung edges by Proposition 1 and are supported. Consequently, the contribution of any L-edge in α₂ that is not even weakly supported is zero. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\ell_1 x \in \alpha_2$$ \end{document} be such an edge. Let ℓ₁ℓ₂ be the middle rail edge and let ℓ₂y be the adjacency in α₂. If ℓ₂y is not weakly supported, ℓ₁ℓ₂yxℓ₁ is an augmenting cycle. Otherwise, if ℓ₂y is a rail edge, it contributes +1 to the score and ℓ₁ℓ₂yxℓ₁ is a non-negative cycle.

The last case is that ℓ₂y is a rung edge contributing +1 to the score. Let ℓ₃ = y, let ℓ₃ℓ₄ be the other middle rail edge, and let ℓ₄z be the adjacency in α₂. Again, if ℓ₄z is unsupported, ℓ₁ℓ₂ℓ₃ℓ₄zxℓ₁ is an augmenting cycle; otherwise it is a rail edge and the cycle is non-negative.

It is easy to check that with each non-negative pair of cycles, we get rid of an L-edge that is not weakly supported, unless we improve the score, which may be done only O(n) times. In the process, we may introduce unsupported C-edges, which is acceptable and we will deal with them next. ■

Proposition 4.

We may suppose that there are no common C-edges other than port edges.

Proof. Let xb be a common C–C-edge in α₁ ∩ α₂. In the proof, we will use the notation introduced in Figure 5. From what we have proved so far, we may assume that α₁ contains the rung edges ℓ_aℓ_b and ℓ_cℓ_d (Proposition 1), am₁ is a common adjacency of α₁ and α₂, and either m₂c or m₂d is included in α₂ (Proposition 2).

FIG. 5.

Different cases that arise when disposing of unsupported common C–C-edges. The dashed edges represent the underlying edge gadgets; adjacencies of α₂ are blue, and adjacencies of α₁ are yellow, red, and green. Wavy lines are the new suggested adjacencies that should be exchanged for the present ones in the non-negative cycles. (a) Case 1: α₂ contains m₂d and cz ∉ α₁. (b) Case 2: α₂ contains m₂d and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$cz \in \alpha_1 \cap \alpha_2$$ \end{document} . (c) Case 3: α₂ contains m₂c and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$dz \,\notin\, \alpha_1$$ \end{document} . (d) Case 4: α₂ contains m₂c and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$dz \in \alpha_1 \cap \alpha_2$$ \end{document} .

First, assume the latter case that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m_2d \in \alpha_2$$ \end{document} (Fig. 5a,b). As the L-edges are weakly supported, either \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\ell_b \ell_c \in \alpha_2$$ \end{document} (Fig. 5a) or both ℓ_aℓ_b and ℓ_cℓ_d belong to α₂ (Fig. 5b). In either case, we can add ladder edges to form an alternating b–c-path with score +1 that will be a part of our non-negative pair of cycles.

Let cz be an adjacency in α₂. As m₂ and ℓ_c are already matched to different vertices, cz is unsupported. Now, either \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$cz \,\notin \alpha_1$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$xb \ldots czx$$ \end{document} is a non-negative cycle (see Fig. 5a), or cz is a common edge and we will also have to exchange some edges in α₁. In particular, xbczx and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$xb \ldots czx$$ \end{document} are a non-negative pair of cycles (see Fig. 5b).

Similarly, we can prove the other case when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m_2c \in \alpha_2$$ \end{document} ; the non-negative cycle pairs are depicted in Fig. 5c,d. It can be easily checked that the proof also works when extremities x and b belong to the same edge gadget (in this case, x coincides with c or d, and b coincides with z). A C–C-edge connecting two corners of a single port is ruled out by Proposition 2.

Note that if α₁ and α₂ have a common z-edge (Fig. 5b,d), we may create a new common unsupported C–C-edge xz. However, the number of common unsupported C–C-edges is decreased by 1 in all cases. ■

Corollary 1.

We may suppose that all the common adjacencies of the ancestors α₁ and α₂ are weakly supported: α₁ ∩ α₂ ⊆ Π. More specifically, we may suppose that the only common adjacencies are port edges and rung edges. Consequently, each unsupported adjacency except for rung edges in α₂ contributes zero to the score.

We say that α₁ is uniform at a vertex gadget, if all the port edges in the gadget have the same color (they all agree with either the π₁ edges or the π₂ edges). Next, we prove that α₁ may be assumed uniform at all gadgets. Such an ancestor α₁ directly corresponds to a cut in G.

Here, we use the fact that G is cubic: Imagine that G was a complete bipartite graph K_n_,n with one more vertex connected to all the other vertices. Then our reduction would not work, because the optimal ancestors would color one bipartition red, the other green, and the extra vertex half green, half red (i.e., half of the ports would be green and the other half red).

First, let us characterize what the non-uniform gadgets look like.

Proposition 5.

We may suppose that the following statements are equivalent:

• Genome α ₁ is not uniform at a vertex gadget.

• There is one unsupported I-edge in α ₁ incident to the vertex gadget.

• There is one unsupported C-edge in α ₁ incident to the vertex gadget.

Proof. Let α₁ be nonuniform at a vertex gadget. Without loss of generality, let two of the port edges be green and one be red (see Fig. 6a). Denote r the red and g₁ and g₂ the green edges, such that g₁ is closer to r (as in Fig. 6a). The edge incident to the intermediate extremity between r and g₁ is an unsupported I-edge.

FIG. 6.

Nonuniform ancestors at a vertex and a way how to remedy them. (a) A nonuniform ancestor α₁ at a vertex gadget. (b) The nonuniform gadgets are connected by unsupported I-edges and C-edges. (c) A non-negative cycle used for disposing of nonuniform gadgets and unsupported edges in α₁.

Obviously, if two neighboring extremities in a vertex gadget are incident with unsupported edges, there is an augmenting cycle, so we may suppose that the intermediate edge between g₁ and g₂ is green and one of the intermediate edges e or f in Figure 6a belongs to α₁; the other corner has an unsupported C-edge.

Conversely, if there is an unsupported I-edge or C-edge, the neighboring ports cannot have edges of the same color (this would imply two neighboring extremities with unsupported edges in α₁). ■

Now we are ready to prove the normal form lemma.

Proposition 6.

We may suppose that in each vertex gadget, the port edges of α₁ are either all red or all green. Thus, we may suppose that all adjacencies in α₁ are supported: α₁ ⊆ π₁ ∪ π₂.

Proof. We prove that for each vertex gadget, we may simply look at the three port edges and choose the color by majority vote. In the previous proposition, we have shown that nonuniform gadgets have exactly two unsupported edges so they form cycles as in Figure 6b. Figure 6c shows the non-negative cycle that we get by including the edges decided by majority vote. In each vertex gadget, we may lose 1 point for switching the port edge (if this was a common edge), but we get 1 extra point for increasing the number of supported edges. ■

Proposition 7.

We may suppose that all adjacencies in α₂ are supported: α₂ ⊆ π₃ ∪ π₄.

Proof. The only remaining unsupported edges in α₂ are the rung edges and C–C-edges. If α₂ contains a rung edge, it must in fact contain both rung edges, and in the adjacent ports, only one corner is covered by a port edge. Thus, the edge gadget is incident to two unsupported C–C-edges.

Conversely, it is easy to see that if α₂ contains a C–C-edge, in the incident edge gadgets, α₂ contains either both rung or both middle rail edges and there are C–C-edges incident to the corners of the opposite ports. So the edge gadgets together with the unsupported C–C-edges form cycles, and all the rung edges are in these edge gadgets (see Fig. 7).

FIG. 7.

Example of three edge gadgets connected in a cycle by unsupported C–C-edges. We can join two corners with unsupported C–C-edges in an edge gadget by a non-negative path. Note that we also get rid of the blue rung edges in the top and right edge gadgets at the same time.

In each edge gadget, we can join the two corners by a non-negative alternating path (see Fig. 7); we can lose 1 point for destroying a common adjacency of α₁ and α₂, but we gain 1 point for increasing the number of supported edges in α₂. By exchanging edges along these cycles, we fix both the unsupported C–C-edges and rung edges. ■

This concludes the proof of the normal form lemma, and thus also the proof of NP-hardness and APX-hardness of the breakpoint quartet problem.

6. Conclusion

In this article, we have settled several open problems concerning the computational complexity of different rearrangement problems in the breakpoint models. There are at least three intriguing questions in this area that remain open. The first two are of theoretical interest and are related to approximability of the small phylogeny problem; the third question is more practical:

1. How well can we approximate small phylogeny? For example, the breakpoint quartet problem can be easily formulated as an integer linear program (we can use different variables for the edges present only in α₁, only in α₂, and in the intersection α₁ ∩ α₂). Its relaxation might lead to an algorithm with a good approximation ratio.

2. In the Steinerization approach to ancestral reconstruction, we repeatedly replace the ancestral genomes by medians of genomes in the neighboring nodes of the tree until we converge to a local optimum. Despite the fact that this is the most common approach to ancestral reconstruction (also in the other models) and that preliminary experiments with simulated data suggest that this heuristic performs very well, no guarantees are known for the method (in any model).

3. Finally, the motivation behind the general breakpoint model is that we can solve the median problem in polynomial time. Using the Steinerization method, we can also get very good solutions of the small phylogeny problem rapidly. The question is: Are these solutions useful in practice? Are they biologically plausible? Or can we adjust them and use them as starting points in more complicated models?

Footnotes

Acknowledgments

The author would like to thank Broňa Brejová and Tomáš Vinař for many constructive comments. The research of Jakub Kováč is supported by VEGA grant 1/1085/12 and Comenius University grant UK/310/2013. A preliminary version of this article was presented on the RECOMB-CG 2011 satellite workshop.

Disclosure Statement

The author declares that no competing financial interests exist.

¹As opposed to the large-phylogeny problem, where the phylogenetic tree is not given and is part of the solution.

²Technically, this is a disjoint or multiset union; we allow parallel edges forming 2-cycles.

References

Alekseyev

M.A.

, and Pevzner

P.A.

2007. Colored de Bruijn graphs and the genome halving problem. IEEE/ACM Trans. Comput. Biol. Bioinform., 4, 98–107.

Alimonti

, and Kann

1997. Hardness of approximating problems on cubic graphs. Proc. CIAC, 288–298.

Biedl

T.C.

2001. Linear reductions of maximum matching. Proc. SODA, 825–826.

Bryant

1998. The complexity of the breakpoint median problem. Technical Report CRM-2579. Centre de Recherches Mathématiques, Université de Montréal.

Caprara

2003. The reversal median problem. INFORMS J. Comput., 15, 93.

Dees

2009. Simultaneous matchings in dynamic graphs student research project. Universität Karlsruhe, Karlsruhe, Germany.

El-Mabrouk

, and Sankoff

2003. The reconstruction of doubled genomes. SIAM J. Comput., 32, 754–792.

Fertin

, Labarre

, and Rusu

2009. Combinatorics of Genome Rearrangements. The MIT Press, Cambridge, MA.

Gabow

1973. Implementation of algorithms for maximum matching on nonbipartite graphs Ph.D. thesis. Stanford University, Stanford, CA.

10.

Gabow

1990. Data structures for weighted matching and nearest common ancestors with linking. Proc. SODA, 434–443. Society for Industrial and Applied Mathematics.

11.

Gabow

, and Tarjan

1991. Faster scaling algorithms for general graph matching problems. J. ACM, 38, 815–853.

12.

Kováč

, Warren

, Braga

M.D.V.

, and Stoye

2011. Restricted DCJ model: rearrangement problems with chromosome reincorporation. J. Comput. Biol., 18, 1231–1241.

13.

Lawler

1976. Combinatorial Optimization: Networks and Matroids. Holt, Rinehart and Winston, New York.

14.

Micali

, and Vazirani

V.V.

1980. An O(

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sqrt{ \mid V \mid } \mid E \mid )$$ \end{document}

) algorithm for finding maximum matching in general graphs. Proc. FOCS, 17–27. IEEE Computer Society.

15.

Mixtacki

2008. Genome halving under DCJ revisited. Computing and Combinatorics, 276–286.

16.

Pe'er

, and Shamir

1998. The median problems for breakpoints are NP-complete. Electronic Colloquium on Computational Complexity (ECCC), Report 71.

17.

Plesník

1979. The NP-completeness of the Hamiltonian cycle problem in planar digraphs with degree bound two. Inf. Process. Lett., 8, 199–201.

18.

Sankoff

, and Blanchette

1998. Multiple genome rearrangement and breakpoint phylogeny. J. Comput. Biol., 5, 555–570.

19.

Tannier

, Zheng

, and Sankoff

2009. Multichromosomal median and halving problems under different genomic distances. BMC Bioinformatics, 10, 120.

20.

Warren

, and Sankoff

2009. Genome halving with double cut and join. J. Bioinform. Comput. Biol., 7, 357–371.

21.

Zheng

, Zhu

, Adam

, and Sankoff

2008. Guided genome halving: hardness, heuristics and the history of the Hemiascomycetes. Proc. ISMB, 96–104.