The Zero Exemplar Distance Problem

Abstract

Given two genomes with duplicate genes, Zero Exemplar Distance is the problem of deciding whether the two genomes can be reduced to the same genome without duplicate genes by deleting all but one copy of each gene in each genome. Blin, Fertin, Sikora, and Vialette recently proved that Zero Exemplar Distance for monochromosomal genomes is NP-hard even if each gene appears at most two times in each genome, thereby settling an important open question on genome rearrangement in the exemplar model. In this article, we give a very simple alternative proof of this result. We also study the problem Zero Exemplar Distance for multichromosomal genomes without gene order, and prove the analogous result that it is also NP-hard even if each gene appears at most two times in each genome. For the positive direction, we show that both variants of Zero Exemplar Distance admit polynomial-time algorithms if each gene appears exactly once in one genome and at least once in the other genome. In addition, we present a polynomial-time algorithm for the related problem Exemplar Longest Common Subsequence in the special case that each mandatory symbol appears exactly once in one input sequence and at least once in the other input sequence. This answers an open question of Bonizzoni et al. We also show that Zero Exemplar Distance for multichromosomal genomes without gene order is fixed-parameter tractable in the general case if the parameter is the maximum number of chromosomes in each genome.

1. Introduction

Given two genomes with duplicate genes, Genome Rearrangement with Gene Families is the problem of deleting all but one copy of each gene in each genome, so as to minimize some rearrangement distance between the two reduced genomes (Sankoff, 1999). The minimum rearrangement distance thus attained is called the exemplar distance between the two genomes. For example, each of the following two monochromosomal genomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} G_1 : & \quad {- 4} \ {+ 1} \ {+ 2} \ {+ 3} \ {- 5} \ {+ 1} \ {+ 2} \ {+ 3} \ {- 6} \\ G_2 : & \quad {- 1} \ {- 4} \ {+ 1} \ {+ 2} \ {- 5} \ {+ 3} \ {- 2} \ {- 6} \ {+ 3} \end{align*} \end{document}

has at most two copies of each gene, and each of the following two reduced genomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} G^{ \prime}_1 & : \quad {- 4} \ {+ 1} \ {+ 2} \ {- 5} \ {+ 3} \ {- 6} \\ G^{ \prime}_2 & : \quad {- 4} \ {+ 1} \ {+ 2} \ {- 5} \ {+ 3} \ {- 6} \end{align*} \end{document}

has exactly one copy of each gene. Recall that, in the study of genome rearrangement, a gene is usually represented by a signed integer: the absolute value of the integer (the unsigned integer) denotes the gene family to which the gene belongs; the sign of the integer denotes the orientation of the gene in its chromosome. Then a chromosome is a sequence of signed integers, and a genome is a collection of chromosomes. In the simpliest case, a monochromosomal genome, which contains only one chromosome, can be represented by a sequence of signed integers.

Genome Rearrangement with Gene Families is not a single problem but a whole class of related problems, because the choice of rearrangement distance is not unique. This choice becomes irrelevant, however, when we ask the fundamental question: Is the distance zero? In the example above, the two reduced genomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$G_1^{ \prime}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$G_2^{ \prime}$$ \end{document} are identical, thus the exemplar distance between the two original genomes G₁ and G₂ is zero for any reasonable choice of rearrangement distance.

In this article, we study the most basic version of the problem Genome Rearrangement with Gene Families: Given two sequences of signed integers, Zero Exemplar Distance (for monochromosomal genomes) is the problem of deciding whether the two sequences have a common subsequence including each unsigned integer exactly once in either positive or negative form.

Due to its generic nature, the problem Zero Exemplar Distance has been extensively studied by several groups of researchers focusing on different rearrangement distances, and, not surprisingly, has acquired several different names. Except for trivial distinctions, Zero Exemplar Distance is essentially the same problem as Zero Exemplar Conserved Interval Distance by Chen et al. (2008), Exemplar Longest Common Subsequence (deciding whether a feasible solution exists) by Bonizzoni et al. (2007), and Zero Exemplar Breakpoint Distance by Angibaud et al. (2009).

It is easy to check that if only one of the two genomes has duplicate genes, then Zero Exemplar Distance can be solved in linear time: we simply need to decide whether the genome without duplicates is a subsequence of the genome with duplicates. In sharp contrast, if both genomes contain duplicate genes, then even if each gene appears at most three times in each genome, the problem Zero Exemplar Distance is already NP-hard, as shown independently in three articles (Chen et al., 2008; Bonizzoni et al., 2007; Angibaud et al., 2009. The quest for the exact boundary between polynomial solvability and NP-hardness led to the following open question first raised by Chen et al. (2008):

Question 1

(Chen et al., 2008). Is the problem Zero Exemplar Distance for monochromosomal genomes still NP-hard if each gene appears at most two times in each genome?

This question was finally settled in the affirmative by Blin et al. (2009):

Theorem 1

(Blin et al., 2009). Zero Exemplar Distance for monochromosomal genomes is NP-hard even if each gene appears at most two times in each genome.

In Section 2, we give a very simple alternative proof of this theorem.

Both the previous proof of Theorem 1 and our alternative proof depend crucially on the order of the genes in the chromosomes. One may naturally wonder whether the complexity of Zero Exemplar Distance would change if gene order is not known. Now model each chromosome as a set of unsigned integers instead of a sequence of signed integers. Then Zero Exemplar Distance for multichromosomal genomes without gene order is the following problem: Given two collections G₁ and G₂ of subsets of the same ground set S of unsigned integers, decide whether both G₁ and G₂ can be reduced, by deleting elements from subsets and deleting subsets from collections, to the same collection G′ of subsets of S such that each unsigned integer in S is contained in exactly one subset in G′, i.e., G′ is a partition of S. For example, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}S & : \quad \{ 1, 2, 3, 4, 5 \} \\ G_1 & : \quad \{ 1, 2, 3 \} \quad \{ 2, 3, 4 \} \quad \{ 4, 5 \} \\ G_2 & : \quad \{ 1, 2 \} \quad \{ 2, 3, 4 \} \quad \{ 3, 4, 5 \} \quad \{ 1, 5 \} \\ G^{ \prime} & : \quad \{ 1, 2 \} \quad \{ 3 \} \quad \{ 4, 5 \} \end{align} \end{document}

We note that, in the absence of gene order, genome rearrangement distances such as the syntenic distance can be defined for multichromosomal genomes represented by collections of sets (Ferretti et al., 1996). The computational complexity and combinatorial properties of syntenic distance have been extensively studied (DasGupta et al., 1998; Liben-Nowell, 2002; Chauve and Fertin, 2004).

In Section 3, we prove the following theorem analogous to Theorem 1:

Theorem 2

Zero Exemplar Distance for multichromosomal genomes without gene order is NP-hard even if each gene appears at most two times in each genome.

As decision problems, both variants of Zero Exemplar Distance, for monochromosomal genomes and for multichromosomal genomes without gene order, are in NP. Thus, following the NP-hardness results in Theorem 1 and Theorem 2, these two decision problems are both NP-complete. Moreover, the NP-hardness results in Theorem 1 and Theorem 2 imply that unless NP = P, the corresponding minimization problems of computing the exemplar distance between two genomes do not admit any approximation. For related results, refer to Chen et al. (2006, 2008), Bonizzoni et al. (2007), Angibaud et al. (2009), and Adi et al. (2010).

The problem Zero Exemplar Distance for monochromosomal genomes, as mentioned earlier, has been studied under several different names. Given two sequences A and B over an alphabet Σ = Σ₁ ∪ Σ₂, where Σ₁ is a set of mandatory symbols and Σ₂ is a set of optional symbols, Exemplar Longest Common Subsequence is the problem of finding a longest common subsequence of A and B that contains each mandatory symbol in Σ₁ exactly once (Bonizzoni et al., 2007). For example, if Σ₁ = {1, 2, 3} and Σ₂ = {4, 5}, then both 12355 and 12354 are exemplar longest common subsequences of the two sequences A = 125523545 and B = 51142443554.

Due to the strict requirement on mandatory symbols, Exemplar Longest Common Subsequence does not always have a feasible solution. It is not difficult to see that simply deciding whether a feasible solution to Exemplar Longest Common Subsequence exists for two sequences A and B is the same as the problem Zero Exemplar Distance for two monochromosomal genomes A′ and B′ obtained from A and B by deleting all optional symbols. Recall that the problem Zero Exemplar Distance for monochromosomal genomes becomes trivial when only one of the two genomes has duplicate genes. For the equivalent problem of deciding whether a feasible solution to Exemplar Longest Common Subsequence exists, Bonizzoni et al. (2007) showed another tractable special case: If each mandatory symbol appears a total of at most three times in A and B, then there is a polynomial-time algorithm, based on 2SAT, that decides whether A and B have a common subsequence containing all mandatory symbols. This algorithm does not solve the maximization problem, however, and the following question was left open:

Question 2

(Bonizzoni et al., 2007). Is there a polynomial-time algorithm for Exemplar Longest Common Subsequence in the special case that each mandatory symbol appears a total of at most three times in the two input sequences?

Without loss of generality, we assume that each input sequence contains each symbol in the alphabet at least once. If each mandatory symbol appears a total of at most three times in the two input sequences, then it must appear exactly once in one sequence, and at least once in the other sequence, as in the example shown earlier. In Section 4, we prove the following theorem that complements Theorem 1 and answers the open question of Bonizzoni et al. (2007) in the affirmative:

Theorem 3

Zero Exemplar Distance for monochromosomal genomes admits a polynomial-time algorithm in the special case that each gene appears exactly once in one genome and at least once in the other genome. Exemplar Longest Common Subsequence admits a polynomial-time algorithm in the special case that each mandatory symbol appears exactly once in one input sequence and at least once in the other input sequence.

We note that the special case in Theorem 3 is a generalization of the trivial case that only one of the two genomes has duplicate genes. Here, in Theorem 3, both genomes may contain duplicate genes, but each gene can be duplicated in only one of the two genomes.

Finally, in Section 5, we prove the following theorem that complements Theorem 2:

Theorem 4

Zero Exemplar Distance for multichromosomal genomes without gene order admits a polynomial-time algorithm in the special case that each gene appears exactly once in one genome and at least once in the other genome, and is fixed-parameter tractable in the general case if the parameter is the maximum number of chromosomes in each genome.

2. Alternative Proof of Theorem 1

We prove that Zero Exemplar Distance for monochromosomal genomes is NP-hard by a reduction from the well-known NP-complete problem 3SAT (Garey and Johnson, 1979). Let (V, E) be a 3SAT instance, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$V = \{ v_1, \ldots, v_n \} $$ \end{document} is a set of n boolean variables, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$E = \{ e_1, \ldots, e_m \} $$ \end{document} is a conjunctive boolean formula of m clauses, and each clause in E is a disjunction of exactly three literals of the variables in V. We will construct two sequences (genomes) G₁ and G₂ over 2n + 6m + 1 distinct unsigned integers (genes):
• Two variable genes x_i, y_i for each variable v_i, 1 ≤ i ≤ n;

• Three clause genes a_j, b_j, c_j for each clause e_j, 1 ≤ j ≤ m;

• Three literal genes r_j, s_j, t_j for the three literals of each clause e_j, 1 ≤ j ≤ m;

• One separator gene z.

In our construction, all genes appear in the positive orientation in the two genomes, so we will omit the signs in our description. The two genomes G₁ and G₂ are represented schematically as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G_1 & : \quad \langle v_1 \rangle \ldots \langle v_n \rangle \ z \ \langle e_1 \rangle \ldots \langle e_m \rangle \\ G_2 & : \quad \langle v_1 \rangle \ldots \langle v_n \rangle \ z \ \langle e_1 \rangle \ldots \langle e_m \rangle\end{align} \end{document}

For each variable v_i, the variable gadget 〈v_i〉 consists of one copy of x_i and two copies of y_i in G₁, two copies of x_i and one copy of y_i in G₂, and, for each literal of the variable in the clauses, one copy of the corresponding literal gene (r_j, s_j, or t_j for some clause e_j) in each genome. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p_{i, 1}, \ldots, p_{i, k_i}$$ \end{document} be the literal genes for the positive literals of v_i, and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q_{i, 1}, \ldots, q_{i, l_i}$$ \end{document} be the literal genes for the negative literals of v_i. The genes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$x_i, y_i, p_{i, 1}, \ldots, p_{i, k_i}, q_{i, 1}, \ldots, q_{i, l_i}$$ \end{document} in the variable gadget 〈v_i〉 are arranged in the following pattern in the two genomes: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G_1 \langle v_i \rangle & : \quad y_i \ p_{i, 1} \ldots p_{i, k_i} \ x_i \ q_{i, 1} \ldots q_{i, l_i} \ y_i \\ G_2 \langle v_i \rangle & : \quad p_{i, 1} \ldots p_{i, k_i} \ x_i \ y_i \ x_i \ q_{i, 1} \ldots q_{i, l_i}\end{align} \end{document}

For each clause e_j, the clause gadget 〈e_j〉 consists of two copies of each clause gene a_j, b_j, c_j and one copy of each literal gene r_j, s_j, t_j. These genes in 〈e_j〉 are arranged in the following pattern in the two genomes: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G_1 \langle e_j \rangle & : \quad r_j \ a_j \ b_j \ c_j \ s_j \ a_j \ b_j \ c_j \ t_j \\ G_2 \langle e_j \rangle & : \quad a_j \ r_j \ b_j \ a_j \ s_j \ c_j \ b_j \ t_j \ c_j\end{align} \end{document}

This completes the construction. It is easy to check that each gene appears at most two times in each genome, and that each genome includes exactly 3n + 12m + 1 genes including duplicates. We give an example:

Example 1

For a 3SAT instance of 4 variables and 2 clauses \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e_1 = \{ r_1 = v_1, s_1 = \neg v_2, t_1 = \neg v_3 \} $$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e_2 = \{ r_2 = \neg v_1, s_2 = v_3, t_2 = v_4 \} $$ \end{document} , the reduction constructs the following two genomes: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G_1: & \quad y_1 r_1 x_1 r_2 y_1 \quad y_2 x_2 s_1 y_2 \quad y_3 s_2 x_3 t_1 y_3 \quad y_4 t_2 x_4 y_4 \\ & \quad z \quad r_1 a_1 b_1 c_1 s_1 a_1 b_1 c_1 t_1 \quad r_2 a_2 b_2 c_2 s_2 a_2 b_2 c_2 t_2 \\ G_2: & \quad r_1 x_1 y_1 x_1 r_2 \quad x_2 y_2 x_2 s_1 \quad s_2 x_3 y_3 x_3 t_1 \quad t_2 x_4 y_4 x_4 \\ & \quad z \quad a_1 r_1 b_1 a_1 s_1 c_1 b_1 t_1 c_1 \quad a_2 r_2 b_2 a_2 s_2 c_2 b_2 t_2 c_2\end{align} \end{document}

The assignment v₁ = true, v₂ = false, v₃ = false, v₄ = true satisfies the 3SAT instance and corresponds to the following common reduced genome: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G^{ \prime}: \quad r_1 x_1 y_1 \quad y_2 x_2 s_1 \quad y_3 x_3 t_1 \quad t_2 x_4 y_4 \quad z \quad a_1 b_1 c_1 \quad r_2 a_2 s_2 b_2 c_2\end{align} \end{document}

The reduction clearly runs in polynomial time. It remains to prove the following lemma:

Lemma 1

The 3SAT instance (V,E) is satisfiable if and only if the two genomes G₁ and G₂ have a common subsequence G′ including exactly one copy of each gene.

We first prove the direct implication. Suppose that the 3SAT instance (V,E) is satisfiable. We will compose a common subsequence G′ of the two genomes G₁ and G₂ from a common subsequence of each variable gadget 〈v_i〉, the separator gene z in the middle, and a common subsequence of each clause gadget 〈e_j〉. Consider a truth assignment that satisfies the 3SAT instance. For each variable v_i, take the subsequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p_{i, 1} \ldots p_{i, k_i} \ x_i y_i$$ \end{document} if v_i is set to true, and take the subsequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$y_i x_i \ q_{i, 1} \ldots q_{i, l_i}$$ \end{document} if v_i is set to false. For each clause e_j, at least one of its three literals is true; correspondingly, at least one of the three literal genes r_j, s_j, t_j has been taken from some variable gadget 〈v_i〉. Now take a subsequence from the clause gadget 〈e_j〉 following one of three cases (if two or more cases are applicable because e_j has more than one true literal, pick any one):
1. If r_j has been taken, then take the subsequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a_j b_j \underline{s_j} c_j \underline{t_j}$$ \end{document} .

2. If s_j has been taken, then take either the subsequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline{r_j} b_j a_j c_j \underline{t_j}$$ \end{document} or the subsequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline{r_j} a_j c_j b_j \underline{t_j}$$ \end{document} .

3. If t_j has been taken, then take the subsequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\underline{r_j} a_j \underline{s_j} b_j c_j$$ \end{document} .

Here an underlined literal gene is omitted from the subsequence taken from the clause gadget 〈e_j〉 if its other copy has already been taken from some variable gadget 〈v_i〉 (this happens when e_j has more than one true literal). The common subsequence G′ thus composed clearly includes exactly one copy of each gene.

We next prove the reverse implication. Suppose that the two genomes G₁ and G₂ have a common subsequence G′ including exactly one copy of each gene. We will find a satisfying assignment for the 3SAT instance (V, E) as follows. Due to the strategic location of the separator gene z in the two genomes, each literal gene must appear in the common subsequence either before z in both genomes, in some variable gadget 〈v_i〉, or after z in both genomes, in some clause gadget 〈e_j〉. The crucial property of the clause gadget 〈e_j〉 is that it cannot have a common subsequence including exactly one copy of each clause gene a_j, b_j, c_j unless at least one of the three literal genes r_j, s_j, t_j is omitted. A literal gene omitted from the common subsequence of the clause gadget 〈e_j〉 has to appear in the common subsequence of some variable gadget 〈v_i〉, where the two variable genes x_i and y_i must appear in the order x_iy_i if the literal is positive and appear in the order y_ix_i if the literal is negative. Now set each variable v_i to true if the two variable genes x_i and y_i appear in the common subsequence G′ in the order x_iy_i, and set it to false otherwise. Then each clause gets at least one true literal. This completes the proof of Theorem 1.

3. Proof of Theorem 2

We prove that Zero Exemplar Distance for multichromosomal genomes without gene order is NP-hard by a reduction again from 3SAT. Let (V, E) be a 3SAT instance, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$V = \{ v_1, \ldots, v_n \} $$ \end{document} is a set of n boolean variables, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$E = \{ e_1, \ldots, e_m \} $$ \end{document} is a conjunctive boolean formula of m clauses, and each clause in E is a disjunction of exactly three literals of the variables in V. Without loss of generality, assume that no clause in E contains two literals of the same variable in V. We will construct two genomes G₁ and G₂ over n + 9m distinct genes:
• One variable gene x_i for each variable v_i, 1 ≤ i ≤ n;

• Six clause genes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a_j, b_j, c_j, a^{ \prime}_j, b^{ \prime}_j, c^{ \prime}_j$$ \end{document} for each clause e_j, 1 ≤ j ≤ m;

• Three literal genes r_j, s_j, t_j for the three literals of each clause e_j, 1 ≤ j ≤ m.

For each variable v_i, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$p_{i, 1}, \ldots, p_{i, k_i}$$ \end{document} be the literal genes for the positive literals of v_i, and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q_{i, 1}, \ldots, q_{i, l_i}$$ \end{document} be the literal genes for the negative literals of v_i. G₁ includes one subset and G₂ includes two subsets of genes including x_i: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G_1 \langle v_i \rangle & : \quad \{ p_{i, 1}, \ldots, p_{i, k_i}, x_i, q_{i, 1}, \ldots, q_{i, l_i} \} \\ G_2 \langle v_i \rangle & : \quad \{ p_{i, 1}, \ldots, p_{i, k_i}, x_i \} \quad \{ x_i, q_{i, 1}, \ldots, q_{i, l_i} \} \end{align} \end{document}

For each clause e_j, G₁ includes six subsets and G₂ includes seven subsets of clause/literal genes: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G_1 \langle e_j \rangle & : \quad \{ a_j, b_j \} \ \{ b_j, c_j \} \ \{ c_j, a_j \} \quad \{ a^{ \prime}_j, r_j \} \ \{ b^{ \prime}_j, s_j \} \ \{ c^{ \prime}_j, t_j \} \\ G_2 \langle e_j \rangle & : \quad \{ a_j, b_j, c_j \} \quad \{ a_j, a^{ \prime}_j, r_j \} \ \{ b_j, b^{ \prime}_j, s_j \} \ \{ c_j, c^{ \prime}_j, t_j \} \quad \{ a^{ \prime}_j \} \ \{ b^{ \prime}_j \} \ \{ c^{ \prime}_j \} \end{align} \end{document}

This completes the construction. It is easy to check that each gene appears at most two times in each genome, G₁ includes exactly n + 15m genes including duplicates, and G₂ includes exactly 2n + 18m genes including duplicates. We give an example:

Example 2

For a 3SAT instance of 4 variables and 2 clauses \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e_1 = \{ r_1 = v_1, s_1 = \neg v_2, t_1 = \neg v_3 \} $$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$e_2 = \{ r_2 = \neg v_1, s_2 = v_3, t_2 = v_4 \} $$ \end{document} , the reduction constructs the following two genomes: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G_1: & \quad \{ r_1, x_1, r_2 \} \quad \{ x_2, s_1 \} \quad \{ s_2, x_3, t_1 \} \quad \{ t_2, x_4 \} \\ & \quad \{ a_1, b_1 \} \ \{ b_1, c_1 \} \ \{ c_1, a_1 \} \quad \{ a^{ \prime}_1, r_1 \} \ \{ b^{ \prime}_1, s_1 \} \ \{ c^{ \prime}_1, t_1 \} \\ & \quad \{ a_2, b_2 \} \ \{ b_2, c_2 \} \ \{ c_2, a_2 \} \quad \{ a^{ \prime}_2, r_2 \} \ \{ b^{ \prime}_2, s_2 \} \ \{ c^{ \prime}_2, t_2 \} \\ G_2: & \quad \{ r_1, x_1 \} \ \{ x_1, r_2 \} \quad \{ x_2 \} \ \{ x_2, s_1 \} \quad \{ s_2, x_3 \} \ \{ x_3, t_1 \} \quad \{ t_2, x_4 \} \ \{ x_4 \} \\ & \quad \{ a_1, b_1, c_1 \} \quad \{ a_1, a^{ \prime}_1, r_1 \} \ \{ b_1, b^{ \prime}_1, s_1 \} \ \{ c_1, c^{ \prime}_1, t_1 \} \quad \{ a^{ \prime}_1 \} \ \{ b^{ \prime}_1 \} \ \{ c^{ \prime}_1 \} \\ & \quad \{ a_2, b_2, c_2 \} \quad \{ a_2, a^{ \prime}_2, r_2 \} \ \{ b_2, b^{ \prime}_2, s_2 \} \ \{ c_2, c^{ \prime}_2, t_2 \} \quad \{ a^{ \prime}_2 \} \ \{ b^{ \prime}_2 \} \ \{ c^{ \prime}_2 \} \end{align} \end{document}

The assignment v₁ = true, v₂ = false, v₃ = false, v₄ = true satisfies the 3SAT instance and corresponds to the following common reduced genome: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}G^{ \prime}: & \quad \{ r_1, x_1 \} \quad \{ x_2, s_1 \} \quad \{ x_3, t_1 \} \quad \{ t_2, x_4 \} \\ & \quad \{ a_1 \} \ \{ b_1, c_1 \} \quad \{ a^{ \prime}_1 \} \ \{ b^{ \prime}_1 \} \ \{ c^{ \prime}_1 \} \\ & \quad \{ c_2 \} \ \{ a_2, b_2 \} \quad \{ a^{ \prime}_2, r_2 \} \ \{ b^{ \prime}_2, s_2 \} \ \{ c^{ \prime}_2 \} \end{align} \end{document}

The reduction clearly runs in polynomial time. It remains to prove the following lemma:

Lemma 2

The 3SAT instance (V,E) is satisfiable if and only if the two genomes G₁ and G₂ have a common reduced genome G′ including exactly one copy of each gene.

We first prove the direct implication. Suppose that the 3SAT instance (V,E) is satisfiable. We will compose a common reduced genome G′ of the two genomes G₁ and G₂ as follows. Consider a truth assignment that satisfies the 3SAT instance. For each variable v_i, take the subset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ p_{i, 1}, \ldots, p_{i, k_i}, x_i \} $$ \end{document} if v_i is set to true, and take the subset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ x_i, q_{i, 1}, \ldots, q_{i, l_i} \} $$ \end{document} if v_i is set to false. For each clause e_j, at least one of its three literals is true; correspondingly, at least one of the three literal genes r_j, s_j, t_j has been taken from some variable gadget 〈v_i〉. Now take some subsets of clause/literal genes following one of three cases:
1. If r_j has been taken, then take the subsets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ a_j \}, \{ b_j, c_j \}, \{ a^{ \prime}_j \}, \{ b^{ \prime}_j, \underline{s_j} \}, \{ c^{ \prime}_j, \underline{t_j} \} $$ \end{document} .

2. If s_j has been taken, then take the subsets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ b_j \}, \{ c_j, a_j \}, \{ a^{ \prime}_j, \underline{r_j} \}, \{ b^{ \prime}_j \}, \{ c^{ \prime}_j, \underline{t_j} \} $$ \end{document} .

3. If t_j has been taken, then take the subsets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ c_j \}, \{ a_j, b_j \}, \{ a^{ \prime}_j, \underline{r_j} \}, \{ b^{ \prime}_j, \underline{s_j} \}, \{ c^{ \prime}_j \} $$ \end{document} .

Here an underlined literal gene is omitted from the subset taken from the clause gadget 〈e_j〉 if its other copy has already been taken from some variable gadget 〈v_i〉. The reduced genome G′ thus composed clearly includes exactly one copy of each gene.

We next prove the reverse implication. Suppose that the two genomes G₁ and G₂ have a common reduced genome G′ including exactly one copy of each gene. We will find a satisfying assignment for the 3SAT instance (V, E) as follows. The crucial property of the clause gadget 〈e_j〉 is that it cannot have a common reduced genome including exactly one copy of each clause gene \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$a_j, b_j, c_j, a^{ \prime}_j, b^{ \prime}_j, c^{ \prime}_j$$ \end{document} unless at least one of the three literal genes r_j, s_j, t_j is omitted. A literal gene omitted from the clause gadget 〈e_j〉 has to appear in a subset in G′ that contains some variable gene x_i. By the construction of the variable gadgets, this subset contains, besides x_i, either literal genes for positive literals, or literal genes for negative literals. Now set each variable v_i to true if the subset in G′ that contains x_i also contains at least one literal gene for a positive literal, and set it to false otherwise. Then each clause gets at least one true literal. This completes the proof of Theorem 2.

4. Proof of Theorem 3

Let A and B be two sequences of lengths n and m, respectively, over an alphabet Σ = Σ₁ ∪ Σ₂, where Σ₁ is a set of mandatory symbols and Σ₂ is a set of optional symbols. In the special case that each mandatory symbol in Σ₁ appears exactly once in one sequence and at least once in the other sequence, we have the obvious but important property that any common subsequence of the two sequences can contain each mandatory symbol at most once. This property leads to a very simple algorithm that decides whether a feasible solution to Exemplar Longest Common Subsequence exists in this special case:

The time complexity of Algorithm 1 is O(nm) by using a standard dynamic programming algorithm for longest common subsequence (Gusfield, 1997). The correctness of Algorithm 1 is justified by the following lemma:

Algorithm 1

1. Obtain two sequences A′ and B′ from A and B by deleting all optional symbols in Σ₂.

2. Compute a longest common subsequence C* of A′ and B′.

3. If C* contains all mandatory symbols in Σ₁, return yes. Otherwise, return no.

Lemma 3

A and B have a common subsequence containing all mandatory symbols in Σ₁ *if and only if the longest common subsequence C** of A′ and B′ contains all mandatory symbols in Σ₁.

Proof

The reduction from A and B to A′ and B′ preserves the mandatory symbols. Thus A and B have a common subsequence containing all mandatory symbols in Σ₁ if and only if A′ and B′ have a common subsequence containing all mandatory symbols in Σ₁. It remains to prove the equivalent claim that A′ and B′ have a common subsequence containing all mandatory symbols in Σ₁ if and only if C* contains all mandatory symbols in Σ₁.

The “if” direction of the claim is trivial because C* is a common subsequence of A′ and B′. To prove the “only if” direction, recall that in any common subsequence of A′ and B′, each mandatory symbol can appear at most once. Thus the length of any common subsequence of A′ and B′ is at most the size of Σ₁. Moreover, if the length of some common subsequence of A′ and B′ is equal to the size of Σ₁, then this common subsequence must contain all mandatory symbols in Σ₁, and vice versa. Now suppose that A′ and B′ have a common subsequence C′ containing all mandatory symbols in Σ₁. Then the length of C′ must be equal to the size of Σ₁. Since the length of C* is at least the length of C′, the length of C* must also be equal to the size of Σ₁. Then C* must contain all mandatory symbols in Σ₁ too. This completes the proof. ▪

Since deciding whether a feasible solution to Exemplar Longest Common Subsequence exists for two sequences A and B is the same as the problem Zero Exemplar Distance for two monochromosomal genomes A′ and B′ obtained from A and B by deleting all optional symbols, we also have an O(nm) algorithm for Zero Exemplar Distance for monochromosomal genomes in the special case that each gene appears exactly once in one genome and at least once in the other genome.

We next present an algorithm for the maximization problem Exemplar Longest Common Subsequence in the special case that each mandatory symbol appears exactly once in one input sequence and at least once in the other input sequence:

If A and B have no common subsequence containing all mandatory symbols in Σ₁, then clearly the maximum-weight common subsequence C* of A and B cannot contain all mandatory symbols in Σ₁, and hence the algorithm correctly reports that no feasible solution exists. Otherwise, the correctness of Algorithm 2 is justified by the following lemma:

Algorithm 2

1. Assign each mandatory symbol in Σ₁ a weight of w = min{n, m} + 1, and assign each optional symbol in Σ₂ a weight of 1. Compute a common subsequence C* of A and B of the maximum total weight.

2. If C* contains all mandatory symbols in Σ₁, return C. Otherwise, report that no feasible solution exists.

Lemma 4

If A and B have a common subsequence containing all mandatory symbols in* Σ₁*, then the maximum-weight common subsequence C** of A and B is a longest common subsequence of A and B that contains all mandatory symbols in Σ₁.

Proof

Suppose that A and B have a common subsequence C containing all mandatory symbols in Σ₁. We first show that the maximum-weight common subsequence C* of A and B contains all mandatory symbols in Σ₁. Note that the number of optional symbols in C* is at most the length of C, which is at most min{n, m}. Also recall that any common subsequence of A and B can contain each mandatory symbol at most once. If C does not contain all mandatory symbols in Σ₁, then by our choice of w = min{n, m} + 1, the total weight of C* would be at most \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align}( | \Sigma_1 | - 1 ) \cdot w + \min \{ n, m \} \cdot 1 < ( | \Sigma_1 | - 1 ) \cdot w + w \cdot 1 = | \Sigma_1 | \cdot w.\end{align} \end{document}

On the other hand, since C contains all mandatory symbols in Σ₁, the weight of C is at least |Σ₁| · w. This contradicts the assumption that C* is a maximum-weight common subsequence of A and B.

Now, since C* contains all mandatory symbols and can contain each mandatory symbol at most once, C* must contain each mandatory symbol exactly once. Then, to have the maximum total weight, C* must be a longest common subsequence of A and B that contains all mandatory symbols in Σ₁. ▪

Again, the overall time complexity of Algorithm 2 is clearly O(nm) because the standard dynamic programming algorithm for longest common subsequence can be easily adapted to the weighted case within the same running time. This completes the proof of Theorem 3.

5. Proof of Theorem 4

We present two algorithms for Zero Exemplar Distance for multichromosomal genomes without gene order. Let k₁ and k₂, respectively, be the numbers of chromosomes in G₁ and G₂. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A_1, \ldots, A_{k_1}$$ \end{document} be the k₁ chromosomes in G₁. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B_1, \ldots, B_{k_2}$$ \end{document} be the k₂ chromosomes in G₂. Let k = max{k₁, k₂}. Let n be the total number of genes in G₁ and G₂, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$n = \sum\nolimits_{i = 1}^{k_1} | A_i | + \sum\nolimits_{j = 1}^{k_2} |B_j|$$ \end{document} .

We first present a polynomial-time algorithm for Zero Exemplar Distance for multichromosomal genomes without gene order in the special case that each gene appears exactly once in one genome and at least once in the other genome. Our algorithm is based on maximum-weight matching in bipartite graphs:

To see the correctness of Algorithm 3, note that each reduced chromosome of a common reduced genome is a common subset of two distinct chromosomes, one from each input genome, and corresponds to an edge of a matching in the complete bipartite graph. In the special case that each gene appears exactly once in one genome and at least once in the other genome, no gene can appear more than once in the reduced chromosomes corresponding to the edges of a matching. Thus, the maximum possible weight of a matching is equal to the number of distinct genes, and a common reduced genome that includes all the genes corresponds to a matching of the maximum weight.

Algorithm 3

1. Construct a complete bipartite graph G = (V₁ ∪ V₂, V₁ × V₂) with vertices \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$V_1 = \{ A_1, \ldots, A_{k_1} \} $$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$V_2 = \{ B_1, \ldots, B_{k_2} \} $$ \end{document} . Associate with each edge between \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A_i \in V_1$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B_j \in V_2$$ \end{document} a reduced chromosome C_ij = A_i ∩ B_j and a weight equal to its size.

2. Compute a maximum-weight matching M in the graph G.

3. If the set of reduced chromosomes corresponding to the edges in M includes all the genes, return yes. Otherwise, return no.

We now analyze the time complexity of Algorithm 3. Steps 1 and 3 can be easily implemented in O(n²) time. Step 2 can be implemented in O(k³) time using a standard algorithm for weight bipartite matching (Tarjan, 1983). Thus the overall time complexity is O(n² + k³).

We next present a fixed-parameter tractable algorithm for this problem without any assumption on the distribution of duplicate genes. Refer to Downey and Fellows (1999) for basic concepts in parameterized complexity theory. The parameter of our algorithm is k = max{k₁, k₂}:

To see the correctness of Algorithm 4, note again that each chromosome of a common reduced genome is a common subset of two distinct chromosomes, one from each input genome. All other chromosomes of the two input genomes that do not contribute to the common reduced genome are deleted. To handle the matching and the deletion of the chromosomes in a uniform way, we can think of each chromosome deleted from one genome as matched to a chromosome deleted from the other genome or to an empty chromosome. Thus by padding the two genomes to the same number of chromosomes, we only need to consider perfect matchings as permutations. The time complexity of Algorithm 4 is O(k!n²), with O(n²) time for each of the k! permutations. This completes the proof of Theorem 4.

Algorithm 4

1. Add k − k₁ empty chromosomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A_{k_1 + 1}, \ldots, A_k$$ \end{document} to G₁, or add k − k₂ empty chromosomes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$B_{k_2 + 1}, \ldots, B_k$$ \end{document} to G₂, such that G₁ and G₂ have the same number k of chromosomes.

2. For each permutation π of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\langle 1, \ldots, k \rangle$$ \end{document} , compute \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$C_{ \pi} = \cup_{i = 1}^k ( A_i \cap B_{ \pi ( i ) } )$$ \end{document} .

3. If for some permutation π the set C_π includes all the genes, return yes. Otherwise return no.

We remark that the problem Zero Exemplar Distance for multichromosomal genomes without gene order is unlikely to have a fixed-parameter tractable algorithm if the parameter is the maximum number of genes in any single chromosome. This is because 3SAT remains NP-hard even if for each variable there are at most five clauses that contain its literals (Garey and Johnson, 1979). As a result, in our reduction from 3SAT in Section 3, the number of genes in each chromosome need not be more than some constant.

Footnotes

Acknowledgments

This work is supported in part by NSF grant DBI-0743670. A preliminary version of this work (including Theorem 1, Theorem 2, and a weaker version of Theorem 4) was presented at the 8th Annual RECOMB Satellite Workshop on Comparative Genomics (RECOMB-CG’10) and appeared in the workshop proceedings. The author would like to thank Binhai Zhu for a brief discussion on Question 1 during a visit to University of Texas–Pan American in May 2008, and Guillaume Fertin for communicating the recent result of Blin et al. (2009). The alternative proof of Theorem 1 was obtained independently by the author in February 2010 without knowing this recent progress. The author also thanks Pedro J. Tejada for bringing Question 2, an open question of Bonizzoni et al. (), to his attention.

Disclosure Statement

No competing financial interests exist.

References

Adi

S.S.

, Braga

M.D.V.

, Fernandes

C.G.

et al. 2010. Repetition-free longest common subsequence. Discrete Appl. Math., 158:1315–1324.

Angibaud

, Fertin

, Rusu

et al. 2009. On the approximability of comparing genomes with duplicates. J. Graph Algorithms Appl., 13:19–53.

Blin

, Fertin

, Sikora

et al. 2009. The Exemplar Breakpoint Distance for non-trivial genomes cannot be approximated. Proc. WALCOM’09, 357–368.

Bonizzoni

, Della Vedova

, Dondi

et al. 2007. Exemplar longest common subsequence. IEEE-ACM Trans. Comput. Biol. Bioinform., 4:535–543.

Chauve

, Fertin

2004. On maximal instances for the original syntenic distance. Theor. Comput. Sci., 326:29–43.

Chen

, Fowler

R.H.

, Fu

et al. 2008. On the inapproximability of the exemplar conserved interval distance problem of genomes. J. Comb. Optim., 15:201–221.

Chen

, Fu

, Zhu

2006. The approximability of the exemplar breakpoint distance problem. Proc. AAIM’06, 291–302.

DasGupta

, Jiang

, Kannan

et al. 1998. On the complexity and approximation of syntenic distance. Discrete Appl. Math., 88:59–82.

Downey

R.G.

, Fellows

M.R.

1999. Parameterized Complexity. Springer-Verlag: New York.

10.

Ferretti

, Nadeau

J.H.

, Sankoff

1996. Original synteny. Proc. CPM’96, 159–167.

11.

Garey

M.R.

, Johnson

D.S.

1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company: New York.

12.

Gusfield

1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press: Cambridge, UK.

13.

Liben-Nowell

2002. Gossip is synteny: incomplete gossip and the syntenic distance between genomes. J. Algorithms, 43:264–283.

14.

Sankoff

1999. Genome rearrangement with gene families. Bioinformatics, 15:909–917.

15.

Tarjan

R.E.

1983. Data Structures and Network Algorithms. SIAM: Washington, DC.