Abstract
Given two genomes with duplicate genes, Zero Exemplar Distance is the problem of deciding whether the two genomes can be reduced to the same genome without duplicate genes by deleting all but one copy of each gene in each genome. Blin, Fertin, Sikora, and Vialette recently proved that Zero Exemplar Distance for monochromosomal genomes is NP-hard even if each gene appears at most two times in each genome, thereby settling an important open question on genome rearrangement in the exemplar model. In this article, we give a very simple alternative proof of this result. We also study the problem Zero Exemplar Distance for multichromosomal genomes without gene order, and prove the analogous result that it is also NP-hard even if each gene appears at most two times in each genome. For the positive direction, we show that both variants of Zero Exemplar Distance admit polynomial-time algorithms if each gene appears exactly once in one genome and at least once in the other genome. In addition, we present a polynomial-time algorithm for the related problem Exemplar Longest Common Subsequence in the special case that each mandatory symbol appears exactly once in one input sequence and at least once in the other input sequence. This answers an open question of Bonizzoni et al. We also show that Zero Exemplar Distance for multichromosomal genomes without gene order is fixed-parameter tractable in the general case if the parameter is the maximum number of chromosomes in each genome.
1. Introduction
Given two genomes with duplicate genes, Genome Rearrangement with Gene Families is the problem of deleting all but one copy of each gene in each genome, so as to minimize some rearrangement distance between the two reduced genomes (Sankoff, 1999). The minimum rearrangement distance thus attained is called the exemplar distance between the two genomes. For example, each of the following two monochromosomal genomes
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
G_1 : & \quad {- 4} \ {+ 1} \ {+ 2} \ {+ 3} \ {- 5} \ {+ 1} \ {+ 2} \ {+ 3} \ {- 6} \\ G_2 : & \quad {- 1} \ {- 4} \ {+ 1} \ {+ 2} \ {- 5} \ {+ 3} \ {- 2} \ {- 6} \ {+ 3}
\end{align*}
\end{document}
has at most two copies of each gene, and each of the following two reduced genomes
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
G^{ \prime}_1 & : \quad {- 4} \ {+ 1} \ {+ 2} \ {- 5} \ {+ 3} \ {- 6} \\ G^{ \prime}_2 & : \quad {- 4} \ {+ 1} \ {+ 2} \ {- 5} \ {+ 3} \ {- 6}
\end{align*}
\end{document}
has exactly one copy of each gene. Recall that, in the study of genome rearrangement, a gene is usually represented by a signed integer: the absolute value of the integer (the unsigned integer) denotes the gene family to which the gene belongs; the sign of the integer denotes the orientation of the gene in its chromosome. Then a chromosome is a sequence of signed integers, and a genome is a collection of chromosomes. In the simpliest case, a monochromosomal genome, which contains only one chromosome, can be represented by a sequence of signed integers.
Genome Rearrangement with Gene Families is not a single problem but a whole class of related problems, because the choice of rearrangement distance is not unique. This choice becomes irrelevant, however, when we ask the fundamental question: Is the distance zero? In the example above, the two reduced genomes
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$G_1^{ \prime}$$
\end{document}
and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$G_2^{ \prime}$$
\end{document}
are identical, thus the exemplar distance between the two original genomes G1 and G2 is zero for any reasonable choice of rearrangement distance.
In this article, we study the most basic version of the problem Genome Rearrangement with Gene Families: Given two sequences of signed integers, Zero Exemplar Distance (for monochromosomal genomes) is the problem of deciding whether the two sequences have a common subsequence including each unsigned integer exactly once in either positive or negative form.
Due to its generic nature, the problem Zero Exemplar Distance has been extensively studied by several groups of researchers focusing on different rearrangement distances, and, not surprisingly, has acquired several different names. Except for trivial distinctions, Zero Exemplar Distance is essentially the same problem as Zero Exemplar Conserved Interval Distance by Chen et al. (2008), Exemplar Longest Common Subsequence (deciding whether a feasible solution exists) by Bonizzoni et al. (2007), and Zero Exemplar Breakpoint Distance by Angibaud et al. (2009).
It is easy to check that if only one of the two genomes has duplicate genes, then Zero Exemplar Distance can be solved in linear time: we simply need to decide whether the genome without duplicates is a subsequence of the genome with duplicates. In sharp contrast, if both genomes contain duplicate genes, then even if each gene appears at most three times in each genome, the problem Zero Exemplar Distance is already NP-hard, as shown independently in three articles (Chen et al., 2008; Bonizzoni et al., 2007; Angibaud et al., 2009. The quest for the exact boundary between polynomial solvability and NP-hardness led to the following open question first raised by Chen et al. (2008):
Question 1
(Chen et al., 2008). Is the problem Zero Exemplar Distance for monochromosomal genomes still NP-hard if each gene appears at most two times in each genome?
This question was finally settled in the affirmative by Blin et al. (2009):
Theorem 1
(Blin et al., 2009). Zero Exemplar Distance for monochromosomal genomes is NP-hard even if each gene appears at most two times in each genome.
In Section 2, we give a very simple alternative proof of this theorem.
Both the previous proof of Theorem 1 and our alternative proof depend crucially on the order of the genes in the chromosomes. One may naturally wonder whether the complexity of Zero Exemplar Distance would change if gene order is not known. Now model each chromosome as a set of unsigned integers instead of a sequence of signed integers. Then Zero Exemplar Distance for multichromosomal genomes without gene order is the following problem: Given two collections G1 and G2 of subsets of the same ground set S of unsigned integers, decide whether both G1 and G2 can be reduced, by deleting elements from subsets and deleting subsets from collections, to the same collection G′ of subsets of S such that each unsigned integer in S is contained in exactly one subset in G′, i.e., G′ is a partition of S. For example,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}S & : \quad \{ 1, 2, 3, 4, 5 \} \\ G_1 & : \quad \{ 1, 2, 3 \} \quad \{ 2, 3, 4 \} \quad \{ 4, 5 \} \\ G_2 & : \quad \{ 1, 2 \} \quad \{ 2, 3, 4 \} \quad \{ 3, 4, 5 \} \quad \{ 1, 5 \} \\ G^{ \prime} & : \quad \{ 1, 2 \} \quad \{ 3 \} \quad \{ 4, 5 \} \end{align*}
\end{document}
We note that, in the absence of gene order, genome rearrangement distances such as the syntenic distance can be defined for multichromosomal genomes represented by collections of sets (Ferretti et al., 1996). The computational complexity and combinatorial properties of syntenic distance have been extensively studied (DasGupta et al., 1998; Liben-Nowell, 2002; Chauve and Fertin, 2004).
In Section 3, we prove the following theorem analogous to Theorem 1:
Theorem 2
Zero Exemplar Distance
for multichromosomal genomes without gene order is NP-hard even if each gene appears at most two times in each genome.
As decision problems, both variants of Zero Exemplar Distance, for monochromosomal genomes and for multichromosomal genomes without gene order, are in NP. Thus, following the NP-hardness results in Theorem 1 and Theorem 2, these two decision problems are both NP-complete. Moreover, the NP-hardness results in Theorem 1 and Theorem 2 imply that unless NP = P, the corresponding minimization problems of computing the exemplar distance between two genomes do not admit any approximation. For related results, refer to Chen et al. (2006, 2008), Bonizzoni et al. (2007), Angibaud et al. (2009), and Adi et al. (2010).
The problem Zero Exemplar Distance for monochromosomal genomes, as mentioned earlier, has been studied under several different names. Given two sequences A and B over an alphabet Σ = Σ1 ∪ Σ2, where Σ1 is a set of mandatory symbols and Σ2 is a set of optional symbols, Exemplar Longest Common Subsequence is the problem of finding a longest common subsequence of A and B that contains each mandatory symbol in Σ1 exactly once (Bonizzoni et al., 2007). For example, if Σ1 = {1, 2, 3} and Σ2 = {4, 5}, then both 12355 and 12354 are exemplar longest common subsequences of the two sequences A = 125523545 and B = 51142443554.
Due to the strict requirement on mandatory symbols, Exemplar Longest Common Subsequence does not always have a feasible solution. It is not difficult to see that simply deciding whether a feasible solution to Exemplar Longest Common Subsequence exists for two sequences A and B is the same as the problem Zero Exemplar Distance for two monochromosomal genomes A′ and B′ obtained from A and B by deleting all optional symbols. Recall that the problem Zero Exemplar Distance for monochromosomal genomes becomes trivial when only one of the two genomes has duplicate genes. For the equivalent problem of deciding whether a feasible solution to Exemplar Longest Common Subsequence exists, Bonizzoni et al. (2007) showed another tractable special case: If each mandatory symbol appears a total of at most three times in A and B, then there is a polynomial-time algorithm, based on 2SAT, that decides whether A and B have a common subsequence containing all mandatory symbols. This algorithm does not solve the maximization problem, however, and the following question was left open:
Question 2
(Bonizzoni et al., 2007). Is there a polynomial-time algorithm for Exemplar Longest Common Subsequence in the special case that each mandatory symbol appears a total of at most three times in the two input sequences?
Without loss of generality, we assume that each input sequence contains each symbol in the alphabet at least once. If each mandatory symbol appears a total of at most three times in the two input sequences, then it must appear exactly once in one sequence, and at least once in the other sequence, as in the example shown earlier. In Section 4, we prove the following theorem that complements Theorem 1 and answers the open question of Bonizzoni et al. (2007) in the affirmative:
Theorem 3
Zero Exemplar Distance
for monochromosomal genomes admits a polynomial-time algorithm in the special case that each gene appears exactly once in one genome and at least once in the other genome.
Exemplar Longest Common Subsequence
admits a polynomial-time algorithm in the special case that each mandatory symbol appears exactly once in one input sequence and at least once in the other input sequence.
We note that the special case in Theorem 3 is a generalization of the trivial case that only one of the two genomes has duplicate genes. Here, in Theorem 3, both genomes may contain duplicate genes, but each gene can be duplicated in only one of the two genomes.
Finally, in Section 5, we prove the following theorem that complements Theorem 2:
Theorem 4
Zero Exemplar Distance
for multichromosomal genomes without gene order admits a polynomial-time algorithm in the special case that each gene appears exactly once in one genome and at least once in the other genome, and is fixed-parameter tractable in the general case if the parameter is the maximum number of chromosomes in each genome.
2. Alternative Proof of Theorem 1
We prove that Zero Exemplar Distance for monochromosomal genomes is NP-hard by a reduction from the well-known NP-complete problem 3SAT (Garey and Johnson, 1979). Let (V, E) be a 3SAT instance, where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$V = \{ v_1, \ldots, v_n \} $$
\end{document}
is a set of n boolean variables,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$E = \{ e_1, \ldots, e_m \} $$
\end{document}
is a conjunctive boolean formula of m clauses, and each clause in E is a disjunction of exactly three literals of the variables in V. We will construct two sequences (genomes) G1 and G2 over 2n + 6m + 1 distinct unsigned integers (genes):
• Two variable genes xi, yi for each variable vi, 1 ≤ i ≤ n;
• Three clause genes aj, bj, cj for each clause ej, 1 ≤ j ≤ m;
• Three literal genes rj, sj, tj for the three literals of each clause ej, 1 ≤ j ≤ m;
• One separator gene z.
In our construction, all genes appear in the positive orientation in the two genomes, so we will omit the signs in our description. The two genomes G1 and G2 are represented schematically as follows:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_1 & : \quad \langle v_1 \rangle \ldots \langle v_n \rangle \ z \ \langle e_1 \rangle \ldots \langle e_m \rangle \\ G_2 & : \quad \langle v_1 \rangle \ldots \langle v_n \rangle \ z \ \langle e_1 \rangle \ldots \langle e_m \rangle\end{align*}
\end{document}
For each variable vi, the variable gadget 〈vi〉 consists of one copy of xi and two copies of yi in G1, two copies of xi and one copy of yi in G2, and, for each literal of the variable in the clauses, one copy of the corresponding literal gene (rj, sj, or tj for some clause ej) in each genome. Let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$p_{i, 1}, \ldots, p_{i, k_i}$$
\end{document}
be the literal genes for the positive literals of vi, and let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$q_{i, 1}, \ldots, q_{i, l_i}$$
\end{document}
be the literal genes for the negative literals of vi. The genes
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$x_i, y_i, p_{i, 1}, \ldots, p_{i, k_i}, q_{i, 1}, \ldots, q_{i, l_i}$$
\end{document}
in the variable gadget 〈vi〉 are arranged in the following pattern in the two genomes:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_1 \langle v_i \rangle & : \quad y_i \ p_{i, 1} \ldots p_{i, k_i} \ x_i \ q_{i, 1} \ldots q_{i, l_i} \ y_i \\ G_2 \langle v_i \rangle & : \quad p_{i, 1} \ldots p_{i, k_i} \ x_i \ y_i \ x_i \ q_{i, 1} \ldots q_{i, l_i}\end{align*}
\end{document}
For each clause ej, the clause gadget 〈ej〉 consists of two copies of each clause gene aj, bj, cj and one copy of each literal gene rj, sj, tj. These genes in 〈ej〉 are arranged in the following pattern in the two genomes:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_1 \langle e_j \rangle & : \quad r_j \ a_j \ b_j \ c_j \ s_j \ a_j \ b_j \ c_j \ t_j \\ G_2 \langle e_j \rangle & : \quad a_j \ r_j \ b_j \ a_j \ s_j \ c_j \ b_j \ t_j \ c_j\end{align*}
\end{document}
This completes the construction. It is easy to check that each gene appears at most two times in each genome, and that each genome includes exactly 3n + 12m + 1 genes including duplicates. We give an example:
Example 1
For a 3SAT instance of 4 variables and 2 clauses
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$e_1 = \{ r_1 = v_1, s_1 = \neg v_2, t_1 = \neg v_3 \} $$
\end{document}
and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$e_2 = \{ r_2 = \neg v_1, s_2 = v_3, t_2 = v_4 \} $$
\end{document}
, the reduction constructs the following two genomes:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_1: & \quad y_1 r_1 x_1 r_2 y_1 \quad y_2 x_2 s_1 y_2 \quad y_3 s_2 x_3 t_1 y_3 \quad y_4 t_2 x_4 y_4 \\ & \quad z \quad r_1 a_1 b_1 c_1 s_1 a_1 b_1 c_1 t_1 \quad r_2 a_2 b_2 c_2 s_2 a_2 b_2 c_2 t_2 \\ G_2: & \quad r_1 x_1 y_1 x_1 r_2 \quad x_2 y_2 x_2 s_1 \quad s_2 x_3 y_3 x_3 t_1 \quad t_2 x_4 y_4 x_4 \\ & \quad z \quad a_1 r_1 b_1 a_1 s_1 c_1 b_1 t_1 c_1 \quad a_2 r_2 b_2 a_2 s_2 c_2 b_2 t_2 c_2\end{align*}
\end{document}
The assignment v1 = true, v2 = false, v3 = false, v4 = true satisfies the 3SAT instance and corresponds to the following common reduced genome:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G^{ \prime}: \quad r_1 x_1 y_1 \quad y_2 x_2 s_1 \quad y_3 x_3 t_1 \quad t_2 x_4 y_4 \quad z \quad a_1 b_1 c_1 \quad r_2 a_2 s_2 b_2 c_2\end{align*}
\end{document}
The reduction clearly runs in polynomial time. It remains to prove the following lemma:
Lemma 1
The 3SAT instance (V,E) is satisfiable if and only if the two genomes G1 and G2 have a common subsequence G′ including exactly one copy of each gene.
We first prove the direct implication. Suppose that the 3SAT instance (V,E) is satisfiable. We will compose a common subsequence G′ of the two genomes G1 and G2 from a common subsequence of each variable gadget 〈vi〉, the separator gene z in the middle, and a common subsequence of each clause gadget 〈ej〉. Consider a truth assignment that satisfies the 3SAT instance. For each variable vi, take the subsequence
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$p_{i, 1} \ldots p_{i, k_i} \ x_i y_i$$
\end{document}
if vi is set to true, and take the subsequence
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$y_i x_i \ q_{i, 1} \ldots q_{i, l_i}$$
\end{document}
if vi is set to false. For each clause ej, at least one of its three literals is true; correspondingly, at least one of the three literal genes rj, sj, tj has been taken from some variable gadget 〈vi〉. Now take a subsequence from the clause gadget 〈ej〉 following one of three cases (if two or more cases are applicable because ej has more than one true literal, pick any one):
1. If rj has been taken, then take the subsequence
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a_j b_j \underline{s_j} c_j \underline{t_j}$$
\end{document}
.
2. If sj has been taken, then take either the subsequence
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\underline{r_j} b_j a_j c_j \underline{t_j}$$
\end{document}
or the subsequence
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\underline{r_j} a_j c_j b_j \underline{t_j}$$
\end{document}
.
3. If tj has been taken, then take the subsequence
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\underline{r_j} a_j \underline{s_j} b_j c_j$$
\end{document}
.
Here an underlined literal gene is omitted from the subsequence taken from the clause gadget 〈ej〉 if its other copy has already been taken from some variable gadget 〈vi〉 (this happens when ej has more than one true literal). The common subsequence G′ thus composed clearly includes exactly one copy of each gene.
We next prove the reverse implication. Suppose that the two genomes G1 and G2 have a common subsequence G′ including exactly one copy of each gene. We will find a satisfying assignment for the 3SAT instance (V, E) as follows. Due to the strategic location of the separator gene z in the two genomes, each literal gene must appear in the common subsequence either before z in both genomes, in some variable gadget 〈vi〉, or after z in both genomes, in some clause gadget 〈ej〉. The crucial property of the clause gadget 〈ej〉 is that it cannot have a common subsequence including exactly one copy of each clause gene aj, bj, cj unless at least one of the three literal genes rj, sj, tj is omitted. A literal gene omitted from the common subsequence of the clause gadget 〈ej〉 has to appear in the common subsequence of some variable gadget 〈vi〉, where the two variable genes xi and yi must appear in the order xiyi if the literal is positive and appear in the order yixi if the literal is negative. Now set each variable vi to true if the two variable genes xi and yi appear in the common subsequence G′ in the order xiyi, and set it to false otherwise. Then each clause gets at least one true literal. This completes the proof of Theorem 1.
We prove that Zero Exemplar Distance for multichromosomal genomes without gene order is NP-hard by a reduction again from 3SAT. Let (V, E) be a 3SAT instance, where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$V = \{ v_1, \ldots, v_n \} $$
\end{document}
is a set of n boolean variables,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$E = \{ e_1, \ldots, e_m \} $$
\end{document}
is a conjunctive boolean formula of m clauses, and each clause in E is a disjunction of exactly three literals of the variables in V. Without loss of generality, assume that no clause in E contains two literals of the same variable in V. We will construct two genomes G1 and G2 over n + 9m distinct genes:
• One variable gene xi for each variable vi, 1 ≤ i ≤ n;
• Six clause genes
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a_j, b_j, c_j, a^{ \prime}_j, b^{ \prime}_j, c^{ \prime}_j$$
\end{document}
for each clause ej, 1 ≤ j ≤ m;
• Three literal genes rj, sj, tj for the three literals of each clause ej, 1 ≤ j ≤ m.
For each variable vi, let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$p_{i, 1}, \ldots, p_{i, k_i}$$
\end{document}
be the literal genes for the positive literals of vi, and let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$q_{i, 1}, \ldots, q_{i, l_i}$$
\end{document}
be the literal genes for the negative literals of vi. G1 includes one subset and G2 includes two subsets of genes including xi:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_1 \langle v_i \rangle & : \quad \{ p_{i, 1}, \ldots, p_{i, k_i}, x_i, q_{i, 1}, \ldots, q_{i, l_i} \} \\ G_2 \langle v_i \rangle & : \quad \{ p_{i, 1}, \ldots, p_{i, k_i}, x_i \} \quad \{ x_i, q_{i, 1}, \ldots, q_{i, l_i} \} \end{align*}
\end{document}
For each clause ej, G1 includes six subsets and G2 includes seven subsets of clause/literal genes:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_1 \langle e_j \rangle & : \quad \{ a_j, b_j \} \ \{ b_j, c_j \} \ \{ c_j, a_j \} \quad \{ a^{ \prime}_j, r_j \} \ \{ b^{ \prime}_j, s_j \} \ \{ c^{ \prime}_j, t_j \} \\ G_2 \langle e_j \rangle & : \quad \{ a_j, b_j, c_j \} \quad \{ a_j, a^{ \prime}_j, r_j \} \ \{ b_j, b^{ \prime}_j, s_j \} \ \{ c_j, c^{ \prime}_j, t_j \} \quad \{ a^{ \prime}_j \} \ \{ b^{ \prime}_j \} \ \{ c^{ \prime}_j \} \end{align*}
\end{document}
This completes the construction. It is easy to check that each gene appears at most two times in each genome, G1 includes exactly n + 15m genes including duplicates, and G2 includes exactly 2n + 18m genes including duplicates. We give an example:
Example 2
For a 3SAT instance of 4 variables and 2 clauses
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$e_1 = \{ r_1 = v_1, s_1 = \neg v_2, t_1 = \neg v_3 \} $$
\end{document}
and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$e_2 = \{ r_2 = \neg v_1, s_2 = v_3, t_2 = v_4 \} $$
\end{document}
, the reduction constructs the following two genomes:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G_1: & \quad \{ r_1, x_1, r_2 \} \quad \{ x_2, s_1 \} \quad \{ s_2, x_3, t_1 \} \quad \{ t_2, x_4 \} \\ & \quad \{ a_1, b_1 \} \ \{ b_1, c_1 \} \ \{ c_1, a_1 \} \quad \{ a^{ \prime}_1, r_1 \} \ \{ b^{ \prime}_1, s_1 \} \ \{ c^{ \prime}_1, t_1 \} \\ & \quad \{ a_2, b_2 \} \ \{ b_2, c_2 \} \ \{ c_2, a_2 \} \quad \{ a^{ \prime}_2, r_2 \} \ \{ b^{ \prime}_2, s_2 \} \ \{ c^{ \prime}_2, t_2 \} \\ G_2: & \quad \{ r_1, x_1 \} \ \{ x_1, r_2 \} \quad \{ x_2 \} \ \{ x_2, s_1 \} \quad \{ s_2, x_3 \} \ \{ x_3, t_1 \} \quad \{ t_2, x_4 \} \ \{ x_4 \} \\ & \quad \{ a_1, b_1, c_1 \} \quad \{ a_1, a^{ \prime}_1, r_1 \} \ \{ b_1, b^{ \prime}_1, s_1 \} \ \{ c_1, c^{ \prime}_1, t_1 \} \quad \{ a^{ \prime}_1 \} \ \{ b^{ \prime}_1 \} \ \{ c^{ \prime}_1 \} \\ & \quad \{ a_2, b_2, c_2 \} \quad \{ a_2, a^{ \prime}_2, r_2 \} \ \{ b_2, b^{ \prime}_2, s_2 \} \ \{ c_2, c^{ \prime}_2, t_2 \} \quad \{ a^{ \prime}_2 \} \ \{ b^{ \prime}_2 \} \ \{ c^{ \prime}_2 \} \end{align*}
\end{document}
The assignment v1 = true, v2 = false, v3 = false, v4 = true satisfies the 3SAT instance and corresponds to the following common reduced genome:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}G^{ \prime}: & \quad \{ r_1, x_1 \} \quad \{ x_2, s_1 \} \quad \{ x_3, t_1 \} \quad \{ t_2, x_4 \} \\ & \quad \{ a_1 \} \ \{ b_1, c_1 \} \quad \{ a^{ \prime}_1 \} \ \{ b^{ \prime}_1 \} \ \{ c^{ \prime}_1 \} \\ & \quad \{ c_2 \} \ \{ a_2, b_2 \} \quad \{ a^{ \prime}_2, r_2 \} \ \{ b^{ \prime}_2, s_2 \} \ \{ c^{ \prime}_2 \} \end{align*}
\end{document}
The reduction clearly runs in polynomial time. It remains to prove the following lemma:
Lemma 2
The 3SAT instance (V,E) is satisfiable if and only if the two genomes G1 and G2 have a common reduced genome G′ including exactly one copy of each gene.
We first prove the direct implication. Suppose that the 3SAT instance (V,E) is satisfiable. We will compose a common reduced genome G′ of the two genomes G1 and G2 as follows. Consider a truth assignment that satisfies the 3SAT instance. For each variable vi, take the subset
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\{ p_{i, 1}, \ldots, p_{i, k_i}, x_i \} $$
\end{document}
if vi is set to true, and take the subset
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\{ x_i, q_{i, 1}, \ldots, q_{i, l_i} \} $$
\end{document}
if vi is set to false. For each clause ej, at least one of its three literals is true; correspondingly, at least one of the three literal genes rj, sj, tj has been taken from some variable gadget 〈vi〉. Now take some subsets of clause/literal genes following one of three cases:
1. If rj has been taken, then take the subsets
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\{ a_j \}, \{ b_j, c_j \}, \{ a^{ \prime}_j \}, \{ b^{ \prime}_j, \underline{s_j} \}, \{ c^{ \prime}_j, \underline{t_j} \} $$
\end{document}
.
2. If sj has been taken, then take the subsets
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\{ b_j \}, \{ c_j, a_j \}, \{ a^{ \prime}_j, \underline{r_j} \}, \{ b^{ \prime}_j \}, \{ c^{ \prime}_j, \underline{t_j} \} $$
\end{document}
.
3. If tj has been taken, then take the subsets
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\{ c_j \}, \{ a_j, b_j \}, \{ a^{ \prime}_j, \underline{r_j} \}, \{ b^{ \prime}_j, \underline{s_j} \}, \{ c^{ \prime}_j \} $$
\end{document}
.
Here an underlined literal gene is omitted from the subset taken from the clause gadget 〈ej〉 if its other copy has already been taken from some variable gadget 〈vi〉. The reduced genome G′ thus composed clearly includes exactly one copy of each gene.
We next prove the reverse implication. Suppose that the two genomes G1 and G2 have a common reduced genome G′ including exactly one copy of each gene. We will find a satisfying assignment for the 3SAT instance (V, E) as follows. The crucial property of the clause gadget 〈ej〉 is that it cannot have a common reduced genome including exactly one copy of each clause gene
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a_j, b_j, c_j, a^{ \prime}_j, b^{ \prime}_j, c^{ \prime}_j$$
\end{document}
unless at least one of the three literal genes rj, sj, tj is omitted. A literal gene omitted from the clause gadget 〈ej〉 has to appear in a subset in G′ that contains some variable gene xi. By the construction of the variable gadgets, this subset contains, besides xi, either literal genes for positive literals, or literal genes for negative literals. Now set each variable vi to true if the subset in G′ that contains xi also contains at least one literal gene for a positive literal, and set it to false otherwise. Then each clause gets at least one true literal. This completes the proof of Theorem 2.
Let A and B be two sequences of lengths n and m, respectively, over an alphabet Σ = Σ1 ∪ Σ2, where Σ1 is a set of mandatory symbols and Σ2 is a set of optional symbols. In the special case that each mandatory symbol in Σ1 appears exactly once in one sequence and at least once in the other sequence, we have the obvious but important property that any common subsequence of the two sequences can contain each mandatory symbol at most once. This property leads to a very simple algorithm that decides whether a feasible solution to Exemplar Longest Common Subsequence exists in this special case:
The time complexity of Algorithm 1 is O(nm) by using a standard dynamic programming algorithm for longest common subsequence (Gusfield, 1997). The correctness of Algorithm 1 is justified by the following lemma:
Lemma 3
A and B have a common subsequence containing all mandatory symbols in Σ1 if and only if the longest common subsequence C* of A′ and B′ contains all mandatory symbols in Σ1.
Proof
The reduction from A and B to A′ and B′ preserves the mandatory symbols. Thus A and B have a common subsequence containing all mandatory symbols in Σ1 if and only if A′ and B′ have a common subsequence containing all mandatory symbols in Σ1. It remains to prove the equivalent claim that A′ and B′ have a common subsequence containing all mandatory symbols in Σ1 if and only if C* contains all mandatory symbols in Σ1.
The “if” direction of the claim is trivial because C* is a common subsequence of A′ and B′. To prove the “only if” direction, recall that in any common subsequence of A′ and B′, each mandatory symbol can appear at most once. Thus the length of any common subsequence of A′ and B′ is at most the size of Σ1. Moreover, if the length of some common subsequence of A′ and B′ is equal to the size of Σ1, then this common subsequence must contain all mandatory symbols in Σ1, and vice versa. Now suppose that A′ and B′ have a common subsequence C′ containing all mandatory symbols in Σ1. Then the length of C′ must be equal to the size of Σ1. Since the length of C* is at least the length of C′, the length of C* must also be equal to the size of Σ1. Then C* must contain all mandatory symbols in Σ1 too. This completes the proof. ▪
Since deciding whether a feasible solution to Exemplar Longest Common Subsequence exists for two sequences A and B is the same as the problem Zero Exemplar Distance for two monochromosomal genomes A′ and B′ obtained from A and B by deleting all optional symbols, we also have an O(nm) algorithm for Zero Exemplar Distance for monochromosomal genomes in the special case that each gene appears exactly once in one genome and at least once in the other genome.
We next present an algorithm for the maximization problem Exemplar Longest Common Subsequence in the special case that each mandatory symbol appears exactly once in one input sequence and at least once in the other input sequence:
If A and B have no common subsequence containing all mandatory symbols in Σ1, then clearly the maximum-weight common subsequence C* of A and B cannot contain all mandatory symbols in Σ1, and hence the algorithm correctly reports that no feasible solution exists. Otherwise, the correctness of Algorithm 2 is justified by the following lemma:
Lemma 4
If A and B have a common subsequence containing all mandatory symbols in Σ1, then the maximum-weight common subsequence C* of A and B is a longest common subsequence of A and B that contains all mandatory symbols in Σ1.
Proof
Suppose that A and B have a common subsequence C containing all mandatory symbols in Σ1. We first show that the maximum-weight common subsequence C* of A and B contains all mandatory symbols in Σ1. Note that the number of optional symbols in C* is at most the length of C*, which is at most min{n, m}. Also recall that any common subsequence of A and B can contain each mandatory symbol at most once. If C* does not contain all mandatory symbols in Σ1, then by our choice of w = min{n, m} + 1, the total weight of C* would be at most
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}( | \Sigma_1 | - 1 ) \cdot w + \min \{ n, m \} \cdot 1 < ( | \Sigma_1 | - 1 ) \cdot w + w \cdot 1 = | \Sigma_1 | \cdot w.\end{align*}
\end{document}
On the other hand, since C contains all mandatory symbols in Σ1, the weight of C is at least |Σ1| · w. This contradicts the assumption that C* is a maximum-weight common subsequence of A and B.
Now, since C* contains all mandatory symbols and can contain each mandatory symbol at most once, C* must contain each mandatory symbol exactly once. Then, to have the maximum total weight, C* must be a longest common subsequence of A and B that contains all mandatory symbols in Σ1. ▪
Again, the overall time complexity of Algorithm 2 is clearly O(nm) because the standard dynamic programming algorithm for longest common subsequence can be easily adapted to the weighted case within the same running time. This completes the proof of Theorem 3.
We present two algorithms for Zero Exemplar Distance for multichromosomal genomes without gene order. Let k1 and k2, respectively, be the numbers of chromosomes in G1 and G2. Let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$A_1, \ldots, A_{k_1}$$
\end{document}
be the k1 chromosomes in G1. Let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$B_1, \ldots, B_{k_2}$$
\end{document}
be the k2 chromosomes in G2. Let k = max{k1, k2}. Let n be the total number of genes in G1 and G2, i.e.,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$n = \sum\nolimits_{i = 1}^{k_1} | A_i | + \sum\nolimits_{j = 1}^{k_2} |B_j|$$
\end{document}
.
We first present a polynomial-time algorithm for Zero Exemplar Distance for multichromosomal genomes without gene order in the special case that each gene appears exactly once in one genome and at least once in the other genome. Our algorithm is based on maximum-weight matching in bipartite graphs:
To see the correctness of Algorithm 3, note that each reduced chromosome of a common reduced genome is a common subset of two distinct chromosomes, one from each input genome, and corresponds to an edge of a matching in the complete bipartite graph. In the special case that each gene appears exactly once in one genome and at least once in the other genome, no gene can appear more than once in the reduced chromosomes corresponding to the edges of a matching. Thus, the maximum possible weight of a matching is equal to the number of distinct genes, and a common reduced genome that includes all the genes corresponds to a matching of the maximum weight.
We now analyze the time complexity of Algorithm 3. Steps 1 and 3 can be easily implemented in O(n2) time. Step 2 can be implemented in O(k3) time using a standard algorithm for weight bipartite matching (Tarjan, 1983). Thus the overall time complexity is O(n2 + k3).
We next present a fixed-parameter tractable algorithm for this problem without any assumption on the distribution of duplicate genes. Refer to Downey and Fellows (1999) for basic concepts in parameterized complexity theory. The parameter of our algorithm is k = max{k1, k2}:
To see the correctness of Algorithm 4, note again that each chromosome of a common reduced genome is a common subset of two distinct chromosomes, one from each input genome. All other chromosomes of the two input genomes that do not contribute to the common reduced genome are deleted. To handle the matching and the deletion of the chromosomes in a uniform way, we can think of each chromosome deleted from one genome as matched to a chromosome deleted from the other genome or to an empty chromosome. Thus by padding the two genomes to the same number of chromosomes, we only need to consider perfect matchings as permutations. The time complexity of Algorithm 4 is O(k!n2), with O(n2) time for each of the k! permutations. This completes the proof of Theorem 4.
We remark that the problem Zero Exemplar Distance for multichromosomal genomes without gene order is unlikely to have a fixed-parameter tractable algorithm if the parameter is the maximum number of genes in any single chromosome. This is because 3SAT remains NP-hard even if for each variable there are at most five clauses that contain its literals (Garey and Johnson, 1979). As a result, in our reduction from 3SAT in Section 3, the number of genes in each chromosome need not be more than some constant.