Haplotype Inferring via Galled-Tree Networks Is NP-Complete

Abstract

The problem of determining haplotypes from genotypes has gained considerable prominence in the research community since the beginning of the HapMap project. Here the focus is on determining the sets of SNP values of individual chromosomes (haplotypes), since such information better captures the genetic causes of diseases. One of the main algorithmic tools for haplotyping is based on the assumption that the evolutionary history for the original haplotypes satisfies perfect phylogeny. This tool can be applied only on individual blocks of chromosomes, in which it is assumed that recombinations do not happen. However, exact determination of blocks is usually not possible. It would be desirable to develop a method for haplotyping which can account for recombinations, and thus can be applied on multiblock sections of chromosomes. A natural candidate for such a method is haplotyping via phylogenetic networks (which model recombinations) or their simplified version: galled-tree networks. However, even haplotyping via galled-tree networks appears hard, as the efficient algorithms exist only for very special cases: the galled-tree network has either a single gall or only small galls with two mutations each. Building on our previous results, we show that, in general, haplotyping via galled-tree networks is NP-complete, and it remains NP-complete when galls are allowed to have at most k mutations, for any k ≥ 3.

1. Introduction

With the completion of the human genome project, research has focused on the problem of determining variations in chromosomes among whole human population. This body of work is now encompassed in the international HapMap project (International HapMap Consortium, 2005; Pennisi, 2007; Thorisson et al., 2005). Genetic variations, in particular SNPs (single nucleotide polymorphisms), are already playing a central role in determining the genetic causes of diseases and in designing individualized medicine (Daly et al., 2001; Gabriel et al., 2002; Helmuth, 2001; Patil et al., 2001). For complex diseases (those affected by more than a single gene), it is much more informative to have haplotype data (a set of SNP values on an individual chromosome) than the individual SNPs. However, experimental methods only allow for cost-effective determination of genotype information (the combined information of haplotypes for pairs of chromosomes) (Mitra et al., 2003), and so the problem of computationally determining haplotypes from genotypes arises.

Various methods can be used to infer haplotypes from genotypes for population data. The first heuristic algorithm for computational haplotype inference was designed by Clark (1990). The exact version of Clark's problem was shown to be NP-hard (Gusfield, 2001). Another approach, called pure-parsimony haplotyping, asking for a solution with the minimum number of distinct haplotypes, was shown to be NP-hard as well (Gusfield, 2003; Lancia et al., 2002). Gusfield (2002) developed the first exact polynomial algorithm based on the assumption of no recombinations happened during the evolutionary history of the haplotypes in consideration, which allowed him to make effective use of phylogenetic trees. This assumption was justified by experimental results that show many chromosomes are blocky, with a strong correlation between sites on the same block (Daly et al., 2001; Patil et al., 2001). As such, these experiments do not exclude recombinations within a block; models that allow for recombinations are needed.

A first attempt in haplotyping via models which allow a limited number of biological events that violate the perfect phylogeny model was made by Song et al. (2005). In their article, a polynomial algorithm for haplotyping via imperfect phylogenies with a single homoplasy was presented, as well as a practical algorithm for haplotyping via galled-tree networks with one recombination cycle (gall). Galled-tree networks are special instances of phylogenetic networks, which in turn generalize phylogenetic trees by incorporating recombinations in the model (Wang et al., 2001). There is always a phylogenetic network for any set of haplotypes, while finding such phylogenetic networks with the smallest number of recombinations is NP-hard (Bordewich and Semple, 2004; Wang et al., 2001), and hence, haplotyping via phylogenetic networks is either easy and meaningless (any inferring is good) or intractable, depending on whether the minimum number of recombinations is required.

A galled-tree network is a special type of phylogenetic network in which recombination cycles do not intersect. Similar to phylogenetic trees, not every set of haplotypes admits a galled-tree network; however, it can be decided in polynomial time whether it is the case (Gusfield et al., 2004b). In addition, if there is a galled-tree network, it is easy to find the one (reduced GTN) with the smallest number of recombinations, and no phylogenetic network for the same set of haplotypes has fever recombinations. In earlier work (Gupta et al., 2006), we found a characterization of the existence of galled-tree networks. A similar characterization was independently discovered in Song (2006). Building on this characterization, we developed a polynomial algorithm for haplotype inference via galled-tree networks with simple galls, having two mutations each based on reduction of haplotyping problem to a hypergraph covering problem in Gupta et al. (2007). It is very natural to ask whether the assumption on the number of galls or the size of galls can be dropped and still hope for a polynomial algorithm. In Gupta et al. (2009), we reduced the haplotype inferring problem to a hypergraph covering problem for genotype matrices satisfying a combinatorial condition.

Building on our previous work, here we show that the problem of inferring haplotypes via galled-tree networks is NP-complete by reduction from 3-SAT. Moreover, the problem remains NP-complete even if we put any weaker condition on the galls (without restricting their number) than single galls. In fact, for any k > 2, haplotyping via gall-tree networks with galls, having at most k mutations, is NP-complete.

2. Definitions

2.1. Haplotype inferring from population data

Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variations. A set of SNP values (e.g., SNPs that sit on a gene) on a single chromosome is called a haplotype. SNPs usually take two values among all the human population. Therefore, haplotypes are commonly represented as sequences of 0 and 1, by fixing a mapping of {0, 1} to two possible states in {A, C, G, T} at each SNP position. A combined information from two haplotypes for a matching pair of chromosomes is called a genotype. Here, the information about which value comes from the first and which from the second copy of the chromosome is lost. Genotype sequence is usually represented as a sequence of {0, 1, 2}, where value 0 or 1 at certain position i represents the fact that both haplotypes have this value at i (homozygous), while value 2 means that the values on two haplotypes at position i differ (heterozygous). The haplotype inference problem, or simply haplotyping, asks for determining of haplotype sequences based on genotype sequences of a set of individuals:

Definition 1 (Haplotyping)

Given a genotype n × m matrix A with values {0, 1, 2}, we say that a haplotype 2n × m matrix B with values in {0, 1} is inferred from A if and only if for every SNP \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c \in \{1, \ldots , m \}$$\end{document} ,

if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$A (i, c) \in \{0, 1 \}$$\end{document} , then B(2i − 1, c) = B(2i, c) = A(i, c); and

if A(i, c) = 2, then B(2i − 1, c) ≠ B(2i, c).

Obviously, there are exponentially many ways in the number of 2's in a row how to infer two haplotypes from this row. Therefore, various types of parsimonious criteria are used to choose the most plausible inferring of the whole set of genomes, including maximum resolution problem of Clark, pure parsimony criteria, haplotyping via perfect phylogeny and several statistical methods; for an overview, see Gusfield and Orzack (2006). In this article, we are interested in haplotyping via galled-tree networks which allow for recombination events, defined in the next subsection.

2.2. Phylogenetic and galled-tree networks

In phylogenetic trees, each vertex is labeled by a sequence of states of characters (e.g., SNPs) and is connected by a mutation edge to its parent along which one character changes its state. Phylogenetic networks introduced in Wang et al. (2001), sometimes called “recombination networks,” are an extension of phylogenetic trees in which a vertex can be connected by two recombination edges to two parents, and the label sequence for this recombination vertex is formed by a recombination of sequences of its two parents.

Definition 2 (Phylogenetic network)

A phylogenetic network N on m characters is a directed acyclic graph containing exactly one vertex (the root) with no incoming edges, and each other vertex has either one incoming (mutation) edge or two incoming (mutation) edges. A vertex x with two incoming edges is called a recombination vertex.

Each integer (character) from 1 to m is assigned to exactly one mutation edge in N, and each mutation edge is assigned one character. Each vertex in N is labeled by a binary sequence of length m, starting with the root vertex which is labeled with the all-0 sequence. Since N is acyclic, all other vertices in N can be recursively labeled, and the vertices in N can be topologically sorted into a list, where every vertex occurs in the list only after its parent(s). Using that list, we can define the labels of the non-root vertices, in order of their appearance in the list, as follows:

For a non-recombination vertex v, let e be the mutation edge labeled c coming into v. The label of v is obtained from the label of v's parent by changing the value at position c from 0 to 1.

Each recombination vertex x is associated with an integer \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\rm r}_x \in \{2, \ldots , m \}$$\end{document} , called the recombination point for x. Label the two recombination edges coming to x, P and S, respectively. Let P(x) (S(x)) be the sequence of the parent of x on the edge labeled P (S). Then the label of x is a recombination of labels of its parents: concatenation of the first r_x − 1 characters of P(x) (prefix), followed by the last m − r_x + 1 characters of S(x) (suffix). Hence, P(x) contributes a prefix and S(x) contributes a suffix to x's sequence.

In this article, the sequence at the root of the phylogenetic network is always the all-0 sequence, and all results are relative to that assumption. More general phylogenetic networks with unknown root were studied recently by Gusfield (2005). A phylogenetic network for a given binary matrix M is illustrated in Figure 1.

FIG. 1.

A phylogenetic network for matrix M. In the network, each mutation edge is labeled by a character c_i, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i \in \{1, \ldots , 6 \}$$\end{document} ; recombination edges are labeled by P and S, respectively; the integer label above each recombination vertex represents the recombination point.

Definition 3

Given an n × m matrix A with values in {0, 1}, we say that a phylogenetic network N with m characters explains A if each sequence of A is a label of some vertex in N.

Finding a phylogenetic network with the minimum number of recombination vertices for a given haplotype matrix is NP-hard (Bordewich and Semple, 2004; Wang et al., 2001). Hence, a more restricted version of phylogenetic networks was studied in several articles (Gusfield et al., 2004a,b; Wang et al., 2001). The restricted version can be defined as follows.

Definition 4 (Galled-tree network)

In a phylogenetic network N, let v be a vertex that has two paths out of it that meet at a recombination vertex x (v is the least common ancestor of the parents of x). The two paths together form a recombination cycle C. The vertex v is called the coalescent vertex. We say that recombination cycle C contains a character i, if i labels one of the mutation edges of C.

A phylogenetic network is called a galled-tree network if no two recombination cycles share an edge. A recombination cycle of a galled-tree network is sometimes referred to as a gall.

We now define the conflict graph, introduced in Gusfield et al. (2004b).

Definition 5 (Conflict graph)

We say that the SNPs c₁ and c₂ conflict in a haplotype matrix B if B contains all three pairs [0, 1], [1, 0], and [1, 1] in SNPs c₁ and c₂. An SNP is called conflicted if it is involved in at least one conflict, and is otherwise called unconflicted. The conflict graph G_B has the vertex set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{1, \ldots , m \}$$\end{document} (one entry for every SNP) and for every two SNPs c₁ and c₂, (c₁, c₂) is an (undirected) edge of G_B if they conflict.

The following definition defines a special type of galled-tree networks which are important in characterizing matrices explainable by a galled-tree network based on their conflict graph.

Definition 6 (Reduced galled-tree network)

A galled-tree is called a reduced galled-tree if every gall only contains conflicted SNPs.

Note that the example in Figure 1 is not a galled-tree network as two galls share an edge. A galled-tree network for a binary matrix M is illustrated in Figure 2. The two recombination cycles in this network do not share any edges. This galled-tree is reduced, since the following pairs of SNPs conflict: c₂ and c₆, c₃ and c₆, c₇ and c₈.

FIG. 2.

A galled-tree network for matrix M. Two galls labeled by the set of vertices {s₅, s₆, s₉, s₁₀} and {r, s₂, s₄, s₃, s₇} do not share any edges in the network.

2.3. Known properties of galled-tree networks

In this section we state some known facts about galled-tree networks proved in Gusfield et al. (2004b) which we utilize to prove our results.

Theorem 1 (Gusfield et al., 2004b)

Let N be a galled-tree network that explains a haplotype matrix B. Then there is a one-one correspondence between the non-trivial connected components (having at least two vertices) of the conflict graph G_B and the galls in N containing conflicted SNPs. Each such gall in N contains all conflicted SNPs of one non-trivial connected component of the conflict graph G_B, and contains no conflicted SNPs from a different non-trivial connected component of G_B.

The following corollary easily follows from the definition of reduced galled-tree networks and Theorem 1.

Corollary 1

Let N be a “reduced” galled-tree network that explains a haplotype matrix B. Then there is a one-one correspondence between the non-trivial connected components of the conflict graph G_B and the galls in N. Each gall in N contains all conflicted SNPs of one non-trivial connected component of the conflict graph G_B, and contains no other SNPs.

It follows by the algorithm for constructing galled-tree networks presented in Gusfield et al. (2004b) or in Gupta et al. (2006) that if a haplotype matrix B can be explained by a galled-tree network, then it can be explained by a reduced galled-tree network as well. In addition, it follows by Theorem 1 and Corollary 1, that a reduced galled-tree network explaining B will have the smallest number of galls and its galls will contain the smallest possible number of SNPs. Hence, we have the following corollary.

Corollary 2

If there exists a galled-tree network explaining a haplotype matrix B with at most k SNPs on each gall, then there is a reduced galled-tree network explaining B with at most k SNPs on each gall.

2.4. Inferring haplotypes via galled-tree network and the extended hypergraph covering problem

In this section, we will recall the characterization for the galled-tree network haplotyping (GTNH) problem using a hypergraph covering problem developed in Gupta et al. (2009). This characterization works only for genotype matrices satisfying special combinatorial properties. To state the result from Gupta et al. (2009), we need the following definitions.

Definition 7

Given a genotype matrix A, we say that A can be explained by a galled-tree network if there exists a haplotype matrix B inferred from A such that B can be explained by a galled-tree network.

Problem 1 (Galled-Tree Network Haplotyping [GTNH] Problem)

Given a genotype matrix A, decide if A can be explained by a galled-tree network.

Let k be a fixed integer. The k-GTNH problem is defined as follows:

Problem 2 (k-Galled-Tree Network Haplotyping [k-GTNH] Problem)

Given a genotype matrix A, decide if A can be explained by a galled-tree network in which each gall contains at most k mutation edges.

Note that for a matrix satisfying weak property (for every column that contains 2, also contains 1), 2-GTNH Problem is polynomial (Gupta et al., 2007).

Next, we will give the definitions of the combinatorial properties of genotype matrices used in Gupta et al. (2009).

Definition 8 (Simple genotype matrix)

We say that a genotype matrix is simple if every row contains either zero or three 2's.

Definition 9 (Inducing)

Given a genotype matrix A, for every \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$x, y \in \{0, 1 \}$$\end{document} , we say that a pair of SNPs c₁, c₂ induces [x, y] in A, if A contains at least one of the pairs [x, y], [2, y] and [x, 2] in SNPs c₁ and c₂. Similarly, a triple of SNPs c₁, c₂, c₃ induces [x, y, z] in A, if A contains at least one of the triples [x, y, z], [2, y, z], [x, 2, z] and [x, y, 2] in SNPs c₁, c₂, c₃.

Definition 10 (Weak diagonal [WD] property)

Given a genotype matrix A, we say that a pair of SNPs is active if it contains [2, 2], or it induces all three pairs [1, 1], [0, 1] and [1, 0]. Further, we say that a pair c₁, c₂ is weakly active if either it is active, or if there is an SNP c₃ such that c₁, c₃ and c₂, c₃ are both active pairs. We say that A has the weak diagonal property if every weakly active pair of SNPs induces both [0, 1] and [1, 0].

For purposes of this article and clarity, we will consider the following (stripped) version of the extended genotype hypergraph.

Definition 11 (Extended genotype hypergraph [EGH])

An extended genotype hypergraph is a hypergraph with hyperedges containing either two or three vertices, and a list of switches, which are ordered triples of vertices.

Given a simple n × m genotype matrix A, the extended genotype hypergraph H_A of A has the set of SNPs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{1, \ldots , m \}$$\end{document} as a vertex set. Hypergraph H_A contains only the following hyperedges:

for every row r of A containing three 2's, say in SNPs c₁, c₂, c₃ there is a hyperedge e_r = {c₁, c₂, c₃}; and

for every two SNPs c₁ and c₂ inducing [0, 1], [1, 0] and [1, 1] in A, there is a hyperedge {c₁, c₂} in H_A.

Furthermore, for every triple of SNPs c₁, c₂, c₃ such that there are distinct hyperedges e and e′ such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{1}, c_{2} \in e$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{2}, c_{3} \in e^{\prime}$$\end{document} and the triple induces [0, 1, 0] in A, H_A contains a switch [c₁, c₂, c₃].

The following definition defines a graph covering of an extended genotype hypergraph.

Definition 12 (Covering of EGH)

Consider an extended genotype hypergraph H. We say that a graph G with the same vertex set as H covers H if G can be obtained as follows:

for every 2-edge {c₁, c₂} of H_A, add the edge (c₁, c₂) in G;

for every 3-edge {c₁, c₂, c₃} of H_A, add exactly one of the edges (c₁, c₂), (c₂, c₃) and (c₁, c₃) to G; and

and for every switch [c₁, c₂, c₃], at most one of the edges (c₁, c₂) and (c₂, c₃) is in G. A graph G that covers H is called a covering of H.

Let k be a fixed integer. The k-EHC problem can be formulated as follows:

Problem 3 (k-Extended Hypergraph Covering [k-EHC] Problem)

Given an extended genotype hypergraph H, determine whether there is a covering G of H such that each connected component C of G is a path of length at most k satisfying the ordered component property:

C is bipartite with partitions L and R such that all vertices in L are smaller than all vertices in R. Recall vertices of G and H are integers from 1 to m.

The following result was shown in Gupta et al. (2009).

Theorem 2 (Gupta et al., 2009)

Given a simple genotype matrix A with the WD property, let B be a matrix inferred from A which can be explained by a galled-tree network N. Then each component of G_B is a path of length at most 3.

The following two corollaries follow by Corollaries 1 and 2, and Theorem 2.

Corollary 3

Given a simple genotype matrix A with the WD property, let B be a matrix inferred from A which can be explained by a galled-tree network. Let N be a reduced galled-tree network explaining B. Then each gall in N contains at most four SNPs.

Corollary 4

Given a simple genotype matrix A with the WD property, let B be a matrix inferred from A which can be explained by a galled-tree network. Let N be a reduced galled-tree network explaining B. For any 1 ≤ k ≤ 3, each gall in N contains at most k + 1 SNPs if and only if each component of G_B is a path of length at most k.

The next result from Gupta et al. (2009) characterizes when a genotype matrix can be explained by a galled-tree network in terms of hypergraph coverings.

Theorem 3 (Gupta et al., 2009)

Consider a simple genotype matrix A with the WD property. Let B be a haplotype matrix inferred from A and G_B its conflict graph. Then B can be explained by a galled-tree network if and only if G_B is a covering of extended genotype hypergraph H_A and satisfies the ordered-component property.

The following theorem is a characterization that we will use in the next section to prove the NP-hardness of restricted versions of the GTNH problem.

Theorem 4

Let A be a simple genotype matrix with the WD property and let k ≥ 1 be a fixed integer. Then A can be explained by a galled-tree network N in which each gall contains at most k + 1 SNPs if and only if there exists a covering G of H_A such that every component of G is a path of length at most min(3, k) and has the ordered-component property.

Proof. Suppose that A can be explained by a galled-tree network in which each gall contains at most k + 1 SNPs. By Corollary 2, there is a haplotype matrix B inferred from A where B can be explained by a reduced galled-tree network N. If k ≤ 3, by Corollary 4, each component of the conflict graph G_B is a path of length at most k. If k > 3, by Theorem 2, each component of G_B is a path of length at most 3. By Theorem 3, G_B is a covering of H_A and satisfies the ordered-component property, and thus there is a covering with required properties.

Conversely, suppose that G is a covering of H_A such that every component of G is a path of length at most min(3, k) and has the ordered-component property. Then by Theorem 3, there is a haplotype matrix B inferred from A with the conflict graph G_B = G which can be explained by a reduced galled-tree network N. Moreover, by Corollary 4, each gall in N contains at most min(3, k) + 1 SNPs. Therefore, A can be explained by a galled-tree network N where each gall contains at most k + 1 SNPs. ▪

Note that the WD property of the genotype matrix forces the components in the conflict graphs of inferred haplotype matrices to be small. In Gupta et al. (2009), it was also shown that the 3-EHC problem is NP-complete, hence this characterization fails to provide a polynomial solution for the GTNH problem even for such special genotype matrices. On the other hand, since not every extended genotype hypergraph has a corresponding genotype matrix—in particular, the gadgets used to show NP-completeness of the 3-EHC problem in Gupta et al. (2009) do not have a corresponding genotype matrix—this result does not imply that the GTNH problem is NP-complete. In the next section, we consider special instances of extended genotype hypergraphs for which, as we will see later, it is possible to construct a corresponding genotype matrix and show that the EHC problem for them remains NP-complete.

3. Gtnh Problem is Np-Complete

The proof of NP-completeness is done in two steps. First, we define special instances of extended genotype hypergraphs, and show that the 2-EHC and 3-EHC problems for them are NP-complete. Then we show that it is possible to construct a genotype matrix for each such instance. The NP-completeness of the GTNH problem then follows by the characterization obtained in Gupta et al. (2009) (Theorem 4).

Definition 13 (Natural EGH)

We say that an extended genotype hypergraph H is natural if for any two hyperedges e,e′ of H, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mid e \cap e^{\prime} \mid \le 1$$\end{document} and the list of switches contains all and only the following switches: for every vertex c of H with degree at least 3, and for every two hyperedges e₁ and e₂ containing c, and for every two vertices \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{1} \in e_{1}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{2} \in e_{2}$$\end{document} such that c, c₁, c₂ are all distinct, there is a switch [c₁, c, c₂].

We will show that the extended genotype hypergraph covering problem for natural EGHs is NP-complete by reduction from 3-SAT. The proof follows the idea of the proof of NP-completeness of the EHC problem in Gupta et al. (2009). However, the proof in Gupta et al. (2009) assumed that there are no switches in the EGH, which was the main reason why there was no corresponding genotype matrix for the constructed EGH. In the following proof, the gadgets had to be redesigned to take into account the existence of the switches.

Theorem 5

The 3-EHC problem for natural extended genotype hypergraphs is NP-complete.

Proof. The proof is done by a reduction from a special instance of the 3-SAT problem in which each clause contains two or three literals and every variable occurs in exactly three clauses—once positive and twice negated (Papadimitriou, 1994). Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$f (x_1, x_2, \ldots , x_m) = C_{1} \wedge \ldots \wedge C_{k}$$\end{document} be such a formula in the conjunctive normal form, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{1}, \ldots , C_{k}$$\end{document} are the clauses. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{1}, \ldots , p_{3m}$$\end{document} be the list of all occurrences of literals in f such that p_3i − 2 = x_i and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{3i - 1} = p_{3i} = \neg x_{i}$$\end{document} . Now every clause C_i can be written as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align}C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}} \qquad {\rm or} \qquad C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}} \vee p_{s_{i, 3}}\end{align}\end{document}

depending on whether C_i contains two or three literals.

Next, we construct a natural extended genotype hypergraph H(f) for f which has a covering if and only if the formula f is satisfiable. The hypergraph H(f) will be an edge-disjoint union of several gadgets, one for each clause and one for each variable. The only vertices in common among gadgets will be the literal vertices; in particular, each literal vertex will be shared between one clause gadget and one variable gadget. Furthermore, in each gadget we will mark every vertex either with a dot or a cross such that every literal vertex will be marked with a dot. This will guarantee that our marking will be consistent in whole H(f). Using this marking, we order the vertices of H(f) such that every vertex marked with a dot precedes every vertex marked with a cross. This ordering implies that, to verify the ordered component property of a covering, it is enough to check that in such a covering every path of length two or three alternates between vertices with crosses and dots.

Now we define the gadgets; we start with the clause gadgets. Consider a covering c of H(f). We say that a literal p_j has value 1 in this covering, if c restricted to the clause gadget containing p_j contains an edge incident to the vertex p_j. Note that this is well defined since for every literal vertex p_j there is a unique clause gadget containing it.

For every clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}}$$\end{document} with two literals, we construct a gadget consisting of one 3-edge as depicted in Figure 3a. Figure 3b–d shows all possible coverings of the gadget. Note that in each such covering, at least one of literals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 2}}$$\end{document} has value 1.

FIG. 3.
NP-completeness for 3-EHC problem. (a–d) The clause gadget for 2-clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_i = p_{s_{i, 1}} \vee p_{s_{i, 2}}$$\end{document} with two literals and all possible coverings. The coverings are depicted as solid edges joining particular pair of vertices. In (b), \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}}$$\end{document} has value 1 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 2}}$$\end{document} has value 0; in (c) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}}$$\end{document} has value 0 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 2}}$$\end{document} has value 1; and in (d), both have value 1.

For every clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}} \vee p_{s_{i, 3}}$$\end{document} with three literals, we construct a gadget consisting of one 2-edge and nine 3-edges as depicted in Figure 4a. Figure 4b–d shows three possible coverings of the gadget, in which exactly one of the literals is set to 1. Note that, in our proof, the restriction of a covering corresponding to a satisfiable assignment of the formula f to the gadget will be one of these three coverings. In any other covering of the gadget, an important property is that at least one of the literals has value 1. Indeed, assume that all three values are set to 0. Then in every covering of the gadget in which the condition for switches is satisfied, there is a path of length 4 (Fig. 4e); hence, it is not a covering. This guarantees that in any covering of H(f), in the corresponding assignment, every clause of f will be satisfied.

FIG. 4.
NP-completeness for 3-EHC problem. (a) The part of hypergraph \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bar H} (f)$$\end{document} for 3-clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_i = p_{s_{i, 1}} \vee p_{s_{i, 2}} \vee p_{s_{i, 3}}$$\end{document} . (b–d) Three possible graph coverings, each representing one case how the clause can become satisfied. (e) The case when all three literals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}} , p_{s_{i, 2}} , p_{s_{i, 3}}$$\end{document} are set to 0 leads to a path of length 4 (dashed).

In the second part of the construction, for each variable x_i, we add a variable gadget which will guarantee that three occurrences of a variable x_i: p_3i − 2,p_3i − 1,p_3i must be assigned consistent values. That is if p_3i − 2 (positive occurrence) has value 1 then both p_3i − 1 and p_3i (negated occurrences) should have values 0, and if at least one of p_3i − 1 and p_3i has value 1 then p_3i − 2 should have value 0. This is achieved by a gadget consisting of three 2-edges and thirteen 3-edges depicted in Figure 5a. Figure 5b,c shows two possible coverings of the gadget. In these figures, a variable p_j has value 1 if no edge in the gadget joins p_j, which is in agreement with interpretation of values of p_i's in gadgets of the first part of construction.

FIG. 5.
NP-completeness for 3-EHC problem. (a) The part of hypergraph H(f) verifying the values of three occurrences of a variable x_i. (b, c) Two possible coverings. In (b), we have depicted the unique covering if it is assumed that p_3i − 2 has value 1 (set by clause gadget). As can be seen, this forces values of p_3i − 1 and p_3i to 0 (in their clause gadgets). In (c), p_3i − 2 is forced to have value 0 (in its clause gadgets) and p_3i − 1 and p_3i can have arbitrary values. (d) When p_3i − 2 and p_3i − 1 are set to 1 leads to a path of length 4 or 5 (depending on which of the dashed edges is chosen).

Let us verify the claimed property of the gadget. Assume for instance that both p_3i − 2 and p_3i − 1 have value 1. No edge covering the variable gadget can be adjacent to any of these two vertices, otherwise the condition on switches (crossing from variable to clause gadgets) would be violated. Hence, the edges connecting the other two vertices of the 3-edges containing p_3i − 2 or p_3i − 1 have to be in the covering. Similarly, edges e₁, e₂ in Figure 5e have to be in the covering. Now, there is no edge to be selected to cover the 3-edge in the middle of the gadget, as the selection of any of the dashed edges would produce a path of length 4 or 5. The other cases can be proved using similar arguments.

Finally, we have to check that it is possible to find a covering of H(f) that satisfies the conditions of the 3-EHC problem (a solution to the 3-EHC problem) if and only if f is satisfiable. First, consider a covering G that is the solution to the 3-EHC problem for H(f). For every clause C_i, at least one of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}} , p_{s_{i, 2}}$$\end{document} (respectively, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}} , p_{s_{i, 2}} , p_{s_{i, 3}}$$\end{document} ) has value 1 in G. Let it be p_qi (if there are several literals in C_i with value 1 in G, pick any of them). We will form a true assignment as follows. For every x_j, if there is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i}} = x_{j}$$\end{document} , set x_j = 1; if there is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i}} = \neg x_{j}$$\end{document} , set x_j = 0; otherwise set x_j to any value. As long as we guarantee that there are no i, i′ such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i}} = x_{j}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i^{\prime}}} = \neg x_{j}$$\end{document} , the above definition is correct and obviously is a true assignment for f. Assume that on the contrary, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i}} = x_{j}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i^{\prime}}} = \neg x_{j}$$\end{document} . Obviously, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i}} = p_{3j - 2}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{q_{i^{\prime}}}$$\end{document} is either p_3j or p_3j − 1. Now, since p_3j − 2 has value 1 and one of p_3j, p_3j − 1 has value 1 in G, there is no valid covering of the variable gadget for x_j, a contradiction.

For the converse, consider a true assignment for f. For every clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}}$$\end{document} with two literals, there is at least one literal in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$q_{i} \in \{s_{i, 1} , s_{i, 2} \}$$\end{document} with value 1 in this assignment. If it is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}}$$\end{document} (respectively, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 2}}$$\end{document} ), pick the covering of the clause gadget for C_i as depicted in Figure 3b (respectively, Fig. 3c). Similarly, for every clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}} \vee p_{s_{i, 3}}$$\end{document} with three literals, pick the covering as depicted in Figure 4b–d, depending on which literal has value 1 (if there are several pick one). For the variable gadgets, we will select the coverings as follows. For every x_i, if value of x_i is 1, pick a hypergraph covering of the gadget for x_i depicted in Figure 5b, and if value of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\neg x_{i}$$\end{document} is 1, in Figure 5c. Let G be the union of graphs that cover all gadgets. Now, we need to show that G satisfies the conditions of the 3-EHC problem. The coverings of each gadgets satisfies these conditions. Since, the only common vertices among gadgets are literal vertices, which have degree 3, the switches with literal vertices in the middle forbid the connected components of the coverings of different gadgets to connect. Hence, G satisfies the conditions of the 3-EHC problem as well. ▪

Theorem 6

The 2-EHC problem for natural extended genotype hypergraphs is NP-complete.

Proof. The proof is similar to the proof of Theorem 5 except that we use slightly different clause and variable gadgets. Here, for every clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}}$$\end{document} with two literals, we use the same gadget used in the proof of Theorem 5 (Fig. 3a). For every clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{i} = p_{s_{i, 1}} \vee p_{s_{i, 2}} \vee p_{s_{i, 3}}$$\end{document} with three literals, we construct a gadget consisting of nine 3-edges as depicted in Figure 6a. Figure 6b–d shows three possible coverings of the gadget, in which exactly one of the literals is set to 1. Figure 6e shows that if all three literals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}}$$\end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 2}}$$\end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 3}}$$\end{document} are set to zero, then in every covering of the gadget in which the condition for switches is satisfied, there is a path of length 3. Hence, it is not a covering. This guarantees that, in any covering of H(f), in the corresponding assignment, every clause of f will be satisfied.

FIG. 6.
NP-completeness for 2-EHC problem. (a) The part of hypergraph \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bar H} (f)$$\end{document} for 3-clause \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_i = p_{s_{i, 1}} \vee p_{s_{i, 2}} \vee p_{s_{i, 3}}$$\end{document} . (b–d) Three possible graph coverings, each representing one case how the clause can become satisfied. (e) The case when all three literals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p_{s_{i, 1}} , p_{s_{i, 2}} , p_{s_{i, 3}}$$\end{document} are set to 0 leads to a path of length 3 (dashed).

Figure 7a depicts a variable gadget which will guarantee that three occurrences of a variable x_i: p_3i − 2, p_3i − 1, p_3i must be assigned consistent values. Figure 5b,c shows two possible coverings of the gadget. Figure 5b depicts a covering of the gadget in which p_3i − 2 is set to one and p_3i − 1, p_3i are set to zero. In Figure 5c, p_3i − 2 is set to zero and p_3i − 1 and p_3i can have arbitrary values. To prove that the occurrences of a variable x_i must be assigned consistent values, it remains to prove that it is impossible to set both p_3i − 2 and p_3i − 1 or both p_3i − 2 and p_3i to value 1. Without loss of generality, assume that both p_3i − 2 and p_3i − 1 have value 1. No edge covering the variable gadget can be adjacent to any of these two vertices, otherwise the condition on switches (crossing from variable to clause gadgets) would be violated. Hence, the edges connecting the other two vertices of the 3-edges containing p_3i − 2 or p_3i − 1 have to be in the covering. Similarly, edges e₁, e₂ in Figure 7e have to be in the covering. Now, there is no edge to be selected to cover the 3-edge in the middle of the gadget, as the selection of any of the dashed edges would produce a path of length 3 or 4. The other case can be proved using similar argument.

FIG. 7.
NP-completeness for 2-EHC problem. (a) The part of hypergraph H(f) verifying the values of three occurrences of a variable x_i. (b, c) Two possible coverings. In (b), we have depicted the unique covering if it is assumed that p_3i − 2 has value 1 (set by clause gadget). As can be seen, this forces values of p_3i − 1 and p_3i to 0 (in their clause gadgets). In (c), p_3i − 2 is forced to have value 0 (in its clause gadgets) and p_3i − 1 and p_3i can have arbitrary values. (d) when p_3i − 2 and p_3i − 1 are set to 1 leads to a path of length 3 or 4 (depending on which of the dashed edges is chosen).

The rest of the proof is similar to the proof of Theorem 5. ▪

The following lemma shows that, for every natural EGH, there is a corresponding simple genotype matrix with the WD property.

Lemma 7

For every natural EGH H, there is a simple genotype matrix A with the WD property such that H_A = H.

Proof. Let H be a natural EGH. Construct a genotype matrix A(H) using the following steps:
(1) Let A(H) have ∣V (H)∣ columns and each vertex \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c \in V (H)$$\end{document} corresponds to a column c of A(H).

(2) For each 3-edge {c₁, c₂, c₃} of H, add a new row r to A(H) such that r[c] = 2, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c \in \{c_{1} , c_{2} , c_{3} \}$$\end{document} , and r[c] = 0, otherwise.

(3) For each 2-edge {c₁, c₂} of H, add a new row r to A(H) such that r[c] = 1, if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c \in \{c_{1} , c_{2} \}$$\end{document} , and r[c] = 0, otherwise.

(4) For every vertex c of H with degree 1, add a new row r to A(H) such that r[c] = 1, and r[c′] = 0 for any c′ ≠ c.

Let A = A(H). Obviously, H_A and H have the same sets of 3-edges. Consider a 2-edge {c₁, c₂} of H. By definition of A, c₁, c₂ induce [1, 1] in A. Let us show, for instance, that they also induce [1, 0]. First, if the degree of c₁ in H is one, then there is a special row in A for c₁ which induces [1, 0]. Second, assume that there is other hyperedge in H containing c₁. Then the row in A for this hyperedge induces [1, 0]. Hence, H_A contains hyperedge {c₁, c₂}. Finally, observe that H_A cannot contain any other 2-edge, as pair [1, 1] can be induced only in the rows added in the step (3).

Next, we need to show that H_A and H have the same lists of switches. Assume that H contains a switch [c₁, c, c₂]. Then c has a degree at least 3 in H, and there are two distinct hyperedges e₁ and e₂ containing c and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{1} \in e_{1}$$\end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{2} \in e_{2}$$\end{document} such that c, c₁, c₂ are all distinct. Since d(c) ≥ 3, there is another hyperedge e₃ containing c. Since, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{1} , c_{2} \notin e_{3}$$\end{document} , c₁, c, c₂ induces [0, 1, 0]. Hence, by definition, H_A contains the switch [c₁, c, c₂].

Assume now that H_A contains a switch [c₁, c₂, c₃]. By definition, there are two distinct hyperedges e and e′ such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{1} , c_{2} \in e$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_{2} , c_{3} \in e^{\prime}$$\end{document} . Since the triple c₁, c₂, c₃ induces [0, 1, 0] and d(c₂) ≥ 2, there must be another hyperedge e″ containing c₂, but not containing c₁ and c₃. Hence, d(c₂) ≥ 3, and [c₁, c₂, c₃] is a switch in H.

Example 1. In this example, we built a genotype matrix A for a natural extended hypergraph H (Fig. 8).

FIG. 8.
Example of a genotype matrix A(H) constructed for a natural extended hypergraph H.

Finally, we will show that the matrix A has the WD property. We will show a stronger result that any two columns \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c_i , c_j \in A (H)$$\end{document} induce both [0, 1] and [1, 0]. First, suppose vertex \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c \in H$$\end{document} has degree more than 1. For any other vertex c′, there must exist a hyperedge e such that c is incident on e while c′ is not. Indeed, any two hyperedges in H share at most one vertex. From the constructing procedure of A(H) we know that there must be a row r with r[c] = 2 (or r[c] = 1) and r[c′] = 0. In other words, columns c and c′ induce [1, 0].

Now, for any pair of vertices c and c′, if each of them has degree more than 1, then the corresponding pair of columns \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c , c^{\prime} \in A (H)$$\end{document} induce both [1, 0] and [0, 1]. If one of them, say c, has degree 1, then according to the last step of the construction of A(H), the two columns induce both [0, 1] and [1, 0] as well. ▪

The main result follows by Theorems 4, 5, and 6, and Lemma 7.

Theorem 8

The galled-tree network haplotyping problem is NP-complete. In addition, the problem remains NP-complete even if we require that each gall contains at most k mutation edges, for any fixed k ≥ 3.

Note that for k = 1, the galls must contain only unconflicted SNPs, and hence they can be omitted (the reduced galled-tree network explaining the same haplotype matrix would not contain them), i.e., k = 1 is equivalent to haplotyping via perfect phylogenetic tree, and hence, can be solved in polynomial time (Bafna et al., 2003). For k = 2, it was shown in Gupta et al. (2007) that the problem is polytime solvable under assumption that the input genotype matrix has the weak property (every column containing 2 contains also 1).

4. Conclusion

We have shown that the GTNH problem is NP-hard in general. Furthermore, we have characterized the complexity of the problem depending on the maximal size of the galls (the number of mutations edges—SNPs on it): The problem is still NP-hard for any k ≥ 3. For k = 1, the problem is equivalent to the haplotyping via perfect phylogeny, and thus polynomial. For k = 2, the problem is polynomial for genotype matrices with the weak property. To complete the characterization, it would be interesting to determine whether 2-GTNH Problem remains polynomial without the weak property, or whether it becomes NP-hard. The techniques used to prove NP-completeness for k ≥ 3 presented in this paper, cannot be directly used to show NP-hardness of the case k = 2, since they are based on assumption that the genotype matrix has the weak diagonal property which implies the weak property.

Another possible direction of future research is to characterize the complexity of the problem based on the number of galls. The proofs presented in this article assumed that the number of galls is not restricted. In Song et al. (2005), a practical algorithm for the case with a single gall was presented; however, it is not clear whether it works in polynomial time. Thus, it would be interesting to see whether the problem is polytime solvable for a constant number of galls.

Footnotes

Acknowledgments

Research was supported in part by NSERC (grant to A.G. and L.S.) and MITACS (M.K. and J.M.). A preliminary version of this paper appeared in Gupta et al. ().

Disclosure Statement

No conflicting financial interests exist.

References

Bafna

, Gusfield

, Lancia

et al. 2003. Haplotyping as perfect phylogeny: a direct approach. J. Comput. Biol., 10:323–340.

Bordewich

, Semple

2004. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Combin., 8:409–423.

Clark

1990. Inference of haplotypes from PCR-amplified samples of dipoid populations. Mol. Biol. Evol., 7:111–122.

Daly

, Rioux

, Schaffner

et al. 2001. High-resolution haplotype structure in the human genome. Nat. Genet., 29:229–232.

Gabriel

, Schaffner

, Nguyen

et al. 2002. The structure of haplotype blocks in the human genome. Science, 296:2225–2229.

Gupta

, Maňuch

, Stacho

et al. 2006. Characterization of the existence of galled-tree networks. J. Bioinform. Comput. Biol., 4:1309–1328.

Gupta

, Maňuch

, Stacho

et al. 2007. Algorithm for haplotype inferring via galled-tree networks with simple galls [extended abstract] Lect. Notes Bioinform, 4463:121–132.

Gupta

, Maňuch

, Stacho

et al. 2008. Haplotype inferring via galled-tree networks is NP-complete. Lect. Notes Comput. Sci., 5902:287–298.

Gupta

, Maňuch

, Stacho

et al. 2009. Haplotype inferring via galled-tree networks using a hypergraph covering problem for special genotype matrices. Discr. Appl. Math., 157:2310–2324.

10.

Gusfield

2001. Inference of haplotypes from samples of diploid populations: complexity and algorithms. J. Comput. Biol., 8:305–323.

11.

Gusfield

2002. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. Proc. RECOMB 2002, 166–175.

12.

Gusfield

2003. Haplotype inference by pure parsimony. Lect. Notes Comput. Sci., 2676:144–155.

13.

Gusfield

2005. Optimal, efficient reconstruction of root-unknown phylogenetic networks with constrained and structured recombination. J. Comput. Syst. Sci., 70:381–398.

14.

Gusfield

, Orzack

S.H.

2006. Haplotype inference, 18-1–18-28. Handbook of Computational Molecular Biology. CRC Computer and Information Science Series. Chapman & Hall: New York Aluru

15.

Gusfield

, Eddhu

, Langley

2004a. The fine structure of galls in phylogenetic networks. INFORMS J. Comput., 16:459–469.

16.

Gusfield

, Eddhu

, Langley

2004b. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J. Bioinform. Comput. Biol., 2:173–213.

17.

Helmuth

2001. Genome research: map of the human genome 3.0. Science, 293:583–585.

18.

International HapMap Consortium. 2005. A haplotype map of the human genome. Nature, 437:1299–1320.

19.

Lancia

, Pinotti

, Rizzi

2002. Haplotyping populations: complexity and approximations [Dit-02-082] University of Trento.

20.

Mitra

R.D.

, Butty

V.L.

, Shendure

et al. 2003. Digital genotyping and haplotyping with polymerase colonies. Proc. Natl. Acad. Sci. USA, 100:5926–5931.

21.

Papadimitriou

C.H.

1994. Computational Complexity. Addison-Wesley: New York.

22.

Patil

, Berno

, Hinds

et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294:1719–1723.

23.

Pennisi

2007. Breakthrough of the year: human genetic variation. Science, 318:1842–1843.

24.

Song

Y.S.

2006. A concise, necessary and sufficient condition for the existence of a galled-tree. IEEE/ACM Trans. Comput. Biol. and Bioinform., 3:186–191.

25.

Song

Y.S.

, Wu

, Gusfield

2005. Algorithms for imperfect phylogeny haplotyping (IPPH) with a single homoplasy or recombination event. Lect. Notes Comput. Sci., 3692:152–164.

26.

Thorisson

, Smith

, Krishnan

et al. 2005. The international HapMap project web site. Genome Res., 15:1591–1593.

27.

Wang

, Zhang

2001. Perfect phylogenetic networks with recombination. J. Comput. Biol., 8:69–78.