Abstract
Abstract
The problem of determining haplotypes from genotypes has gained considerable prominence in the research community since the beginning of the HapMap project. Here the focus is on determining the sets of SNP values of individual chromosomes (haplotypes), since such information better captures the genetic causes of diseases. One of the main algorithmic tools for haplotyping is based on the assumption that the evolutionary history for the original haplotypes satisfies perfect phylogeny. This tool can be applied only on individual blocks of chromosomes, in which it is assumed that recombinations do not happen. However, exact determination of blocks is usually not possible. It would be desirable to develop a method for haplotyping which can account for recombinations, and thus can be applied on multiblock sections of chromosomes. A natural candidate for such a method is haplotyping via phylogenetic networks (which model recombinations) or their simplified version: galled-tree networks. However, even haplotyping via galled-tree networks appears hard, as the efficient algorithms exist only for very special cases: the galled-tree network has either a single gall or only small galls with two mutations each. Building on our previous results, we show that, in general, haplotyping via galled-tree networks is NP-complete, and it remains NP-complete when galls are allowed to have at most k mutations, for any k ≥ 3.
1. Introduction
Various methods can be used to infer haplotypes from genotypes for population data. The first heuristic algorithm for computational haplotype inference was designed by Clark (1990). The exact version of Clark's problem was shown to be NP-hard (Gusfield, 2001). Another approach, called pure-parsimony haplotyping, asking for a solution with the minimum number of distinct haplotypes, was shown to be NP-hard as well (Gusfield, 2003; Lancia et al., 2002). Gusfield (2002) developed the first exact polynomial algorithm based on the assumption of no recombinations happened during the evolutionary history of the haplotypes in consideration, which allowed him to make effective use of phylogenetic trees. This assumption was justified by experimental results that show many chromosomes are blocky, with a strong correlation between sites on the same block (Daly et al., 2001; Patil et al., 2001). As such, these experiments do not exclude recombinations within a block; models that allow for recombinations are needed.
A first attempt in haplotyping via models which allow a limited number of biological events that violate the perfect phylogeny model was made by Song et al. (2005). In their article, a polynomial algorithm for haplotyping via imperfect phylogenies with a single homoplasy was presented, as well as a practical algorithm for haplotyping via galled-tree networks with one recombination cycle (gall). Galled-tree networks are special instances of phylogenetic networks, which in turn generalize phylogenetic trees by incorporating recombinations in the model (Wang et al., 2001). There is always a phylogenetic network for any set of haplotypes, while finding such phylogenetic networks with the smallest number of recombinations is NP-hard (Bordewich and Semple, 2004; Wang et al., 2001), and hence, haplotyping via phylogenetic networks is either easy and meaningless (any inferring is good) or intractable, depending on whether the minimum number of recombinations is required.
A galled-tree network is a special type of phylogenetic network in which recombination cycles do not intersect. Similar to phylogenetic trees, not every set of haplotypes admits a galled-tree network; however, it can be decided in polynomial time whether it is the case (Gusfield et al., 2004b). In addition, if there is a galled-tree network, it is easy to find the one (reduced GTN) with the smallest number of recombinations, and no phylogenetic network for the same set of haplotypes has fever recombinations. In earlier work (Gupta et al., 2006), we found a characterization of the existence of galled-tree networks. A similar characterization was independently discovered in Song (2006). Building on this characterization, we developed a polynomial algorithm for haplotype inference via galled-tree networks with simple galls, having two mutations each based on reduction of haplotyping problem to a hypergraph covering problem in Gupta et al. (2007). It is very natural to ask whether the assumption on the number of galls or the size of galls can be dropped and still hope for a polynomial algorithm. In Gupta et al. (2009), we reduced the haplotype inferring problem to a hypergraph covering problem for genotype matrices satisfying a combinatorial condition.
Building on our previous work, here we show that the problem of inferring haplotypes via galled-tree networks is NP-complete by reduction from 3-SAT. Moreover, the problem remains NP-complete even if we put any weaker condition on the galls (without restricting their number) than single galls. In fact, for any k > 2, haplotyping via gall-tree networks with galls, having at most k mutations, is NP-complete.
2. Definitions
2.1. Haplotype inferring from population data
Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variations. A set of SNP values (e.g., SNPs that sit on a gene) on a single chromosome is called a haplotype. SNPs usually take two values among all the human population. Therefore, haplotypes are commonly represented as sequences of 0 and 1, by fixing a mapping of {0, 1} to two possible states in {A, C, G, T} at each SNP position. A combined information from two haplotypes for a matching pair of chromosomes is called a genotype. Here, the information about which value comes from the first and which from the second copy of the chromosome is lost. Genotype sequence is usually represented as a sequence of {0, 1, 2}, where value 0 or 1 at certain position i represents the fact that both haplotypes have this value at i (homozygous), while value 2 means that the values on two haplotypes at position i differ (heterozygous). The haplotype inference problem, or simply haplotyping, asks for determining of haplotype sequences based on genotype sequences of a set of individuals:
Definition 1 (Haplotyping)
Given a genotype n × m matrix A with values {0, 1, 2}, we say that a haplotype 2n × m matrix B with values in {0, 1} is inferred from A if and only if for every SNP if if A(i, c) = 2, then B(2i − 1, c) ≠ B(2i, c).
Obviously, there are exponentially many ways in the number of 2's in a row how to infer two haplotypes from this row. Therefore, various types of parsimonious criteria are used to choose the most plausible inferring of the whole set of genomes, including maximum resolution problem of Clark, pure parsimony criteria, haplotyping via perfect phylogeny and several statistical methods; for an overview, see Gusfield and Orzack (2006). In this article, we are interested in haplotyping via galled-tree networks which allow for recombination events, defined in the next subsection.
2.2. Phylogenetic and galled-tree networks
In phylogenetic trees, each vertex is labeled by a sequence of states of characters (e.g., SNPs) and is connected by a mutation edge to its parent along which one character changes its state. Phylogenetic networks introduced in Wang et al. (2001), sometimes called “recombination networks,” are an extension of phylogenetic trees in which a vertex can be connected by two recombination edges to two parents, and the label sequence for this recombination vertex is formed by a recombination of sequences of its two parents.
Definition 2 (Phylogenetic network)
A phylogenetic network N on m characters is a directed acyclic graph containing exactly one vertex (the root) with no incoming edges, and each other vertex has either one incoming (mutation) edge or two incoming (mutation) edges. A vertex x with two incoming edges is called a recombination vertex.
Each integer (character) from 1 to m is assigned to exactly one mutation edge in N, and each mutation edge is assigned one character. Each vertex in N is labeled by a binary sequence of length m, starting with the root vertex which is labeled with the all-0 sequence. Since N is acyclic, all other vertices in N can be recursively labeled, and the vertices in N can be topologically sorted into a list, where every vertex occurs in the list only after its parent(s). Using that list, we can define the labels of the non-root vertices, in order of their appearance in the list, as follows:
For a non-recombination vertex v, let e be the mutation edge labeled c coming into v. The label of v is obtained from the label of v's parent by changing the value at position c from 0 to 1. Each recombination vertex x is associated with an integer
In this article, the sequence at the root of the phylogenetic network is always the all-0 sequence, and all results are relative to that assumption. More general phylogenetic networks with unknown root were studied recently by Gusfield (2005). A phylogenetic network for a given binary matrix M is illustrated in Figure 1.

A phylogenetic network for matrix M. In the network, each mutation edge is labeled by a character ci,
Definition 3
Given an n × m matrix A with values in {0, 1}, we say that a phylogenetic network N with m characters explains A if each sequence of A is a label of some vertex in N.
Finding a phylogenetic network with the minimum number of recombination vertices for a given haplotype matrix is NP-hard (Bordewich and Semple, 2004; Wang et al., 2001). Hence, a more restricted version of phylogenetic networks was studied in several articles (Gusfield et al., 2004a,b; Wang et al., 2001). The restricted version can be defined as follows.
Definition 4 (Galled-tree network)
In a phylogenetic network N, let v be a vertex that has two paths out of it that meet at a recombination vertex x (v is the least common ancestor of the parents of x). The two paths together form a recombination cycle C. The vertex v is called the coalescent vertex. We say that recombination cycle C contains a character i, if i labels one of the mutation edges of C.
A phylogenetic network is called a galled-tree network if no two recombination cycles share an edge. A recombination cycle of a galled-tree network is sometimes referred to as a gall.
We now define the conflict graph, introduced in Gusfield et al. (2004b).
Definition 5 (Conflict graph)
We say that the SNPs c1 and c2 conflict in a haplotype matrix B if B contains all three pairs [0, 1], [1, 0], and [1, 1] in SNPs c1 and c2. An SNP is called conflicted if it is involved in at least one conflict, and is otherwise called unconflicted. The conflict graph GB has the vertex set
The following definition defines a special type of galled-tree networks which are important in characterizing matrices explainable by a galled-tree network based on their conflict graph.
Definition 6 (Reduced galled-tree network)
A galled-tree is called a reduced galled-tree if every gall only contains conflicted SNPs.
Note that the example in Figure 1 is not a galled-tree network as two galls share an edge. A galled-tree network for a binary matrix M is illustrated in Figure 2. The two recombination cycles in this network do not share any edges. This galled-tree is reduced, since the following pairs of SNPs conflict: c2 and c6, c3 and c6, c7 and c8.

A galled-tree network for matrix M. Two galls labeled by the set of vertices {s5, s6, s9, s10} and {r, s2, s4, s3, s7} do not share any edges in the network.
2.3. Known properties of galled-tree networks
In this section we state some known facts about galled-tree networks proved in Gusfield et al. (2004b) which we utilize to prove our results.
Theorem 1 (Gusfield et al., 2004b)
Let N be a galled-tree network that explains a haplotype matrix B. Then there is a one-one correspondence between the non-trivial connected components (having at least two vertices) of the conflict graph GB and the galls in N containing conflicted SNPs. Each such gall in N contains all conflicted SNPs of one non-trivial connected component of the conflict graph GB, and contains no conflicted SNPs from a different non-trivial connected component of GB.
The following corollary easily follows from the definition of reduced galled-tree networks and Theorem 1.
Corollary 1
Let N be a “reduced” galled-tree network that explains a haplotype matrix B. Then there is a one-one correspondence between the non-trivial connected components of the conflict graph GB and the galls in N. Each gall in N contains all conflicted SNPs of one non-trivial connected component of the conflict graph GB, and contains no other SNPs.
It follows by the algorithm for constructing galled-tree networks presented in Gusfield et al. (2004b) or in Gupta et al. (2006) that if a haplotype matrix B can be explained by a galled-tree network, then it can be explained by a reduced galled-tree network as well. In addition, it follows by Theorem 1 and Corollary 1, that a reduced galled-tree network explaining B will have the smallest number of galls and its galls will contain the smallest possible number of SNPs. Hence, we have the following corollary.
Corollary 2
If there exists a galled-tree network explaining a haplotype matrix B with at most k SNPs on each gall, then there is a reduced galled-tree network explaining B with at most k SNPs on each gall.
2.4. Inferring haplotypes via galled-tree network and the extended hypergraph covering problem
In this section, we will recall the characterization for the galled-tree network haplotyping (GTNH) problem using a hypergraph covering problem developed in Gupta et al. (2009). This characterization works only for genotype matrices satisfying special combinatorial properties. To state the result from Gupta et al. (2009), we need the following definitions.
Definition 7
Given a genotype matrix A, we say that A can be explained by a galled-tree network if there exists a haplotype matrix B inferred from A such that B can be explained by a galled-tree network.
Problem 1 (Galled-Tree Network Haplotyping [GTNH] Problem)
Given a genotype matrix A, decide if A can be explained by a galled-tree network.
Let k be a fixed integer. The k-GTNH problem is defined as follows:
Problem 2 (k-Galled-Tree Network Haplotyping [k-GTNH] Problem)
Given a genotype matrix A, decide if A can be explained by a galled-tree network in which each gall contains at most k mutation edges.
Note that for a matrix satisfying weak property (for every column that contains 2, also contains 1), 2-GTNH Problem is polynomial (Gupta et al., 2007).
Next, we will give the definitions of the combinatorial properties of genotype matrices used in Gupta et al. (2009).
Definition 8 (Simple genotype matrix)
We say that a genotype matrix is simple if every row contains either zero or three 2's.
Definition 9 (Inducing)
Given a genotype matrix A, for every
Definition 10 (Weak diagonal [WD] property)
Given a genotype matrix A, we say that a pair of SNPs is active if it contains [2, 2], or it induces all three pairs [1, 1], [0, 1] and [1, 0]. Further, we say that a pair c1, c2 is weakly active if either it is active, or if there is an SNP c3 such that c1, c3 and c2, c3 are both active pairs. We say that A has the weak diagonal property if every weakly active pair of SNPs induces both [0, 1] and [1, 0].
For purposes of this article and clarity, we will consider the following (stripped) version of the extended genotype hypergraph.
Definition 11 (Extended genotype hypergraph [EGH])
An extended genotype hypergraph is a hypergraph with hyperedges containing either two or three vertices, and a list of switches, which are ordered triples of vertices.
Given a simple n × m genotype matrix A, the extended genotype hypergraph HA of A has the set of SNPs for every row r of A containing three 2's, say in SNPs c1, c2, c3 there is a hyperedge er = {c1, c2, c3}; and for every two SNPs c1 and c2 inducing [0, 1], [1, 0] and [1, 1] in A, there is a hyperedge {c1, c2} in HA.
Furthermore, for every triple of SNPs c1, c2, c3 such that there are distinct hyperedges e and e′ such that
The following definition defines a graph covering of an extended genotype hypergraph.
Definition 12 (Covering of EGH)
Consider an extended genotype hypergraph H. We say that a graph G with the same vertex set as H covers H if G can be obtained as follows:
for every 2-edge {c1, c2} of HA, add the edge (c1, c2) in G; for every 3-edge {c1, c2, c3} of HA, add exactly one of the edges (c1, c2), (c2, c3) and (c1, c3) to G; and
and for every switch [c1, c2, c3], at most one of the edges (c1, c2) and (c2, c3) is in G. A graph G that covers H is called a covering of H.
Let k be a fixed integer. The k-EHC problem can be formulated as follows:
Problem 3 (k-Extended Hypergraph Covering [k-EHC] Problem)
Given an extended genotype hypergraph H, determine whether there is a covering G of H such that each connected component C of G is a path of length at most k satisfying the ordered component property:
C is bipartite with partitions L and R such that all vertices in L are smaller than all vertices in R. Recall vertices of G and H are integers from 1 to m.
The following result was shown in Gupta et al. (2009).
Theorem 2 (Gupta et al., 2009)
Given a simple genotype matrix A with the WD property, let B be a matrix inferred from A which can be explained by a galled-tree network N. Then each component of GB is a path of length at most 3.
The following two corollaries follow by Corollaries 1 and 2, and Theorem 2.
Corollary 3
Given a simple genotype matrix A with the WD property, let B be a matrix inferred from A which can be explained by a galled-tree network. Let N be a reduced galled-tree network explaining B. Then each gall in N contains at most four SNPs.
Corollary 4
Given a simple genotype matrix A with the WD property, let B be a matrix inferred from A which can be explained by a galled-tree network. Let N be a reduced galled-tree network explaining B. For any 1 ≤ k ≤ 3, each gall in N contains at most k + 1 SNPs if and only if each component of GB is a path of length at most k.
The next result from Gupta et al. (2009) characterizes when a genotype matrix can be explained by a galled-tree network in terms of hypergraph coverings.
Theorem 3 (Gupta et al., 2009)
Consider a simple genotype matrix A with the WD property. Let B be a haplotype matrix inferred from A and GB its conflict graph. Then B can be explained by a galled-tree network if and only if GB is a covering of extended genotype hypergraph HA and satisfies the ordered-component property.
The following theorem is a characterization that we will use in the next section to prove the NP-hardness of restricted versions of the GTNH problem.
Theorem 4
Let A be a simple genotype matrix with the WD property and let k ≥ 1 be a fixed integer. Then A can be explained by a galled-tree network N in which each gall contains at most k + 1 SNPs if and only if there exists a covering G of HA such that every component of G is a path of length at most min(3, k) and has the ordered-component property.
Conversely, suppose that G is a covering of HA such that every component of G is a path of length at most min(3, k) and has the ordered-component property. Then by Theorem 3, there is a haplotype matrix B inferred from A with the conflict graph GB = G which can be explained by a reduced galled-tree network N. Moreover, by Corollary 4, each gall in N contains at most min(3, k) + 1 SNPs. Therefore, A can be explained by a galled-tree network N where each gall contains at most k + 1 SNPs. ▪
Note that the WD property of the genotype matrix forces the components in the conflict graphs of inferred haplotype matrices to be small. In Gupta et al. (2009), it was also shown that the 3-EHC problem is NP-complete, hence this characterization fails to provide a polynomial solution for the GTNH problem even for such special genotype matrices. On the other hand, since not every extended genotype hypergraph has a corresponding genotype matrix—in particular, the gadgets used to show NP-completeness of the 3-EHC problem in Gupta et al. (2009) do not have a corresponding genotype matrix—this result does not imply that the GTNH problem is NP-complete. In the next section, we consider special instances of extended genotype hypergraphs for which, as we will see later, it is possible to construct a corresponding genotype matrix and show that the EHC problem for them remains NP-complete.
3. Gtnh Problem is Np-Complete
The proof of NP-completeness is done in two steps. First, we define special instances of extended genotype hypergraphs, and show that the 2-EHC and 3-EHC problems for them are NP-complete. Then we show that it is possible to construct a genotype matrix for each such instance. The NP-completeness of the GTNH problem then follows by the characterization obtained in Gupta et al. (2009) (Theorem 4).
Definition 13 (Natural EGH)
We say that an extended genotype hypergraph H is natural if for any two hyperedges e,e′ of H,
We will show that the extended genotype hypergraph covering problem for natural EGHs is NP-complete by reduction from 3-SAT. The proof follows the idea of the proof of NP-completeness of the EHC problem in Gupta et al. (2009). However, the proof in Gupta et al. (2009) assumed that there are no switches in the EGH, which was the main reason why there was no corresponding genotype matrix for the constructed EGH. In the following proof, the gadgets had to be redesigned to take into account the existence of the switches.
Theorem 5
The 3-EHC problem for natural extended genotype hypergraphs is NP-complete.
depending on whether Ci contains two or three literals.
Next, we construct a natural extended genotype hypergraph H(f) for f which has a covering if and only if the formula f is satisfiable. The hypergraph H(f) will be an edge-disjoint union of several gadgets, one for each clause and one for each variable. The only vertices in common among gadgets will be the literal vertices; in particular, each literal vertex will be shared between one clause gadget and one variable gadget. Furthermore, in each gadget we will mark every vertex either with a dot or a cross such that every literal vertex will be marked with a dot. This will guarantee that our marking will be consistent in whole H(f). Using this marking, we order the vertices of H(f) such that every vertex marked with a dot precedes every vertex marked with a cross. This ordering implies that, to verify the ordered component property of a covering, it is enough to check that in such a covering every path of length two or three alternates between vertices with crosses and dots.
Now we define the gadgets; we start with the clause gadgets. Consider a covering c of H(f). We say that a literal pj has value 1 in this covering, if c restricted to the clause gadget containing pj contains an edge incident to the vertex pj. Note that this is well defined since for every literal vertex pj there is a unique clause gadget containing it.
For every clause

NP-completeness for 3-EHC problem.
For every clause

NP-completeness for 3-EHC problem.
In the second part of the construction, for each variable xi, we add a variable gadget which will guarantee that three occurrences of a variable xi: p3i − 2,p3i − 1,p3i must be assigned consistent values. That is if p3i − 2 (positive occurrence) has value 1 then both p3i − 1 and p3i (negated occurrences) should have values 0, and if at least one of p3i − 1 and p3i has value 1 then p3i − 2 should have value 0. This is achieved by a gadget consisting of three 2-edges and thirteen 3-edges depicted in Figure 5a. Figure 5b,c shows two possible coverings of the gadget. In these figures, a variable pj has value 1 if no edge in the gadget joins pj, which is in agreement with interpretation of values of pi's in gadgets of the first part of construction.

NP-completeness for 3-EHC problem.
Let us verify the claimed property of the gadget. Assume for instance that both p3i − 2 and p3i − 1 have value 1. No edge covering the variable gadget can be adjacent to any of these two vertices, otherwise the condition on switches (crossing from variable to clause gadgets) would be violated. Hence, the edges connecting the other two vertices of the 3-edges containing p3i − 2 or p3i − 1 have to be in the covering. Similarly, edges e1, e2 in Figure 5e have to be in the covering. Now, there is no edge to be selected to cover the 3-edge in the middle of the gadget, as the selection of any of the dashed edges would produce a path of length 4 or 5. The other cases can be proved using similar arguments.
Finally, we have to check that it is possible to find a covering of H(f) that satisfies the conditions of the 3-EHC problem (a solution to the 3-EHC problem) if and only if f is satisfiable. First, consider a covering G that is the solution to the 3-EHC problem for H(f). For every clause Ci, at least one of
For the converse, consider a true assignment for f. For every clause
Theorem 6
The 2-EHC problem for natural extended genotype hypergraphs is NP-complete.

NP-completeness for 2-EHC problem.
Figure 7a depicts a variable gadget which will guarantee that three occurrences of a variable xi: p3i − 2, p3i − 1, p3i must be assigned consistent values. Figure 5b,c shows two possible coverings of the gadget. Figure 5b depicts a covering of the gadget in which p3i − 2 is set to one and p3i − 1, p3i are set to zero. In Figure 5c, p3i − 2 is set to zero and p3i − 1 and p3i can have arbitrary values. To prove that the occurrences of a variable xi must be assigned consistent values, it remains to prove that it is impossible to set both p3i − 2 and p3i − 1 or both p3i − 2 and p3i to value 1. Without loss of generality, assume that both p3i − 2 and p3i − 1 have value 1. No edge covering the variable gadget can be adjacent to any of these two vertices, otherwise the condition on switches (crossing from variable to clause gadgets) would be violated. Hence, the edges connecting the other two vertices of the 3-edges containing p3i − 2 or p3i − 1 have to be in the covering. Similarly, edges e1, e2 in Figure 7e have to be in the covering. Now, there is no edge to be selected to cover the 3-edge in the middle of the gadget, as the selection of any of the dashed edges would produce a path of length 3 or 4. The other case can be proved using similar argument.

NP-completeness for 2-EHC problem.
The rest of the proof is similar to the proof of Theorem 5. ▪
The following lemma shows that, for every natural EGH, there is a corresponding simple genotype matrix with the WD property.
Lemma 7
For every natural EGH H, there is a simple genotype matrix A with the WD property such that HA = H.
(1) Let A(H) have ∣V (H)∣ columns and each vertex (2) For each 3-edge {c1, c2, c3} of H, add a new row r to A(H) such that r[c] = 2, if (3) For each 2-edge {c1, c2} of H, add a new row r to A(H) such that r[c] = 1, if (4) For every vertex c of H with degree 1, add a new row r to A(H) such that r[c] = 1, and r[c′] = 0 for any c′ ≠ c.
Let A = A(H). Obviously, HA and H have the same sets of 3-edges. Consider a 2-edge {c1, c2} of H. By definition of A, c1, c2 induce [1, 1] in A. Let us show, for instance, that they also induce [1, 0]. First, if the degree of c1 in H is one, then there is a special row in A for c1 which induces [1, 0]. Second, assume that there is other hyperedge in H containing c1. Then the row in A for this hyperedge induces [1, 0]. Hence, HA contains hyperedge {c1, c2}. Finally, observe that HA cannot contain any other 2-edge, as pair [1, 1] can be induced only in the rows added in the step (3).
Next, we need to show that HA and H have the same lists of switches. Assume that H contains a switch [c1, c, c2]. Then c has a degree at least 3 in H, and there are two distinct hyperedges e1 and e2 containing c and
Assume now that HA contains a switch [c1, c2, c3]. By definition, there are two distinct hyperedges e and e′ such that

Example of a genotype matrix A(H) constructed for a natural extended hypergraph H.
Finally, we will show that the matrix A has the WD property. We will show a stronger result that any two columns
Now, for any pair of vertices c and c′, if each of them has degree more than 1, then the corresponding pair of columns
The main result follows by Theorems 4, 5, and 6, and Lemma 7.
Theorem 8
The galled-tree network haplotyping problem is NP-complete. In addition, the problem remains NP-complete even if we require that each gall contains at most k mutation edges, for any fixed k ≥ 3.
Note that for k = 1, the galls must contain only unconflicted SNPs, and hence they can be omitted (the reduced galled-tree network explaining the same haplotype matrix would not contain them), i.e., k = 1 is equivalent to haplotyping via perfect phylogenetic tree, and hence, can be solved in polynomial time (Bafna et al., 2003). For k = 2, it was shown in Gupta et al. (2007) that the problem is polytime solvable under assumption that the input genotype matrix has the weak property (every column containing 2 contains also 1).
4. Conclusion
We have shown that the GTNH problem is NP-hard in general. Furthermore, we have characterized the complexity of the problem depending on the maximal size of the galls (the number of mutations edges—SNPs on it): The problem is still NP-hard for any k ≥ 3. For k = 1, the problem is equivalent to the haplotyping via perfect phylogeny, and thus polynomial. For k = 2, the problem is polynomial for genotype matrices with the weak property. To complete the characterization, it would be interesting to determine whether 2-GTNH Problem remains polynomial without the weak property, or whether it becomes NP-hard. The techniques used to prove NP-completeness for k ≥ 3 presented in this paper, cannot be directly used to show NP-hardness of the case k = 2, since they are based on assumption that the genotype matrix has the weak diagonal property which implies the weak property.
Another possible direction of future research is to characterize the complexity of the problem based on the number of galls. The proofs presented in this article assumed that the number of galls is not restricted. In Song et al. (2005), a practical algorithm for the case with a single gall was presented; however, it is not clear whether it works in polynomial time. Thus, it would be interesting to see whether the problem is polytime solvable for a constant number of galls.
