Abstract
Abstract
Protein/peptide microarrays are rapidly gaining momentum in the diagnosis of cancer. High-density and high-throughput peptide arrays are being extensively used to detect tumor biomarkers, examine kinase activity, identify antibodies having low serum titers, and locate antibody signatures. Improving the yield of microarray fabrication involves solving a hard combinatorial optimization problem called the border length minimization problem (BLMP). An important question that remained open for the past 7 years is if the BLMP is tractable or not. We settle this open problem by proving that the BLMP is
1. Introduction
C
The BLMP can be stated as follows. Given N2 strings of the same length, how do we place them in a grid of size N×N such that the Hamming distance summed over all the pairs of neighbors in the grid is minimized? The BLMP has received a lot of attention from many researchers. The earliest algorithm suggested by Hannenhalli et al. (2002) reduces BLMP to TSP (traveling salesman problem) by computing a tour of the strings and then threading the tour on the grid. Kahng et al. (2003) have proposed several other heuristic algorithms that are considered the best performing algorithms in practice. De Carvalho and Rahmann (2006) introduced a quadratic program formulation of the BLMP but unfortunately the quadratic program is an intractable problem. Later, Kundeti and Rajasekaran (2009) formulated the problem as an integer linear program, which performs better than the quadratic program in practice.
Certain generalizations of the BLMP are known to be
Despite many studies on the BLMP, the question of whether BLMP is tractable or not remained open for the past seven years. In this article, we show that the BLMP is
Our article is organized as follows. Section 2 formally defines the BLMP and HGPMP. Section 3 provides the
2. Problem Definition
Let S be a set of strings of the same length with
Obviously, if G is a ring graph, then HGPMP is the same as the well-known Hamming traveling salesman problem (HTSP). If G is a grid graph of size N×N (where N2=n), then HGPMP becomes the border length minimization problem (BLMP), which is the main study of our article.
3.
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal NP}$$\end{document}
-Hardness Of The Blmp And Hgpmp
Theorem 1
The BLMP is
We will show that the Hamming traveling salesperson problem (HTSP) for strings (with the Hamming distance metric) polynomially reduces to the BLMP. The HTSP is already defined in Section 2.
The idea of the proof is that given 4N strings for the HTSP, we construct (N+1)2 strings for the BLMP such that from an optimal solution to this BLMP, we can easily obtain an optimal solution for the HTSP. So we need to consider the variant of the HTSP in which the number of strings is divisible by 4. The proof will be presented in stages. The next three subsections present some preliminaries needed for the proof of the theorem. Followed by these subsections, the proof is presented.
3.1. 4N-strings traveling salesperson problem
Define an instance of the HTSP as a 4N-strings HTSP if the number of strings in the input is 4N (for some integer N). In this section, we show that the 4N-strings HTSP is
Theorem 2
4N-strings HTSP is
Proof
We will show that the HTSP polynomially reduces to the 4N-strings HTSP. Let
It is easy to see that in an optimal tour for the above 4N-strings HTSP instance, all the copies of
3.2. A special instance of the BLMP
Consider the following (N+1)2 strings as an input for the BLMP:
Lemma 1
In any optimal solution to the above BLMP instance,

An illustration for Lemma 1 with N = 4. Each ti lies on a dark vertex in the grid.
Proof
This can be proven by contradiction. Let T be the collection of the strings
Let u be the number of neighbors of q from T. Let v be the number of neighbors of r from T. Note that 0 ≤ u ≤ 4 and 0 ≤ v ≤ 3. In the current solution, the total cost incurred by q and r is at least
We thus conclude that all the strings of T lie on the boundary of the grid in any optimal solution. ■
3.3. A special set of strings and some operations on strings
We denote the (ordered) concatenation of two strings x and y as x+y. If x and x′ (respectively y and y′) have the same length then, clearly, δ(x+y, x′+y′)=δ(x, x′)+δ(y, y′).
Given a string
Given an integer n, we can construct a set of n strings of length n each,
3.4. Proof of the main theorem
Now we are ready to present the proof of Theorem 1. Let
The input for the BLMP instance that we generate will be
We choose h so that T satisfies the condition in Lemma 1. Particularly, choose h=8l. Now we will show that OPTBLMP(T)=4(N − 1)(h+l)+8Nh+2OPTHTSP(S), which in turn means that an optimal solution for the BLMP on T will yield an optimal solution for the 4N-strings HTSP on S.
Let
On the other hand, let B be an optimal solution for the BLMP on T. By Lemma 1, ti's lie on the border of the grid, and the copies of t lie on the center of the grid. Assume that ti's lie in the order
This completes the proof of Theorem 1. ■
3.5.
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal NP}$$\end{document}
-hardness of the HGPMP for other special cases
We can generalize the result in Theorem 1 for other special cases of the HGPMP. We say graph G is “bordered-ring” if G is undirected and G has a ring of size Ω(nα) for some constant α > 0 such that every vertex in the ring has degree no greater than d and every vertex outside the ring has degree greater than d for some d ≥ 3. For grid graphs,
Theorem 3
The HGPMP is
Proof
By a similar reduction to that of the BLMP above, the theorem follows. ■
3.6. An alternate
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal NP}$$\end{document}
-hardness proof for the BLMP
In this section, we give an alternate
3.6.1. k-Segments traveling salesperson problem
We define the k-segments HTSP and show that it is NP-hard. Consider an input of n strings:
Theorem 4
The k-segments HTSP for strings is
Proof
We will prove this for k=4 (since this is the instance that will be useful for us to prove the main result), and the theorem will then be obvious.
We will show that the HTSP polynomially reduces to the 4-segments HTSP. Let
Consider the 4 strings:
The input strings for the 4-segments HTSP are
Clearly, in an optimal solution for the 4-segments HTSP instance, the four parts have to be
3.6.2. A special instance of the BLMP
Consider the following n2 strings as an input for the BLMP:
Lemma 2
In an optimal solution to the above BLMP instance,
Proof
Let T be the collection of strings
Let S1 and S2 be two segments such that S1 and S2 consist of strings from T, strings in S1 are in successive nodes, strings in S2 are in successive nodes, and these two segments are not successive. Consider the case in which none of these strings are in a corner of the grid. Let
Thus it follows that all the strings of T will be on the boundary, and they will be found in successive nodes in any optimal solution. Also it helps to utilize the corners of the grid since each use of a corner will reduce the total cost by nine. Therefore, in an optimal solution there will be four segments such that all the segments are in the boundary of the grid, each segment has strings from T in successive nodes, and one string of each segment occupies a corner of the grid. In other words, an optimal solution for the BLMP instance contains an optimal solution for the 4-segments TSP corresponding to T. The optimal cost for this BLMP instance is 25n − 28. ■
3.6.3. Construction of strings for the above BLMP instance
We can construct n2 strings that have the same properties as the ones in the above BLMP instance. To begin with, we construct (n+1) binary strings of length n each. The string ti has all
Now, in each ti (for 1 ≤ i ≤ (n+1)), replace every
Finally, append a
3.6.4. The alternate proof of the main theorem
Let
We will use as the basis the (n+1) strings generated in the above section. Recall that these strings
Replace each
Replace each
The input for the BLMP instance that we generate will be
Note that strings of this BLMP instance are comparable to the strings we had for Lemma 2. This is because the interstring distances are very nearly in the same ratios for the two cases. As a result, using a proof similar to that of Lemma 2, we can show that the strings
Let |Si|=ni for 1 ≤ i ≤ 4. The total cost (i.e., the border length) for the above BLMP solution can be computed as follows. Consider S1 alone. The cost because of this segment is C1+2(9nl+l)+(n1 − 1)(9nl+l). The cost 2(9nl+l) is due to the two end points of the segment S1. The cost (n1 − 1)(9nl+l) is because each string of S1 (except for the one in a corner of the grid) is a neighbor of a t. Upon simplification, the cost for S1 is C1+(n1+1)(9nl+l). Summing over all the four segments, the total cost for the BLMP solution is C1+C2+C3+C4+(n+4)(9nl+l). The minimum value of this is obtained when S1, S2, S3, and S4 form a solution to the 4-segments HTSP on T.
Clearly, an optimal solution for the 4-segments HTSP on T will also yield an optimal solution for the 4-segments HTSP on S. This can be seen as follows. Consider the strings in Si and let
This completes the proof of Theorem 1. ■
4. Algorithms For The Blmp
4.1. An O(N)-approximation algorithm
In this section, we will show that a simple version of the algorithm suggested by Hannenhalli et al. (2002) is actually an O(N)-approximation algorithm. This algorithm can be described as follows. Assume that the input is the set of strings

The thick dark line corresponds to an optimal tour on the input strings.
Lemma 3
OPTHTSP(S) ≤ 2OPTBLMP(S).
Proof
Let A be an optimal solution for the BLMP on S. Consider the path P′ drawn as the thick dark line in Figure 2. Obviously, Cost(P′) ≤ Cost(A)=OPTBLMP(S). Let
Theorem 5
The above algorithm yields an O(N)-approximate solution.
Proof
First, we see that
4.2. A hierarchical refinement algorithm
Several heuristics such as the epitaxial growth have been proposed to solve the BLMP problem earlier. However, most of these heuristics do not improve the cost monotonically. Local search-based algorithms are often employed to solve hard combinatorial problems. We now introduce a hierarchical refinement algorithm (HRA). This refinement technique can be applied to any heuristic placement to refine the cost and obtain a better placement. Let N be the number of probes in the placement, d a positive integer such that dx=N, and x ≥ 1 is called the degree of refinement. The refinement algorithm starts with a given placement, then it divides the placement into

Illustration of the hierarchical refinement algorithm with degree of refinement 3. This shows the possible optimal solutions (i.e., permutation among subproblems) at the top-most and penultimate levels.
We should remark that while solving a subproblem optimally, we also consider the cost contributed from the neighboring subproblems. This ensures the monotonic improvement in the placement cost. The refinement algorithm asymptotically runs in Θ(d2!N) time. If d=O(1), the refinement algorithm runs in linear time. For small values of d, the algorithm performs well in practice. HRA is a deterministic refinement algorithm. We further extend this by introducing randomness. The randomized hierarchical refinement algorithm (RHRA) is similar to the HRA algorithm. RHRA randomly selects a subsquare within the given placement and applies the HRA technique to the selected subsquare. Similar to local search algorithms, repeating RHRA algorithm several times improves the placement cost monotonically. We study the performance of both these algorithms in Section 5.
4.3. Quad epitaxial algorithm
The epitaxial (EPX) placement suggested in Kahng et al. (2003) places a randomly selected probe at the center of the array, it continues placing the probes greedily around the locations adjacent to the placed probes to minimize the cost (i.e., the algorithm almost spends O(N2) time to place each probe). The epitaxial algorithm gives good results for small arrays, but for larger arrays the epitaxial algorithm is impractical and extremely slow. We propose the quad epitaxial (QEPX) algorithm as a simple extension to the epitaxial algorithm. QEPX yields good performance and is very fast compared to the EPX algorithm. The basic idea behind the QEPX algorithm is to divide the array into four parts, apply EPX algorithm for each of the four parts, and finally find an optimal arrangement among the four parts. In Section 5, we compare the QEPX algorithm with EPX algorithm.
5. Experimental Study
5.1. Performance of the QEPX algorithm
In this section, we compare the performance of QEPX algorithm introduced earlier. We use randomly generated probe arrays of size 322, 642, 1282, and 2562. In all of our experimental studies we compute a lower bound on the solution by picking the smallest 2N(N − 1) edges from the complete Hamming distance graph. Column 4 (Init Cost) in Table 1 indicates the placement cost obtained by placing the probes in the row-major order as given by the input. Column 5 indicates the final placement cost obtained by the epitaxial (quad) algorithm. As we can see from columns 7 and 10, the refinement obtained by the QEPX algorithm is very close to the EPX algorithm. On the other hand, QEPX runs 3.6×faster than the EPX algorithm. As we can see from Table1, as the chip size increases, EPX algorithm becomes very slow. We ran both EPX and QEPX algorithms on a chip size of 243×243 with a time limit of 60 minutes. The QEPX algorithm took around 12 minutes to complete and improved the input placement cost by 36%. On the other hand, the EPX algorithm did not complete the placement. From our experiments we conclude that the QEPX can provide a good placement, which we can use as an input for refinement/local search algorithms such as RHRA. In the next subsection we provide our experimental study of HRA and RHRA algorithms on various placement heuristics.
5.2. Performance of refinement algorithms
We have applied our HRA and RHRA refining algorithms on the following placement heuristics.
— (RAND) Random placement: In this placement, we use the order in which the probes are provided to our algorithm.
— (SORT) Sort placement: In this placement, the input probes are sorted lexicographically.
— (SWM) Sliding window matching placement is obtained by running the SWM (Kahng et al., 2003) algorithm with parameters (6, 3).
— (REPX) Row epitaxial placement is obtained by running the row-epitaxial algorithm with three look-ahead rows.
— (EPX) Epitaxial placement is obtained by running the EPX algorithm.
— (QEPX) Quad epitaxial placement is obtained by our quad-epitaxial algorithm.
The cost of the placement obtained by running the HRA algorithm exactly once is given in column 5 (HRA). Column 6 (RHRA) indicates the placement cost obtained by running our randomized refinement algorithm RHRA for 350 iterations. From Table 2 we can see that as initial placement moves closer and closer toward the lower bound, the refinement percentage decreases, which is logical. For test cases with 729, 6561 (1024, 4096) probes we use a refinement degree d=3 (d=2). Choosing a bigger refinement degree gives better refinements, however, takes more time. Finally we conclude that our refinement algorithms would be very useful when applied in conjunction with fast initial placement heuristics.
6. Conclusions
In this article, we have studied the border length minimization problem (BLMP), which has numerous applications in biology and medicine. We solved a seven-year-old open problem in this area by showing that the BLMP is
One of the best-performing algorithms for the BLMP is the epitaxial algorithm (EPX). This algorithm takes too much time, especially when the number of probes is large. In this article we present a variant called the quad-epitaxial algorithm (QEPX), which is much faster than EPX while yielding a solution that is very close to that of EPX in quality. QEPX partitions the input into four parts, works on each part separately, and finally combines these solutions. This idea can be extended further to partition the input into more parts and hence this algorithm is ideal for parallelism.
Some of the additional goals are: 1) We have used a simple lower bound on the quality of solution for the BLMP, and it would be nice to develop tighter lower bounds; 2) to develop more efficient algorithms than EPX; and 3) to design parallel algorithms for the BLMP.
Footnotes
Acknowledgments
This work has been supported in part by the following grants: NSF 0326155, NSF 0829916, NIH 1R01GM079689-01A1, and NIH R01-LM010101.
Author Disclosure StatEment
The authors declare that no competing financial interests exist.
