Protein Structure Prediction (PSP) is the problem of predicting the three-dimensional native structure of a protein given its primary structure, i.e., the corresponding sequence of amino acids. Different approaches have been proposed to model this problem, and this research explores the prediction of optimal structures using the well studied simplified lattice Hydrophobic and Polar (HP) model—in particular, on the 2D square lattice.
We present a twofold result. First, we devise a new upper bound for the number of contacts achievable by an HP sequence, and show that it is in several cases more stringent than the upper bound previously known in literature. Then, we present an innovative algorithm that outperforms the state of the art in exact approaches for the prediction of optimal structures in lattice protein model, for 2D square lattices. The algorithm, called minwalk and based on a heavily pruned exhaustive search, also outperforms the state of the art in non-exact approaches in several cases.
Due to this algorithm, it is now possible to prove optimal results in the square 2D lattice, for standard HP sequences of size up to 80 elements, which were only best-known-results previously. Furthermore, we provide the degeneracy (i.e. all optimal solutions) of such benchmark sequences, which was unknown in literature. These results can be a useful tool to foster advances in further research.
1. Introduction
Protein structure prediction (PSP) is the problem of predicting the three-dimensional native structure of a protein given its primary structure, that is, the corresponding sequence of amino acids. It is among the most challenging problems in computational biology and, despite many advancements, it is still an open challenge. In the denatured state, a protein is a linear chain of amino acids. The physical process by which a protein achieves its final state, also known as tertiary structure, is called protein folding. In particular, this state corresponds to the folded state with minimum free energy.
Different approaches have been proposed to model this problem. This research aims at exploring the prediction of optimal structures using the simplified lattice model. In this model, each amino acid is represented as a point in an integer lattice, and the PSP problem is reduced to that of finding the self-avoiding walk(s) of minimum energy on the lattice. The energy model used here is the HP model, due to Lau and Dill (1989), where one considers the hydrophobic effect as the main interaction in the folding process. Because of this effect, the hydrophobic amino acids present in the folded proteins tend to form a compact core on the inside, while the polar amino acids tend to group on the surface. In this model, each amino acid is encoded as a bead, which can be of two different types: Hydrophobic (H) and polar (or hydrophilic) (P). Two beads of type H that are topologically first-neighbors and are not connected in the sequence contribute negatively to the energy. All the other combinations contribute zero energy. The PSP problem on two-dimensional (2D) and three-dimensional (3D) lattices is NP complete (Crescenzi et al., 1998; Berger and Leighton, 1998).
In this article, we present new properties of HP sequences on 2D square lattices, and an innovative algorithm that outperforms the state of the art in exact approaches for 2D square lattices. In particular, we describe a new relationship between the energy and the area of a specific type of configuration that we call connected and, on top of this result, a new upper bound on the energy achievable by an HP sequence. The algorithm, called minwalk, is based on a pruned exhaustive search of the input space. Given an HP sequence, our algorithm returns all the optimal configurations of the sequence.
Concerning a comparison with the state of the art in the field, an interesting survey of combinatorial algorithms on lattice protein folding models is Istrail and Lam (2009). A useful characterization is to divide past efforts into the following categories: exact and nonexact approaches on 2D and/or 3D lattices. In the remainder of this work, whenever we mention 2D or 3D, we mean 2D square lattices and 3D cubic lattices.
For exact approaches, the state of the art in 2D is represented by Irbäck and Troein (2002) and Holzgräfe et al. (2011). There have been other, earlier approaches (Shahrezaei and Ejtehadi, 2000), which, however, can only deal with smaller instances of the problem. An exact algorithm (Irbäck and Troein, 2002) was proposed that scales up to an instant size equal to 25. This algorithm enumerates, for a given length, all nondegenerate sequences (i.e., the ones that admit a single optimal configuration) of that length, together with the optimal configuration itself. This result was later improved by the same group (Holzgräfe et al., 2011) so as to handle sequences of length up to 30. In contrast, the algorithm presented in this article finds all optimal (possibly degenerate) configurations when given an input HP sequence, but it scales for rather larger sizes: It can find optimal solutions to random sequences of length up to around 45 and to specific sequences of length up to 85.
Several nonexact approaches also exist to find good structures of a given HP sequence (Unger and Moult, 1993; Liang and Wong, 2001; Lesh et al., 2003; Hsu et al., 2003; Shmygelska and Hoos, 2005; Thalheim et al., 2008; Rego et al., 2011). The articles (Hsu et al., 2003; Shmygelska and Hoos, 2005; Thalheim et al., 2008; Rego et al., 2011) illustrate, to the best of our knowledge, the best-performing nonexact approaches in the literature. Thalheim et al. (2008) contains an extensive comparison of them. In this article, we compare our exact algorithm with the best performing nonexact approaches, and we show that, for most benchmarks of sizes up to 85, we exhibit similar or even better running times while guaranteeing the optimality of the solutions found.
To sum up, we raise the maximum size of sequences that can be dealt with by an exact algorithm to return all its optimal configurations in 2D. At present, our algorithm returns optimal structures for sequences up to size around 45 in hundreds of seconds. We also present running times for several benchmark sequences of size 30, mostly in fractions of seconds. There is no exact approach for the 2D square lattice that can return optimal configurations for these sizes. We also show by experimental evaluation that our algorithm is competitive if compared with the best nonexact methods (Hsu et al., 2003; Shmygelska and Hoos, 2005; Thalheim et al., 2008; Rego et al., 2011). In all but one case, it achieves running times comparable with, or better than, the ones obtained by these algorithms, with the additional advantage that it guarantees optimality.
Despite its simplicity, the HP model has been successfully used to study general principles of protein folding. For example, the HP model on 2D lattice is used in Chen et al. (2010) to study protein evolution. Also, several studies in structural biology look at the designability of a structure, that is, the number of different sequences that have that structure as its unique minimum, both at simple exact models level, on the 2D lattice (Irbäck and Troein, 2002) and on the 3D lattice (Mann et al., 2008), and also at the atomic level. Only exact approaches can be used to study degeneracy—nonexact approaches cannot guarantee the optimality of the solution given, and therefore they cannot, as a consequence, give the degeneracy of optimal solutions. The algorithm here proposed can be a useful tool to give new insights into these issues in the 2D lattice, and the improved performance of exact methods that this article provides, can foster research in various contexts including the ones mentioned above.
For the 3D lattice case, the exact method proposed in Backofen and Will (2006) represents the state of the art. The authors introduced a high-performance algorithm, based on constraint programming, that can find optimal structures for sequences of length higher than 100. While the algorithm works on both the 3D cubic and FCC lattices, it does not extend to 2D. Most likely, this is because it depends on enumeration of all possible cores, and it is therefore particularly suited to lattices where optimal configurations result in extremely compact cores. In the 2D lattice, this is not always true. The algorithm proposed here can in principle also be extended to the 3D lattice, but this has not been done yet, and therefore, a comparison with Backofen and Will (2006) is not feasible.
2. Methods
2.1. Preliminary definitions
Let Σ be a finite alphabet of symbols and Σ* the Kleene star of Σ, that is, the set of all possible finite sequences of symbols belonging to Σ. An amino acid sequence of length n is a sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a_1 , \ldots , a_n \in (\{A , \ldots , Z \} \setminus \{B , J , O , U , X , Z \}) ^*$$
\end{document}. The HP encoding of an amino acid sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$a_1 , \ldots , a_n$$
\end{document} is the sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\hat{a}_1 , \ldots , \hat{a}_n \in \{H , P \} ^*$$
\end{document}, where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\hat{a}_i = \begin{cases}H \qquad {\rm if} \ a_i \in \{A , V , L , I , P , F , M , W , G , C \} \\ P \qquad{\rm otherwise}\end{cases}\end{align*}
\end{document}
for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$i = 1 , \ldots , n$$
\end{document}. We encode a self-avoiding walk on a 2D lattice of an HP sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\hat{a}_1 , \ldots , \hat{a}_n$$
\end{document} as the sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\langle d_1 , \ldots , d_{n - 1} \rangle$$
\end{document}, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$d_i \in \{L , R , U , D \} $$
\end{document} is the direction of the link that connects beads âi and âi+1. Observe that, for i ≥ 2, each di can be chosen in at most three ways, because there are at most three empty cells that are first-neighbors to bead âi-1. Moreover, the second bead, and its associated direction d1, can be fixed at any of the four first-neighbor sites that are all equivalent by symmetry. Given a configuration \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\langle d_1 , \ldots , d_{n - 1} \rangle$$
\end{document}, we define the vector (xi, yi) as the coordinate of bead âi in the lattice, where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}(x_i , y_i) = \begin{cases}(0 , 0) \quad\quad\quad\quad\quad
{\rm if} \ i = 1 \\ (x_{i - 1} - 1 , y_{i - 1}) \
{\rm if} \ i > 1 \ {\rm and} \ d_{i - 1} = L \\ (x_{i - 1} + 1 ,
y_{i - 1}) \ {\rm if} \ i > 1 \ {\rm and} \ d_{i - 1} = R
\\ (x_{i - 1} , y_{i - 1} - 1) \ {\rm if} \ i > 1 \ {\rm and
} \ d_{i - 1} = U \\ (x_{i - 1} , y_{i - 1} + 1) \ {\rm if} \ i
> 1 \ {\rm and} \ d_{i - 1} = D \end{cases}\end{align*}
\end{document}
Given two distinct beads âi and âj, i ≠ j, we define the energy of the corresponding contact, if any, as ε(i, j) = C(i, j)E(âi,â j), where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}C (i , j) = \begin{cases}1 \qquad {\rm if} \ \mid i - j \mid \neq 1 \ {\rm and} \ d ((x_i , y_i) , (x_j , y_j)) = 1 \\ 0 \qquad {\rm otherwise}\end{cases}\end{align*}
\end{document}
indicates whether the beads are topologically first-neighbors (d(·) is the euclidean distance) and are not adjacent in the original sequence and
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}E (\hat{a}_1 , \hat{a}_2) = \begin{cases} - 1 \ \quad {\rm if} \ \hat{a}_1 = \hat{a}_2 = H \\ 0 \qquad {\rm otherwise}\end{cases}\end{align*}
\end{document}
defines the energy of a contact depending on the bead types. In the HP model, an HH contact provides a unitary contribution while all the other combinations yield zero energy. Finally, we define the energy of a configuration as
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\sum_{i < j} \epsilon (i , j)\end{align*}
\end{document}
In this model, the problem of finding the configuration(s) with minimal energy is equivalent to finding the configuration(s) with maximum number of HH contacts.
Given an HP sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$S = \hat{a}_1 , \ldots , \hat{a}_n$$
\end{document}, we partition the set of H beads into the following two subsets:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*} & {\cal H}_E = \{i \mid \hat{a}_i = H \ {\rm and} \ i \ {\rm mod} \ 2 = 0 \} \\ & {\cal H}_O = \{i \mid \hat{a}_i = H \ {\rm and} \ i \ {\rm mod} \ 2 = 1 \} \end{align*}
\end{document}
We say that an H bead is of type E(O) if it belongs to the subset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_E ({\cal H}_O)$$
\end{document}. The \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_E$$
\end{document} set contains all H beads of even index, and the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_O$$
\end{document} set contains all H beads of odd index. It is easy to see that on a 2D lattice an HH contact is possible only between two beads that do not belong to the same subset, as no pair of beads in either subset can be first-neighbors. This property is known as the parity problem and holds also in cubic lattices. Given a bead âi, for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$i = 1 , \ldots , n$$
\end{document}, we say that the bead is internal if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$i\, {\notin} \{1 , n \} $$
\end{document}. Note that, in any configuration that is a valid self-avoiding walk, all internal beads have exactly two sequence-adjacent elements, while the first and the last bead have exactly one sequence-adjacent element. The following elementary lemma holds:
Lemma 1.Given an HP sequence of length n, the maximum number of nonadjacent first neighbors of a given bead, in any self-avoiding walk covering the whole sequence, is equal to 2 if the bead is internal and to 3 otherwise, that is,\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\sum_j C (i , j) \le \begin{cases}2 \qquad if \ i\ \notin\ \{1 , n \} \\ 3 \qquad otherwise\end{cases}\end{align*}
\end{document}
for all i = 1, … ,n.
Hence, the maximum number of contacts in which a bead can participate is equal to 2 if the bead is internal and to 3 otherwise. The maximum number of contacts achievable by the beads of type O and E is then
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}
\begin{split} & C_E = 2 \mid {\cal H}_E \mid + e_E \\ & C_O = 2 \mid {\cal H}_O \mid + e_1 + e_O\end{split}
\tag{1}\end{align*}
\end{document}
where the ex terms take care to consider extra contacts made by sequence extremities: eE is 1 if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\hat{a}_n \in {\cal H}_E$$
\end{document} and 0 otherwise, eO is defined analogously to eE, and e1 is 1 if â1 = H and 0 otherwise. The maximum number of contacts achievable by the HP sequence S is then bounded with the number
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}C_{\rm parity} = \min (C_E , C_O) \tag{2}\end{align*}
\end{document}
However, in the next section we present a new upper bound on the number of contacts achievable by an HP sequence.
2.2. New results on the structures of HP sequences
In this section, we present some new results on the configurations of HP sequences on the 2D lattice model. Given a configuration of an HP sequence on a 2D lattice, we define the H-core as the smallest rectangle containing all the H beads. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$S = \hat{a}_1 , \ldots , \hat{a}_n$$
\end{document} be an HP sequence with m H beads and let also
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}{\cal P}_S = \{i \mid 2 \le i < n \wedge \hat{a}_i = P \wedge \hat{a}_{i - 1} = \hat{a}_{i + 1} = H \} \end{align*}
\end{document}
be the set of P beads whose previous and next neighbors in the sequence are H beads. We call such beads P singlets. Observe that the minimum area that the H-core must span is equal to
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}A_{\min} = m + \mid {\cal P}_S \mid\end{align*}
\end{document}
because it must contain, by definition, all the H beads, as well as all the P beads in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal P}_S$$
\end{document}, since it is not possible to place them outside the core.
We now introduce the concept of a connected H-core, which in turn will lead to the definition of a new upper bound on the number of contacts achievable by an HP sequence. We say that an H-core is connected if it satisfies the property that each row and column contains at least one H bead or one P singlet. We devise an interesting relationship between the size of a connected H-core and the maximum number of contacts that can be achieved within it. Let (L1,L2) be the dimensions of an H-core. The number of cells available in such a rectangle is L1L2 and the number of segments in such a rectangle, which represents the maximum number of contacts that can be obtained by placing an H bead in each cell, is
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}(L_1 - 1) L_2 + (L_2 - 1) L_1.\end{align*}
\end{document}
This can be observed in Figure 1a, depicting an example H-core of dimensions L1 = 3 and L2 = 4, correspondingly containing 12 cells (3 * 4) and 17 segments (2 * 4 + 3 * 3) linking these cells. These segments represent the potential contacts that a given configuration, within such H-core, can realize.
(a) Number of cells and segments for an H-core of dimensions L1 = 3 and L2 = 4. (b) A configuration for sequence HHHPHHHHPHP. (c) Analysis of such configuration, in terms of segment types: 4 segments are realized into contacts, 5 are internal H-H links, and 8 segments are wasted. (d) A different configuration for the same sequence. (e) Analysis of such configuration (now only 1 contact is realized, and 11 are wasted).
Observe that in an H-core with dimensions (L1, L2), m cells will be occupied by H beads, and therefore L1L2 −m cells must be either empty or occupied with P beads. The segments adjacent to these cells cannot contribute any contact. Figure 1b shows a sequence (HHHPHHHHPHP) configured into the example H-core; and for such configuration, Figure 1c analyzes the nature of the available 17 segments. Four (in red) are realized into contacts, while the remaining 13 are not. Of the remaining 13: five (in black) are H-H links internal to the sequence, and eight (in light gray) are adjacent to non–H-cell segments. We call these “wasted.” Figure 1d and e show a different configuration for the same sequence, which realizes only one contact: for the remaining segments, five are H-H links internal to the sequence (this number is of course a property of the sequence and does not depend on the configuration) and as many as 11 segments are wasted.
In order to estimate the minimum number of segments that must be wasted (i.e., they are not realized into contacts) because of non–H-cells, we introduce the following simple result:
Lemma 2.Let\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{P}$$
\end{document}denote either a P bead or an empty cell on the lattice. Given an H-core and an associated row (column) with\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$n_{\bar{P}} > {\ 0}$$
\end{document}beads of type\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{P}$$
\end{document}, the number of\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$H \bar{P}$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{P} \bar{P}$$
\end{document}pairs of adjacent beads in the row (column) is equal to\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$n_{\bar {P}} - 1 + n_H$$
\end{document}, where nH is the number of H beads after collapsing all the sequences of adjacent H beads into a single H bead.
Proof. Consider the row (column) obtained by collapsing all the sequences of adjacent H beads into a single H bead; this normalization removes all the HH pairs, making the analysis simpler, and does not change the number of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\bar{\rm P}} \bar{\rm P}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\hbox{H}\bar{\rm P}$$
\end{document} pairs. Let he be the number of H beads that are adjacent, in the normalized sequence, to one \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{\rm P}$$
\end{document} bead, that is, the number of H beads that are the first or last element of the row (column); nH−he is thus the number of H beads that are adjacent to two \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{\rm P}$$
\end{document} beads. Then, it is easy to verify that the number of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{\rm P} \bar{\rm P}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\rm H} \bar{\rm P}$$
\end{document} pairs is equal to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$n_{\bar{P}} - 1 - (n_H - h_e)$$
\end{document} and 2(nH−he) + he, respectively. Hence, the total number of pairs is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$n_{\bar{P}} - 1 + n_H$$
\end{document} and the claim follows.
Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$n_{\bar{P}} (i , *)$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$n_{\bar{P}} (* , \ j)$$
\end{document} denote the number of beads of type \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{\rm P}$$
\end{document} in the i-th row and in the j-th column, respectively; likewise for nH(i,* ) and nH(*, j). Let also \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$R_P = \{1 \le i \le L_1 \mid n_{\bar{P}} (i , *) > 0 \} $$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$C_P = \{1 \le j \le L_2 \mid n_{\bar{P}} (* ,\ j) > 0 \} $$
\end{document} be the set of row and column indexes with at least one P bead. Based on Lemma 2, we are now able to provide a lower bound on the number of wasted contacts in a connected H-core:
Theorem 1.Given a connected H-core with dimensions (L1, L2) and containing m H beads, the number of\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\rm H} \bar{\rm P}$$
\end{document}and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{\rm P} \bar{\rm P}$$
\end{document}pairs is at least 2(L1L2 −m).
Proof. By Lemma 2, the total number of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\rm H} \bar{\rm P}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{\rm P} \bar{\rm P}$$
\end{document} pairs is equal to
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\sum_{i \in R_P} n_{\bar{P}} (i , *) + \sum_{j \in C_P} n_{\bar{P}} (* ,\ j) + \sum_{i \in R_P} (n_H (i , *) - 1) + \sum_{j \in C_P} (n_H (* ,\ j) - 1) .\end{align*}
\end{document}
It is easy to see that the first part of the expression, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sum_{i \in R_P} n_{\bar{P}} (i , *) + \sum_{j \in C_P} n_{\bar{P}} (* ,\ j)$$
\end{document}, is equal to 2(L1L2 −m), as each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\bar{\rm P}$$
\end{document} bead occurs once in all the rows and once in all the columns. We now show that, in a connected H-core, the second part, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sum_{i \in R_P} (n_H (i , *) - 1) + \sum_{j \in C_P} (n_H (* ,\ j) - 1)$$
\end{document}, is ≥0. Consider a generic row with index i (the reasoning is analogous in the case of a column). If nH(i,* ) > 0, then its contribution to the sum is at least 0. Instead, if nH(i,* ) = 0, by definition of connected H-core, the row must contain a P singlet. Let j be the column coordinate of the P singlet. Since row i does not contain any H bead, the two H beads adjacent (in the sequence) to the P singlet must belong to column j, that is, nH(*, j) ≥ 2. Hence, the total contribution of row i and column j to the sum is equal to nH(i,* ) + nH(*, j) − 2 ≥ 0.
Therefore, by Theorem 1, in a connected H-core, the number of wasted contacts is at least 2(L1L2−m). Moreover, we have also one contact less for each H-H link in the walk because these links must lie inside the H-core. Let
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}{\cal H}_I = \{i \mid i < n \wedge \hat{a}_i = \hat{a}_{i + 1} = H \} \end{align*}
\end{document}
be the set of H beads such that the next adjacent bead in the sequence is of type H. The number of H-H links only depends on the sequence and is equal to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mid {\cal H}_I \mid$$
\end{document} (this value is 5 for the example in Fig. 1). For the sequence S, an upper bound of the number of contacts achievable in a connected H-core with dimensions (L1, L2) is then
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}(L_1 - 1) L_2 + (L_2 - 1) L_1 - 2 (L_1L_2 - m) - \mid {\cal H}_I \mid = 2m - \mid {\cal H}_I \mid - L_1 - L_2.\end{align*}
\end{document}
be the maximum area of a rectangle with semi-perimeter t. By coalescing the two variables L1 and L2 into the corresponding sum, denoted as t, we can then define the function
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\phi (t) = 2m - \mid {\cal H}_I \mid - t ,\end{align*}
\end{document}
that maps a semiperimeter onto the maximum number of contacts achievable in an H-core with area at most equal to area(t). The φ function is strictly decreasing in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\mathbb R}^ {+}$$
\end{document} and injective; hence, area(t) is an upper bound for the area of a connected configuration of S with φ(t) contacts. Observe that the minimum value for t is equal to the smallest semiperimeter of a rectangle with area equal or greater than Amin. The following lemma defines the formula to compute such a value:
Lemma 3.Given A ∈ N, the smallest semiperimeter of a rectangle with integer dimensions and area equal or greater than A is\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\lceil{2 \sqrt A} \,\rceil\end{align*}
\end{document}
Proof. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sqrt A = n + \epsilon$$
\end{document}, for 0 ≤ ε < 1. If ε = 0, the smallest semiperimeter of a rectangle with area equal or greater than n2 is 2n, and the claim follows. Otherwise, if ε > 0, we have that the smallest semiperimeter is equal or greater than 2n + 1, because A > n2 = area(2n). Moreover, observe that area(2n + 1) =n(n + 1). Hence, we have that for n2 < A ≤ n(n + 1) the smallest semi-perimeter is 2n + 1. With a similar reasoning it can be shown that, for n(n + 1) < A ≤ (n + 1)2, the smallest semiperimeter is 2n + 2. Now, observe that for 0 < ε ≤ 0.5, we must have n2 < A ≤ n(n + 1), since A is an integer and (n + ε)2< n(n + 1) + 1. Hence, the function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\lceil{2 \sqrt A} \,\rceil$$
\end{document} maps correctly any n2 < A ≤ n(n + 1) onto the value 2n + 1. Similarly, for 0.5 < ε < 1 we have n(n + 1) < A < (n + 1),2 and the function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\lceil{2 \sqrt A} \,\rceil$$
\end{document} maps correctly any such area onto the value 2n + 2.
From Lemma 3, it follows that the maximum number of contacts achievable in a connected configuration by a sequence with minimum area Amin is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\phi (\lceil{2 \sqrt{A_{\min}}} \rceil)$$
\end{document}.
We now show that no nonconnected configuration may exist with at least \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\phi (\lceil{2 \sqrt{A_{\min}}} \rceil)$$
\end{document} contacts—this will conclude the proof on the new upper bound. Note that, by definition, a nonconnected H-core must have at least one internal row or column, containing only nonsinglet P beads or empty cells. Obviously, we do not consider any border row or column, because they would not belong to an H-core, by definition. Moreover, such a core can be always decomposed into two or more connected H-cores such that no HH contact spans two cores (as, otherwise, we could merge the two cores into a bigger connected core).
Lemma 4.An HP sequence with minimum area Amindoes not admit nonconnected configurations with at least\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\phi (\lceil{2 \sqrt{A_{\min}}} \rceil)$$
\end{document}contacts.
Proof. We prove the result by contradiction. Assume that there exists one nonconnected configuration with at least \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\phi (\lceil{2 \sqrt{A_{\min}}} \rceil)$$
\end{document} contacts. Clearly, we must have Amin ≥ 2 for such a configuration to exist. We first prove it in the case when the configuration can be decomposed into two maximal connected cores. Let P1 and P2 be the semiperimeters of such cores. By definition, the HH contacts must be fully contained in the two cores. Hence, it must hold that
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\phi (P_1) + \phi (P_2) \ge \phi (\lceil{2 \sqrt{A_{\min}}} \rceil) ,\end{align*}
\end{document}
which can be easily proven to be equivalent to the condition
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}P_1 + P_2 \le \lceil{2 \sqrt{A_{\min}}} \rceil.\end{align*}
\end{document}
Observe that the sum of the areas of the two connected cores must be at least Amin. Hence, in the best-case scenario, the two areas are of the form B; Amin−B and their semiperimeters are equal to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$2 \sqrt B$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$2 \sqrt{A_{\min} - B}$$
\end{document}, respectively. The function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sqrt x + \sqrt{A_{\min} - x}$$
\end{document}, where 1 ≤ x < Amin, has two minima at x = 1 and x = Amin−1. Summing up, we obtain the following inequality
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}2 \sqrt{1} + 2 \sqrt{A_{\min} - 1} \le \lceil{2 \sqrt{A_{\min}}} \rceil < 2 \sqrt{A_{\min}} + 1 ,\end{align*}\end{document}
which is true only for Amin < 2, thus contradicting the hypothesis. Clearly, the proof also holds if the number of connected cores is greater than two.
Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$C_{{\rm area}} = \phi (\lceil{2 \sqrt{A_{\min}} \rceil})$$
\end{document}. By Lemmas 3 and 4 and the definition of φ it follows that Carea is a new upper bound for the number of contacts achievable by an HP sequence with minimum area Amin. Indeed, no connected configuration with more than Carea contacts may exist, as the corresponding area would be smaller than Amin, while, by Lemma 4, no nonconnected configuration may exist with at least Carea contacts. Summing up, and remembering that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$A_{\min} = m + \mid {\cal P}_S \mid$$
\end{document}, we provide the following theorem stating the new bound:
Theorem 2.The number of contacts achievable by an HP sequence with m H beads and corresponding sets\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal P}_S$$
\end{document}(P singlets) and\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_I$$
\end{document}(internal H-H links) is at most equal to\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}C_{area} = \phi (\lceil{2 \sqrt{m + \mid {\cal P}_S \mid} \rceil}) = 2m - \mid {\cal H}_I \mid - \lceil{2 \sqrt{m + \mid {\cal P}_S \mid} \rceil}\end{align*}
\end{document}
This represents a new upper bound on the number of contacts achievable by an HP sequence, in addition to the known upper bound presented in equation 3, and due to the parity issue.
Note that either of the two upper bounds may be the most stringent, depending on the sequence. Therefore, the new upper bound is the minimum of the two:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}C_{\rm {upper - bound}} - \min (C_{\rm
area} , C_{\rm parity})\end{align*}
\end{document}
In Section 3, Table 1 shows that the new bound outperforms the existing one in several cases.
Various Sequences from the Standard HP Benchmarks and Corresponding Length, Contact Bounds, and Best Energy Values
ID
Length
CNEW
CLP
Cparity
Carea
E*
Sequence
S1-1
20
10
11
11
10
9
(HP)2PH2PHP2HPH2P2HPH
S1-2
24
11
11
11
11
9
H2(P2H)7H
S1-3
25
8
8
8
8
8
P2HP2(H2P4)3H2
S1-4
36
14
16
16
14
14
P3H2P2H2P5H7P2H2P4H2P2HP2
S1-5
48
23
25
25
23
23
P2H(P2H2)2P5H10P6(H2P2)2HP2H5
S1-6
50
25
25
25
28
21
H2(PH)3PH4PH(P3H)2P4H(P3H)2PH4P(HP)3H2
S1-7
60
37
40
40
37
36
P2H3PH8P3H10PHP3H12P4H6PH2PHP
S1-8
64
42
43
43
42
39
H12(PH)2(P2H2)2P2HP2H2PPH2P2HP2(H2P2)2(HP)2H12
S1-9
85
53
-
58
53
53
H4P4H12P6(H12P3)3HP2(H2P2)2HPH
The new bound CNEW, equal to the minimum between Cparity and Carea, improves in several cases on the state of the art CLP (which is, at least for these sequences, effectively equal to Cparity). The best values among CNEW and CLP are highlighted in bold.
2.3. A new algorithm for finding optimal structures on the 2D lattice
We propose a new algorithm for the PSP problem based on the HP model on 2D lattices. Our algorithm is based on a pruned exhaustive search. Given an HP sequence, the algorithm explores the corresponding tree representation of all its possible configurations, the size of the tree being exponential in the length of the sequence in a top-down fashion. However, we add the ability to detect whether a partial configuration, corresponding to an internal node, may yield an optimal configuration. When this assertion is false, we say that the partial configuration is prunable and the algorithm can safely stop exploring the current branch of the tree, that is, all the configurations that can be derived from (that start with) the current partial configuration are discarded.
Formally, the number of possible configurations on a 2D lattice for an HP sequence of length n is at most 3n−2, because, as explained in the previous section, we can assign a fixed position to the first two beads and place each of the remaining n−2 beads in at most three different ways. Hence, the set of all possible configurations of a sequence of length n can be represented using a complete 3-ary tree with height n−2. The number of iterations of the algorithm is therefore bounded by (3n−1 − 1)/2 (the number of nodes of a 3-ary tree with height n−2). For each partial configuration of length k that is recognized as prunable, the number of iterations is reduced by a number bounded by (3n−2−k−1)/2. The detection of prunable partial configurations is achieved using a number of pruning criteria, which exploit various properties of the state associated to a given partial configuration.
An example of a search tree is depicted in Figure 2. Leaf nodes represent configurations; some of them are invalid as they are not self-avoiding walks, such as configuration URDL; others are valid but do not achieve the highest possible number of contacts, such as configuration UUUU; finally, some are valid and optimal, such as configurations URDD and URDR. Without actually reaching all leaves, the algorithm is able to compute if a partial configuration, such as for example UU, cannot possibly lead to an optimal configuration and can prune computation early.
Search tree for sequence HPPHH. Level 0 is not represented, and the value of level 1, that is, the direction of the second bead is fixed at U, because the four branches of the root node are all equivalent up to rotations. Observe, furthermore, that all L-branches from the center, backbone of the tree are equivalent (specular) to their corresponding R-branches. Therefore, all corresponding calls are skipped. The left part of the tree is therefore not actually explored and is shown here only for the sake of clarity.
In Figure 3, we depict an example sequence of length 25, Se = PHPH2(PPH)2P5(HP)3PHP, which will guide us, as a running example, through the algorithm. In our pictures H beads are depicted as squares, and they are colored blue if they belong to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_E$$
\end{document} and red if they belong to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_O$$
\end{document}. P beads are not depicted, that is, they appear as a straight line that connects the sequence. Above the sequence, the index of each bead is shown, from 1 to 25. There are 4 Hs of type blue, and 5 Hs of type red. Hence, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mid {\cal H}_E \mid = 4 , \mid {\cal H}_O \mid = 5$$
\end{document}, and Cparity = CE = 8. Moreover, we have that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mid {\cal H}_I \mid = 1$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mid {\cal P}_S \mid = 3$$
\end{document}, so Carea = 10 and the stricter upper bound is given by Cparity. Consequently, in order to achieve eight contacts, all four H beads belonging to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_E$$
\end{document}must achieve two contacts each. Observe that the energy of the optimal configuration(s) can be less than Cparity, that is, there might not be a configuration with such energy. In the case of the example sequence mentioned above, an optimal configuration with Cparity contacts (eight, in this case) does exist and is pictured in Figure 4.
(a) The sequence Se = PHPH2(PPH)2P5(HP)3PHP, of length 25, and the corresponding indexes of each bead; (b) graphical representation of Se. The link between sequence-adjacent beads is shown as a black segment. H beads are depicted as squares, and they are colored in blue if they belong to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_E$$
\end{document}, and in red if they belong to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_O$$
\end{document}. P beads are not depicted, that is, they appear as a straight line that connects the sequence. There are four blue Hs in this sequence (at indexes 2, 4, 8, 24), and five red Hs (at indexes 5, 11, 17, 19, 21).
The single, optimal configuration for the example sequence Se. It has Cparity contacts (eight in this case: two for each of the blue beads).
The main idea that we exploit in the proposed algorithm is based on the concept of exposed site. When a bead is placed on the lattice, it exposes sites where other beads must be located in order to create contacts. Hence, the placement of a bead imposes some constraints on a configuration, that is, it reduces the degrees of freedom and, consequently, makes it possible to predict if a partial configuration may or may not lead to an optimal configuration. In particular, we define the notion of weight of an exposed site and show how to use the information about the number and weight of exposed sites to state whether a configuration cannot yield a given, sought-after number of contacts. Figure 5a shows a partial configuration of an HP sequence, where three sites are exposed by O beads and two sites are exposed by E beads.
Partial configurations of the example sequence Se: (a) Partial configuration UUURD. The H beads of types O and E are colored in red and blue, respectively. The link between sequence-adjacent beads is shown as a black segment. Empty squares are exposed sites. (b) Partial configuration UUURDD. The last placed P bead occupies a site exposed by the second bead and thus wastes a contact.
Formally, we define exposed site as an empty cell on the lattice that has at least one H bead as first-neighbor. We define the weight of an exposed site as the number of H beads that are first-neighbors of the site. The weight corresponds to the number of contacts that can be obtained by placing an H bead on the site. We say that an exposed site is of type O(E) if its first-neighbor(s) belong to the subset \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_O ({\cal H}_E)$$
\end{document}. Note that, analogously to the beads case, two exposed sites of the same type cannot be first-neighbors.
We define the state of a partial configuration as the tuple
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}(C , (x_{\min} , y_{\min}) , (x_{\max} , y_{\max}) , W_O , W_E , S_O , S_E) ,\end{align*}
\end{document}
where
• C is the number of contacts achieved by the partial configuration.
• (xmin,ymin) and (xmax,ymax) are the coordinates of the lower-left and upper-right corner, respectively, of the smallest rectangle, including all the H beads placed so far.
• WO (WE) is the number of wasted contacts, that is, the sum of the weights of the exposed sites of type O (E) that have been occupied by a P bead.
• SO (SE) is the number of exposed sites of type O (E).
We denote as hE(k) the number of beads in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$${\cal H}_E$$
\end{document} that occur in the suffix of length n−k of the sequence, that is,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}h_E (k) = \mid \{i \mid i > k \wedge i \in {\cal H}_E \} \mid ,\tag{4}\end{align*}
\end{document}
and hO(k) is defined analogously. For the example sequence, hE(15) = 1, hO(15) = 3 (there is one remaining blue bead among the last 10 (25–15) beads, there are 3 remaining red beads among the last 10).
Our algorithm takes two input parameters in addition to the HP sequence. The first parameter is the number CI ≤ min(Cparity,Carea) of expected contacts. Since the maximum number of contacts of the optimal configuration(s) is not known a priori, we parametrize it. The second parameter is the number AreaI, which is the maximum area of the H-core. The algorithm finds all the configurations of the sequence with a number of contacts equal at least to CI and with H-core area equal at most to AreaI.
To find all the optimal configurations of a given sequence we proceed as follows. Given a sequence with contact bounds Cparity and Carea, we run the algorithm for each CI value in the range 1, 2, … , min(Cparity,Carea) in decreasing order, in such a way that we can stop as soon as at least one configuration with the expected energy is found. Concerning the area parameter, given the parameter CI we have to find a bound on the maximum area of any configuration of the sequence with CI contacts. Values of AreaI below the maximum area are not safe as the algorithm could miss some or even all the configurations. Based on the results presented in Section 2.2, if CI = Carea we use Equation 3 to compute the parameter AreaI from the semiperimeter \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\lceil{\sqrt{2 A_{\min}} \rceil}$$
\end{document}, otherwise we set AreaI to ∞.
While traversing the search tree, the algorithm recursively computes the state associated with the partial configuration corresponding to the current node in the tree. It then evaluates various pruning criteria that allow one to establish whether the current partial configuration cannot yield a configuration with CI contacts.
The pruning criteria that we discuss first are based on the parameter CI. We say that a waste of w contacts occurs if an exposed site, with weight w, is occupied by a P bead. The maximum number of contacts that are exposed by the H beads of type O is CO. Since we want to find all the configurations with at least CI contacts, we can then waste at most
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}W_O^{\max} = C_O - C_I\end{align*}
\end{document}
contacts exposed by H beads of type O. The same consideration holds for the H beads of type E. Note that, since CI ≤ Cparity = min(CE, CO), both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_O^{\max}$$
\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_E^{\max}$$
\end{document} are non-negative and thus well defined. The pruning criterion is then:
Criterion 1.If WE > \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_E^{\max}$$
\end{document}or WO > \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_O^{max}$$
\end{document}, the current configuration is prunable.
Figure 5b shows a partial configuration of the example sequence Se in which one site of type E with unitary weight is wasted, that is, WE = 1. If CI = Cparity, it follows that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_E^{\max} = 0$$
\end{document}, and thus the partial configuration is prunable, as it cannot possibly lead to an optimal configuration of Cparity contacts.
The next pruning criterion relates to exposed sites. Observe that the minimum number of exposed sites of type O that must be occupied, in order to not waste more than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_O^{\max}$$
\end{document}−WO contacts, is equal to
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}S_O - (W_O^{\max} - W_O)\end{align*}
\end{document}
because, in the most conservative case, each site has unitary weight, and thus we can waste at most a number of sites equal to the number of wastable contacts; the same reasoning holds for the sites of type E. It follows that the number of H beads in the remaining suffix of the sequence (defined by the map hO(·) in Eq. 4) must be equal at least to the number of exposed sites that have to be occupied. Given a partial configuration of length k, the pruning criterion is then:
Criterion 2.If hO(k) < SE−(\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_E^{\max} - W_E$$
\end{document}) or hE(k) < SO−(\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$W_O^{\max} - W_O$$
\end{document}), the current configuration is prunable.
Figure 6a shows a partial configuration of the example sequence Se in which there are not enough remaining beads of type O that may occupy the exposed sites of type E. Specifically, we have hO(8) = 4, SE = 5, and WE = 0. Hence, if CI = Cparity, the configuration is prunable.
Partial configurations of the example sequence Se: (a) Partial configuration URUUUUUU. There are four beads of type O left; hence, it is not possible to occupy all the sites exposed by the beads of type E (there are five of them). This situation will result necessarily in wasting at least a contact, and therefore Cparity cannot be achieved. Hence, this configuration is prunable. (b) Partial configuration UURRDDLD showing an unreachable exposed site with weight 3.
We now focus on the number of contacts of a given configuration. In particular, we try to estimate the maximum number of remaining contacts achievable by a given partial configuration. If this number is smaller than CI−C, (C is the current number of contacts of the partial configuration), we may conclude that the current configuration cannot yield CI contacts. A pruning criterion follows:
Criterion 3.Let Cleft be the maximum number of remaining contacts that can be achieved by the current configuration. If Cleft < (CI−C), the current configuration is prunable.
We show next how to estimate Cleft. The first observation concerns the number of contacts that an H bead generates when placed on the lattice. By Lemma 1, any internal bead can make at most two contacts while the last bead can make at most three contacts. Moreover, note that the last bead can make three contacts only if it is an H bead. Obviously, the placement of the first bead does not give rise to any contact. This fact leads to a simple loose estimate of the maximum number of contacts achievable by a partial configuration of length k:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}2 (h_E (k) + h_O (k)) + e_n\end{align*}
\end{document}
where en is 1 if ân = H and 0 otherwise. However, it is possible to obtain a closer (i.e., correct but less conservative) estimate by exploiting the information on the number and weight of the existing exposed sites. As explained above, an H bead, when placed on the lattice, exposes either two sites if it is internal or three if it is the first or the last bead; to generate contacts, the exposed sites must be occupied by a bead of the opposite type. There are three main conditions under which an exposed site cannot be occupied by such a bead with a resulting waste equal to the corresponding weight:
1. the site is occupied by a P bead;
2. there are not as many beads of opposite type as the available sites;
3. the site is unreachable.
Figure 6a and 6b shows examples of Conditions 2 and 3. Let 1 ≤ k < n be the length of a given partial configuration. We focus on the beads of type O, but the reasoning is analogous for the other case. A crucial observation is that the beads cannot make more than \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$2 \mid {\cal H}_O \mid + e_1 + e_O$$
\end{document} contacts (see Eq. 1), which implies that we can estimate the number of potential residual contacts by considering the H beads and the exposed sites of type O only. We can write the maximum number of remaining contacts as C1+C2, where C1 and C2 are the contacts achievable by the remaining and placed beads, respectively. As for the beads that have not been placed yet, we assume that the most favorable case applies, that is, C1 = 2hO(k)+eO. Let sO(i) be the number of exposed sites of type O with weight i, for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$i = 1 , \ldots , 4$$
\end{document}; sE is defined analogously. Concerning the placed beads, the respective number of potential contacts is 2(\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\mid {\cal H}_O \mid - h_O (k)$$
\end{document})+e1. However, the following inequality holds:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}C_2 \le \sum_{i = 1}^{4}i \cdot s_O (i) \le 2 (\mid {\cal H}_O \mid - h_O (k)) + e_1 - C ,\end{align*}
\end{document}
that is, the weighted sum of the exposed sites is the maximum number of contacts still achievable by the placed beads, and it cannot exceed the number of potential contacts minus the existing contacts. Hence, we can use \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$C_2 = \sum_{i = 1}^{4}i \cdot s_O (i)$$
\end{document}. Observe that the number of contacts wasted because of Condition 1, that is, WO, is equal to the third term minus the second term. When such two quantities are equal, no site is wasted; instead, if the difference is positive, one or more sites are wasted, because when a site is occupied by a P bead, the number of exposed sites decreases while C does not increase. To take into account Conditions 2 and 3, we need a tighter bound for the term \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sum \nolimits _{i = 1}^{4}i \cdot s_O (i)$$
\end{document}. Observe that the sites with weight 4 cannot be reached and only one site with weight 3 can be occupied (by the last bead, provided that it is an H bead of type E). Moreover, note that we cannot occupy more than hE(k) sites, as we would not have enough beads available. In order to account for the most conservative scenario, we assume that the sites with highest weights are occupied. The number of sites that can be effectively occupied is then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$M_O = M_O^3 + M_O^2 + M_O^1 \le S_O$$
\end{document}, where
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*} & M_O^3 = \min (s_O (3) , e_E) \\ & M_O^2 = \min (s_O (2) , h_E (k) - M_O^3) \\ & M_O^1 = \min (s_O (1) , h_E (k) - M_O^3 - M_O^2) \end{align*}
\end{document}
Hence, if MO < SO, we waste \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\sum_{i = 0}^4 i (s_O (i) - M_O^i)$$
\end{document} contacts because of Conditions 2 and/or 3. Summing up, the resulting formula is:
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}2h_O (k) + e_O + 3M_O^3 + 2M_O^2 + M_O^1\end{align*}
\end{document}
The next pruning criterion concerns the H-core, that is, the smallest rectangle containing all the H beads placed in the current configuration:
Criterion 4.If the area of the H-core encoded by the coordinates (xmin, ymin),(xmax, ymax) is bigger than AreaI, the current configuration is prunable.
We recall that AreaI is the parameter of the algorithm that specifies the maximum area of the H-core. Observe that the parameter AreaI is constrained by the inequality AreaI ≥ Amin. Moreover, note that the maximum number of nonsinglet P beads in the H-core is bounded by AreaI−Amin; otherwise, the H-core could not contain all the H beads. This observation leads to another pruning criterion:
Criterion 5.If the number of nonsinglet P beads in the current H-core is greater than AreaI−Amin, the current configuration is prunable.
When CI = Carea, we can also devise another pruning criterion, based on Theorem 1. By Theorem 1 and the definition of φ, it follows that in any connected H-core with dimensions (L1, L2), m H beads and φ(L1 + L2) contacts the waste must be exactly 2(L1L2−m), that is,
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
\begin{align*}\sum_{i \in R_P} (n_H (i , *) - 1) + \sum_{j \in C_P} (n_H (* ,\ j) - 1) = 0 ,\end{align*}
\end{document}
where we recall that nH is the number of H beads in a row (column) after collapsing all the sequences of adjacent H beads into a single H bead. Moreover, it is a simple observation that in a configuration with semiperimeter \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}
$$\lceil{2 \sqrt{A_{\min}} \rceil}$$
\end{document}, which is the smallest semiperimeter possible to contain the whole sequence, we have nH ≥ 1 for any row and column, that is, each row and column contains at least one H bead. This observation leads to our final pruning criterion:
Criterion 6.If CI = Carea and there is one row or column with nH > 1, the current configuration is prunable.
The criterion in practice says that in a configuration with Carea contacts, there cannot be in any row or in any column a segment of Ps delimited by Hs at each end. Figure 7 exemplifies this concept: a benchmark sequence is pictured, S1 - 9, with one of its optimal configurations. The picture then shows three partial configurations that are pruned because of Criterion 6. In fact, in all three cases we have a segment of Ps delimited by Hs: in the first and second case this happens along a column, while in the third case it happens along a row.
Example of how early in the search Criterion 6 is able to prune. The picture shows an optimal configuration for sequence S1-9 (reversed), and three partial configurations that are pruned because of Criterion 6. The first partial configuration is pruned as early as the third bead, and the second and third configuration as early as the fifth bead.
One important thing to notice is how early in the search this criterion is able to prune: already in the third or fifth bead, in a sequence of length 64. Notice that, instead, the criteria based on exposed sites would not be effective, in this case, this early in the search. Criterion 6, in combination with Criterion 5, is indeed the main reason minwalk performs so effectively in those benchmark sequences where the optimal number of contacts is equal to the upper bound Carea.
3. Experimental Results
In this section, we present some experimental results on the new devised bound and on our algorithm, named minwalk.
The algorithm was implemented in the C programming language, and running times have been measured using the clock function. The tests were performed on a 2.53 GHz Intel Xeon E5630 machine with Ubuntu 10.04 and GNU C Compiler 4.4.
We tested minwalk on various benchmarks. The first experiment consisted of running minwalk on the first nine sequences of the standard HP benchmarks (Hart and Istrail, 1996).1 The sequences are shown in Table 1, together with the associated contact bounds Cparity, Carea, and CLP, which corresponds to the value presented in Ahn and Park (2010). As can be seen from the reported values, our new bound Carea is better than the best-known one in six cases, and in three cases it is also equal to the real best energy value. Moreover, observe that in these benchmarks, the contact bounds reported by Ahn and Park (2010) are always equal to Cparity.
Table 2 shows the experimental results for these benchmarks. In particular, we reported for each sequence the running time, the best energy value found, and the corresponding number of optimal configurations (the actual configurations for all optimal solutions are also provided, but in a separate file2). This table also includes the running times, taken from the benchmark in Thalheim et al. (2008), of the best nonexact algorithms, namely PERM (Hsu et al., 2003), F&F (Rego et al., 2011), HPPFP-3 (Shmygelska and Hoos, 2005), and PFold-P (Thalheim et al., 2008). Note that these algorithms find a single configuration, as opposed to our algorithm, which finds all the optimal configurations of a given sequence. To provide a fair comparison, we also show in this table the running time for minwalk to compute the first optimal configuration found. The experimental results show that our algorithm is fast and competitive if compared with the best nonexact methods. The only exception is sequence S1-6, which requires a large amount of time to be solved with our method. One possible explanation for this fact is the large difference, equal to 4, between the optimal energy for this sequence and its bound Carea. Therefore, in this case, we perform five runs of minwalk (with CI from 25 to 21) on an input of length 50 with the H-core pruning criteria disabled. Concerning the other sequences, our algorithm finds the first optimal configuration in the order of seconds for all of them, and it is always as fast or faster than all other algorithms in the state of the art. In normal mode, our algorithm finds all optimal configurations in the order of seconds for all benchmark sequences, with the exception of S1-7 and S1-9, which require more time. For example, for sequence S1-5, all 492 optimal configurations (which we know to be optimal by the results in Table 1) are found in a fraction of a second.
Experimental Results of the Algorithm on Various Biological Sequences from the Standard HP Benchmarks of Lengths Up to 85
Sequence
minwalk
minwalk (single)
PERM
F&F
HPPFP-3
PFold-P
Energy
Degeneracy
S1-1
0.01 s
0.01 s
<1 s
0 s
<1 s
0.06 s
9
2
S1-2
0.03 s
0.01 s
<1 s
2 s
<1 s
0.4 s
9
19
S1-3
0.01 s
0.01 s
2 s
0.5 s
<1 s
0.2 s
8
16
S1-4
0.01 s
0.01 s
<1 s
4 s
4 s
1.1 s
14
192
S1-5
0.5 s
0.01 s
2 s
10 s
1 m
13.3 s
23
492
S1-6
85 h
38 h
3 s
22 s
15 s
15.4 s
21
897
S1-7
3.5 m
0.4 s
4 s
56 s
20 m
4 m
36
242
S1-8
0.3 s
0.01 s
78 h
24 s
1.5 h
35 m
42
39
S1-9
14 m
30 s
60 s
1.3 m
24 h
4.5 h
53
7710
The results obtained for the long sequences S1-7 and S1-9 are also good. Our running times are in the order of minutes and are better than the ones obtained by the ACO algorithms HPPFP-3 and PFold-P, while the PERM and F&F algorithms are faster. However, we recall that minwalk in this mode finds all the optimal configurations of the sequence, which in the case of S1-9 are a large number (7710). Sequence S1-8 represents an interesting case. The optimal energy value of −42 is found, with its 39 degenerate versions, in less than 1 second. Only the F&F algorithm is able to find a solution in the order seconds; the running time of the other nonexact algorithms on this sequence is in the order of hours and even days in the worst cases.
Figure 8 depicts sample optimal configurations for S1-6, S1-7, S1-8, and S1-9. Note that minwalk is able to guarantee the optimality and the degeneracy of the solutions found.
Two optimal configurations for each of the sequences S1-6, S1-7, S1-8, and S1-9.
Table 3 shows the the experimental results for various biological and random sequences of length ∼30 taken from the Supplementary Material of Shmygelska and Hoos (2005). Our algorithm performed well also in this experiment; the running time is in the order of seconds on average, and it is less than 2 minutes in the worst case.
Experimental Results of the Algorithm on Various Biological and Random Sequences from Shmygelska and Hoos (2005)
Sequence
Time (sec)
Energy
Degeneracy
B30-1
105.08
12
104
B30-2
1.22
10
860083
B30-3
0.01
14
6
B30-4
4.88
10
265
B30-5
3.55
13
243
B30-6
3.59
13
7
B30-7
61.23
16
84
B30-8
2.53
8
494
B30-9
0.39
18
401
B30-10
31.48
16
307
R30-1
0.31
9
1595
R30-2
0.12
13
137
R30-3
0.13
10
1964
R30-4
5.72
9
2606
R30-5
1.04
13
10
R30-6
0.11
11
6
R30-7
1.69
8
15914
R30-8
0.13
12
50
R30-9
10.85
9
7167
R30-10
55.41
12
74
All these sequences have a length of around 30.
We also tested minwalk on random sequences of length 45; Table 4 again shows the running time, the best energy value found, and the corresponding number of optimal configurations. The sequences are reported in Table 5. There is no algorithm in the state of the art that can return optimal configurations on the 2D lattice for input sequences of this size. minwalk does so in a matter of hundreds of seconds.
Experimental Results of the Algorithm on Random Sequences of Length 45
Sequence
Time (sec)
Energy
Degeneracy
R45-1
95.1
24
8
R45-2
68.68
17
76
R45-3
113.02
16
161
R45-4
28.52
18
4
R45-5
2.45
22
2
R45-6
563.78
17
259675
R45-7
767.35
25
190
R45-8
54.84
22
120
R45-9
158.94
21
77
R45-10
248.39
22
4
Random HP Sequences of Length 45
ID
Sequence
R45-1
HPHPHHHHHHPHPPHHPPHHPHHHHPPHHHHPPHPHHPPPPPHHH
R45-2
PHHPPPPHPPHHHPHPHHPPHHHHHPPHPPPHPPHPHPPPPHPPH
R45-3
HHPHHPHPHPPPHHPPPHHPHHPHPHPHPPPPPPPPPHPPPPHHH
R45-4
PPHHHPPHPPPPPPHPHPHHHHHHHHPHPHPHPPPPPPPHPPPHH
R45-5
HPHHHHHHHHHHPHPHPPPPHHHPPHPHPPHPHPPPPHPPHPHHP
R45-6
PPPHHHPPHPHHPPHHHHPHHHPPHPHHPHHPHHHHPPPPPPPPP
R45-7
HHPPPHHHPHPPHHHHHHHHPHHHHPHPHPHPPHPPHHPHPHHHH
R45-8
HHPPHPPHHHPHHHHPPHHHHHHHHPPPPPPPPHPHPPHHPHPPH
R45-9
HHPPHHPHHHPHHHPHHHHPPHPHPHPPHHPPPPHPHHHPHPPPH
R45-10
HHPPPPHHPPHHHHHPHPHHPHPHHPPHPPPPHHHHPHHHHPPHP
Last, we noted that the optimal configurations of a consistent number of standard benchmark sequences have an H-core area equal, or very close to, Amin(recall that Amin is the number of Hs + P-singlets in the sequence, and achieving Amin basically corresponds to packing all Hs and P singlets in a compact H-core).
4. Conclusions
We have presented new results on HP sequences and configurations on the 2D lattice and a novel algorithm to solve the protein structure prediction problem in the HP model on the 2D lattice. In particular, we discovered a new relationship between the energy and the area of a specific type of native configurations and, on top of this result, a new upper bound on the energy achievable by an HP sequence. The presented algorithm is fast and competitive even if compared with the speed of nonexact methods and guarantees the optimality of the solutions found. It is also the first exact algorithm able to solve random instances of length up to 45 and specific benchmark instances of length up to 85. In the future, we aim to extend our algorithm to the 3D lattice case and compare it in performance with Backofen and Will (2006).
Footnotes
Disclosure Statement
No competing financial interests exist.
1
2
References
1.
AhnN., ParkS.2010. Finding an upper bound for the number of contacts in hydrophobic-hydrophilic protein structure prediction model. J. of Comp. Biol., 17:647–656.
2.
BackofenR., WillS.2006. A constraint-based approach to fast and exact structure prediction in three-dimensional protein models. Journal of Constraints, 11:5–30.
3.
BergerB., LeightonF.T.1998. Protein folding in the hydrophobic–hydrophilic (HP) model is NP-complete. J. of Compu. Biol., 5:27–40.
4.
ChenT., VernazobresD., YomoT.et al.2010. Evolvability and single-genotype fluctuation in phenotypic properties: A simple heteropolymer model. Biophysical Journal, 98:2487–2496.
5.
CrescenziP., GoldmanD., PapadimitriouC.H.et al.1998. On the complexity of protein folding. J. of Comp. Biol., 5:423–466.
6.
HartW.E., IstrailS. 1996. Fast protein folding in the hydrophobichydrophilic model within three-eighths of optimal. J. of Comp. Biol., 3:53–96.
7.
HolzgräfeC., IrbäckA., TroeinC.2011. Mutation-induced fold switching among lattice proteins. J. of Chem. Phys., 135:195101.
8.
HsuH.-P., MehraV., NadlerW., GrassbergerP.2003. Growth algorithms for lattice heteropolymers at low temperatures. J, of Chem. Phys, 118:444–451.
9.
IrbäckA., TroeinC.2002. Enumerating designing sequences in the HP model. J. of Biol. Phys., 28:1–15.
10.
IstrailS, LamF.2009. Combinatorial algorithms for protein folding in lattice models: A survey of mathematical results. Communications in Information and Systems, 9:303–346.
11.
LauK.F., DillK.A.1989. A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules, 22:3986–3997.
12.
LeshN, MitzenmacherM., WhitesidesS.2003. A complete and effective move set for simplified protein folding. Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology. RECOMB ’, 03:188–195New York.
13.
LiangF., WongW.H.2001. Evolutionary monte carlo for protein folding simulations. J. of Chem. Phys., 115:3374–3380.
14.
MannM., MaticzkaD., SaundersR., BackofenR.2008. Classifying protein-like sequences in arbitrary lattice protein models using LatPack. HFSP Journal, 2:396–404.
15.
RegoC., LiH., GloverF.2011. A filter-and-fan approach to the 2D HP model of the protein folding problem. Annals of Operations Research, 188:389–414.
16.
ShahrezaeiV., EjtehadiM.R.2000. Geometry selects highly designable structures. J. of Chem. Phys., 113:6437–6442.
17.
ShmygelskaA., HoosH.2005. An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem. BMC Bioinformatics, 6:30.
18.
ThalheimT., MerkleD., MiddendorfM.2008. Protein folding in the HP-model solved with a hybrid population based aco algorithm. International Journal of Computer Science, 35:291–300.
19.
UngerR., MoultJ.1993. A genetic algorithm for three dimensional protein folding simulations. Proceedings of the 5th International Conference on Genetic Algorithms, 581–588Morgan Kaufmann, Burlington, MA.