A New Heuristic Algorithm for Protein Folding in the HP Model

Abstract

This article presents an efficient heuristic for protein folding. The protein folding problem is to predict the compact three-dimensional structure of a protein based on its amino acid sequence. The focus is on an original integer programming model derived from a platform used for Contact Map Overlap problem.

1. Introduction

A protein is a complex biological macromolecule that consists of a sequence of amino acids. Proteins have a key role in living organisms. They can act either as structural components (hair, skin, and muscle), or as active agents (enzymes, or transport of oxygen to tissues). In standard terms, proteins always fold to the same unique native structure. This form is determined by the amino acid sequence. This was proven by Christian Anfinsen, who won the Nobel Prize in chemistry in 1972 (Berger and Leighton, 1998; Ahn and Park, 2010). For this reason, it is considered that the functional properties of the protein are dependent on its tertiary structure.

In this study, we consider the hydrophobic–hydrophilic (HP) model for a two-dimensional (2D) square lattice. The first HP model was introduced by Dill (Dill,1985; Dill et al., 1995; Yoon, 2006). The 20 types of amino acids ingredients of a protein are classified as hydrophobic (H) or polar (P) by degree of hydrophobicity of amino acids. Then the HP model simplifies the protein-folding problem by considering only two types of amino acids: H and P. Lattice protein models locate each amino acid on a point of a 3-D cubic lattice, and in order to maintain the connectivity of the protein, amino acids that are adjacent in the protein's sequence must also occupy adjacent lattice points.

Proteins fold in three dimensions. However, scientists often use a 2-D model instead of the 3-D model to test their algorithms (i.e., they use the square lattice instead of the cubic lattice), and attempt to extend the 2-D model into 3-D. In this report, we will focus on one specific type of protein-folding model that can be described in the following manner:

Maximize the number of H–H contacts subject to:

1. Assignment: Each amino acid must occupy one lattice point,

2. Non-overlapping: No two amino acids may share the same lattice point,

3. Connectivity: Every two amino acids that are consecutive in the protein's sequence must also occupy adjacent lattice points.

The objective function of this model is to maximize the number of H–H contacts, which is the number of adjacent (in the lattice) hydrophobic amino acids (Chandru et al., 2004).

This problem (even for a square lattice) is proven NP-complete (Alberts et al., 1998; Jiang and Zhu, 2005; Ahn and Park, 2010). Therefore, finding a good heuristic is a challenge partially overcome by the algorithm described below (Duan and Kollman, 2001; Michalewicz and Fogel, 2004).

2. Mathematical Model

The mathematical model is based on an upright square lattice with a fixed size (Carr et al., 2003; Istrail et al., 2000). Such a lattice is conveniently presented as a m × m checkerboard with the neighborhood of each white (black) square—the four black (white) surrounding squares. Let now G_c = (V_c, E_c) be a graph with V_c = {1,2,.…,m²}, where node i corresponds to the i-th square (under an arbitrary numeration of the squares) and the edge (i, j) ∈ E_c if i and j are neighbors. Conveniently, the four edges incident with a given vertex i will be labeled with: u,d,l,r for the up, down, left, and right surrounding squares. (The border squares, resp. nodes, are of smaller than 4 degrees.) The simple paths (each node is visited at most once) in G_c are called a self-avoiding path.

Let S be a sequence of n letters on {0, 1} alphabet (0-for P and 1-for H). Let G_S = (V_S, E_S) be a graph associated with S with a node set V_S = {1, 2, …, n} and (i, j) ∈ E_S if and only if |i – j| ≥ 2 and S[i] = S[j] = 1. Let G = G_S ∪ G_c be a complete bipartite graph with node set V_S ∪ V_c. The matching (in this case, one-to-one mapping of V_S to V_c) M = {e₁, e₂, …, e_n} with |M| = n is feasible if the covered nodes in V_c define a self-avoiding path (Fig. 1). Define function z(e_i, e_j) = z_ikjl = 1 if (i, j) ∈ E_S, (k, l) ∈ E_c and z_ikjl = 0 otherwise. Finally, define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$v \left( M \right) = \sum \nolimits_{e_i , e_j {}^{ \in M}} {z \left( e_i , e_j \right) }$$ \end{document} . Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\widetilde M$$ \end{document} be the set of feasible matchings.

FIG. 1.

One-to-one mapping of V_S to V_c.

Then the problem of finding the optimal folding over up right square lattice is:

Square folding problem (SFP). For \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$M \in \widetilde M$$ \end{document} , find v = max v(M).

Converting the HP problem as an optimization problem on graphs allows for building various integer programming models. Most of them involve introducing binary variables, say x_ik for modeling the feasible matchings from above as 0-1 solutions to the simple linear constraints:

(assignment) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mathop \sum \limits_{i = 1}^{m^2} {x_{ik}} = 1 \quad \quad \quad k = 1 , \ldots , n. \tag{1}\end{align*} \end{document}

(non-overlapping) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mathop \sum \limits_{k = 1}^n {x_{ik} } \le 1 \quad \quad \quad i = 1 , \ldots , m^2. \tag{2}\end{align*} \end{document}

The objective function could be expressed by: linearization of z_ijkl = x_ikx_jl and/or by partitioning the sum of z in sub-sums in different ways. We will not present here any possible integer programming models, since our goals do not involve solving such models, but for sake of completeness, we add the following constraints to finish modeling the self-avoiding paths: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}x_{ik} \le \mathop \sum \limits_{j \in n \left( i \right) } {x_{j , k + 1}} \quad \quad \quad i = 1 , \ldots m^2 ; \quad k = 1 , \ldots , n - 1 , \tag{3}\end{align*} \end{document}

where n(i) ≔ {i − 1, i + 1, i − m, i + m} is the set of neighbors of the i-th square under the row-wise numbering of the checkerboard squares.

Remark: In the next section the notation SFP(i, j), resp. v_ij is used for the restriction of SFP over a prefix S_j of S and x_.k, k ≤ i < j ≤ n fixed. Also, to simplify the notations, a self-avoiding path of length n will be given as a sequence of moves (a₂, a₃, …, a_n-1), a_i ∈ {s, r, l}, with s for straight, r-for right turn, and l-for left turn (w.r.t. previous move), provided that the first two letters of the sequence are arbitrary fixed on two neighboring nodes of the grid.

An easily derivable bound to v could be obtained by the following observation: Let ODD be the set of odd i, s.t. S[i] = 1 and EVEN be the set of even i, s.t. S[i] = 1. Since w.l.o.g. even elements of S are assigned to black squares of the checkerboard, and odd elements to the white ones, then z_ikjl could be equal to 1 only for even-odd couples i, j. Then C_2D(S) = 2*min{|ODD|, |EVEN|} is obviously an sharp upper bound to v. (If S starts or ends with 1 this bound should be increased by 1 or 2.)

Getting back to the HP folding problem and its conversion to a problem of finding matching that maximizes the number of overlapping edges, one could find a lot of similarity with another problem known as Contact Map Overlap (CMO). The problems in this class were well solved by applying lagrangean relaxation techniques to the corresponding integer programming models (Malod-Dognin et al., 2008; Yanev et al., 2008). Unfortunately, applying such technique to the SFP is prevented by the bad cutting properties of the bounds obtained by the LP relaxation as well and Lagrangean dual bound which coincide with trivially computable C_2D(S). These bounds (bound) are non-improvable deep in the branch and bound tree and thus the direct application of integer programming solvers is still questionable. The heuristic, given below, is a reasonable candidate to overcome these difficulties and due to the new insight on HP problems as a kind of contact map overlap, it could be applied to a broader class of optimization problems.

3. Heuristic Algorithm

Let i = {2, 3, …, n } = ∪^kS_i S_i∩S_i₊₁ = ∅ and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S_i^*$$ \end{document} - substring of length = len(S_i) over {s, r, l}. Given an square lattice SL = {−n ≤ i, j ≤ n} with a[0] placed on SL[0, 0] and a[1] on SL[0, 1], a string is feasible if its embedding on SL is a self-avoiding path. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T_{i - 1} = \cup ^{i - 1} S_l^*$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S_{ij} = T_{i - 1} \oplus st \left[ j \right]$$ \end{document} (⨁ for concatenation), where st[j] is a string of length j. Finally, if j = len(S_i) define cont(S_ij) to be the contacts number in the corresponding self-avoiding path.

3.1. Algorithm GREEDYLIKE

1: for i in {1, 2, …, k} do

2: rec : = 0

3: PATHFINDER(i, 0)

4: end for

def PATHFINDER(i, m) //Generate all subset of defined length.

5: if m > len(S_i) then

6: if cont(S_im) > rec then /* cont() count the contacts of the

7: rec : = cont(S_im) obtained conformation. */

8: Save S_im

9: end if

10: return

11: end if

12: for a in {s, r, l} do

13: st[m] : = a

14: if check(S_im) then /* check() check whether the obtained

15: PATHFINDER(i, m + 1) conformation is self-avoiding path. */

16: end if

17: end for

4. Computational Experiments

The computational runs illustrated below are for Python realization of the algorithm on an Intel Core i5 430M (2.26 GHz, 3 MB L3 cache), 4 GB RAM laptop. As a first demonstration of its capabilities, we borrow some figures from Istrail and Lam (2009) for a small (36 amino acids) human protein: PPPHHPPHHPPPPPHHHHHHHPPHHPPPPHHPPHPP. Figure 2 shows the fold obtained by our algorithm. This fold is close to the optimal one given on Figure 3c and much better than the folds obtained by approximate algorithms, shown on Figure 3a and b.

FIG. 2.

Protein folding with length 36 amino acids. The number of obtained contacts is 13.

FIG. 3.

Protein folding with length 36 amino acid constructed by: (a) Hart-Istrail algorithm (Istrail and Lam, 2009), (b) Newman's algorithm (Istrail and Lam, 2009), and (c) Optimal protein folding.

More extensive results for proteins of different lengths are listed in Table 1. The comparative results for the first eight benchmarks are shown in Table 2 (Toma and Toma, 1996; Chen and Huang, 2005; Ahn and Park, 2010). The column segment size in Table 1 gives the value of the only parameter of the algorithm (see len(S_i) in the previous section). Since the function PATHFINDER is a kind of total enumeration, increasing its value above 12 could consume a prohibitively large time.

Table 1.

Computational Results Obtained for 11 HP Sequences

Length	Sequences	Contacts	Segment size	Time (sec)
20	HPHP2H2PHP2HPH2P2HPH	9	5	0.48
24	H2P2HP2HP2HP2HP2HP2HP2H2	9	10	11
25	P2HP2H2P4H2P4H2P4H2	8	4	42
36	P3H2P2H2P5H7P2H2P4H2P2HP2	13	10	51
48	P2HP2H2P2H2P5H10P6H2P2H2P2HP2H5	23	12	143
50	H2PHPHPHPH4PHP3HP3HP4HP3HP3HPH4	20	7	43
	PHPHPHPH2
60	P2H3PH8P3H10PHP3H12P4H6PH2PHP	35	7	4
64	H12PHPHP2H2P2H2P2HP2H2P2H2P2HP2H2	35	7	4
	P2H2P2HPHPH12
102	PH2P5H2P2H2PHP2HP7HP3H2PH2P6HP2	28	7	7
	HPHP2HP5H3P4H2PH2P5H2P4H4PHP8H5
	P2HP2
123	P2H3PHP4HP5H2P4H2P2H2P4HP4HP2HP2	35	10	100
	H2P3H2PHPH3P4H3P6H2P2HP2HPHP2H
	P7HP2H3P4HP3H5P4H2PHPHPHPH
136	HP5HP4HPH2PH2P4HPH3P4HPHPH4P11H	37	10	68
	P2HP3HPH2P3H2P2HP2HPHPHP8HP3H6P3
	H2P2H3P3H2PH5P9HP4HPHP4

Table 2.

Computational Results Obtained for 9 HP Sequences ^a

	Contacts
Length	Optimal	Greedylike	MC1	MC2	GA	MS
20	9	9	9	9	9	9
24	9	9	9	9	9	9
25	8	8	8	7	8	8
36	14	13	14	12	14	14
48	23	23	23	18	22	22
50	21	20	21	19	21	21
60	36	35	36	31	34	34
64	42	35	42	31	37	38

Monte Carlo Algorithm, MC1 (Thachuk et al., 2007), Monte Carlo Algorithm, MC2 (Chen and Huang, 2005), Mixed Search Algorithm, MS (Chen and Huang, 2005), Genetic Algorithm, GA (Toma and Toma, 1996; Chen and Huang, 2005).

As we can see from Table 2, the GREEDYLIKE algorithm finds solutions that are very close to optimal. Even for some sequences, we find the optimal solution. The output for much longer proteins is not given here, since there is nothing with which to compare.

The resulting folds for sequences with length 24, 36, and 60 amino acids are provided on Figure 4.

FIG. 4.

Folding of proteins with lengths: (a) 24 amino acids—9 contacts, (b) 36 amino acids—13 contacts, and (c) 60 amino acids—35 contacts.

5. Conclusion

Computational experiments show that the idea of decomposing the problem into tractable subproblems works well for arbitrary long protein sequences. What could be added to the idea is to extend the size of the subproblems by replacing the PATHFINDER function with stronger solvers. Ahead of us stands the challenge to implement the algorithm on a larger segment size by changing PATHFINDER with an exact algorithm based on an integer programming approach. In fact, by using advanced integer programming models, we were able to solve to optimality all instances from Table 1, but there are still problems with skipping the length size 100 barrier. Also, we can improve the quality of folds obtained from the proposed method by adapting to other lattice models (including 3D lattice) or insertion of other techniques for analysis of protein structure.

We note that coding of this algorithm is not complicated, and thus can be easily applied in practice.

Footnotes

Acknowledgments

This work is partially supported by the project of the Bulgarian National Science Fund, entitled: “Bioinformatics research: Protein folding, docking and prediction of biological activity,” code NSF I02/16, 12.12.14.

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Ahn

, and Park

2010. Finding an upper bound for the number of contacts in hydrophobic-hydrophilic protein structure prediction model. J. Comput. Biol., 17, 647–656.

Alberts

, Bray

, Johnson

, et al. 1998. Essential Cell Biology: An Introduction to the Molecular Biology of the Cell. Garland Science Publishing, New York.

Berger

, and Leighton

1998. Protein folding in the hydrophobic-hydrophilic (hp) is np-complete. J. Comput. Biol., 5, 27–40.

Carr

, Hart

, and Newman

2003. Discrete optimization models for protein folding. Technical Report SAND2002. Sandia National Laboratories.

Chen

, and Huang

2005. A branch and bound algorithm for the protein folding problem in the HP Lattice Model. Genomics Proteomics Bioinformatics, 3, 225–230.

Chandru

, Rao

, and Swaminathan

2004. Protein folding on lattices: An integer programming approach. IIM Bangalore Research Paper No. 199.

Dill

K.A.

1985. Theory for the folding and stability of globular proteins. Biochemistry, 24, 1501–1509.

Dill

K.A.

, Bromberg

, Yue

, et al. 1995. Principles of protein folding. A perspective from simple exact models. Protein Sci. 4, 561–602.

Duan

, and Kollman

2001. Computational protein folding: From lattice to all-atom. IBM Syst. J., 40, 297–309.

10.

Istrail

, Hurd

, Lippert

, Walenz

, et al. 2000. Prediction of self-assembly of energetic tiles and dominoes: Experiments, mathematics, and software. Technical Report SAND2002. Sandia National Laboratories.

11.

Istrail

, and Lam

2009. Combinatorial algorithms for protein folding in lattice models: A survey of mathematical results. Commun. Inf. Syst., 9, 303–346.

12.

Jiang

, and Zhu

2005. Protein folding in the hexagonal lattice in the hp model, J. Bioinform. Comput. Biol., 3, 19–34.

13.

Malod-Dognin

, Andonov

, and Yanev

2008. Maximum cliques in protein structure comparison. 106–117. In: Experimental Algorithms, Lect. Notes. Comput. Sc. 6049. Springer-Verlag, Berlin.

14.

Michalewicz

, and Fogel

, 2004. How to Solve It: Modern Heuristics. Springer-Verlag, Berlin.

15.

Thachuk

, Shmygelska

, and Hoos

H. H.

2007. A replica exchange Monte Carlo algorithm for protein folding in the HP model, BMC Bioinformatics, 8, 342–362.

16.

Toma

, and Toma

1996. Contact interactions method: A new algorithm for protein folding simulations. Protein Sci. 5, 147–153.

17.

Yanev

, Andonov

, Veber

Ph.

, et al. 2008. Lagrangian approaches for a class of matching problems in computational biology. Comput. Math. Appl., 55, 1054–1067.

18.

Yoon

2006. Optimization approaches to protein folding [PhD Thesis]. School of Industrial and System Engineering, Institute of Technology, Georgia.