A Particle Swarm Optimization-Based Approach with Local Search for Predicting Protein Folding

Abstract

The hydrophobic-polar (HP) model is commonly used for predicting protein folding structures and hydrophobic interactions. This study developed a particle swarm optimization (PSO)-based algorithm combined with local search algorithms; specifically, the high exploration PSO (HEPSO) algorithm (which can execute global search processes) was combined with three local search algorithms (hill-climbing algorithm, greedy algorithm, and Tabu table), yielding the proposed HE-L-PSO algorithm. By using 20 known protein structures, we evaluated the performance of the HE-L-PSO algorithm in predicting protein folding in the HP model. The proposed HE-L-PSO algorithm exhibited favorable performance in predicting both short and long amino acid sequences with high reproducibility and stability, compared with seven reported algorithms. The HE-L-PSO algorithm yielded optimal solutions for all predicted protein folding structures. All HE-L-PSO-predicted protein folding structures possessed a hydrophobic core that is similar to normal protein folding.

1. Introduction

Amino acid sequences constitute the primary structure of proteins. Stable folding of amino acids results in a tertiary structure of proteins that determines protein function (Mount, 2004). How to effectively and accurately predict protein folding is a crucial issue in studying the molecular biology of proteins. In contrast to comparative modeling and fold recognition, ab initio methods may enable directly predicting the folding process on the basis of an amino acid sequence without knowing structural information.

Regarding ab initio methods, the hydrophobic-polar (HP) protein folding model (Dill, 1985) was proposed for predicting protein folding. In general, the HP model entails simulating the folding process of amino acid sequences on lattice models, such as triangular lattices (Bechini, 2013), and classifying all amino acids into hydrophobic (H) and polar (P) types. However, exploring the possibility of an extremely large folding process is necessary, and resolving a nondeterministic polynomial-time hard (NP-hard) problem for deriving optimal solutions remains a challenge (Crescenzi et al., 1998; Qian et al., 2007).

According to Anfinsen's thermodynamic hypothesis, an amino acid sequence may fold into a tertiary structure with a lower free energy level and may thus more closely resemble real protein folding (Anfinsen, 1973). This concept is also used for evaluation in HP model-based protein folding predictions. When two hydrophobic amino acids are adjacent to each other, a hydrophobic–hydrophobic (H–H) interaction may be generated. When the number of H–H interactions is increased, the predicted structure becomes more stable and highly similar to natural protein folding. Concerning two-dimensional (2D) lattice models, both square and triangular HP models have been proposed. A triangular lattice model was reported to have looser topological constraints, a higher number of simulated hydrophobic interactions, and higher efficiency in predicting protein folding, compared with a square lattice model (Bechini, 2013). Accordingly, the current study applied a triangular lattice model.

Because protein folding involves integration of the movement of each connecting amino acid, its prediction is a complex and time-consuming process even when the computation strategy is properly applied. Accordingly, using an algorithm with a convergence property to predict protein folding may prove helpful. Recently, a high exploration particle swarm optimization (HEPSO) algorithm was reported to perform effectively for convergence speed, global optimality, and solution accuracy (Mahmoodabadi et al., 2014). Therefore, we hypothesize that the HEPSO algorithm may improve the prediction of protein folding. However, the HEPSO algorithm may only focus on optimization for predicting protein folding because of its global search property. Recently, an NP-hard problem (Thilagavathi and Amudha, 2014) was reported to have been solved using local search methods such as the hill-climbing algorithm, greedy algorithm (DeRonne and Karypis, 2006), and Tabu table (Lin et al., 2014). Therefore, the combination of the HEPSO algorithm with local search algorithms for predicting protein folding warrants further investigation.

To improve the performance of protein folding predictions, we propose an HE-L-PSO algorithm that combines three local search algorithms with the HEPSO algorithm for providing higher quality solutions. In this study, we evaluated a simulation of protein folding on a triangular lattice by using three local search algorithms (hill-climbing algorithm, greedy algorithm, and Tabu table) as well as 20 amino acid sequences with known structures as examples.

2. Methods

We propose an HE-L-PSO algorithm that combines the HEPSO algorithm with the hill-climbing algorithm, greedy algorithm, and the Tabu table. The HEPSO algorithm can execute a global search in the search space, whereas the hill-climbing algorithm exchanges the contents of two rational structures in two particles and then attempts to exchange these two particles' information to determine a more effective structure. The greedy algorithm enhances prediction of protein structures by folding all amino acids in all directions to evaluate all fitness values and maintain optimal results. However, the greedy algorithm easily falls into a local optimal solution; thus, the Tabu table can store the current optimal solution through n iterations, and the contents of the particles are subsequently reset to prevent the same protein structure from being repeatedly predicted.

2.1. Particle swarm optimization

The particle swarm optimization (PSO) algorithm (Kennedy and Eberhart, 1995) is based on the foraging behavior of fish and birds. When the algorithm is executing a search process, each bird is regarded as a particle, and the birds seek food. For every period (iteration), all the birds share their experiences with other individuals. Each bird uses the shared information to improve the direction and speed of its search for food to increase the probability of finding food or to find higher quality food.

2.2. High exploration PSO

Mahmoodabadi et al. (2014) proposed the HEPSO algorithm that implements two methods for updating particles (Mahmoodabadi et al., 2014). The flowchart of the HEPSO algorithm is shown in the HEPSO part (without local search) of Figure 1. P_B and P_C affect the probability of using the artificial bee colony algorithm and genetic algorithm (GA), respectively, and the usage rate of the artificial bee colony algorithm is gradually reduced with the iteration; otherwise, probability of the GA and original PSO algorithm, which are used to update the probability of particles, would be increased.

FIG. 1.

Pseudocode of the HE-L-PSO algorithm. GA, genetic algorithm; HEPSO, high exploration particle swarm optimization; PSO, particle swarm optimization; TBP, HE-L-PSO.

2.2.1. Encoding

In the HP model prediction, each amino acid is treated as a dimension, and the content data represent the direction in which the next amino acid is folded. If the data value for a specific amino acid is 0, the next amino acid is folded to the upper left; if the value is 1, the amino acid on the right shifts to the upper right; if the value is 2, the next amino acid is to the right; if the value is 3, the next amino acid is to the lower right; if the value is 4, the next amino acid is to the lower left; and if the value is 5, the next amino acid is to the left. For example, if the data value for the third amino acid is 0, the fourth amino acid is to the upper left of the third amino acid.

2.2.2. Fitness calculation

In the HP model, the number of H–H interactions is regarded as an indicator of the stability of the algorithm in predicting protein structures. More interactions are expected to be closer to the natural folded state. Generally, fitness is represented by a negative value. The fitness of H–H interactions is counted for the two nearest hydrophobic amino acids. For example, the fitness values are −1, −4, and −2 for Figure 2A–C, respectively.

FIG. 2.

Example of protein folding prediction using the greedy algorithm. The seventh point is used as the example of the mutation point, and all possible folding directions are displayed, namely (A) left down, (B) left, and (C) right down. Green represents polar amino acids and blue represents hydrophobic amino acids. Dashed and solid lines indicate hydrophobic interactions and a common link between amino acids listed in sequence, respectively. The outcome of the fitness calculation is provided in parentheses. Among these directions, (B) displays strong fitness (hydrophobic interaction) and is selected for the optimal folding result.

2.2.3. Updating gbest and pbest

A local search algorithm is used after updating each particle in the HEPSO algorithm. The optimal solution for each particle (pbest) is updated before the end of the iteration. Specifically, the optimal solution for the particle in history is set to pbest. When the updating process is completed, the global best solution (gbest) is selected from all pbest solutions. Therefore, gbest and pbest guide the search directions for each particle in the next iteration of the HEPSO algorithm.

2.2.4. Updating particles

In the HEPSO algorithm, the optimal swarm experience (gbest) is used along with the previous optimal experience (pbest) to influence the next search direction (velocity), according to the following equations: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{split} & v_{id}^{new} = w \times v_{id}^{old} + {c_1} \times {r_1} \times \left( {pbes{t_{id}} - x_{id}^{old}} \right) \\ & { \rm{ }} + {c_2} \times {r_2} \times \left( {gbes{t_d} - x_{id}^{old}} \right) \\ \end{split} \tag{1}\end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} x_{id}^{new} = x_{id}^{old} + v_{id}^{new} \tag{2} \end{align*} \end{document}

where v is the velocity, w is the weighting value, c₁ and c₂ are learning factors, and r₁ and r₂ are random numbers between 0 and 1; moreover, i represents the index of the particle among population, d represents the updated dimension, old represents the current iteration, new represents the newly calculated results (next iteration), pbest represents the optimal result of particles in previous iterations, and gbest represents the optimal result for all particles. Regarding calculation of the new velocity in Equation (1), the original velocity is first multiplied by the weighting value; subsequently, pbest and gbest are each subtracted from the contents of the original particle, and the corresponding results are multiplied by the random number and weight (r and c), after which the multiplication results are added to the product of the original velocity and the weighting value. In Equation (2), the contents of the original particle are updated by adding the new velocity, thus resulting in a new particle. The parameters are calculated according to Equations (3 )–(7). \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { C_1 } = { { \rm { C } } _ { 1i } } - \left( { { { \rm { C } } _ { 1i } } - { { \rm { C } } _ { 1f } } } \right) \times \left( { \frac { t } { { \max { \rm { iteration } } } } } \right). \tag { 3 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { C_2 } = { { \rm { C } } _ { 2i } } - \left( { { { \rm { C } } _ { 2i } } - { { \rm { C } } _ { 2f } } } \right) \times \left( { \frac { t } { { \max { \rm { iteration } } } } } \right). \tag { 4 } \end{align*} \end{document}

The values of c₁ and c₂ can be calculated using Equations (3) and (4), where i is the initial value of the learning factor, f is the final value of the learning factor, and t is the current iteration number. When the number of iterations increases, c₁ becomes increasingly lower and c₂ continues increasing. Therefore, the early iterations focus on a global search, whereas the later iterations focus on local search. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} w \left( f \right) = \frac { 1 } { { 1 + 1.5 { e^ { - 2.6f } } } } \in \left[ { 0.4 , 0.9 } \right]. \tag { 5 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} f = { \frac { { d_g } - { d_ { min } } } { { d_ { max } } - { d_ { min } } } } \in \left[ { 0 , 1 } \right]. \tag { 6 } \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { d_i } = \frac { 1 } { { { \rm { N } } - 1 } } \mathop \sum \limits_ { j = 1 , j \ne i } ^ { \rm { N } } \sqrt { \mathop \sum \limits_ { k = 1 } ^ { \rm { D } } { { \left( { x_i^k - x_j^k } \right) } ^2 } } \tag { 7 } \end{align*} \end{document}

The value of w in Equation (1) can be obtained using Equations (5 )–(7). Equation (7) can be used to calculate the value (d) for each particle. When the values for all the particles are calculated, Equation (6) can be used to obtain f, which is subsequently applied to Equation (5) to derive the weight values (w); the weighting values inherently range between 0.4 and 0.9. In Equation (7), Euclidean distance is applied to calculate the average distance between the current particle and other particles, where k is the dimension of each particle, D is the dimension of each particle size, i is the current particle, and N is the total number of particles. For all calculated d_i values, the optimal fitness of particles is considered to be d_g. Moreover, d_max and d_min are the maximum and minimum values of all calculated d_i, respectively. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} x_{id}^{{ \rm{new}}} = x_{id}^{{ \rm{old}}} + \left( {2r - 1} \right) \left( {x_{id}^{{ \rm{old}}} - x_{jd}^{{ \rm{old}}}} \right). \tag{8} \end{align*} \end{document}

Equation (8) presents the artificial bee colony algorithm, where old represents the current iteration, new represents the next iteration, i represents the current particle, d represents the dimension to be calculated, x represents the particle itself, and r represents a random number between 0 and 1. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} v_ { id } ^ { new } = r \left( { \frac { { { C_2 } } } { 2 } } \right) \times gbes { t_d } - pbes { t_ { id } } - x_ { id } ^ { old } . \tag { 9 } \end{align*} \end{document}

Equation (9) is based on the GA concept, where old denotes the current iteration; new denotes the next iteration; pbest denotes the optimal result of particles in previous iterations; gbest denotes the optimal results for all iterations; c₂ denotes a learning factor, which is calculated according to Equation (4); v denotes the new speed, which is used in Equation (2) to update the contents of the particle; and i and d denote current particles and the dimension.

2.2.5. Local search

(1) Hill climbing. The hill-climbing algorithm (information exchange) is applied to derive two predictions of protein structures. If the updated prediction is superior to the original, the original information is replaced. The pseudocode of the hill-climbing algorithm is presented in Figure 1 (center). This algorithm (Lim et al., 2006) entails using current information to continue searching for superior results. In this study, we used the exchange information to design the climbing algorithm. As shown in Figure 3, the particle with the optimal solution in previous iterations (pbest) is randomly selected for information exchange with other random current particles. Information exchange is performed through a two-point crossover process. Once two random numbers are selected, all resource information between these two selected dimensions is exchanged. For example, when dimensions 3 and 6 are selected, all information from dimensions 3 and 6 is exchanged to obtain two new results.

(2) Greedy algorithm. If the hill-climbing algorithm produces two structures that are inferior to the original, the greedy algorithm can be used to strengthen the predictions. The pseudocode of the greedy algorithm is shown in Figure 1 (center). The greedy algorithm (DeVore and Temlyakov, 1996) is currently used to select the most favorable situation that is suitable for the exchange equivalent of the minimum coin problem. In this study, the amino acid sequence was designed to fold in all directions. Finally, only the optimal predictions of structures are chosen. As illustrated in Figure 2, the number presented in each circle is the order of the amino acid sequence. The dotted line represents H–H interactions; more dotted lines represent higher stability in predicting protein structures, which is closer to the natural folded state of proteins. In this study, we assumed that the seventh amino acids are mutated; therefore, we attempted to simulate the folding of this dimension in various directions such as left down (Fig. 2A), left (Fig. 2B), and right down (Fig. 2C). Among the predicted structures, the optimal structure with the highest H–H interactions is used to replace the original information. The fitness calculation is described in the following sections.

(3) Tabu table. The hill-climbing and greedy algorithms are executed several times, after which pbest and gbest are updated. When a certain set of iterations is reached, the optimal structure can be stored in the Tabu table, and the contents of all particles are reset. Future prediction processes are designed such that prediction of the same or similar structures is avoided. The pseudocode of the HE-L-PSO algorithm is displayed in Figure 1 (lower part).

FIG. 3.

Example of information exchange in the hill-climbing algorithm. The information between dimensions 3 and 6 is exchanged for pbest_rand and particle_rand, thus yielding solution_new1 and solution_new2, respectively.

The Tabu table was proposed by Glover (1989). The Tabu table records short-term results to prevent duplicate search processes in algorithms. This feature can prevent some degree of the optimal solution search process being trapped into a local optimum solutions. Moreover, the feature can effectively compensate for the shortcomings of the greedy algorithm in this study because we can record optimal results after every few iterations and then reset the contents of all particles, thus preventing the previously searched direction from being searched again.

2.3. Parameter settings

The parameter settings of the HE-L-PSO algorithm are described as follows. In this algorithm, P_A represents the probability of particles [Eq. (8)], which is updated by the GA; the value of P_A was set to 0.05 in this study. In addition, P_B represents the probability of particles [Eq. (9)], which is updated by the artificial bee colony algorithm; in this study, the value of P_B was set to 0.95. The numbers of particles and iterations for the PSO algorithm were set to 100 and 200, respectively, where c₁ and c₂ represent learning factors and i and f represent the initial and final values of the learning factors, respectively. Furthermore, C_1i, C_1f, C_2i, and C_2f were set to 2.5, 0.5, 0.5, and 2.5, respectively. The update speeds V_min and V_max were set to −5 and 5, respectively, because the triangular lattice model used for the prediction involved only six directions. In addition, the minimum and maximum search ranges were set to 0 and 5, respectively. The number of local search processes by the hill-climbing and greedy algorithms was set to 64. The Tabu table (reset in the iteration) was set to save gbest and reset the information after 20 iterations.

3. Results

3.1. Data set

As presented in Table 1, 20 sequences were used to predict the protein structure. The accuracy of prediction of the 2D triangular lattice model was evaluated and compared with a known optimal structure (Krasnogor and Smith, 2000). Regarding the contents of sequences, H and P indicate hydrophobic and polar amino acids. The prediction process typically assigns a folding direction for each amino acid, and the next amino acid is thus folded in that direction. The first amino acid does not affect the folding structure in any direction, and the last amino acid does not have a next object. Therefore, in the encoding process, the number of dimensions for each particle is equal to the length of the amino acid sequence (−2).

Table 1.

Example of Amino Acid Sequences in the Hydrophobic-Polar Model

Sequences	Length	Amino acid sequences (Krasnogor and Smith, 2000)
1	12	HHPHPHPHPHPH
2	14	HHPPHPHPHPHPHP
3	14	HHPPHPPHPHPHPH
4	16	HHPHPPHPPHPPHPPH
5	16	HHPPHPPHPHPHPPHP
6	17	HHPPHPPHPPHPPHPPH
7	17	HHPHPHPHPHPHPHPHH
8	20	HHPPHPPHPHPHPPHPHPHH
9	20	HHPHPHPHPHPPHPPHPPHH
10	21	HHPPHPPHPHPPHPHPPHPHH
11	21	HHPHPPHPPHPHPHPPHPPHH
12	21	HHPPHPHPHPPHPHPPHPPHH
13	22	HHPPHPPHPHPHPPHPPHPPHH
14	23	HHHPHPHPHPHPHPHPHPHPHHH
15	24	HHPPHPPHPPHPPHPPHPPHPPHH
16	24	HHHPHPHPPHPHPHPHPHPHPHHH
17	24	HHHPHPHPHPPHPHPHPHPHPHHH
18	30	HHHPPHPPHPPHPPHPHPPHPHPPHPPHHH
19	30	HHHPPHPPHPPHPHPPHPHPPHPPHPPHHH
20	37	HHHPPHPPHPPHPHPHPPHPPHPPHPPPPPHPHPHHH

H, hydrophobic; P, polarity.

3.2. Comparison of optimal prediction results among several algorithms

The prediction performance of the multimeme memetic algorithm (MMA) (Krasnogor et al., 2002) for the 20 sequences (Table 1) is presented in Table 2. We compared optimal prediction results of the proposed HE-L-PSO algorithm with those of the MMA algorithm. The MMA algorithm did not predict the first sequence. When the amino acid sequence length was equal or greater than 24, the MMA algorithm could not determine the optimal structure.

Table 2.

Comparison of Optimal Predictions of the HE-L-PSO and Multimeme Memetic Algorithms

Sequences ^a	Length	Best ^b	MMA	HEPSO	Three local searches ^c	HE-L-PSO ^d
1	12	−11	ND	−11	−11	−11
2	14	−11	−11	−11	−11	−11
3	14	−11	−11	−11	−11	−11
4	16	−11	−11	−11	−11	−11
5	16	−11	−11	−11	−11	−11
6	17	−11	−11	−10	−11	−11
7	17	−17	−17	−15	−17	−17
8	20	−17	−17	−16	−17	−17
9	20	−17	−17	−15	−17	−17
10	21	−17	−17	−13	−17	−17
11	21	−17	−17	−15	−17	−17
12	21	−17	−17	−14	−17	−17
13	22	−17	−17	−14	−17	−17
14	23	−25	−25	−21	−25	−25
15	24	−17	−16	−14	−17	−17
16	24	−25	−25	−21	−25	−25
17	24	−25	−25	−21	−25	−25
18	30	−25	−24	−16	−24	−25
19	30	−25	−24	−16	−24	−25
20	37	−29	−26	−18	−27	−29

Sequences from Table 1.

The optimal values are reported in the literature (Krasnogor and Smith, 2000). Values listed in this table represent optimal prediction results from different algorithms in 25 experiments.

The three local search algorithms comprise the hill-climbing algorithm, greedy algorithm, and Tabu table.

HE-L-PSO is the combination of the HEPSO algorithm and three local search algorithms. HE-L-PSO is visualized in Figure 6.

HEPSO, high exploration particle swarm optimization; MMA, multimeme memetic algorithm (Krasnogor et al., 2002); ND, not shown in the literature.

Notably, the HEPSO algorithm alone was ineffective in predicting most of the 20 sequences. Three local search algorithms alone demonstrated more effective performance in protein folding prediction than did the HEPSO algorithm alone; nevertheless, the local search algorithms alone still did not attain optimal performance for sequences 18–20. By contrast, the proposed HE-L-PSO algorithm, which combines the HEPSO algorithm with three local search algorithms, could execute optimal folding prediction for the 20 amino acid sequences (Table 2).

3.3. Success rate regarding optimal structure prediction

The number of optimal solutions derived by the HEPSO algorithm, three local search algorithms, and HE-L-PSO algorithm was recorded for comparison with other methods (GComa, SComa, GRand, SRand, simple memetic algorithm [SMA], and GA) (Smith, 2005) by testing each sequence 25 times (Table 3). The optimal performance regarding folding prediction was defined as 100% reproducibility (25/25). For the shortest sequence (i.e., sequence 1), three methods (SComa, SMA, and HE-L-PSO) could constantly determine the optimal solution. In the longest sequence (i.e., sequence 20), only the HE-L-PSO algorithm could determine the optimal structure within 25 predictions. Regarding the overall optimal solutions, the HE-L-PSO algorithm could derive a total of 357 optimal solutions in all experiments, which is higher than those observed for the other algorithms as well as the HEPSO algorithm or the three local search algorithms alone.

Table 3.

Reproducibility of Optimal Prediction Results Among 25 Experiments

		Algorithms ^b
Seq ^a	Length	G-Coma (Smith, 2005)	S-Coma (Smith, 2005)	G-Rand (Smith, 2005)	S-Rand (Smith, 2005)	SMA (Smith, 2005)	GA (Smith, 2005)	HEPSO	Three local searches ^c	HE-L-PSO ^d
1	12	13	25	16	16	25	13	5	24	25
2	14	14	25	15	15	23	13	2	25	25
3	14	15	24	10	10	22	7	4	24	25
4	16	19	25	17	17	24	13	1	24	25
5	16	13	25	13	13	22	9	1	25	25
6	17	10	24	11	11	20	9	0	23	25
7	17	9	24	5	5	14	3	0	16	25
8	20	7	25	6	6	11	2	0	11	25
9	20	4	22	5	5	4	2	0	9	22
10	21	4	21	4	4	10	2	0	11	24
11	21	5	21	7	7	7	2	0	17	21
12	21	7	22	7	7	12	4	0	12	24
13	22	6	21	3	3	7	2	0	10	23
14	23	0	7	0	0	0	0	0	4	12
15	24	0	9	1	1	0	2	0	3	7
16	24	1	7	0	0	1	0	0	6	12
17	24	0	8	0	0	0	0	0	3	9
18	30	0	1	0	0	0	0	0	0	1
19	30	0	1	0	0	0	0	0	0	2
20	37	0	0	0	0	0	0	0	0	1
Total		127	337	120	44	202	83	13	247	357

Sequences from Table 1.

The three local search algorithms comprise the hill-climbing algorithm, greedy algorithm, and Tabu table.

HE-L-PSO is the combination of HEPSO and three local search algorithms.

Abbreviations (Smith, 2005): COMA, coevolving memetic algorithm; adaptive versions of the COMA were used with the two pivot rules (GComa and SComa), where GComa is the greedy version of the COMA. The versions of the COMA were applied using a randomly created rule in each application (i.e., learning was disabled). SRand, steepest. GRand, greedy. One iteration of SRand or GRand ascent local search was applied. SMA, simple memetic algorithm; SMA using a bit-ipping neighborhood, with one iteration of greedy ascent, was applied. GA, genetic algorithm without local search.

3.4. Mean of optimal fitness

In general, the prediction performance for sequences shorter than 22 amino acids (sequences 1–13) was superior to that for sequences longer than 23 amino acids (sequences 14–20; Table 3). Figure 4 presents statistics for the optimal performance of the algorithms in predicting these longer sequences in the 25 experimental executions. Except for sequence 17, means (standard deviations) for all test sequences were highest (lowest) in the HE-L-PSO algorithm. In addition, for sequences longer than 60, the HE-L-PSO algorithm performed more effectively than did the other tested algorithms such as SGA (Tamjidul Hoque et al., 2006), HGA (Tamjidul Hoque et al., 2006), ERS-GA (Su et al., 2011), and HHGA (Su et al., 2011) (Table 4).

FIG. 4.

Optimal values analyzed by seven algorithms in the example of sequences 14–20 after 25 experiments. Sequence information is shown in Table 1. Data = mean ± SD (n = 25). SMA, simple memetic algorithm.

Table 4.

Optimal Prediction Results for Long Sequences

Sequences	Length	SGA (Tamjidul Hoque et al., 2006)	HGA (Tamjidul Hoque et al., 2006)	ERS-GA (Su et al., 2011)	HHGA (Su et al., 2011)	HE-L-PSO
1^a	60	−40^c	−46^c	−55/−49^d	−66/−62^d	−70/−66^d
2^b	64	−33^c	−46^c	−47/−42^d	−63/−55^d	−73/−70^d

H³ is HHH; (PH)² is PHPH.

P(PH³)²H⁵P³H¹⁰PHP³H¹²P⁴H⁶PH²PHP.

H¹²(PH)²((P²H²)²P²H)³(PH)²H¹¹

Only the optimal prediction result was reported in the literature without the average value for repetitive experiments.

The values indicate the optimal prediction result and the average value for 25 repetitive experiments.

3.5. Visualization of optimal prediction results

On the basis of the longest sequence (sequence 20 in Table 1), optimal prediction (Table 2) is visualized in Figure 5. The values of hydrophobic interactions in protein folding predicted by the HE-L-PSO algorithm were higher than those of the interactions in protein folding predicted by the HEPSO algorithm or three local search algorithms alone. The optimal HE-L-PSO-based prediction for the 20 amino acid sequences (Table 2) is visualized in Figure 6. The high values of hydrophobic interactions indicate that the predicted structures were closer to the natural folded state. All the HE-L-PSO-predicted protein folding structures possessed a hydrophobic core that is similar to a normal protein folding.

FIG. 5.

Visualization and comparison of the optimal prediction outcomes for the long sequence predicted using the HEPSO algorithm, three local search algorithms, and HE-L-PSO algorithm. The long sequence is sequence 20 in Table 1 (the longest among all sample sequences). The results are optimal outcomes among 25 tests. (A) HEPSO-based protein folding; the number of corresponding hydrophobic interactions (H interactions) is 18. (B) Protein folding predicted using three local search algorithms (hill-climbing algorithm, greed algorithm, and Tabu table); the number of H interactions is 27. (C) HE-L-PSO-based protein folding; the number of H interactions is 29. The proposed HE-L-PSO algorithm was constructed by combining HEPSO with the three local search algorithms. Black and white points represent hydrophobic and polar amino acids, respectively. All amino acids in the sequence order are connected by solid lines, where H interactions are marked with dashed lines and the total number of H interactions is shown in the upper left corner. “First” is the starting amino acid for simulation.

FIG. 6.

Visualization of the optimal prediction outcome for the 20 amino acid sequences. These results are optimal outcomes of the 25 tests. Twenty sequence samples (Table 1) are marked with S, namely S1–S20. The length (L) of the amino acid sequence ranges from 12 to 37. Black and white points represent hydrophobic and polar amino acids, respectively. All amino acids in the sequence order are connected by solid lines, where the hydrophobic interactions (H inter) are marked with dashed lines. “First” is the starting amino acid for simulation. A higher value of the H inter indicates that the predicted structures are closer to the natural folded state.

4. Discussion

Numerous machine learning algorithms (Gromiha and Huang, 2011), including Monte Carlo algorithms (Mirny and Shakhnovich, 2001; Li et al., 2011), GA (Huang et al., 2005, 2010), ant colony optimization (Shmygelska et al., 2002), the memetic algorithm (Islam and Chetty, 2009), branch and bound (Chen and Huang, 2005; Hsieh and Lai, 2011), the filter-and-fan solution (Rego et al., 2011), and other algorithms listed in Tables 2 and 3, have been proposed for predicting protein folding structures. Optimization algorithms usually converge into an adequate and similar structure; nevertheless, such algorithms are still not the real optimal solution. Moreover, local search algorithms such as the hill-climbing algorithm, greedy algorithm (DeRonne and Karypis, 2006), and Tabu table have been implemented for individually enhancing prediction of protein folding (Lin et al., 2014). The HE-L-PSO algorithm proposed in the current study demonstrates reliable and effective performance in predicting protein folding.

The protein folding problem in the HP model entails determining the optimal structure that maximizes the number of adjacencies between hydrophobic amino acids under the assumption that hydrophobic reactions contribute to embed the free energy for protein folding (Fig. 6). This study applied the HEPSO algorithm to predict protein folding, in addition to using the greedy algorithm, hill-climbing algorithm, and Tabu table in early, medium, and later iterations, respectively, to improve the search ability of the HEPSO algorithm. The hill-climbing algorithm may be insufficient for improving protein folding structures; this is because the particles exhibit poor fitness in the early iteration (Fig. 7, left block). The greedy algorithm is excellent for improving protein folding structures when the protein structures are not dense (Fig. 7, right block). Therefore, when the hill-climbing algorithm does not improve the particle fitness, the greedy algorithm can be applied to improve protein folding structures.

FIG. 7.

Example of hill climbing performed at a later iteration of the PSO algorithm. Green line represents results derived from a random pbest. Yellow line represents results derived from random particles in the current iteration.

The number of adjacencies between the hydrophobic amino acids of the predicted protein structure for a specific particle can thus be increased along with the number of iterations. In the medium iteration, the hill-climbing algorithm can effectively facilitate particles to detect superior protein structures through an operation that involves exchanging the substructures between two dense protein folding structures (Fig. 8). In the search space, the particles can converge to a local region through numerous generations, a phenomenon that is known as the problem of local optima. The particles face difficulty in escaping from a local optimum because improving the protein folding structures is difficult, particularly because of numerous adjacencies between hydrophobic amino acids (Fig. 9, left). The renewed technique is a powerful evolutionary algorithm that enables particles to escape the local optimum by jumping a longer distance in the search space. However, the particles may still move toward the same search path because of the PSO convergence property. The Tabu table is used to prevent the particles from moving toward the same search path by changing gbest. Therefore, superior results could be obtained for the particles in the new local region (right of Fig. 9).

FIG. 8.

Example of hill climbing performed at an early iteration of the PSO algorithm and example of the greedy algorithm. Green line represents results derived from a random pbest. Yellow line represents results derived from random particles in the current iteration. Green circle represents the amino acids executing greedy.

FIG. 9.

A satisfactory structure may not be similar to an optimal structure. The visualization of the structure underneath is the particle content.

Advantages of the proposed HE-L-PSO algorithms are described as follows. (1) High reproducibility: Compared with other algorithms, the HE-L-PSO algorithm was noted to demonstrate a higher number of optimal solutions for 25 executions of the same test (Table 2). (2) High performance in predicting protein folding for long sequences: The HE-L-PSO algorithm is more effective in predicting optimal structures for longer sequences compared with other algorithms (Tables 3 and 4). (3) High stability in predicting protein folding: The HE-L-PSO algorithm was also determined to be stable when predicting optimal protein structures (Fig. 4) because of its short standard deviation for test sequences.

In general, most algorithms for predicting protein folding exhibit lower success rates in long sequences than they do in short sequences. The reason is simple: increasing the sequence length raises the complexity of escaping from the local minima in the protein folding space. Similarly, HE-L-PSO and other algorithms perform comparatively more effectively for short sequences than they do for longer ones. However, the HE-L-PSO algorithm performs more effectively than do the other algorithms tested in Table 3. These results suggest that the HE-L-PSO algorithm is an improved algorithm for predicting protein folding in a triangular lattice model.

In conclusion, this study applied an optimization algorithm (HEPSO) and three local search algorithms to predict protein folding. When used to repeatedly test 20 amino acid sequences 25 times, the proposed HE-L-PSO algorithm demonstrated high performance in predicting both short and long sequences with high reproducibility and stability.

Footnotes

Acknowledgments

This work was supported by funds from the Ministry of Science and Technology, Taiwan (MOST 105-2221-E-151-053-MY2, MOST 103-2221-E-151-029-MY3, MOST 104-2221-E-214-035-MY2, and MOST 104-2320-B-037-013-MY3), and the National Sun Yat-sen University-KMU Joint Research Project (#NSYSU-KMU 105-p022).

Author Disclosure Statement

No competing financial interests exist.

References

Anfinsen

C.B.

1973. Principles that govern the folding of protein chains. Science, 181, 223–230.

Bechini

2013. On the characterization and software implementation of general protein lattice models. PLoS One, 8, e59504.

Chen

, and Huang

W.Q.

2005. A branch and bound algorithm for the protein folding problem in the HP lattice model. Genomics Proteomics Bioinformatics, 3, 225–230.

Crescenzi

, Goldman

, Papadimitriou

, Piccolboni

, and Yannakakis

1998. On the complexity of protein folding. J. Comput. Biol., 5, 423–465.

DeRonne

K.W.

, and Karypis

2007. Effective optimization algorithms for fragment-assembly based protein structure prediction. J Bioinform. Comput. Biol. 5, 335–352.

DeVore

R.A.

, and Temlyakov

V.N.

1996. Some remarks on greedy algorithms. Adv. Comput. Math., 5, 173–187.

Dill

K.A.

1985. Theory for the folding and stability of globular proteins. Biochemistry, 24, 1501–1509.

Glover

1989. Tabu search-part I. ORSA J. Comput., 1, 190–206.

Gromiha

M.M.

, and Huang

L.T.

2011. Machine learning algorithms for predicting protein folding rates and stability of mutant proteins: Comparison with statistical methods. Curr. Protein Pept. Sci., 12, 490–502.

10.

Hsieh

S.Y.

, and Lai

D.W.

2011. A new branch and bound method for the protein folding problem under the 2D-HP model. IEEE Trans. Nanobioscience, 10, 69–75.

11.

Huang

, Yang

, and He

2010. Protein folding simulations of 2D HP model by the genetic algorithm based on optimal secondary structures. Comput. Biol. Chem., 34, 137–142.

12.

Huang

Y.Y.

, Yang

C.B.

, Tseng

K.T.

, and Yang

C.N.

2005. Protein folding prediction with genetic algorithms, 130–139. In The 4th Conference on Information Technology and Applications in Outlying Islands. Penghu, Taiwan.

13.

Islam

M.K.

, and Chetty

2009. Novel memetic algorithm for protein structure prediction, 412–421. In Nicholson

, and Li

, eds. AI 2009: Advances in Artificial Intelligence: 22nd Australasian Joint Conference, Melbourne, Australia, December 1–4, 2009. Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg.

14.

Kennedy

, and Eberhart

R.C.

1995. Particle swarm optimization, 1942–1948. In Proceedings IEEE International Conference on Neural Networks. Perth, Western Australia.

15.

Krasnogor

, Blackburne

, Burke

E.K.

, and Hirst

J.D.

2002. Multimeme algorithms for protein structure prediction, 769–778. In Guervós

J.J.M.

, Adamidis

, Beyer

H.-G.

, Schwefel

H.-P.

, and Fernández-Villacañas

J.-L.

, eds. Parallel Problem Solving from Nature—PPSN VII. Springer, Berlin.

16.

Krasnogor

, and Smith

2000. A memetic algorithm with self-adaptive local search: TSP as a case study, 987–994. In GECCO.

17.

Y.W.

, Wust

, and Landau

D.P.

2011. Monte Carlo simulations of the HP model (the “Ising model” of protein folding). Comput. Phys. Commun., 182, 1896–1899.

18.

Lim

, Rodrigues

, and Zhang

2006. A simulated annealing and hill-climbing algorithm for the traveling tournament problem. Eur. J. Oper. Res., 174, 1459–1478.

19.

Lin

, Zhang

, and Zhou

2014. Protein structure prediction with local adjust tabu search algorithm. BMC Bioinformatics, 15 Suppl 15, S1.

20.

Mahmoodabadi

M.J.

, Mottaghi

Z.S.

, and Bagheri

2014. HEPSO: High exploration particle swarm optimization. Inf. Sci., 273, 101–111.

21.

Mirny

, and Shakhnovich

2001. Protein folding theory: From lattice to all-atom models. Annu. Rev. Biophys. Biomol. Struct., 30, 361–396.

22.

Mount

D.W.

2004. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, New York.

23.

Qian

, Raman

, Das

, Bradley

, McCoy

A.J.

, Read

R.J.

, and Baker

2007. High-resolution structure prediction and the crystallographic phase problem. Nature, 450, 259–264.

24.

Rego

, Li

, and Glover

2011. A filter-and-fan approach to the 2D HP model of the protein folding problem. Ann. Oper. Res., 188, 389–414.

25.

Shmygelska

, Aguirre-Hernández

, and Hoos

H.H.

2002. An ant colony optimization algorithm for the 2D HP protein folding problem, 40–52. In Dorigo

, Caro

G.D.

, and Sampels

, eds. Lecture Notes in Computer Science. Springer, Berline.

26.

Smith

2005. The co-evolution of memetic algorithms for protein structure prediction, 105–128. In Recent Advances in Memetic Algorithms. Springer.

27.

S.C.

, Lin

C.J.

, and Ting

C.K.

2011. An effective hybrid of hill climbing and genetic algorithm for 2D triangular protein structure prediction. Proteome Sci. 9, S19.

28.

Tamjidul

Hoque

, M., Chetty

, and Dooley

L.S.

2006. A hybrid genetic algorithm for 2D FCC hydrophobic-hydrophilic lattice model to predict protein folding, 867–876. In Sattar

, and Kang

B.-H.

, eds. AI 2006: Advances in Artificial Intelligence. Springer, Berlin.