Abstract
Abstract
1. Introduction
The most commonly used side chain energy function was obtained by Dunbrack et al. using the knowledge-based potential (KBP) approach (Dunbrack and Karplus, 1993; Dunbrack and Cohen, 1997). KBPs are energy functions derived from databases of proteins with known structures by Boltzmann inverting the observed interaction frequencies (i.e., inverting the Boltzmann probability distribution). Side chain dihedral angles are represented by a discrete set of rotamers (typically 3), and involve correlations among these angles and between these angles and the backbone dihedral angles. These potentials were originally derived for predicting and refining the side chain structure given the backbone conformation. Several other improvements have appeared since, mainly for optimizing the rotamer library search (Canutescu et al., 2003; Wang et al., 2005; Santana et al., 2007; Yanover et al., 2008).
An alternate approach for obtaining coarse-grained peptide energies has recently emerged. It consists of extracting effective potentials from molecular dynamic (MD) simulations of residues and peptides modeled by atomic force fields (Basdevant et al., 2007; Ayton et al., 2007; Zhou et al., 2007; Sherwood et al., 2008; Betancourt and Omovie, 2009). In particular, statistical potentials for all 20 amino acid residue-pairs, including distance and orientation dependencies, have been obtained using the Boltzmann inversion method and used to successfully fold a selected group of proteins (Betancourt, 2009). Energy functions derived from atomic force fields have the advantage of a well-defined theoretical basis and can be derived for different solvent environments. However, they can only be as good as the atomic force field from which they are derived, and are known to have some difficulties in describing local structures (Gnanakaran and Garcia, 2003; Yoda et al., 2004; Jang et al., 2007). The objective of this work is to derive side chain coarse-grained potentials from MD simulations and to compare them to those derived from KBPs.
The procedure for determining the energy function for the side chains as a function of the dihedral angles is described in the following section. KBPs are obtained from a selected group of protein structures, and molecular mechanics–based potentials (MBPs) are obtained from equilibrium simulations of single tripeptides in solution. Correlations between KBPs and MBPs are obtained, and evaluations of the side chain potential on a selected group of proteins are carried out for the KBPs and MBPs.
2. Energy Calculation Method
Following the work of Dunbrack and Karplus (1993), side chain energies depend on the side chain and backbone dihedral angles. In general, these energies are computed from the dihedral angle probability distributions for each amino acid α by the Boltzmann inversion approach, i.e.,
where m is the number of side chain dihedral angles and the probabilities are conditional on the backbone dihedral angles. Here, the assumption is made that the backbone dihedral energies, which are dependent on φ, ψ and ω, are considered separately. Therefore, to minimize redundancy in energy contributions as well as non-additivity, the dependence in the backbone dihedral angles is only conditional, that is, only correlations with backbone dihedral angles are included.
In the work of Dunbrack, as in many others, the side chain angles are coarse-grained in terms of a few (normally 3) rotamers. Here, a finer resolution of 18 values (or 20° intervals) is used for both the side chain and the backbone dihedral angles. Because of this, the probability dependencies on the angles are approximated by pair-wise correlations for all residues, except for Gly and Ala, which have no side chain energies, and Asn and Asp, for which higher correlations are included. More specifically, the general side chain probability is selected as
Notice that if the angles were independent, the right hand side of the equation would reduce to a product of the probabilities for each of the side chain angles. For Asn and Asp, the probability is
Unlike most amino acids, these two amino acids exhibit significant correlations between these angles due to the residue self-interactions between the end of the side chain and its backbone (Dunbrack and Karplus, 1993).
The probabilities are computed separately from equilibrium MD simulations in the case of MBPs or from protein structure databases in the case of KBPs. In either case, there are angles for which events may not be observed. For this reason, if Nα(χi, χ
i
+1) is the number of events for the corresponding angles, then the conditional probability is computed as
where b = 18 is the number of bins, i.e., 1/b2 = uniform probability. A similar equation is used for Asn and Asp.
Probability densities built from protein structures (KBPs) were obtained from a database of more than 22,000 proteins with less than 98% sequence identity among them. These proteins were grouped in nearly one thousand clusters of less than 20% sequence identity between sequences of different clusters. Rather than using only one sequence per cluster, all sequences were considered in building the density histograms with weights inversely proportional to their cluster size. This database was limited to structures determined by crystallographic methods with better than 3Å resolution and deposited in the Protein Data Bank (PDB). Residues at the chain termini, or with missing atoms, were excluded from the calculation.
To determine the probability densities from atomic force fields, molecular dynamics simulations were carried out for the 18 residues with side chain dihedral angles. Tripeptides with the sequence AXA, where X is the amino acid being sampled, were equilibrated at 300 K. Surrounding the residue by alanines allows the inclusion of the backbone dihedral angles and provides the neighboring atoms that most consistently surrounds the residue. Simulations in explicit water were carried out using the Gromacs software (Berendsen et al., 1995; Van Der Spoel et al., 2005) and three different force fields: the united atom Gromos G43A1 and G43A2 and the all-atom OPLS-AA/L. Each triplet was placed in a dodecahedron box with 25 Å diameter and solvated using the SPC water model (SPCE for OPLS-AA/L). For each triplet, 50 simulations of 11 ns starting from random initial conditions were carried out. The first 1 ns of each trajectory was eliminated to remove transient effects.
Additionally, implicit solvent simulations were carried out using the Amber 9 software package (Pearlman et al., 1995; Case et al., 2005). The force fields ff99 and ff03 (improved φ and ψ dihedrals) were used, with the generalized Born approximations GBHCT (igb = 1) and GBOBC, model II (igb = 5, with PBradii set to mbondi2), respectively. This selection was made to compare the prediction dependence on the force field and solvent model optimizations. The following are some of the calculation details. A triplet structure was initially minimized using 500 steps of steepest descent followed by another 500 steps of conjugate gradient. The system was then heated to the equilibrium temperature in four stages of random but sequentially increasing temperatures, approaching 325 K from 0 K and ending back at 300 K. Gradual heating was used to obtain a stable relaxation, and random temperatures were used to randomize the structure for each simulation trajectory. Each heating trajectory was made in 0.5 fs steps and lasted 50 ps. Simulations in equilibrium were carried out for 10 ns in 2 fs steps at 300 K, and structures were obtained every pico second. A total of 50 independent trajectories were calculated for each triplet. Non-periodic conditions with a potential cutoff at 999 Å were used in the simulations.
3. Results and Discussion
Substitution of Eq. 2 into Eq. 1 results in a sum of m + 1 energy functions in two variables for each residue. Similarly, substituting Eq. 3 into Eq. 1 results in a sum of two energy functions in three variables. To measure the similarity between the MBP and the KBP, the Pearson correlation coefficient was computed between all corresponding values of the energy functions. The highest correlation was obtained with the G43A2 force field, and the results are shown in Table 1. Correlation values range from 0.425 for Glu UE(χ2, χ3) to 0.921 for Pro UP (ψ, χ1). Overall, there is a significant correlation between the energy functions obtained from both approaches. On average, a correlation coefficient of 0.77 was obtained. There is no evident dependence of the correlation on distance from the backbone or on the residue properties, such as hydrophobicity or charge.
Generated using the Gromos G43A2 force field.
A one-letter code for the amino acids is used.
Correlations are actually for (φ, χ1, χ2) and (ψ, χ1, χ2).
Figure 1 shows examples of KBPs and MBPs (G43A2) with high and low correlations. The conditional angles are plotted in the horizontal axis. Figures 1a and 1b correspond to the tyrosine potential UY (χ1, χ2) for KBP and MBP, respectively. This case has one of the highest correlations between any two energy terms (excluding proline), with a correlation coefficient of r = 0.87. Both potentials show similar features, with the KBP being somewhat noisier. Notice that in both KBPs and MBPs there are noticeable dependencies on the conditional angle χ1. In particular, the energies for χ1 < −90° are lower than for χ1 > 90°. Figures 1c and 1d correspond to the potential for glutamic acid UE(χ2, χ3) for KBP and MBP, respectively. This potential has a correlation coefficient of r = 0.43, which is significantly lower than any of the others shown in Table 1. The UE(χ2, χ3) potential is actually well correlated in the region corresponding to −90° < χ2 < 90°. Outside this region, the KBP shows a lower energy around the cis state as compared to the MBP. The MBP is highly symmetrical around the cis and trans state but not the KBP. This difference is caused by the symmetric representation of the side chain oxygens in the simulation model and not in the protein structures. In particular, the Gromos G43A2 force field assigns the same charge (−0.635 e) to both oxygens denominated as OE1 and OE2. However, in protein structures, the hydrogen HE2 is usually attached to OE2, resulting in an asymmetry in the potential. This difference may be diminished with the explicit inclusion of the HE2 hydrogen.

Example of potentials with high and low correlations. High correlation between the last two angles of phenyalanine: (
Interestingly, the all-atom simulations using with the OPLS-AA/L force field did not improve the correlations between the KBPs and the MDPs in general. This suggests that the parameterization of the force field may be more important to increase the correlations than replacing the united atoms by all atoms. Other possible factors for the differences between KBPs and MBPs may be the effect of neighboring-residue side chains and the polarization effects in the presence of other residues and molecules, both neglected in the simulations. Evidently, noise due to sparse data in KBPs lowers the correlation. In addition, the derivation of the KBPs assumes that the non-bonded interactions do not affect the side chain potential, but is not entirely clear that this may be the case.
Side chain KBPs are regularly used to predict the side chain conformations given the backbone dihedral angles (Dunbrack and Karplus, 1993; Dunbrack and Cohen, 1997; Canutescu et al., 2003; Wang et al., 2005; Santana et al., 2007; Yanover et al., 2008). In reality, the side chain potential alone cannot predict the side chain conformation precisely without including non-bonded interactions. Nonetheless, the KBPs can be compared to the MBPs by predicting side chain conformations using each potential. Here, side chain predictions were made by minimizing the side chain energy using simulated annealing and the Monte Carlo method while keeping the backbone fixed to its native conformation. In addition to the side chain potential, a hard-core excluded volume potential between side chains was included. This consists of adding high-energy values when the center of mass of the side chains is below the sum of their radii. These radii were fixed and approximated according to the size of the side chains. In addition, local excluded volume effects were included by requiring that the side chain conformations, generated a priori by the Monte Carlo method, did not cause overlaps between the side chain atoms and the backbone atoms of the residue and neighboring residues. For computational efficiency, only the side chain atoms required to define the side chain dihedral angles χ were considered.
In the first test, a set of medium resolution proteins were selected from a group of globular proteins listed by Ivankov et al. (2003). Only those with a clearly identifiable sequence and no missing atoms were used, resulting in a total of 45 proteins. The PDB codes for this protein set are shown in Table 2, which includes 32 proteins determined by x-ray crystallography with an average resolution of 1.81Å, and 13 proteins determined by nuclear magnetic resonance (NMR). As a precaution, a special version of the KBP was derived by eliminating all the proteins used in this test from the calculation of the potential. Nevertheless, this special KBP is practically indistinguishable from the original given that each amino acid contributes no more than one data point for each dihedral angle histogram, which containts thousands of data points. A summary of the results of the minimization using the KBP and the MBPs are presented in Table 3. Two quantities are shown, one is the average absolute difference between the side chain angles with native and with minimum energy 〈|Δχ|〉, and the other is the percentage of resulting angles within 40° from native. The values are also shown for buried and for solvent exposed amino acids. This was done by calculating the total number of atoms from other residues that are at a distance of 4.5Å or less from the atoms of the residue, divided by the number atoms in the residue. When this number was above 11.5 (a value close to the mid point between fully exposed and fully buried residues) the residue was defined to be buried.
The average resolution for the structures determined by x-ray crystallography is 1.81Å.
The PDB code and chain ID are shown.
Residues less exposed to the solvent.
Residues more exposed to the solvent.
All residues, buried and exposed.
Averaged absolute difference between native and minimized angles.
Percentage of angles less than 40° from native.
The results show that the KBP gives angles closer to the native values. The average angle difference resulting from the KBP was 〈|Δχ|〉 = 42°, which is more than 10° lower than the MBPs, whose values range between 53° and 59°. Among all force fields, the united atom force field G43A2 gave results closer to native. The results for OPLS-AA/L and Amber ff03 were comparable to G43A2. However, as expected, both G43A1 and Amber ff99 gave inferior results when compared to their improved counterparts (G43A2 and ff03).
Separating the results into buried and exposed residues reveals that, for both KBP and MBP, the buried residue conformations are closer to native than the exposed residue conformations. This is in agreement with pervious observations done by other groups. It could be expected that for surface residues, the simulations using MBPs would yield comparable results to those using KBPs because the simulations are carried out with fully solvated residues. However, the results show that the KBPs yields conformations much closer to the native ones even in this case. It is not entirely clear if this is due to the force field parameterization or due to differences in the experimental conditions for the protein structures as compared to the prediction simulations. Notice that overall prediction rates reported in other studies typically consider the χ1 and χ2 values for the buried residues only. Under these constraints, the present KBP value yields an average 80% prediction rate, which is comparable to the values reported by Yanover et al. (2008) (82.6%), using a different protein test set and buried residue definition.
Table 4 shows the prediction results, averaged for each side chain angle of each amino acid. The MBP results correspond to the simulations with the G43A2 force field. The table also shows the difference between the angle native distance and the prediction percentages. When the differences for the angle distances are negative and for the percentages are positive, the KBP results in better predictions than the MBP. Predictions for almost every single angle determined by the KBP are better than the ones determined by the MBP. Only for Arg, Glu, and His are the angles closer to native using the MBP. Significant differences can be observed for residues such as Asn, Asp, Glu, Phe, and Trp. Notice that the results for Tyr are also closer to native using the KBP in spite of the high correlation of the KBP to the MBP for this residue. In the case of Glu, the low correlation of the UE(χ2, χ3) potential results in a higher difference of the χ3 angle using the MBP.
Obtained using the Gromos G43A2 force field.
In spite of the permissive range used to define the native angles in the predictions (40° from native), it is instructive to study the effects of the protein resolution on the prediction quality for the different potentials. For this, a set of high resolution structures compiled by Davis et al. (2006) was used. The protein PDB codes are shown in Table 5. A total of 19 proteins were used (two from the original list were eliminated because of missing residues) with an average resolution of 0.80Å and a maximum of 0.90Å. A KBP not including these proteins was calculated and used in the prediction of their side chain angles. An identical procedure used in the analysis of the medium resolution proteins was followed. The prediction results are shown in Table 6. Once again, the KBP performs better than all the MBPs and by approximately the same margin as when using the medium resolution proteins. That is, for the medium resolution structures the correctly predicted angles using KBP for all residues was 9% higher than using G43A2, while for the high resolution structures this difference was 13%. Furthermore, for all potentials, the percentage of correctly predicted structures increased for the high resolution structures. In particular, 88% percent of the native angles were recovered for the buried residues using the KBP.
The average resolution is 0.80Å.
The PDB code and chain ID are shown.
Residues less exposed to the solvent.
Residues more exposed to the solvent.
All residues, buried and exposed.
Averaged absolute difference between native and minimized angles.
Percentage of angles less than 40∘ from native.
It was also observed (data not shown) that, when combining both set of proteins, the structure resolution R and the percentage of correctly predicted angles % show a significant linear correlation for either KBPs and MBPs. For KBPs, the linear regression yields % = 85 − 9.5R, with a correlation coefficient of r = −0.71. Two outliers showing lower prediction percentage than expected were eliminated from the regression fit. These were 3al1_A and 3al1_B, both which are 12 residue designed peptides with 0.75Å resolution but with only about 60% of correctly predicted angles. For G43A2, the linear regression for all the angles yields % = 69 − 7.0R, with a correlation coefficient of r = −0.59. Extrapolating these results to a zero resolution error (R = 0) yields 85% for KBP and 69% for G43A2. These observations indicate that at least some of the prediction inaccuracy of the side chain potentials arise from inaccuracies in the structure determination, and that predictions using KBPs are significantly better than using any of the MBPs independent of resolution error. The prediction results using MBPs for the high resolution structures are clustered between 58% and 66%, which is significantly below the 79% resulting from KBPs. These results indicate that any systematic error that may exist in the least-squares refinement of the protein crystal structures from diffraction data is not responsible for the difference in prediction accuracy between the KBP and the MBPs.
4. Conclusion
Statistical potentials for amino acid side chains were derived using two approaches, one from known protein structures and another from MD equilibrium simulations. Molecular dynamics simulations were carried out using two Gromos united atom force fields (G43A1, G43A2), the all-atom OPLS-AA/L force field, and two Amber force fields (ff99, ff03) with implicit solvent. A comparison between the knowledge-based and the MD-based potentials showed significant correlations between them, with correlation coefficients near 0.8. Nevertheless, when using the potentials for predicting the side chain dihedral angles, with no other energy contributions other than excluded volume interactions, the KBP performed significantly better for either buried or solvent exposed residues. It was also found that the united atom force field G43A2 perform better than all other force field tested. The KBP performed as well as any of the state of the art side-chain prediction methods in spite of only including pair correlations of dihedral angles in the energy terms. Both KBPs and MBPs energy functions were derived with a resolution of 20°, which is higher than typical rotamer libraries.
In spite of the MBPs' lower ability to predict side chain conformations, they are ultimately desirable over the KBPs, because in principle MBPs can be adapted to various solvent conditions. However, KBPs are currently superior in predicting the side chain conformation, and their prediction advantage over the tested MBPs is not due to the proteins resolution but due to force field properties not included in the MBPs. It was not possible to determine conclusively if further atomic force field improvements for bonded interactions are needed to increase the MBP prediction ability or if the KBP is capturing non-bonded interactions not modeled in the MBP calculations. Differences in the residue environment between both approaches were controlled in several ways. First, by classifying the residues as buried or exposed, it was observed that the environment surrounding a residue was not mainly responsible for the superiority of the KBP, given that it also predicted better the solvent exposed residues. Also, by maintaining the backbone dihedral angles in their native values during the side chain angles minimization, possible preferences in the KBP for native backbone dihedral angles were ruled out. However, other effects such as the types of neighboring residues along the main chain, which in the MBP determination consisted of alanines, could be affecting the MBPs. In spite of these difficulties, the results suggest that the inclusion of other properties in the atomic force fields, such as electronic polarization effects, or the modification of dihedral angle potentials may be needed.
Footnotes
Disclosure Statement
No competing financial interests exist.
