Abstract
Abstract
1. Introduction
P
Computational docking methods aim to compute the correct bound form of two or more molecules. This is highly challenging because even for rigid body docking, the solution space spans the three translational and three rotational degrees of freedom for the other protein, and it grows exponentially with the size of the input proteins (Cherfils and Janin, 1993; Moreira et al., 2010). Most docking algorithms apply a geometric search for the correctly bound complex, followed by a ranking/scoring stage where a scoring function aims to distinguish native-like candidates from false positives. Several modern docking algorithms are successful in predicting the correctly bound complex of their input proteins, but many of the highest ranking docking candidates are still false positives (Halperin et al., 2002; Janin, 2010).
It is generally agreed that there is a relationship between various scoring terms (e.g., Van der Waals [VdW], electrostatic, and desolvation forces) and the similarity of a docking output complex to its native structure (Moal et al., 2013). However, the exact form of this relationship is unknown. Therefore, docking algorithms often formulate this relationship as a weighted sum of selected energetic, biochemical, or geometric terms and adjust their weights against a training set (Moal et al., 2013). However, the general inaccuracy of the rankings may suggest that this relationship may be more complex than a weighted sum (Kastritis and Bonvin, 2010). For this reason, many docking algorithms provide an additional, sometimes optional, refinement stage.
In this work, we describe three methods to predict the RMSD of a set of docking refinement candidates with respect to their native conformation. We train the models using different data sets with the goal of determining the training data set that gives the highest prediction power. By building three models to predict the RMSD of a given structure, we provide the experimental setting for comparing the performance of these models. These experiments can be used as a guiding tool for building the right data set and designing the best model in the studies that rely on selecting the best docking refinement candidates.
1.1. Scoring functions for protein–protein docking
Over the past 20 years, several scoring functions have been developed for ranking putative docked complexes. These functions combine geometric complementarity with physicochemical interactions (Halperin et al., 2002). Most scoring functions use a combination of VdW energy, electrostatic interactions, and desolvation terms (Dominguez et al., 2003; Comeau et al., 2004; Cheng et al., 2007; Pierce and Weng, 2007; Lyskov and Gray, 2008; Vries and Zacharias, 2013). The combination and weighing of the terms vary among methods. A recent docking refinement method (Akbal-Delibas et al., 2012; Akbal-Delibas and Haspel, 2013) uses a scoring function, including an evolutionary traces (ET) term (Mihalek et al., 2006; Wilkins et al., 2012), in addition to the VdW and electrostatic components.
While it is known that proteins often change their conformations upon binding, modeling flexibility is challenging due to the additional computational cost to an already difficult problem. Flexible docking methods use normal mode analysis (Li et al., 2010) and side-chain flexibility combined with soft rigid body optimization (Lyskov and Gray, 2008; Mashiach et al., 2008). Despite recent development in scoring functions, predicting the correct binding conformation is still a largely unsolved problem. A recent large-scale benchmarking of current docking methods revealed that most current physics-based scoring functions still fail to accurately predict the binding affinity of complexes (Kastritis and Bonvin, 2010).
1.2. Neural networks
Neural networks are widely used to approximate complex functions. In previous work (Akbal-Delibas et al., 2014, 2015b), we used a backpropagation network (Rumelhart et al., 1986; Werbos, 1990; Mehrotra et al., 1997) to formulate the relationship between a wide set of scoring function terms and RMSD of a docked structure. The tool, denoted AccuRMSD, not only ranks the decoys relative to each other but also indicates how similar each decoy is to the native conformation.
Recently, more sophisticated types of networks, called deep-learning networks, have shown great success in the domain of image processing, speech recognition, and bioinformatics. In deep architectures, multiple layers of nonlinear mappings are stacked on top of one another to better capture the nonlinearities in the representation of training data. These multiple hidden layers build a hierarchy that transforms the raw input to a feature space that is often initially unobserved. The rapid advances in computing hardware as well as availability of vast volumes of data at low cost have enabled the efficient training of deep networks. We used deep learning in the past to aid in docking refinement (Akbal-Delibas et al., 2015a, 2016). In this work, we have used a multilayer neural network (MLNN) as well as a network of restricted Boltzmann machines (RBMs) (Yu and Deng, 2011).
2. Methods
2.1. Training data sets
The first set of data sets, used in the conference version of this article (Farhoodi et al., 2015), includes 41 protein structures with RMSD values that range between 0 and 7 Å and divided into training and testing sets of 35 and 6 proteins, respectively. The second set of data sets was added to this extended version to test a more diverse set of putative complexes. It includes 54 proteins where the RMSD values span a much wider range, between 0 and 27 Å. We have divided these into training and testing data sets of 48 and 6 proteins, respectively. Table 1 provides information of all the data sets used in this article. In what follows, we provide more details about the data sets.
Ref, refinement.
2.1.1. Protein structures with limited RMSD range
We initially trained our models with three extensive data sets that included the following 35 unbound dimer proteins listed in the unbound Protein–Protein Docking Benchmark 4.0 (Hwang et al., 2010): 1B6C, 1EFN, 1EWY, 1FFW, 1GL1, 1GLA, 1GPW, 1GXD, 1H9D, 1US7, 1J2J, 1JTG, 1OC0, 1OYV, 1PVH, 1S1Q, 1T6B, 1XD3, 1YVB, 1Z0K, 1Z5Y, 1ZHH, 1ZHI, 2AST, 2AJF, 2B42, 2FJU, 2HLE, 2HQS, 2J0T, 2O8 V, 2OOB, 2VDB, 3DSS, and 4CPA. We focused on the dimers from the rigid body category in this benchmark and selected the proteins for which the corresponding ET files were available in the ET Server (Mihalek et al., 2006). Our methods were initially designed to accurately discriminate docking refinement candidates generated from putative docked protein complexes. Therefore, in addition to the docked structures, we generated a set of refinement candidates. We grouped these samples into three data sets based on the method used to generate the corresponding protein structures:
• Docked structures data set: For each protein, we produced 1000 docked structures by RosettaDock (Lyskov and Gray, 2008). We used the coarse grained module (without refinement), followed by 500 minimization steps using NAMD (Phillips et al., 2005) to resolve clashes without significantly changing the structures. We then analyzed the interfaces of each structure and calculated the values of the network features. This data set consists of 35,000 samples. • Refinement candidates data set: For each protein, we generated 1000 refinement candidates from coarsely docked complexes by applying small rigid body rotations around an arbitrary axis as described in Akbal-Delibas et al., 2015a. This initially resulted in 35,000 refinement candidates, of which ∼6000 were discarded due to high RMSD values (above 7Å), with the aim of obtaining an approximately normal distribution. • Docked and refinement structures data set: We combined the samples in the docked structures data set and the refinement candidates data set. This data set consisted of ∼64,000 samples.
Our motivation of building these data sets was to investigate how adding the refinement candidates to the docking training set affects the accuracy of the model. Figure 1a–c depicts the RMSD (Å) distribution of samples in these data sets.

RMSD (Å) distribution of
2.1.2. Protein structures with higher RMSD range
To evaluate the accuracy of methods in cases where structures have higher RMSD values, we generated two new training data sets of the following 48 proteins: 2A5T, 3D5S, 1S1Q, 1Z5Y, 2AJF, 2GAF, 1GLA, 1GPW, 1XD3, 2A1A, 1FFW, 1JTD, 1YVB, 2GTP, 1EWY, 3A4S, 1J2J, 2J0T, 1T6B, 1US7, 1OC0, 1ZHI, 1OYV, 1H9D, 2I25, 2VDB, 4M76, 1ZHH, 2HLE, 1EFN, 1B6C, 2OOB, 2O8V, 4CPA, 1Z0K, 1PVH, 4H03, 3BIW, 3VLB, 1GL1, 2YVJ, 1GXD, 2B42, 3K75, 3PC8, 2HQS, 1JTG, and 2FJU. This set includes the proteins in the previous data sets and 13 extra proteins that we have newly added to the collection after the release of the Protein–Protein benchmark v.5 (Vreven et al., 2015). Also, in addition to RosettaDock, we have used pyDock (Jiménez-García et al., 2013) to generate samples. Based on the method used to generate these data sets, we have divided the samples into two classes of structures:
• RosettaDock structures: For each protein named above, we have generated 1000 refinement candidates from coarsely docked protein structures produced by RosettaDock, which resulted in 48,000 samples. The RMSD values of the structures range between 0.92 and 24.46 Å. • pyDock structures: Similarly, we generated 1000 refinement candidates from complexes produced by pyDock (Jiménez-García et al., 2013) without any distance restraints. These samples have RMSD values that vary between 0.82 and 26.88 Å.
The distribution of RMSD values for these data sets is shown in Figure 2, and more statistics are presented in Table 1.

RMSD (Å) distribution of
2.2. Test data sets
The models were tested on three data sets of refinement candidates produced from coarsely docked protein complexes generated by RosettaDock and pyDock. The sets include six proteins named 1R0R, 2A9K, 2AYO, 2G77, 2SNI, and 7CEI. For each protein, we selected five docked complexes, and for each of these complexes, 200 refinement candidates were generated by applying small-scale rigid body rotations resulting in a total of 6000 samples in each data set. The test data sets are divided into three classes based on the method that was used to generate them and the samples' RMSD range:
• Refinement candidates 1: generated by RosettaDock, with RMSD range between 0.92 and 7 Å. Figure 1d displays the RMSD distribution of these structures. • Refinement candidates 2: generated by RosettaDock, with RMSD between 0.92 and 15.16 Å. The RMSD distribution is shown in Figure 3a. • Refinement candidates 3: generated by pyDock program, with RMSD between 1.26 and 16.40 Å. The RMSD distribution is shown in Figure 3b.

RMSD (Å) distribution of
Table 1 shows the summary of the statistics for all training and test data sets.
2.3. Features
Our methods approximate the relationship between 16 different features and the RMSD of a protein complex with respect to its native structure. The majority of these features are used as scoring function terms by a number of docking and refinement methods and were used by us in the past (Akbal-Delibas et al., 2012).
• VdW: Computed for interface atoms (atoms within at most 6 Å to the adjacent chain atoms).
• Electrostatic: Computed for interface atoms.
• Interface conserved atom ratio: The ratio of the evolutionarily conserved interface atoms to the total interface size.
• Protein Category: 1: Antibody Antigen, 2: Antigen-Bound Antibody, 3: Enzyme Inhibitor, 4: Enzyme complex with a regulatory or accessory chain, 5: Enzyme Substrate, 6: G-protein containing, 7: Receptor containing, 8: miscellaneous. The categories are taken from the Protein–Protein Docking Benchmark v.5 (Vreven et al., 2015).
• The ratios of interface atoms belonging to residue types to the total interface size: Hydrophobic (A, C, G, I, L, M, P, V); Positively Charged (H, K, R); Negatively Charged (D, E); Polar (N, Q, S, T); and Aromatic (F, H, W, Y).
2.4. Prediction methods
We developed three models to predict the RMSD of the tested candidates. We now describe the configuration of each model.
2.4.1. Two-layer neural network
In Akbal-Delibas et al., 2014, 2015a,b, we introduced a neural network trained using the backpropagation algorithm. In this study, we use the same network with 16 features described above. The network has 16 input neurons, which receive input data, including 16 features that characterize a protein structure. Eight of these features consist of continuous values and were initially normalized to the range of
2.4.2. Multilayer neural network
This model has one input layer, three hidden layers, and one output layer. For comparison purposes, this network uses the same 16 input neurons to receive the input data. After parameter tuning, the network had 50 neurons in the first hidden layer and 70 and 100 neurons in the next two hidden layers. Similarly, the output layer consisted of one neuron that produces the predicted RMSD value. This network was also trained 300 epochs.
2.4.3. RBMs network
This network consists of three layers. The input layer is similar to the input layers of the two previous models. The hidden layer, which is fully connected to the input layer, includes 20 RBMs (Yu and Deng, 2011), and finally, the output layer consists of one output neuron. The RBMs were trained in an unsupervised manner by utilizing the feature values only, using a contrastive divergence (Hinton, 2012) algorithm and then weights between RMB layer and output layer were updated using the backpropagation learning method.
3. Results and Discussion
We conducted two set of experiments with our two data set collections of different RMSD ranges. We divide each subsection into an experiment with refinement candidates and cross-validation.
3.1. Experiments with data sets of smaller RMSD range
For the first set of experiments, we used the smaller data sets with RMSD ranges limited to a maximum value of 7 Å. First, we describe our prediction results after training the models with our data sets of smaller RMSD range. We then present our model comparison results after conducting 10-fold cross-validation.
3.1.1. Experiments with refinement candidates
First, we trained the models with the data set of RosettaDock docked structures. Then, we tested our test set of 6000 refinement candidates for six different proteins. We compared the predicted RMSD and the actual RMSD values through the root-mean-square (RMS) of the error made by the models. The Pearson correlation coefficients of the predicted and actual RMSD values and the error are listed in Table 2. The smallest prediction error was achieved by the MLNN and was 1.45 Å. The two-layer neural network (TLNN) performed slightly worse with 1.48 Å prediction error. In addition, the Pearson correlation coefficient between the predicted and actual RMSD values predicted by MLNN was higher (0.41). Next, we used the data set of refinement candidates for training. The error and correlation coefficients are listed in Table 3. As shown in the table, the prediction accuracy of the TLNN and MLNN deteriorated, while the restricted Boltzmann machine network (RBMN) showed a smaller error compared to the case where it was trained with the docked structures. Finally, we trained the models using the data set with both docked structures and refinement candidates. The results are presented in Table 4. Again, the lowest error (1.3 Å) was obtained using the MLNN. The accuracy of TLNN and RBMN was worse than the previous experiment. Here is the summary of our observations with respect to the training data set impact on the prediction accuracy of the models:
MLNN, multilayer neural network; RBMN, restricted Boltzmann machine network; PDB, protein data bank; RMSD, root mean squared deviation; TLNN, two-layer neural network.
• The TLNN performed best when trained with the data set of only docked structures. Its prediction error increased by 33% and 16% when the refinement candidates data set and the combined data set, respectively, were used for the training.
• The MLNN showed the lowest prediction error when trained using both docked structures and refinement candidates. The prediction error in this case was 18% lower than the case where it was trained using the refinement candidates data set. Furthermore, we observed a 9% increase in error, when we trained this model using docked structures only.
• The RBMN achieved the lowest error with the refinement candidates training data set.
The correlation coefficients and the prediction errors vary significantly from one protein to another. This is mainly due to the considerable diversity in the feature values and RMSD distributions of the test proteins. Furthermore, it indicates that each of the models was able to capture certain characteristics in the existing relation among the features and RMSD values of the test complexes and suggests that the prediction accuracy of the models can be improved by increasing the diversity among the training samples.
3.1.2. Model comparison by cross-validation
To analyze the performance of the models, we conducted a set of 10-fold cross-validation experiments. We randomly divided the data sets into two partitions for training and testing. The training consisted of 90% of the samples that were randomly selected and the remaining samples were used for testing. The average prediction errors of the models trained using the data sets are listed in Table 5. The cross-validation results show that the MLNN is outperforming the other two models in all the cases.
The best result in each category is shown in bold font.
Combined, docked and refinement structures data set; Docked, docked structures data set; Refined, refinement candidates data set.
3.2. Experiments with data sets of higher RMSD range
For the second set of experiments, we used our newer data sets with higher RMSD ranges of up to 27 Å. We present the results in two parts as before.
3.2.1. Experiments with refinement candidates
We trained the models with the data set that included refinement candidate structures produced by RosettaDock. Then, we tested the refinement candidates produced by RosettaDock. We compared the predicted and the actual RMSD values by calculating the RMS of the error made by the models. The Pearson correlation coefficients of the predicted and actual RMSD values and error are listed in Table 6. The smallest average error, 2.21 Å, and highest correlation coefficient (0.54) were achieved by the MLNN. The TLNN and RBMN performed very close to each other, being slightly worse than MLNN with 2.30 and 2.29 Å RMS errors, respectively. RBMN had a higher correlation coefficient on average.
Next, we used the data set of refinement candidates generated by pyDock for training. The results are listed in Table 7. The prediction accuracy of MLNN was the highest among the models with an average error of 2.84 Å and correlation coefficient of 0.43. The TLNN performed slightly worse with 2.88 Å average error and the RBMN performed the worst with 3.34 Å average prediction error. Generally, the errors of the models trained by pyDock complexes were larger and the correlation coefficients were lower compared to the models trained with samples generated by RosettaDock. We attribute this to having considerably more samples with lower RMSD values in the data sets generated by RosettaDock, while the complexes obtained by pyDock had a much wider range of RMSD values.
As seen, the MLNN outperforms the other two models independent of the training data set. As before, the correlation coefficients and the error of the models vary for different proteins. For instance, the TLNN and MLNN trained with the structures generated by RosettaDock showed the smallest error and highest correlation coefficient for 7CEI, while the RBMN had the lowest error in the experiments with the refinement candidates that belong to 2G77.
3.2.2. Model comparison by cross-validation
We conducted 10-fold cross-validation experiments with the two new data sets. We divided the samples into training and testing in an iterative manner, where no samples generated for a particular protein could fall in both training and testing sets. The errors and correlation coefficients of the models trained using RosettaDock data set are listed in Table 8. The results show that both the TLNN and MLNN are performing very closely with average errors of 1.91 and 1.92 Å, respectively, while the MLNN is showing slightly higher average correlation coefficients of 0.54. The cross-validation results with pyDock data set are shown in Table 9. The TLNN is showing the least error of 3.39 Å and highest correlation coefficients of 0.40. Once again the RBMN is the worst performing model.
4. Conclusions
We presented three models to accurately discriminate native-like structures during protein–protein docking and refinement: a TLNN, a MLNN, and a RBMN. These methods enabled us to approximate the nonlinear relationship between a set of selected features and a structure's similarity to its native conformation. We trained the models using several data sets with the primary motivation of investigating their predictive power for ranking of the refinement candidates. We tested the models with a group of refinement candidates generated from six other proteins. The RBMN produced the lowest error when trained with the refinement candidates data set of a smaller RMSD range. In the rest of the experiments, the MLNN outperformed the other models. Based on our experiments, MLNN is the method that shows the highest prediction accuracy in the majority of cases.
It is worth mentioning that despite training the models with samples of higher RMSD values, the average prediction error is still relatively small, which demonstrates the robustness of the methods to large RMSD changes. Furthermore, with the data sets of wide RMSD range, the correlation between the predicted and actual RMSD values was higher than the correlation obtained previously with data sets of lower RMSD ranges, being close to 0.5 on average and as high as 0.85 in some cases. This confirms that the models could benefit from a training data set with a diverse and wide range of samples. Future work includes the following directions. First, we plan to study other prediction methods. Second, we intend to boost our feature set by introducing new features that represent the protein complexes more accurately. Third, we are interested in testing the accuracy of these models on proteins under the medium and difficult categories of the Protein–Protein Docking Benchmark 5.0, as well as multimers. Finally, we will use these models as the ranking tool in a new method for refining coarsely docked protein complexes.
Footnotes
Acknowledgment
The research was funded in part by an NSF grant CCF-1421871 (N.H.).
Author Disclosure Statement
No competing financial interests exist.
