PEM-fitter : A Coarse-Grained Method to Validate Protein Candidate Models

Abstract

The volumetric images produced by Cryo-Electron Microscopy (cryo-EM) technique are used to model macromolecular assemblies and machines. De novo protein modeling uses these images to computationally model the structure of the molecules. Many candidate conformations are usually generated during the intermediate step. Conventionally, each of these candidates is evaluated by time-consuming approaches such as potential energy. We introduce an initial version of a geometrical screening method that uses the skeleton of the cryo-EM images to evaluate candidate structures. The aim of this method is to reduce the number of native-like candidate conformations and, therefore, reduce the time required for structural evaluation by energy calculations. A test of two datasets was performed. The first dataset contains 10 proteins and shows that our method can successfully detect the correct native structure for the given skeleton among a set of different protein structures. The second dataset contains 12 proteins and shows that our method can filter slightly modified decoy conformations of the same protein. The efficiency of the method is also reported.

1. Introduction

Three-dimensional (3D) structural information of biological systems such as proteins, macromolecular assemblies, and protein inhibitor complexes is crucial to understanding structure-function relationships. The knowledge of the structure of the target biological compound (i.e., protein macromolecules) is required, for example, for structure-based drug design. To date, the gap between the number of known protein sequences and the number of determined structures is huge. Therefore, it is advantageous to apply current hardware power and computational tools to minimize this gap.

The main three biophysics techniques used to determine the structure of proteins are X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy (cryo-EM). Although X-ray crystallography and NMR are the dominant techniques for protein structure determination, they are subject to numerous limitations (Pearson and Mozzarelli, 2011; Emwas, 2015; Zheng et al., 2015). Some of these limitations are the required quantity and purity of the sample, the requirement that the protein should be crystallizable, and the stability of the structure. These limitations are particularly troublesome for macromolecule machines and some protein molecules such as viral capsids. In contrast, cryo-EM has been shown to be a powerful biophysical technique that is capable of examining the macromolecular machines in their native environment. Further, it does not suffer from crystallization problems. Therefore, it can determine the structure of types of proteins that are hard to resolve by using conventional methods, for example, membrane-bound proteins. In addition, cryo-EM is expected to be the main workhorse to capture the molecular interaction between large complexes within cells (Mitra and Frank, 2006; Frank, 2009). Cryo-EM produces volumetric images that are used to determine the structure of the protein.

Most of the images are produced at sub/nanometer resolutions (>5 Å). It is challenging to determine atomic structures from images generated at sub/nanometer resolution by using current technologies. Therefore, improved computational methods are critically important for addressing this problem. The number of prospective sub/nanometer cryo-EM images is rapidly growing with the greatly improved detectors that are now available. The development of powerful and automatic computational methods to derive the atomic structure underlying a cryo-EM volume would greatly advance the role of cryo-EM as a complement to traditional methods. At resolution worse than 5 Å, the image becomes unclear, and, therefore, the atomic model cannot be constructed directly. However, computational methods are capable of extracting features, analyzing, and utilizing the image to derive the structure of the protein. These methods can be classified into either fitting or de novo modeling.

When a high-resolution atomic structure is available for small proteins or for a part of large proteins, fitting and refinement tools have shown the ability to derive the atomic structure of a protein from cryo-EM images (Topf et al., 2005, 2006, 2008; Lu et al., 2007, 2008; DiMaio et al., 2009). Given an initial structural model, the volumetric image is used to refine and fit the model structure and to construct an accurate high-resolution all-atom protein model. The refinement process uses a fitting scoring that measures how well the model fits into the volumetric image and identifies mismatched regions between the model and the image. Two types of fitting can be used: rigid fitting and flexible fitting (Volkmannb and Hanein, 1999; Rossmann, 2000; Jiang et al., 2001; Wriggers and Chacón, 2001; Chacón and Wriggers, 2002; Topf et al., 2008; Pintilie et al., 2010; Brown et al., 2015; Gydo and Alexandre, 2015). In rigid fitting techniques, the structure being fitted is aligned with the image as one component. When the atomic structure of the imaged molecule is not the same as in the assembly, rigid fitting cannot be used. To overcome this problem, flexible fitting should be considered where the conformation of the structure being fit can be changed to improve the correspondence to the cryo-EM image (Wriggers et al., 1999, 2000; Volkmann et al., 2000; Wriggers and Birmanns, 2001; Ming et al., 2002a, 2002b; Tama et al., 2004; Wells et al., 2005; Suhre et al., 2006; Velazquez-Muriel et al., 2006; Schröder et al., 2007; Jolley et al., 2008).

In the absence of a high-resolution structure corresponding to the image or part of it, which is the case for most macromolecules, it is not feasible to fold the sequence to the image (Lindert et al., 2009). Therefore, instead of rigid or flexible fitting and refinement using cryo-EM images, de novo modeling of the protein structure is used. De novo protein modeling is the set of computational methods and algorithms used to construct the 3D structure of a protein by using its primary sequence and cryo-EM image. Cryo-EM is superior when working on nanostructure biological complexes such as viruses, small organelles, and macromolecular biological complexes. Therefore, de novo modeling is able to model certain types of proteins when traditional computational methods fail. For instance, ab initio modeling is not suited for relatively large protein macromolecules due to the complexity of their folding and the huge size of the search space. Likewise, comparative modeling fails with membrane proteins due to the diversity of their structures and functions. It is hard to find a template from the pre-determined structures for the target protein.

Many de novo modeling approaches have been proposed (He et al., 2004; Wu et al., 2005; Dal Palu et al., 2006; Al Nasr and He, 2009; Lindert et al., 2009, 2012; Baker et al., 2011; Al Nasr et al., 2012). In general, de novo modeling techniques generate thousands of candidate structure configurations in an intermediate step and evaluate them to select the most suitable, native-like ones. Wu et al. (2005) used a geometry filter followed by an energetics-based evaluation. The energy evaluation is carried out by a knowledge-based pair-wise potential function to evaluate fold candidates. One drawback of their work is the amount of computation time required to accomplish the search process. Because of this cost, this method is not suitable for medium or large proteins. Baker et al. (Abeysinghe and Ju, 2009; Baker et al., 2011) have used Gorgon in a semi-automatic approach to build the structures of the molecules; however, this modeling process relies greatly on user-system interaction. Lindert et al. (2009, 2012) proposed a de novo folding approach called EM-Fold. EM-Fold uses a Monte Carlo refinement process to improve the placement of structural SSEs (secondary structure elements). In a subsequent step, the loops and side chains are added by Rosetta's iterative side-chain repacking and backbone reconstruction protocols that generate a model at atomic resolution (DiMaio et al., 2009). Although this method can work with a large solution space, the stochastic nature of the approach may miss the correct topology.

Several de novo techniques use the skeleton of the given volume image to reduce the search space and help in modeling (Lindert et al., 2009, 2012; Al Nasr et al., 2014). A skeleton reveals useful information that highlights the connections between secondary structure elements and, therefore, speeds up the modeling process.

In this article, we introduce a skeleton-based coarse-grained approach that uses cryo-EM skeleton to validate the candidate structures in the intermediate step in de novo modeling. The method uses the high-quality skeletons that can be extracted from cryo-EM volumetric images by using a tool developed recently (Al Nasr et al., 2013a, 2013b). The aim is to quickly rank the candidates to determine which are suitable for further evaluation by using potential energy calculations. Thus, instead of evaluating all of the candidates using energy methods, we will reduce the number of candidates drastically to speed up the process. The initial version of the method was presented at the IEEE Bioinformatics and Biomedicine (BIBM) (Al Nasr et al., 2016). Since our initial version, our method has been subject to further tests. Initially, the method was examined for its ability to distinguish the candidate structure that represented the protein in a cryo-EM volumetric image from other, dissimilar, protein candidate structures. In this revision, we expanded the range of testing; the method is examined for its ability to differentiate the experimentally determined structure of a protein from a set of slightly modified structures of the protein. The results of this expanded testing are present herein. The overview of the method is depicted in Figure 1. The candidates are rigidly aligned with the skeleton, and a “fitness score” is calculated (i.e., Root Mean Square Deviation [RMSD] between a skeleton and a candidate structure). The candidates with the best scores are marked as good models and are appropriate for further evaluation and refinement.

FIG. 1.

The overview of our method. The skeleton of the cryo-EM volume image is extracted and used as a reference object to geometrically screen the candidate structures constructed in de novo modeling. Each structure is aligned with the skeleton by using ICP, and a score (i.e., RMSD) is calculated. cryo-EM, cryo-electron microscopy; ICP, Iterative Closest Point; RMSD, root mean square deviation.

2. Methods and Approaches

In this article, we propose a fast and coarse-grained geometrical screening method to speed up de novo modeling of protein structures. Our method is based on the idea that rapid rejection of non-viable candidate structures offers an inexpensive approach for reducing the amount of computational power that must be expended to validate thousands of candidate structures modeled during the intermediate step in de novo modeling. The candidate structures with the highest scores in our method are expected to be native like and are appropriate for more investigations such as energy validation and structure refinement.

The method uses a skeleton extracted from the target cryo-EM volume image as a reference or target shape (Al Nasr et al., 2013a). The advantage of using a skeleton and not a cryo-EM volume image is that a skeleton is a simplified and thin version (usually one voxel in width) of the original volume that is topologically comparable to the object and highlights its geometrical and structural features. Thus, most of the structural features and geometrical properties of the object such as the backbone structure of the protein are carried by the skeleton. Therefore, the complexity of processing is reduced. The candidate structure will be aligned with the skeleton by using the Iterative Closest Point (ICP) algorithm (Besl and McKay, 1992; Zhang, 1994; Pomerleau et al., 2013). However, before the actual alignment, the candidate structure is preprocessed to simplify its representation and to roughly align it with the skeleton. This rough alignment is needed since a requirement for ICP to converge is that the two objects being aligned have a close initial orientation and position. Although both sets of data represent complexly folded, linearly connected, 3D structures, while applying our method we treat each as disconnected point clouds and allow the two structures to move freely through each other. No transformations are performed that alter their shapes in the procedure (i.e., a rigid alignment is performed).

Since the skeleton most likely holds the shape of the backbone of a molecule, the all-atom structure of a candidate model is not necessary. Therefore, spatial data for the candidate structure is simplified by removing all information except the location information for the Cα atoms on the backbone. This reduced structure is faster to process and preserves the geometrical shape of a candidate structure.

To make the orientation of the two-point clouds close, we use the Principle Component Analysis (PCA) to determine their long axes. PCA is a statistical technique that is used in various fields of image processing such as facial recognition and image compression. The PCA is used to find the rough shape of the voxels in the objects being compared. The three axes of each object are found and used to change the initial orientation of the comparable point clouds. In PCA, the coordinates of voxels are transformed to a new coordinate system. Let the input structure (i.e., skeleton or candidate model) c consist of N voxels and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overline c = \left( { \overline x , \overline y , \overline z } \right)$$ \end{document} be the centroid of c. Let X_c be an N × 3 matrix that represents the difference of each voxel in c with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\overline c$$ \end{document} as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {X_c} = \left[ { \begin{matrix} {{x_1} - \overline x } & {{y_1} - \overline y } & {{z_1} - \overline z } \\ {{x_2} - \overline x } & {{y_2} - \overline y } & {{z_2} - \overline z } \\ { \begin{matrix}. \\. \\ {{x_N} - \overline x } \\ \end{matrix} } & { \begin{matrix}. \\. \\ {{y_N} - \overline y } \\ \end{matrix} } & { \begin{matrix}. \\. \\ {{z_N} - \overline z } \\ \end{matrix} } \\ \end{matrix} } \right] \ { \rm and} \tag{1} \end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} CovM = \frac { 1 } { { N - 1 } } X_c^T \cdot { X_c } \tag { 2 } \end{align*} \end{document}

where CovM is the 3 × 3 covariance matrix, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$X_c^T$$ \end{document} is the transpose of X_c. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${e_1} , {e_2}$$ \end{document} and e₃ be the three Eigen vectors of CovM and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _1} , \ { \lambda _2} ,$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \lambda _3}$$ \end{document} be the corresponding Eigen values, respectively. In this preprocessing step, we align the two objects by using their longest axis (i.e., longest Eigen vector) to ensure that the matching objects are similarly oriented. In addition, by translating the candidate structure in this step, we ensure that the matching objects are similarly positioned (Fig. 2).

FIG. 2.

PCA used to align the two structures for initial orientation. The PCA \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left( { \overrightarrow {{e_1}} , \overrightarrow {{e_2}} , \overrightarrow {{e_3}} } \right)$$ \end{document} is calculated for each structure and then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left( { \overrightarrow {{e_1}} } \right)$$ \end{document} is used to align them for the initial guess as input to ICP. PCA, principle component analysis.

Once the two-point clouds are aligned along the same general axis, we use the ICP algorithm to perfect the alignment. There are many variants of ICP, some of which allow for deformation of the point clouds. This deformation is useful if one is trying to match visual data where the apparent shape of an object is altered due to foreshortening. Since we are trying to rigidly register points in two clouds, we have chosen to use an approach that holds the relative locations of the points within each cloud static.

ICP is an efficient algorithm used in many applications to find the transformation between two-point clouds by minimizing the square errors between the clouds (Besl and McKay, 1992). It is an example of a gradient descent approach in which the correspondence of point clouds is reevaluated as the solution comes closer to the error local minimum.

Starting from initial alignment guess that we generated during the preprocessing, ICP, iteratively, finds the closest set of points in the two clouds and calculates the rotation R and translation t to be applied to candidate structure to minimize the squared error E (Fig. 3). Let skeleton point cloud \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$S = \left\{ {{s_1} , {s_2} , \ldots , {s_m}} \right\} $$ \end{document} and candidate model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$M = \left\{ {{c_1} , {c_2} , \ldots , {c_n}} \right\} $$ \end{document} , then in every iteration E is calculated \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} E = \frac { 1 } { n } \ \mathop \sum \limits_ { i = 1 } ^n { \left( { { s_i } - R { c_i } + t } \right) ^2 } . \tag { 3 } \end{align*} \end{document}

FIG. 3.

Illustration of ICP algorithm. (a) Initial configuration is the starting orientation (i.e., guess) where the candidate structure is moved to a close proximity of the skeleton. In each iteration (b–d), a list of closest points from the skeleton are marked and the candidate structure is moved to minimize the distance between these points.

To accelerate the calculations of E, the location data for the voxels making up the skeleton are placed into a k-dimensional tree (kd-tree) (Bentley, 1975). The nearest neighbor in the cryo-EM structure of each point in the candidate structure is generated by querying the kd-tree. The use of the kd-tree in this step reduces the running time of calculating E from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O \left( {mn} \right)$$ \end{document} , where m is the number of points in the cryo-EM skeleton cloud and n is the number in the candidate structure cloud, to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$O \left( {nlog \left( m \right) } \right)$$ \end{document} . Note that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$m > > n$$ \end{document} , therefore, the computational cost of building the kd-tree seems justified. To find the best rotation and translation, we use the Singular Value Decomposition (SVD). In each iteration of ICP, we apply the rotation and translation to the candidate structure points and repeat until the cut off constant of iterations \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sigma$$ \end{document} is reached. Fifty is used in our current implementation for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sigma$$ \end{document} . However, more investigation is required to find the best value for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sigma$$ \end{document} . The pseudo code of our approach is shown in Figure 4.

FIG. 4.

The pseudo code for our method (Protein-EM fitter).

After ICP aligns the two-point clouds over the course of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\sigma$$ \end{document} iteration, our method calculates the RMSD between the candidate structure and the skeleton. The Cα atoms of the back bone and the corresponding closest points from the skeleton are used to calculate RMSD. The candidate structures with the lowest RMSD scores are expected to be native-like and good structures for refinement in further steps in de novo modeling.

3. Experimental Results

To examine the accuracy of our geometrical screening method (PEM-fitter), a random set of 10 protein structures from PDB database were used. The cryo-EM volume images were synthesized by using Chimera software package (Pettersen et al., 2004) at 10 Å resolution. We used some open source packages to support the implementation of our method. The nanoflann ver. 1.2.2 (https://github.com/jlblancoc/nanoflann; accessed May 2, 2017) implementation of the kd-tree is used. The Eigen ver. 3.2.8 (http://eigen.tuxfamily.org; accessed May 2, 2017) package is used to implement the SVD. Skel-EM (Al Nasr et al., 2013a) is used to extract the skeletons of the synthesized cryo-EM images. The Boost C++ Libraries ver. 1.63 (www.boost.org; accessed May 2, 2017) was used to simplify file access. The number of iterations used in this test is 50.

Given a set of candidate models, our method is expected to be able to mark the best candidate model that is most structurally similar to the native protein structure based on the given skeleton. To do so, we aligned each skeleton against each of the 10 protein models in the data set that are used as candidate structures, including the correct native structure of the skeleton. In this test, the models used are from different proteins. The results of the alignments were ranked in ascending order based on RMSD calculations against the skeleton by using our method of ICP. The best structure candidate is the candidate with the minimum score. Table 1 summarizes the results of our test. The test was performed on a Dell OptiPlex 790 desktop personal computer with Intel i5-2400 3.1 GHz processor and 8 GB of memory.

Table 1.

Ranking of Protein Structures Against Skeleton

Cryo-EM skeleton ^a	Best candidate ^b	Best RMSD ^c	Time (ms) ^d
1FLP	1FLP	2.63	12.3
1HZ4	1HZ4	2.30	14.0
1NG6	1NG6	2.24	12.2
2PSR	2PSR	5.41	11.1
2XVV	2XVV	2.36	14.2
3HJL	3HJL	2.19	16.1
3LTJ	3LTJ	2.07	7.8
3ODS	3ODS	2.63	12.4
4X5W	4X5W	2.29	15.0
5FJ6	5FJ6	2.09	18.8

The protein structure (PDB ID) used to generate the volume image and then the skeleton.

The model (PDB ID) picked by our method as the most structurally similar to the skeleton.

The calculated RMSD between the best candidate and the skeleton.

Average processing time for the method to align the 10 structures with the skeleton.

cryo-EM, cryo-electron microscopy; RMSD, root mean square deviation.

As expected, the method has successfully detected the observed protein structure for the given skeleton as the best candidate for every test. Table 1 reports the best RMSD reached by the alignment of the 10 models against each skeleton. As shown, for each skeleton, the candidate with the lowest RMSD was the true model for that skeleton (columns 2 and 3). In addition, the method was time efficient in processing the 10 candidates against the skeleton (column 4). The time reported is for the average time taken for the skeleton to be aligned for the 10 structures in milliseconds. As an example, Figure 5 shows the skeleton 4X5W aligned with protein models 1FLP, 1NG6, 2PSR, 3LTJ, and 4X5W. The RMSD scores are 4.7, 3.8, 4.0, 5.0, and 2.3 Å, respectively. The average time taken for this example is 15 ms.

FIG. 5.

The alignment of the skeleton of 4X5W cryo-EM image. (a) 1FLP. (b) ING6. (c) 2PSR. (d) 3LTJ. (e) 4X5W.

To test the ability of PEM-fitter to rank structures of the same protein, we have built a random set of 12 protein structures. This represents the case when models/decoys are constructed for a given protein in de novo modeling techniques. Conventional de novo modeling uses potential energy calculations to distinguish between the candidate models. This approach is time consuming. Therefore, our method aims at speeding up the process by identifying a set of models that are most native like. These models could then be subjected to conventional potential energy comparisons. A number of conformational models were generated for each protein (Table 2, column 3) by using the method described in Al Nasr et al. (2012) and Al Nasr and He (2014). Cryo-EM volume images were synthesized for the native protein of these models at 10 Å resolution. Skel-EM (Al Nasr et al., 2013a) was used to extract the skeletons of the cryo-EM volume images.

Table 2.

Ranking of Structures for the Same Protein

No.	ID ^a	No. of models ^b	Rank of native ^c	RMSD of native ^d	_minRMSD ^e	_maxRMSD ^f	_avgRMSD ^g	_SDRMSD ^h	%RMSD <6 Å/<10 Å ⁱ
1	1A7D	28	1	2.20	3.54811	7.32637	5.580727037	0.832532572	67/100
2	1BZ4	101	14	2.25	3.18005	19.408	4.9582514	2.200525593	95/96
3	1HZ4	101	1	2.29	4.7393	12.5332	5.9971952	1.127301539	68/99
4	1JMW	101	1	2.33	3.96325	15.8762	5.9384874	1.654145758	67/97
5	1NG6	101	1	2.24	3.80425	15.7516	5.2362289	1.49561688	91/98
6	1Z1L	88	1	2.84	5.78036	14.7157	7.441232759	1.54860547	7/93
7	2XB5	101	1	2.26	4.82086	11.2412	6.4399178	0.922919418	31/99
8	3ACW	135	1	2.11	4.29041	17.0723	6.772001418	2.847445231	67/90
9	3FIN	101	1	2.35	4.71787	14.5322	6.5484252	1.747697231	48/93
10	3HJL	101	1	2.38	4.32212	25.8416	5.910091	2.869142449	78/98
11	3LTJ	101	1	2.07	4.30302	14.598	6.4091343	1.687379438	54/98
12	3ODS	100	1	2.11	4.99044	13.7121	6.874364545	1.430442306	25/97

The protein model (PDB ID) used in the test.

Number of candidate conformations used to evaluate against the skeleton in addition to the native.

The rank of the native structure among all candidates used.

The RMSD between the native structure and the skeleton.

The minimum RMSD between the native structure and the candidates.

The maximum RMSD between the native structure and the candidates.

The average RMSD between the native structure and the candidates.

The standard deviation in the RMSD values between the native structure and the candidates.

The percentage of candidates within 6/10 Å RMSD from the native structure.

Table 2 shows the structural difference of the candidates with the native structure. The minimum, maximum, and average RMSD of the candidates with the native structure is reported in columns 6, 7, and 8, respectively. The standard deviation is reported in column 9. Due to the errors and imperfections in the skeleton or the original cryo-EM image, the native structure's RMSD with the skeleton is not equal to zero. Column 5 reports the RMSD of the native structure. Ideally, the RMSD of the native structure is the minimum that can be reached. Most of the candidate conformations were structurally similar to the native model. As shown in column 10, 9 out of 12 proteins have 95% of the candidate conformations within less than 10 Å from the native model. Figure 6 shows the 95% confidence interval for the 12 protein structures using error bars. This imitates a realistic scenario for any de novo modeling technique. Figure 6 provides a simple statistical picture of the RMSD for the candidate structures from the protein native model. The box plots show that our data were approximately normally distributed for IJMW, 1NG6, 1Z1L, 2XB5 and 3ODS, with a skewness of (0.025) and a kurtosis of (−0.405) for IJMW, with a skewness of (−0.219) and a kurtosis of (−0.266) for 1NG6, with a skewness of (0.256) and a kurtosis of (−0.629) for 1Z1L, with a skewness of (0.082) and a kurtosis of (−1.000) for 2XB5, and with a skewness of (0.115) and a kurtosis of (−0.552) for 3ODS (data not shown). It is noteworthy that the statistical analyses were performed without outliers.

FIG. 6.

Schematic representation of the RMSD for the candidate structures for the set of proteins in test two. In the inset, the red horizontal lines indicate the data median and the error bars indicate the data 95% confidence ritual.

Including the native structure, the models were evaluated and ranked by using PEM-fitter. The results of the test were ranked in ascending order based on RMSD calculations against the skeleton. The best structure candidate is the candidate with the minimum score. Table 2 summarizes the results of our test and shows the rank of the native structure among the model candidates. PEM-fitter was able to pick the native structures of 11 proteins as the best structure when aligned with the skeleton. Only one native structure was not ranked as the top model for 1BZ4 (PDB ID). This is due to the helical property of the structure of 1BZ4. The protein's amino acids are mostly helical, and it contains minimal loop structures. Therefore, the generated models are structurally similar and it was difficult for PEM-fitter to correctly rank the native structure. This is supported by the difference in the calculated RMSD of the models with the skeleton. The RMSD values of 95 out of the 100 models have less than 6 Å RMSD with the native, and the RMSD of the native with the skeleton is 2.25 Å. In addition, although the native structure is supposed to have the minimum RMSD with the skeleton, some of the candidate conformations have less RMSD with the skeleton due to the loop regions. The structural similarity of the tested models is depicted by the examples in Figure 6. The figure shows three models ranked by PEM-fitter as: best model (rank 1), native structure (rank 14), and a random model (rank 50). The superimposition of the models in Figure 7d shows the structural similarity of the models.

FIG. 7.

The alignment of the skeleton of 1BZ4 cryo-EM image (RSMD: 2.14 Å). (a) The model ranked as top 1 by PEM-fitter (RMSD: 2.14 Å). (b) the native structure of IBsZ4 ranked at position 14 (RMSD: 2.25 Å) (RMSD: 2.25 Å). (c) the model ranked at position 50 (RMSD: 2.4 Å). (d) the three structures aligned with the skeleton.

4. Conclusion

This article explores a potential method to reduce the complexity of a daunting problem in de novo protein modeling. Most of de novo modeling techniques generate thousands of candidate structures in an intermediate step. These structures must then be validated to select only the best candidates for further refinement. Conventionally, the candidates are validated energetically either through pairwise contact calculations or through the alignment of the candidates against cryo-EM volume images by using some other technique. These conventional methods are both computationally expensive and time consuming. We have developed an efficient method to geometrically screen the candidate structures. Our method aims at quickly eliminating inaccurate structures to reduce the number of candidate structures that must be investigated, thus reducing the time and computational cost of the validation process.

We demonstrated that our method was able to detect the correct candidates for the 10 skeletons we used in our initial test. Further testing of the method on 12 skeletons showed that it is able to differentiate between the experimentally determined underlying structure and numerous close morphological decoys. The running time of the method suggests that it can be a good alternative compared with more conventional methods. However, since this is an early version of this approach, more investigation is required. For example, many variants of ICP are available and some are expected to produce more accurate alignment such as point-to-plane variant. A method for avoiding inaccurate ranking like the one shown for 1BZ4 should also be developed. The current method uses the longest axis of PCA for the initial guess for positioning the skeleton and the candidate model. The use of all three spatial axes for the initial guess should ensure a better initial alignment, which may alleviate the inaccurate matching issue. Finally, in this version, the method performs a rigid alignment between the skeleton and the candidate structure. A flexible alignment can be tried to account for structures that are globally similar but differ in certain local regions.

Footnotes

Acknowledgment

This work is funded by NSF grant: HBCU-UP RIA 1600919.

Author Disclosure Statement

No competing financial interests exist.

References

Abeysinghe

S.S.

, and Ju

2009. Interactive skeletonization of intensity volumes. Vis. Comput., 25, 627–635.

Al Nasr

, Chen

, Si

, Ranjan

, Zubair

, and He

2012. Building the initial chain of the proteins through de novo modeling of the cryo-electron microscopy volume data at the medium resolutions, 490–497. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, Orlando, Florida.

Al Nasr

, and He

. 2009. An effective convergence independent loop closure method using Forward-Backward Cyclic Coordinate Descent. Int. J. Data Min. Bioinform., 3, 346–61.

Al Nasr

, and He

. 2014. Construction of protein backbone pieces using segment-based FBCCD and Cryo-EM skeleton, 711–716. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM, Newport Beach, California.

Al Nasr

, Jones

, Aboona

, and Alanazi

2016. An efficient method for validating protein models using electron microscopy data. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 1726–1731, Shenzhen, China.

Al Nasr

, Liu

, Rwebangira

, Burge

, and He

2013a. Intensity-based skeletonization of CryoEM gray-scale images using a true segmentation-free algorithm. IEEE/ACM Trans. Comput. Biol. Bioinform., 10, 1289–1298.

Al Nasr

, Liu

, Rwebangira

, and Burge

L.I.

2013b. A graph approach to bridge the gaps in volumetric electron cryo-microscopy skeletons, 211–223. In Cai

, Eulenstein

, Janies

, and Schwartz

, eds. Bioinformatics Research and Applications. Springer Berlin, Heidelberg.

Al Nasr

, Ranjan

, Zubair

, Chen

, and He

2014. Solving the secondary structure matching problem in Cryo-EM de novo modeling using a constrained K-shortest path graph algorithm. IEEE/ACM Trans. Computat. Biol. Bioinform., 11, 419–430.

Baker

M.L.

, Abeysinghe

S.S.

, Schuh

, Coleman

R.A.

, Abrams

, Marsh

M.P.

, Hryc

C.F.

, Ruths

, Chiu

, and Ju

2011. Modeling protein structure at near atomic resolutions with Gorgon. J. Struct. Biol., 174, 360–373.

10.

Bentley

J.L.

1975. Multidimensional binary search trees used for associative searching. Commun. ACM., 18, 509–517.

11.

Besl

P.J.

, and McKay

N.D.

1992. A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell., 14, 239–256.

12.

Brown

, Long

, Nicholls

R.A.

, Toots

, Emsley

, and Murshudov

2015. Tools for macromolecular model building and refinement into electron cryo-microscopy reconstructions. Acta Crystallogr. D Biol. Crystallogr., 71, 136–153.

13.

Chacón

, and Wriggers

2002. Multi-resolution contour-based fitting of macromolecular structures. J. Mol. Biol., 317, 375–384.

14.

Dal Palu

, Pontelli

, He

, and Lu

2006. A constraint logic programming approach to 3D structure determination of large protein complexes, 131–136. In Proceedings of the 2006 ACM Symposium on Applied Computing. ACM, Dijon, France.

15.

DiMaio

, Tyka

M.D.

, Baker

M.L.

, Chiu

, and Baker

2009. Refinement of protein structures into low-resolution density maps using Rosetta. J. Mol. Biol., 392, 181–190.

16.

Emwas

A.-H.

2015. The strengths and weaknesses of NMR spectroscopy and mass spectrometry with particular focus on metabolomics research, pp 161–193. In Bjerrum

J.T.

, ed. Metabonomics. Springer, New York.

17.

Frank

2009. Single-particle reconstruction of biological macromolecules in electron microscopy—30 years. Q. Rev. Biophys., 42, 139–158.

18.

Gydo

C.P.V.Z.

, and Alexandre

M.J.J.B.

2015. Fast and sensitive rigid-body fitting into cryo-EM density maps with PowerFit. AIMS Biophys. 2, 73–87.

19.

, Lu

, and Pontelli

2004. A parallel algorithm for helix mapping between 3-D and 1-D protein structure using the length constraints. Lect. Notes Comput. Sci., 3358, 746–756.

20.

Jiang

, Baker

M.L.

, Ludtke

S.J.

, and Chiu

2001. Bridging the information gap: Computational tools for intermediate resolution structure interpretation. J. Mol. Biol., 308, 1033–44.

21.

Jolley

C.C.

, Wells

S.A.

, Fromme

, and Thorpe

M.F.

2008. Fitting low-resolution cryo-EM maps of proteins using constrained geometric simulations. Biophys. J., 94, 1613–1621.

22.

Lindert

, Alexander

, Wötzel

, Karaka

, Stewart Phoebe

, and Meiler

2012. EM-fold: De novo atomic-detail protein structure determination from medium-resolution density maps. Structure, 20, 464–478.

23.

Lindert

, Staritzbichler

, Wötzel

, Karakas

, Stewart

P.L.

, and Meiler

2009. EM-fold: De novo folding of alpha-helical proteins guided by intermediate-resolution electron microscopy density maps. Structure, 17, 990–1003.

24.

, He

, and Strauss

C.E.

2008. Deriving topology and sequence alignment for the helix skeleton in low-resolution protein density maps. J. Bioinform. Comput. Biol., 6, 183–201.

25.

, Strauss

C.E.M.

, and He

2007. Incorporation of constraints from low resolution density map in Ab initio structure prediction using Rosetta, 67–73. In Proceeding of 2007 IEEE International Conference on Bioinformatics and Biomedicine Workshops, Fremont, CA, USA.

26.

Ming

, Kong

, Lambert

M.A.

, Huang

, and Ma

2002a. How to describe protein motion without amino acid sequence and atomic coordinates. Proc. Natl Acad. Sci. U. S. A., 99, 8620–8625

27.

Ming

, Kong

, Wakil

S.J.

, Brink

, and Ma

2002b. Domain movements in human fatty acid synthase by quantized elastic deformational model. Proc. Natl Acad. Sci. U. S. A., 99, 7835–7899.

28.

Mitra

, and Frank

2006. Ribosome dynamics: Insights from atomic structure modeling into cryo-electron microscopy maps. Ann. Rev. Biophys. Biomol. Struct., 35, 299–317.

29.

Pearson

A.R.

, and Mozzarelli

2011. X-ray crystallography marries spectroscopy to unveil structure and function of biological macromolecules. Biochim. Biophys. Acta., 1814, 731–733.

30.

Pettersen

E.F.

, Goddard

T.D.

, Huang

C.C.

, Couch

G.S.

, Greenblatt

D.M.

, Meng

E.C.

, and Ferrin

T.E.

2004. UCSF Chimera—A visualization system for exploratory research and analysis. J. Comput. Chem., 25, 1605–1612.

31.

Pintilie

G.D.

, Zhang

, Goddard

T.D.

, Chiu

, and Gossard

D.C.

2010. Quantitative analysis of cryo-EM density map segmentation by watershed and scale-space filtering, and fitting of structures by alignment to regions. J. Struct. Biol., 170, 427–438.

32.

Pomerleau

, Colas

, Siegwart

, and Magnenat

2013. Comparing ICP variants on real-world data sets. Auton. Robots, 34, 133–148.

33.

Rossmann

M.G.

2000. Fitting atomic models into electron-microscopy maps. Acta Crystallogr. D Biol. Crystallogr., 56, 1341–1349.

34.

Schröder

G.F.

, Brunger

A.T.

, and Levitt

2007. Combining efficient conformational sampling with a deformable elastic network model facilitates structure refinement at low resolution. Structure, 15, 1630–1641.

35.

Suhre

, Navazab

, and Sanejouand

Y.-H.

2006. NORMA: A tool for flexible fitting of high-resolution protein structures into low-resolution electron-microscopy-derived density maps. Acta Crystallogr. D Biol. Crystallogr., 62, 1098–1100.

36.

Tama

, Miyashita

, and Brooks

C.L.

2004. Normal mode based flexible fitting of high-resolution structure into low-resolution experimental data from cryo-EM. J. Struct. Biol., 147, 315–326.

37.

Topf

, Baker

M.L.

, John

, Chiu

, and Sali

2005. Structural characterization of components of protein assemblies by comparative modeling and electron cryo-microscopy. J. Struct. Biol., 149, 191–203.

38.

Topf

, Baker

M.L.

, Marti-Renom

M.A.

, Chiu

, and Sali

2006. Refinement of protein structures by iterative comparative modeling and CryoEM density fitting. J. Mol. Biol., 357, 1655–68.

39.

Topf

, Lasker

, Webb

, Wolfson

, Chiu

, and Sali

2008. Protein structure fitting and refinement guided by cryo-EM density. Structure, 16, 295–307.

40.

Velazquez-Muriel

J.-Á.

, Valle

, Santamaría-Pang

, Kakadiaris

I.A.

, and Carazo

J.-M.

2006. Flexible fitting in 3D-EM guided by the structural variability of protein superfamilies. Structure, 14, 1115–1126.

41.

Volkmann

, Hanein

, Ouyang

, Trybus

K.M.

, DeRosier

D.J.

, and Lowey

2000. Evidence for cleft closure in actomyosin upon ADP release. Nat. Struct. Biol., 7, 1147–1155.

42.

Volkmannb

, and Hanein

1999. Quantitative fitting of atomic models into observed densities derived by electron microscopy. J. Struct. Biol., 125, 176–184.

43.

Wells

, Menor

, Hespenheide

, and Thorpe

M.F.

2005. Constrained geometric simulation of diffusive motion in proteins. Phys. Biol., 2, S127–S136.

44.

Wriggers

, Agrawal

R.K.

, Drew

D.L.

, McCammon

, and Frank

2000. Domain motions of EF-G bound to the 70S ribosome: Insights from a hand-shaking between multi-resolution structures. Biophys. J., 79, 1670–1678.

45.

Wriggers

, and Birmanns

2001. Using situs for flexible and rigid-body fitting of multiresolution single-molecule data. J. Struct. Biol., 133, 193–202.

46.

Wriggers

, and Chacón

2001. Modeling tricks and fitting techniques for multiresolution structures. Structure, 9, 779–788.

47.

Wriggers

, Milligan

R.A.

, and McCammon

J.A.

1999. Situs: A package for docking crystal structures into low-resolution maps from electron microscopy. J. Struct. Biol., 125, 185–195.

48.

, Chen

, Lu

, Wang

, and Ma

2005. Determining protein topology from skeletons of secondary structures. J. Mol. Biol., 350, 571–86.

49.

Zhang

1994. Iterative point matching for registration of free-form curves and surfaces. Int. J. Comput. Vis., 13, 119–152.

50.

Zheng

, Handing

K.B.

, Zimmerman

M.D.

, Shabalin

I.G.

, Almo

S.C.

, and Minor

2015. X-ray crystallography over the past decade for novel drug discovery—Where are we heading next?. Expert Opin. Drug Discov., 10, 975–989.