Abstract
Abstract
The volumetric images produced by Cryo-Electron Microscopy (cryo-EM) technique are used to model macromolecular assemblies and machines. De novo protein modeling uses these images to computationally model the structure of the molecules. Many candidate conformations are usually generated during the intermediate step. Conventionally, each of these candidates is evaluated by time-consuming approaches such as potential energy. We introduce an initial version of a geometrical screening method that uses the skeleton of the cryo-EM images to evaluate candidate structures. The aim of this method is to reduce the number of native-like candidate conformations and, therefore, reduce the time required for structural evaluation by energy calculations. A test of two datasets was performed. The first dataset contains 10 proteins and shows that our method can successfully detect the correct native structure for the given skeleton among a set of different protein structures. The second dataset contains 12 proteins and shows that our method can filter slightly modified decoy conformations of the same protein. The efficiency of the method is also reported.
1. Introduction
T
The main three biophysics techniques used to determine the structure of proteins are X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy (cryo-EM). Although X-ray crystallography and NMR are the dominant techniques for protein structure determination, they are subject to numerous limitations (Pearson and Mozzarelli, 2011; Emwas, 2015; Zheng et al., 2015). Some of these limitations are the required quantity and purity of the sample, the requirement that the protein should be crystallizable, and the stability of the structure. These limitations are particularly troublesome for macromolecule machines and some protein molecules such as viral capsids. In contrast, cryo-EM has been shown to be a powerful biophysical technique that is capable of examining the macromolecular machines in their native environment. Further, it does not suffer from crystallization problems. Therefore, it can determine the structure of types of proteins that are hard to resolve by using conventional methods, for example, membrane-bound proteins. In addition, cryo-EM is expected to be the main workhorse to capture the molecular interaction between large complexes within cells (Mitra and Frank, 2006; Frank, 2009). Cryo-EM produces volumetric images that are used to determine the structure of the protein.
Most of the images are produced at sub/nanometer resolutions (>5 Å). It is challenging to determine atomic structures from images generated at sub/nanometer resolution by using current technologies. Therefore, improved computational methods are critically important for addressing this problem. The number of prospective sub/nanometer cryo-EM images is rapidly growing with the greatly improved detectors that are now available. The development of powerful and automatic computational methods to derive the atomic structure underlying a cryo-EM volume would greatly advance the role of cryo-EM as a complement to traditional methods. At resolution worse than 5 Å, the image becomes unclear, and, therefore, the atomic model cannot be constructed directly. However, computational methods are capable of extracting features, analyzing, and utilizing the image to derive the structure of the protein. These methods can be classified into either fitting or de novo modeling.
When a high-resolution atomic structure is available for small proteins or for a part of large proteins, fitting and refinement tools have shown the ability to derive the atomic structure of a protein from cryo-EM images (Topf et al., 2005, 2006, 2008; Lu et al., 2007, 2008; DiMaio et al., 2009). Given an initial structural model, the volumetric image is used to refine and fit the model structure and to construct an accurate high-resolution all-atom protein model. The refinement process uses a fitting scoring that measures how well the model fits into the volumetric image and identifies mismatched regions between the model and the image. Two types of fitting can be used: rigid fitting and flexible fitting (Volkmannb and Hanein, 1999; Rossmann, 2000; Jiang et al., 2001; Wriggers and Chacón, 2001; Chacón and Wriggers, 2002; Topf et al., 2008; Pintilie et al., 2010; Brown et al., 2015; Gydo and Alexandre, 2015). In rigid fitting techniques, the structure being fitted is aligned with the image as one component. When the atomic structure of the imaged molecule is not the same as in the assembly, rigid fitting cannot be used. To overcome this problem, flexible fitting should be considered where the conformation of the structure being fit can be changed to improve the correspondence to the cryo-EM image (Wriggers et al., 1999, 2000; Volkmann et al., 2000; Wriggers and Birmanns, 2001; Ming et al., 2002a, 2002b; Tama et al., 2004; Wells et al., 2005; Suhre et al., 2006; Velazquez-Muriel et al., 2006; Schröder et al., 2007; Jolley et al., 2008).
In the absence of a high-resolution structure corresponding to the image or part of it, which is the case for most macromolecules, it is not feasible to fold the sequence to the image (Lindert et al., 2009). Therefore, instead of rigid or flexible fitting and refinement using cryo-EM images, de novo modeling of the protein structure is used. De novo protein modeling is the set of computational methods and algorithms used to construct the 3D structure of a protein by using its primary sequence and cryo-EM image. Cryo-EM is superior when working on nanostructure biological complexes such as viruses, small organelles, and macromolecular biological complexes. Therefore, de novo modeling is able to model certain types of proteins when traditional computational methods fail. For instance, ab initio modeling is not suited for relatively large protein macromolecules due to the complexity of their folding and the huge size of the search space. Likewise, comparative modeling fails with membrane proteins due to the diversity of their structures and functions. It is hard to find a template from the pre-determined structures for the target protein.
Many de novo modeling approaches have been proposed (He et al., 2004; Wu et al., 2005; Dal Palu et al., 2006; Al Nasr and He, 2009; Lindert et al., 2009, 2012; Baker et al., 2011; Al Nasr et al., 2012). In general, de novo modeling techniques generate thousands of candidate structure configurations in an intermediate step and evaluate them to select the most suitable, native-like ones. Wu et al. (2005) used a geometry filter followed by an energetics-based evaluation. The energy evaluation is carried out by a knowledge-based pair-wise potential function to evaluate fold candidates. One drawback of their work is the amount of computation time required to accomplish the search process. Because of this cost, this method is not suitable for medium or large proteins. Baker et al. (Abeysinghe and Ju, 2009; Baker et al., 2011) have used Gorgon in a semi-automatic approach to build the structures of the molecules; however, this modeling process relies greatly on user-system interaction. Lindert et al. (2009, 2012) proposed a de novo folding approach called EM-Fold. EM-Fold uses a Monte Carlo refinement process to improve the placement of structural SSEs (secondary structure elements). In a subsequent step, the loops and side chains are added by Rosetta's iterative side-chain repacking and backbone reconstruction protocols that generate a model at atomic resolution (DiMaio et al., 2009). Although this method can work with a large solution space, the stochastic nature of the approach may miss the correct topology.
Several de novo techniques use the skeleton of the given volume image to reduce the search space and help in modeling (Lindert et al., 2009, 2012; Al Nasr et al., 2014). A skeleton reveals useful information that highlights the connections between secondary structure elements and, therefore, speeds up the modeling process.
In this article, we introduce a skeleton-based coarse-grained approach that uses cryo-EM skeleton to validate the candidate structures in the intermediate step in de novo modeling. The method uses the high-quality skeletons that can be extracted from cryo-EM volumetric images by using a tool developed recently (Al Nasr et al., 2013a, 2013b). The aim is to quickly rank the candidates to determine which are suitable for further evaluation by using potential energy calculations. Thus, instead of evaluating all of the candidates using energy methods, we will reduce the number of candidates drastically to speed up the process. The initial version of the method was presented at the IEEE Bioinformatics and Biomedicine (BIBM) (Al Nasr et al., 2016). Since our initial version, our method has been subject to further tests. Initially, the method was examined for its ability to distinguish the candidate structure that represented the protein in a cryo-EM volumetric image from other, dissimilar, protein candidate structures. In this revision, we expanded the range of testing; the method is examined for its ability to differentiate the experimentally determined structure of a protein from a set of slightly modified structures of the protein. The results of this expanded testing are present herein. The overview of the method is depicted in Figure 1. The candidates are rigidly aligned with the skeleton, and a “fitness score” is calculated (i.e., Root Mean Square Deviation [RMSD] between a skeleton and a candidate structure). The candidates with the best scores are marked as good models and are appropriate for further evaluation and refinement.

The overview of our method. The skeleton of the cryo-EM volume image is extracted and used as a reference object to geometrically screen the candidate structures constructed in de novo modeling. Each structure is aligned with the skeleton by using ICP, and a score (i.e., RMSD) is calculated. cryo-EM, cryo-electron microscopy; ICP, Iterative Closest Point; RMSD, root mean square deviation.
2. Methods and Approaches
In this article, we propose a fast and coarse-grained geometrical screening method to speed up de novo modeling of protein structures. Our method is based on the idea that rapid rejection of non-viable candidate structures offers an inexpensive approach for reducing the amount of computational power that must be expended to validate thousands of candidate structures modeled during the intermediate step in de novo modeling. The candidate structures with the highest scores in our method are expected to be native like and are appropriate for more investigations such as energy validation and structure refinement.
The method uses a skeleton extracted from the target cryo-EM volume image as a reference or target shape (Al Nasr et al., 2013a). The advantage of using a skeleton and not a cryo-EM volume image is that a skeleton is a simplified and thin version (usually one voxel in width) of the original volume that is topologically comparable to the object and highlights its geometrical and structural features. Thus, most of the structural features and geometrical properties of the object such as the backbone structure of the protein are carried by the skeleton. Therefore, the complexity of processing is reduced. The candidate structure will be aligned with the skeleton by using the Iterative Closest Point (ICP) algorithm (Besl and McKay, 1992; Zhang, 1994; Pomerleau et al., 2013). However, before the actual alignment, the candidate structure is preprocessed to simplify its representation and to roughly align it with the skeleton. This rough alignment is needed since a requirement for ICP to converge is that the two objects being aligned have a close initial orientation and position. Although both sets of data represent complexly folded, linearly connected, 3D structures, while applying our method we treat each as disconnected point clouds and allow the two structures to move freely through each other. No transformations are performed that alter their shapes in the procedure (i.e., a rigid alignment is performed).
Since the skeleton most likely holds the shape of the backbone of a molecule, the all-atom structure of a candidate model is not necessary. Therefore, spatial data for the candidate structure is simplified by removing all information except the location information for the Cα atoms on the backbone. This reduced structure is faster to process and preserves the geometrical shape of a candidate structure.
To make the orientation of the two-point clouds close, we use the Principle Component Analysis (PCA) to determine their long axes. PCA is a statistical technique that is used in various fields of image processing such as facial recognition and image compression. The PCA is used to find the rough shape of the voxels in the objects being compared. The three axes of each object are found and used to change the initial orientation of the comparable point clouds. In PCA, the coordinates of voxels are transformed to a new coordinate system. Let the input structure (i.e., skeleton or candidate model) c consist of N voxels and let
where CovM is the 3 × 3 covariance matrix, and

PCA used to align the two structures for initial orientation. The PCA
Once the two-point clouds are aligned along the same general axis, we use the ICP algorithm to perfect the alignment. There are many variants of ICP, some of which allow for deformation of the point clouds. This deformation is useful if one is trying to match visual data where the apparent shape of an object is altered due to foreshortening. Since we are trying to rigidly register points in two clouds, we have chosen to use an approach that holds the relative locations of the points within each cloud static.
ICP is an efficient algorithm used in many applications to find the transformation between two-point clouds by minimizing the square errors between the clouds (Besl and McKay, 1992). It is an example of a gradient descent approach in which the correspondence of point clouds is reevaluated as the solution comes closer to the error local minimum.
Starting from initial alignment guess that we generated during the preprocessing, ICP, iteratively, finds the closest set of points in the two clouds and calculates the rotation R and translation t to be applied to candidate structure to minimize the squared error E (Fig. 3). Let skeleton point cloud

Illustration of ICP algorithm.
To accelerate the calculations of E, the location data for the voxels making up the skeleton are placed into a k-dimensional tree (kd-tree) (Bentley, 1975). The nearest neighbor in the cryo-EM structure of each point in the candidate structure is generated by querying the kd-tree. The use of the kd-tree in this step reduces the running time of calculating E from

The pseudo code for our method (Protein-EM fitter).
After ICP aligns the two-point clouds over the course of
3. Experimental Results
To examine the accuracy of our geometrical screening method (PEM-fitter), a random set of 10 protein structures from PDB database were used. The cryo-EM volume images were synthesized by using Chimera software package (Pettersen et al., 2004) at 10 Å resolution. We used some open source packages to support the implementation of our method. The nanoflann ver. 1.2.2 (https://github.com/jlblancoc/nanoflann; accessed May 2, 2017) implementation of the kd-tree is used. The Eigen ver. 3.2.8 (http://eigen.tuxfamily.org; accessed May 2, 2017) package is used to implement the SVD. Skel-EM (Al Nasr et al., 2013a) is used to extract the skeletons of the synthesized cryo-EM images. The Boost C++ Libraries ver. 1.63 (www.boost.org; accessed May 2, 2017) was used to simplify file access. The number of iterations used in this test is 50.
Given a set of candidate models, our method is expected to be able to mark the best candidate model that is most structurally similar to the native protein structure based on the given skeleton. To do so, we aligned each skeleton against each of the 10 protein models in the data set that are used as candidate structures, including the correct native structure of the skeleton. In this test, the models used are from different proteins. The results of the alignments were ranked in ascending order based on RMSD calculations against the skeleton by using our method of ICP. The best structure candidate is the candidate with the minimum score. Table 1 summarizes the results of our test. The test was performed on a Dell OptiPlex 790 desktop personal computer with Intel i5-2400 3.1 GHz processor and 8 GB of memory.
The protein structure (PDB ID) used to generate the volume image and then the skeleton.
The model (PDB ID) picked by our method as the most structurally similar to the skeleton.
The calculated RMSD between the best candidate and the skeleton.
Average processing time for the method to align the 10 structures with the skeleton.
cryo-EM, cryo-electron microscopy; RMSD, root mean square deviation.
As expected, the method has successfully detected the observed protein structure for the given skeleton as the best candidate for every test. Table 1 reports the best RMSD reached by the alignment of the 10 models against each skeleton. As shown, for each skeleton, the candidate with the lowest RMSD was the true model for that skeleton (columns 2 and 3). In addition, the method was time efficient in processing the 10 candidates against the skeleton (column 4). The time reported is for the average time taken for the skeleton to be aligned for the 10 structures in milliseconds. As an example, Figure 5 shows the skeleton 4X5W aligned with protein models 1FLP, 1NG6, 2PSR, 3LTJ, and 4X5W. The RMSD scores are 4.7, 3.8, 4.0, 5.0, and 2.3 Å, respectively. The average time taken for this example is 15 ms.

The alignment of the skeleton of 4X5W cryo-EM image.
To test the ability of PEM-fitter to rank structures of the same protein, we have built a random set of 12 protein structures. This represents the case when models/decoys are constructed for a given protein in de novo modeling techniques. Conventional de novo modeling uses potential energy calculations to distinguish between the candidate models. This approach is time consuming. Therefore, our method aims at speeding up the process by identifying a set of models that are most native like. These models could then be subjected to conventional potential energy comparisons. A number of conformational models were generated for each protein (Table 2, column 3) by using the method described in Al Nasr et al. (2012) and Al Nasr and He (2014). Cryo-EM volume images were synthesized for the native protein of these models at 10 Å resolution. Skel-EM (Al Nasr et al., 2013a) was used to extract the skeletons of the cryo-EM volume images.
The protein model (PDB ID) used in the test.
Number of candidate conformations used to evaluate against the skeleton in addition to the native.
The rank of the native structure among all candidates used.
The RMSD between the native structure and the skeleton.
The minimum RMSD between the native structure and the candidates.
The maximum RMSD between the native structure and the candidates.
The average RMSD between the native structure and the candidates.
The standard deviation in the RMSD values between the native structure and the candidates.
The percentage of candidates within 6/10 Å RMSD from the native structure.
Table 2 shows the structural difference of the candidates with the native structure. The minimum, maximum, and average RMSD of the candidates with the native structure is reported in columns 6, 7, and 8, respectively. The standard deviation is reported in column 9. Due to the errors and imperfections in the skeleton or the original cryo-EM image, the native structure's RMSD with the skeleton is not equal to zero. Column 5 reports the RMSD of the native structure. Ideally, the RMSD of the native structure is the minimum that can be reached. Most of the candidate conformations were structurally similar to the native model. As shown in column 10, 9 out of 12 proteins have 95% of the candidate conformations within less than 10 Å from the native model. Figure 6 shows the 95% confidence interval for the 12 protein structures using error bars. This imitates a realistic scenario for any de novo modeling technique. Figure 6 provides a simple statistical picture of the RMSD for the candidate structures from the protein native model. The box plots show that our data were approximately normally distributed for IJMW, 1NG6, 1Z1L, 2XB5 and 3ODS, with a skewness of (0.025) and a kurtosis of (−0.405) for IJMW, with a skewness of (−0.219) and a kurtosis of (−0.266) for 1NG6, with a skewness of (0.256) and a kurtosis of (−0.629) for 1Z1L, with a skewness of (0.082) and a kurtosis of (−1.000) for 2XB5, and with a skewness of (0.115) and a kurtosis of (−0.552) for 3ODS (data not shown). It is noteworthy that the statistical analyses were performed without outliers.

Schematic representation of the RMSD for the candidate structures for the set of proteins in test two. In the inset, the red horizontal lines indicate the data median and the error bars indicate the data 95% confidence ritual.
Including the native structure, the models were evaluated and ranked by using PEM-fitter. The results of the test were ranked in ascending order based on RMSD calculations against the skeleton. The best structure candidate is the candidate with the minimum score. Table 2 summarizes the results of our test and shows the rank of the native structure among the model candidates. PEM-fitter was able to pick the native structures of 11 proteins as the best structure when aligned with the skeleton. Only one native structure was not ranked as the top model for 1BZ4 (PDB ID). This is due to the helical property of the structure of 1BZ4. The protein's amino acids are mostly helical, and it contains minimal loop structures. Therefore, the generated models are structurally similar and it was difficult for PEM-fitter to correctly rank the native structure. This is supported by the difference in the calculated RMSD of the models with the skeleton. The RMSD values of 95 out of the 100 models have less than 6 Å RMSD with the native, and the RMSD of the native with the skeleton is 2.25 Å. In addition, although the native structure is supposed to have the minimum RMSD with the skeleton, some of the candidate conformations have less RMSD with the skeleton due to the loop regions. The structural similarity of the tested models is depicted by the examples in Figure 6. The figure shows three models ranked by PEM-fitter as: best model (rank 1), native structure (rank 14), and a random model (rank 50). The superimposition of the models in Figure 7d shows the structural similarity of the models.

The alignment of the skeleton of 1BZ4 cryo-EM image (RSMD: 2.14 Å).
4. Conclusion
This article explores a potential method to reduce the complexity of a daunting problem in de novo protein modeling. Most of de novo modeling techniques generate thousands of candidate structures in an intermediate step. These structures must then be validated to select only the best candidates for further refinement. Conventionally, the candidates are validated energetically either through pairwise contact calculations or through the alignment of the candidates against cryo-EM volume images by using some other technique. These conventional methods are both computationally expensive and time consuming. We have developed an efficient method to geometrically screen the candidate structures. Our method aims at quickly eliminating inaccurate structures to reduce the number of candidate structures that must be investigated, thus reducing the time and computational cost of the validation process.
We demonstrated that our method was able to detect the correct candidates for the 10 skeletons we used in our initial test. Further testing of the method on 12 skeletons showed that it is able to differentiate between the experimentally determined underlying structure and numerous close morphological decoys. The running time of the method suggests that it can be a good alternative compared with more conventional methods. However, since this is an early version of this approach, more investigation is required. For example, many variants of ICP are available and some are expected to produce more accurate alignment such as point-to-plane variant. A method for avoiding inaccurate ranking like the one shown for 1BZ4 should also be developed. The current method uses the longest axis of PCA for the initial guess for positioning the skeleton and the candidate model. The use of all three spatial axes for the initial guess should ensure a better initial alignment, which may alleviate the inaccurate matching issue. Finally, in this version, the method performs a rigid alignment between the skeleton and the candidate structure. A flexible alignment can be tried to account for structures that are globally similar but differ in certain local regions.
Footnotes
Acknowledgment
This work is funded by NSF grant: HBCU-UP RIA 1600919.
Author Disclosure Statement
No competing financial interests exist.
