Abstract
Abstract
Searching for protein structure-function relationships using three-dimensional (3D) structural coordinates represents a fundamental approach for determining the function of proteins with unknown functions. Since protein structure databases are rapidly growing in size, the development of a fast search method to find similar protein substructures by comparison of protein 3D structures is essential. In this article, we present a novel protein 3D structure search method to find all substructures with root mean square deviations (RMSDs) to the query structure that are lower than a given threshold value. Our new algorithm runs in O(m + N/m0.5) time, after O(N log N) preprocessing, where N is the database size and m is the query length. The new method is 1.8–41.6 times faster than the practically best known O(N) algorithm, according to computational experiments using a huge database (i.e., >20,000,000 C-alpha coordinates).
1 Introduction
Many computational approaches have been developed for structural comparison, including FAST (Zhu and Weng, 2005), CE (Shindyalov and Bourne, 1998), DALI (Holm and Sander, 1993), MAMMOTH (Ortiz et al., 2002), TM-Align (Zhang and Skolnick, 2005), and PSIST (Gao and Zaki, 2005). The majority of these algorithms are generally based on two steps. The first step involves a local geometric comparison of small regions for identifying structurally similar regions and residue pairs using the C-alpha atoms between two proteins. These alignments obtained by structural comparisons between small fragments are known as aligned fragment pairs (AFPs). AFPs are identified by sliding a window, which has a fixed size, between protein structures. The second step is the structural alignment of the global folds using a particular algorithm, such as dynamic programming. For example, in the first step that defines the AFPs, FAST compares two five-residue segments according to the similarity function based on Euclidian distances between C-alpha atoms. CE and DALI use a C-alpha distance matrix of local structures (eight and six residues, respectively). MAMMOTH uses the unit vector of small substructures (eight residues) and the unit-vector root mean square (URMS) distance between all pairs of heptapeptides as a similarity function. TM-Align employs secondary structure elements and the TM-Score to obtain the initial alignment. PSIST uses a distance and a bond angle as a feature vector within each window (three residues) to describe local features.
The structural alignment problem of proteins with insertions or deletions is known as the NP-hard problem (Lathrop, 1994). There are various heuristic approaches to solve the NP-hard problem, but no exact solution. In this article, we focus on an easier, but essential problem: finding all substructures of a 3D structure from a protein database whose root mean square deviation (RMSD) (Kabsch, 1976) to the query structure are given a threshold and contain no insertions or deletions. There is a simple solution to this problem involving the comparison between the query structure with all structures within a database; however, this method costs O(Nm) time, where N is the database size and m is the query length (Diamond, 1998; Umeyama, 1991). The best known algorithm is a filtering based algorithm (Shibuya, 2010a) that uses a lower bound RMSD to eliminate unrelated structures before computing the final RMSD. Shibuya proved that the filtering based method runs in O(N) time in the average case, though the worst-case time complexity is still O(Nm), which is the same as the above Naive algorithm. Here, we propose a significantly faster algorithm in practice that uses new lower bounds and runs in O(m + N/m0.5) time in the average case, after O(N log N) preprocessing. Note that the worst-case time complexity of our algorithm is still O(Nm), as in the case of the previous best-known algorithm. The O(N log N) preprocessing time is theoretically worse than the previously best algorithm, but it is the time complexity for very simple sorting, and the actual computing time for it can be dismissed. Note also that there has been proposed an algorithm with better expected time complexity, i.e., O(m + N/m1-ɛ) (Shibuya, 2010b), where is ɛ is an arbitrary small constant, but the algorithm is not a practical algorithm but a theoretical one. It is apparently not effective for ordinary query sizes (which are at most around 1,000), and moreover it is a very complicated algorithm and is extremely difficult to implement. Thus, we do not compare our algorithm with the theoretical best algorithm.
1.1 Problem definition
In the case that we are given a 3D structure database
2 Methods
2.1 Notation and definitions
In this article, to describe the protein 3D structure, 3D coordinates of each residue were approximated by the C-alpha coordinates. A chain molecule S is represented as S = (s1, s2, … , sn), where si denotes the 3D coordinates of the i-th C-alpha atom and n denotes the number of amino acids of S. |S| means the length of S. A structure S[i.j] = (si, si+1, … , sj) is called a substructure of S from the i-th residue to the j-th residue. |S[i.j]| is equal to j−i + 1.
2.2 Shibuya's lower bound for the RMSD
The RMSD between two chain molecules S and T is defined as the minimum value of
over all the possible rotation matrices R and translation vectors v. RMSD(S, T) denotes this minimum value.
Shibuya proposed a lower bound for the RMSD between any two structures with the same length. When the chain molecule
The algorithm for searching for similar substructures with Shibuya's lower bound for the RMSD is simply described by two steps: (1) searching through all the substructures where the lower bound (D(S,T)) for a RMSD is lower than the given threshold c, and is therefore a candidate; and (2) the RMSD values of each candidate are then checked using the C-alpha coordinates. Even though the D(S,T) is not representative of the RMSD, it can be used as a filter to remove dissimilar substructures in simple calculations, because of the inequality such that D(S,T) is smaller than or equal to RMSD(S,T). If the D(S,T) is larger than the given threshold c, RMSD(S,T) must be larger than c.
Shibuya also proposed that the following lower bound for the RMSD using the combination of three segments is very efficient in removing dissimilar substructures before computing the actual RMSD.
where
Shibuya reported that the algorithm (denoted as A3: Algorithm 3) with this lower bound (expression (3)) for the RMSD is much faster than other algorithms.
2.3 New lower bounds for the RMSD
Now we consider P[i.i + m−1] and Q as a chain molecule consisting of six segments which have the same length

A: Illustration of the six segments for position i in database (Pi,1 − Pi,6) and query (Q1 − Q6). B: Example of LBi,15 (expression(23)). LBi,15 consists three lower bounds of combined segments (D(Pi,1 + Pi,6, Q1 + Q6), (D(Pi,2 + Pi,5, Q2 + Q5) and (D(Pi,3 + Pi,4, Q3 + Q4)).
The expression (6) shows that the lower bound can be described as a comparison between two chain molecules consisting of six segments (Pi,1 − Pi,6 and Q1 − Q6). We then consider the combination of two segments among the six segments as:
Even if the two segments are sequentially connected or not in the protein molecule, the lower bound of the combined segments is calculated as the following inequality:
where j and k represent the j-th and k-th segment among the six segments of Pi and Q. According to expression (7) and (8), the following 15 inequalities also can be used as lower bounds for the RMSD(P[i.i + m−1], Q). (See the inequalities (9)–(23).)
For example, LBi,15 (expression (23)) is corresponding to
Let D* i represent the lower bound given in expression (24). Thus, the substructures whose D* i is larger than c can be detected as dissimilar substructures before computing the actual RMSD.
2.4 Search algorithm
Our 3D substructure searching algorithm with D* i has three steps (Fig. 2): (1) The preprocessing is the indexing of protein substructures for all positions in the database P; it is an essential process to gain speed and to reduce the file size of the database, and it is only required once. (2) The second step is a calculation of the lower bounds. (3) The third step is a calculation of the actual RMSD of all substructures that have passed the second step. A detailed description of each step is presented next.

Flow chart of LB3D.
2.5 Preprocessing (Step 1) > Indexing the protein substructures
The process of reading the large number of PDB text files is known to be time consuming and significantly reduces the speed of searching. Moreover, the value of F(Pi,j + Pi,k), where
2.6 Step 2. Computing the lower bounds
Although computing D*i is much faster than computing the actual RMSDs, the computational costs of the calculation D*i for all positions using a huge database (such as the PDB or SCOP) is not possible. By using the binary search algorithm on the sorted list L, we can reduce the computational cost before calculating D*i for all positions. The binary search algorithm is widely used to find the position which has the key value in a sorted array. According to expression (23), the lower bound for the RMSD is described by only the first and sixth segments as:
where D(Pi,1 + Pi,6, Q1 + Q6) represents the |F(Pi,1 + Pi,6) − F(Q1 + Q6)|. Let E(P[i.i + m − 1], Q) denote this lower bound.
If
Therefore, before completing the calculation of all D* i values, we can find all positions i corresponding to expression (26) by using a binary search algorithm on the sorted list L. Then, after computing D* i , where i corresponds to expression (26), we filter out dissimilar substructures where the D* i is greater than the given threshold c before the calculation of actual the RMSD.
2.7 Step 3. Compute the RMSD
The calculation of RMSD(P[i.i + m − 1], Q) is performed for all candidates that passed Step 2. We can then list all similar substructures whose RMSD values are lower than given threshold c.
2.8 Computational time analysis
To analyze the time complexity of the algorithm, we assumed that the structures in the database follow a model called the freely-jointed chain (FJC) model. In the FJC model, we assume that the structures of the chain molecules (in the database) can be considered as random walks in 3D space. This model is a basic and simple model used in molecular physics for chain molecules such as proteins, and is also used for analyzing the average-case (i.e., expected) time complexity of algorithms that deal with chain molecules. Though there are of course many repeated substructures or common substructures in protein databases, the experimental results showed that the behavior of database searching algorithms on PDB reflects the FJC model (Shibuya, 2010a).
According to Shibuya (2010a), the following lemma holds for the lower bound D:
Lemma 1 (Shibuya, 2010a)
For any protein structures P and Q with the same length, the probability
We obtained a similar result on the lower bound E(P, Q) as follows:
Theorem 1
For any protein structures P and Q of the same length, the probability
Let bi denote
where δ is some positive constant. According to Lyapunov's central limit theorem (Kallenberg, 1997), the distribution of
This means that
This means that the distribution of
holds is only in
According to Shibuya (2010a), the following lemma also holds for the lower bound LBi,1:
Lemma 2 (Shibuya, 2010a)
For any protein structures P and Q of the same length, the probability
It is easy to see that the following corollary holds for our lower bound D*i:
Corollary 1
For any protein structures P and Q, the probability
The time complexity of our algorithm can be analyzed as follows (based on the above discussion). In the preprocessing step, we need only O(N) time for obtaining F() values, but need O(N log N) time to sort them to compute the sorted list L. In Step 2, we require O(log N) time for the binary search. After binary searching, we obtain O(N/m0.5) candidates, according to Theorem 1. For these candidates, we compute the D*i values, which takes only O(1) for each. After computing the D*i values, the number of remaining candidates is only in O(N/m1.5), according to Corollary 1. In Step 3, we compute the RMSD for all the remaining candidates, which requires O(m) time for each. Overall, the presented algorithm achieves O(m + N/m0.5) average-case time complexity, after O(N log N) preprocessing, because the term O(log N) can be ignored. Note that the worst-case query time complexity is still O(Nm).
3 Results and Discussion
3.1 Computational experiments on the SCOP1.75 database
To test the performance of the new algorithm, we used the SCOP database (Andreeva et al., 2008) release 1.75 (denoted as SCOP1.75), which contains 110,799 domains with a total of 20,429,263 C-alpha coordinates. We randomly selected 100 domains from SCOP1.75 as query domains for this experiment. These 100 domains contained a total of 6,710–22,692 substructures depending on the length of substructures (10–200). All the 3D coordinates are taken from the ASTRAL database (Chandonia et al., 2004).
The goal is to execute the exhaustive search of similar substructures much faster than known algorithms. We compared three algorithms (Naive, A3, and LB3D). Naive performs an exhaustive calculation of the RMSD against all substructures of the database without any filtering. The computational time of Naive is known as O(Nm). A3 is the best known algorithm, reported as an O(N)-time algorithm by Shibuya. In this experiment, to reduce computational costs, A3 uses the indexed data which consists of a sorted list of all substructures according to the value of F(Pi,1 + Pi,2) before computing all the lower bounds for the RMSD (expression (3)). LB3D denotes our new algorithm described above and is a faster O(m + N/m0.5) algorithm. There are no parameters that need optimization in the algorithms. LB3D was implemented in ANSI C and runs on standard Linux OS. We used a Linux computing system (Intel Xeon E5506 CPU of 2.13 GHz and 12 GByte memory) for the following experiments.
We compared these three algorithms among various query lengths and a fixed threshold (c = 1.0 Å). Table 1 summarizes the results of the experiments on SCOP1.75. All three algorithms performed an exhaustive search. Therefore, the average number of similar substructures that were found by each algorithm (“#Hits”) are always the same. This indicates that there was no requirement to compare the qualities of these search methods. As shown in the row of “Time” and “Time/1M,” LB3D clearly performed faster than the other two algorithms: about 1.8–41 times faster than A3 and 2.3–1536 times faster than Naive. In particular, for the medium-long query (over 40 residues), LB3D is at least 18.4 and 514 times faster than A3 and Naive, respectively.
The number of query substructures which were obtained from the randomly selected 100 domains.
The total number of substructures in the database corresponding to the query length.
The average number of substructures whose RMSDs to queries consisted of 100 random domains, and were lower than 1.0 Å.
The number of “Hits” per one million (1M) substructures.
Algorithm 3, O(N) algorithm, using expression (9).
The average number of substructures whose actual RMSDs were computed. It also means that the average number of substructures whose lower bounds for RMSDs are lower than the given threshold (1.0 Å).
The number of “Checked” per 1M substructures.
The ratio of the average number of “#Checked” against “#Hits.” This represents the average number of substructures whose actual RMSDs were computed per one actual similar substructure.
The average computation time, excluding the preprocessing step.
The average computation time per 1M substructures.
Our new O(m + N/m0.5) algorithm, using the new lower bound (expression (24)).
O(Nm) algorithm; computing the RMSDs for all substructures.
The rows of “#Checked” and “#Checked/1M” indicated the number of RMSDs of the substructures that were not filtered by lower bounds and were computed per one query substructure. It is equivalent to the efficiency of the filtering with the lower bound used in the algorithm. The calculation of the RMSDs is computationally more expensive than that of the lower bounds for the RMSD. The time to find all similar substructures whose RMSDs to the query are at most given threshold is directly dependent on the number of checked substructures; therefore, lowering the number of RMSDs will accelerate the search. Examination of the speed (“Time” and “Time/1M”) by comparison of “#Check/1M” in LB3D and A3 indicated that the use of D* i provides a significant advantage in the filtering of dissimilar substructures.
Interesting results were observed when we focused on the limitation of the efficiency of the filtering with the lower bound. Here, the average number of checked substructures per one substructure whose RMSD is lower than the given threshold (1.0 Å) was denoted as “#Checked/Hits.” For example, the “#Checked/Hits” of LB3D indicates only 4.1–9.6 when the query lengths were 60–200. This indicated that, if the computational cost of filtering is not improved, there is modest room (at most 4.1–9.6 times faster than LB3D) for improving LB3D-like algorithms. In contrast, “#Checked/Hits” of LB3D for a small-to-medium query (especially when the lengths were 20 and 40) showed a significantly poorer performance than the results obtained using the other query lengths. For these small queries associated with two or three secondary structure elements (SSEs), we are considering ways to improve the performance by using other methods such as comparing each SSE or using the orientation of the SSEs.
Moreover, as shown in Table 1, the query length does not affect the “Time/1M” of both A3 and LB3D. On the other hand, in the Naive O(Nm) algorithm, computational cost (“Time/1M”) clearly increased as the length of the query increased.
As mentioned above, the performance of the speed is directly dependent on the performance of the filtering approach, which corresponds to Step 2 of Figure 2. To evaluate the ability of filtering dissimilar substructures, a number of checked substructures at various thresholds (0.2–5.0Å) and query lengths (10–200) were analyzed by comparison of LB3D and A3. Figure 3 shows the results of A3 and LB3D using different query lengths and thresholds. LB3D can filter out over 90% of the substructures from the database when the query length is >60 and the threshold is lower than 4.0 Å. In comparison to A3, LB3D performed significantly better at filtering queries of any length and threshold values. Surprisingly, even when examining small queries (length = 10, 20), LB3D computed actual RMSDs of only 38% and 2% of the database when the threshold c was set to 1.0 Å.

Average number of checked RMSDs of substructures per 1M substructures of the database (SCOP1.75). (A) Results of Algorithm 3. (B) Results of LB3D.
The advantages of the new algorithm suggest that the algorithm has further applications in various fields: (1) LB3D can find all AFPs among a huge database and filter out dissimilar protein structures before computing all structural alignments. Therefore, the speed of searching for a protein structure which has the same fold could be significantly increased using the new algorithm. (2) For protein-protein docking, similar 3D patterns of protein-protein interactions can be found by the algorithm. Even if the substructures of the query or database are not sequentially connected, the new lower bound (expression (24)) can be applied. (3) For protein structure prediction, the algorithm can detect structural conservations by searching all AFPs.
4 Conclusion
In this article, we have developed a new substructure search algorithm, LB3D, which is based on a new lower bound for the RMSD value (expression (24)). We proved that the new algorithm is an O(m + N/m0.5) average-case time query algorithm after O(N log N)-time preprocessing. We showed that the new algorithm is significantly faster than the best-known O(N) algorithm (Algorithm 3) and the most common O(Nm) algorithm (Naive). We attribute the search speed of LB3D to the number of substructures that are filtered as dissimilar substructures by using a lower bound for the RMSD. LB3D can efficiently eliminate dissimilar substructures before computing the actual RMSDs, even when small queries (10–40) or large thresholds (∼5 Å) are used. Thus, LB3D is a useful tool for performing an exhaustive search using a huge database to find all similar substructures.
Footnotes
Acknowledgments
We thank Mr. M. Oosawa for valuable cooperation. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Young Scientists (B), 22700314, 2010–2011.
Disclosure Statement
No competing financial interests exist.
