Using Variable-Length Aligned Fragment Pairs and an Improved Transition Function for Flexible Protein Structure Alignment

Abstract

With the rapid growth of known protein 3D structures in number, how to efficiently compare protein structures becomes an essential and challenging problem in computational structural biology. At present, many protein structure alignment methods have been developed. Among all these methods, flexible structure alignment methods are shown to be superior to rigid structure alignment methods in identifying structure similarities between proteins, which have gone through conformational changes. It is also found that the methods based on aligned fragment pairs (AFPs) have a special advantage over other approaches in balancing global structure similarities and local structure similarities. Accordingly, we propose a new flexible protein structure alignment method based on variable-length AFPs. Compared with other methods, the proposed method possesses three main advantages. First, it is based on variable-length AFPs. The length of each AFP is separately determined to maximally represent a local similar structure fragment, which reduces the number of AFPs. Second, it uses local coordinate systems, which simplify the computation at each step of the expansion of AFPs during the AFP identification. Third, it decreases the number of twists by rewarding the situation where nonconsecutive AFPs share the same transformation in the alignment, which is realized by dynamic programming with an improved transition function. The experimental data show that compared with FlexProt, FATCAT, and FlexSnap, the proposed method can achieve comparable results by introducing fewer twists. Meanwhile, it can generate results similar to those of the FATCAT method in much less running time due to the reduced number of AFPs.

1. Introduction

The number of known protein structures in Protein Data Bank (PDB) is growing fast. To analyze these protein structures, how to efficiently compare protein structures becomes an essential problem in computational structural biology (Taylor and Orengo, 1989; Wang et al., 2013). Protein structure alignment methods can help biologists to compare protein structures (Mayr et al., 2007; Hasegawa and Holm, 2009), to classify proteins (Ma and Wang, 2014), and to predict the functions of unknown proteins (Baker and Sali, 2001; Brylinski and Skolnick, 2008).

In the past three decades, many protein structure alignment methods have been proposed (Altschul et al., 1990; Ortiz et al., 2002; Krissinel and Henrick, 2004; Razmara et al., 2012). Based on different ways in finding optimal alignment, protein structure alignment methods can be divided into three groups: aligned fragment pair (AFP)-based methods, distance matrix-based methods, and other methods. The AFP-based methods first divide protein structures into different fragments and then identify the similar structure fragment pairs coming from two different protein structures, which form the set of AFPs. Then, the AFPs are chained with defined constraints to produce the optimal alignment result. For example, CE (Shindyalov and Bourne, 1998) and FATCAT (Ye and Godzik, 2003) are both AFP-based methods. The distance matrix-based methods first calculate distance matrices of substructures according to 3D coordinates of each protein, and then match similar substructures according to their distance matrices. A typical distance matrix-based method is DALI (Holm and Sander, 1993). The third group of structure alignment methods uses different approaches from the former two. For example, DeepAlign (Wang et al., 2013) aligns two protein structures using not only spatial proximity of equivalent residues but also evolutionary relationship and hydrogen-bonding similarities. The LGA (Zemla, 2003) method detects the regions of local and global structure similarities between proteins according to LCS (longest continuous segments) and GDT (global distance test).

Generally, the AFP-based structure alignment methods have a special advantage over the other methods: the AFP-based methods can balance between global structure similarities and local structure similarities, while other methods mainly focus on the global structure similarities during the alignment. According to whether twists are introduced, protein structure alignment can be divided into two categories: rigid structure alignment and flexible structure alignment. CE is a typical example of rigid structure alignment method, which treats protein structures as rigid bodies (Shindyalov and Bourne, 1998). It aligns protein structures by chaining the consecutive AFPs without twists. Considering recent evidence supporting the idea that some proteins exist in different forms in different environments (Holmes, 2009; Godshall et al., 2013; Wang et al., 2014), rigid structure alignment methods may fail to identify the structure similarities that have gone through conformational changes, while flexible structure alignment methods can remove the above limitation by introducing twists in the structure alignment. Until now, many flexible structure alignment methods, such as FlexProt (Shatsky et al., 2002, 2004), FATCAT, and FlexSnap (Salem et al., 2010), have been developed.

In this article, we propose a method for protein structure alignment based on variable-length AFPs. The method is called Flexible protein structure alignment using Variable-length Aligned Fragment Pairs (FlexVAFP). Different from other flexible structure alignment methods such as FATCAT, the proposed method owns three main advantages: first, it can automatically adjust the sizes of AFPs according to local structure similarities, which can not only produce improved representation of the local structure similarities but also can reduce the total number of AFPs. Second, the proposed method uses local coordinate systems in identifying and expanding AFPs, which simplify the computation at each step of the expansion of AFPs during the AFP identification. Third, the proposed FlexVAFP method can decrease the number of twists by rewarding the situation where nonconsecutive AFPs share the same transformation in the alignment, which is realized by dynamic programming with an improved transition function.

The initial results of the proposed method are already reported in a conference article (Hu and Yonggang, 2015). In this study, the method is further evaluated with more structure alignment results; the efficiency of the method is shown by deriving a relationship between the running time and the number of AFPs; and more discussions about the method and its main features are included.

The rest of this article is organized as follows. First, we will describe how to implement the method in detail in Section 2. Second, we will assess and discuss the performance of the proposed method by comparing our method against other structure alignment methods in Section 3. Finally, we will conclude the article in Section 4.

2. Methods

In this section, we will explain our method clearly and state how to implement it in detail.

2.1. Definition of AFP

In the algorithm proposed in this article, AFP represents a local match of two protein fragments from two different proteins under consideration. Given two proteins called protein A and protein B, the AFP_k is an aligned match of two consecutive protein fragments coming from protein A and protein B. The starting positions of AFP_k in the two proteins are b_A(k) and b_B(k) and ending positions are e_A(k) and e_B(k).

To identify high-quality AFPs, length constraint and root mean square deviation (RMSD) constraint of AFPs are defined. First, it requires that the length of AFP is not less than the minimum length L_min that is set empirically. Second, the maximum value of an AFP RMSD is also restricted to a threshold parameter, THRMSD.

The quality of AFP is measured by the score derived from the following equation:

where L_k is the length of AFP_k; R_s is a coefficient; rmsd_k is the RMSD of AFP_k; THRMSD is the RMSD threshold; and ResA(b_A(k) + i) and ResB(b_B(k) + i) are residues of the (b_A(k) + i)th and the (b_B(k) + i)th in protein A and protein B, respectively; BLOSUM is the widely used amino acid substitution matrix, BLOSUM62 (Henikoff and Henikoff, 1992).

2.2. Process of producing AFPs

Before producing AFPs, the local coordinates of C_α atoms of residues starting from the 4th residue in protein A and protein B are generated (Fig. 1). According to the locations of C_αⁱ⁻³, C_αⁱ⁻², and C_αⁱ⁻¹ atoms, the local coordinate system centered at C_αⁱ⁻¹ is established. Then, the local coordinates \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\vec L \left( i \right)$$ \end{document} of the C_αⁱ atom in the local coordinate system are computed by the following formulas:

FIG. 1.

The generation of a local coordinate system for C_αⁱ atom.

The process for producing an AFP in the proposed method, FlexVAFP, is shown in Figure 2. It can be divided into four steps: first, it computes the local coordinates of every C_α atom (Equation 2). Second, it looks for the starting residue pair (called the seed) of a possible AFP. To avoid the overlap between AFPs, the starting residue pair cannot exist in other AFPs. Then, conditions, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\parallel \vec L \left( {{b_{A \left( k \right) }} + 3} \right) - \vec L \left( {{b_{B \left( k \right) }} + 3} \right) \parallel < THD$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\left\vert { \theta \left( {{b_{A \left( k \right) }} + 3} \right) - \theta \left( {{b_{B \left( k \right) }} + 3} \right) } \right\vert < DEG$$ \end{document} , are used to determine whether the residue pair consisting of the b_A(k)th residue and the b_B(k)th residue can become a seed. Third, when the residue pair is identified as a seed, a local similar fragment pair (called a possible AFP) is identified and its length is expanded with the constraint, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\parallel \vec L \left( {{p_{A \left( k \right) }}} \right) - \vec L \left( {{p_{B \left( k \right) }}} \right) \parallel < THD$$ \end{document} . The parameters, THD and DEG, are also set empirically in the experiments. By using the local coordinate systems, the method avoids the computation of the RMSD between the possible AFP at each step of the expansion. At last, when the expansion of the possible AFP is terminated, the local similar fragment pairs form an AFP if both its length and its RMSD meet the requirements. The program iterates through all possible combinations of the b_A(k)th residue and the b_B(k)th residue to find all AFPs.

FIG. 2.

The process of identifying an AFP in the FlexVAFP method. AFP, aligned fragment pair.

2.3. Connecting AFPs by dynamic programming with an improved transition function

To avoid overlap, AFP_m and AFP_n can be connected if the following conditions are satisfied: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {e_A} \left( m \right) < {b_A} \left( n \right) \ \rm and \ {e_B} \left( m \right) < {b_B} \left( n \right) \tag{3} \end{align*} \end{document}

where e_A(m) and e_B(m) represent the ending positions of AFP_m in protein A and protein B, respectively; b_A(n) and b_B(n) represent the starting positions of AFP_n, respectively.

In the proposed method, FlexVAFP, AFPs meeting the condition (Equation 3) are chained by dynamic programming. The connection score from AFP_m to AFP_n is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} c \left( {m \to n} \right) = W \left( {{D_{mn}}} \right) \times {P_t} + F \left( {p , q} \right) \tag{4} \end{align*} \end{document}

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} F \left( {p , q} \right) = {M_s} \times p + {M_g} \times q \tag{6} \end{align*} \end{document}

where P_t is the maximum penalty for the connection; D_mn is the RMSD for aligning both AFP_m and AFP_n; W(D_mn) is a measure to evaluate the quality of the connection from AFP_m to AFP_n; F(p,q) does the same work by examining the gaps between the two AFPs; TH is a threshold to determine whether AFP_m and AFP_n can be linked by introducing a twist; TL is a threshold for penalizing a connection from AFP_m to AFP_n; p is the number of mismatched residues; q is the number of gaps; M_s is the penalty involved with mismatched residues; and M_g is used to penalize gaps.

To produce the optimal alignment result, a dynamic programming algorithm is used. To reward the situation where nonconsecutive AFPs share the same transformation in the alignment, the transition function of dynamic programming is improved. The transition function S(n) denotes the best score ending at AFP_n:

where a(n) is the score of AFP_n (Equation 1); OPT_m represents the best partial alignment path formed by the AFPs ending at AFP_m, which has the highest score; and P(m→n) represents the best connection score found from an AFP in the OPT_m to AFP_n.

The number of twists required to form OPT_n is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} T \left( n \right) = T \left( m \right) + t \left( {m \to n} \right) \tag{9} \end{align*} \end{document}

where T(n) and T(m) are number of twists introduced to form OPT_n and OPT_m, respectively; t(m→n) is 1 if a new twist is introduced to produce OPT_n by adding AFP_n to OPT_m, otherwise t(m→n) is 0.

By using the improved transition function defined in Equations 7 and 8, the FlexVAFP method can reduce the number of twists by rewarding nonconsecutive AFPs, which share the same transformation in the alignment. For example, if the best partial alignment ending at AFP_m is OPT_m = {…, AFP_i, …, AFP_m}, AFP_n is the next candidate AFP for concatenation, which satisfies Equation (3), D_mn > TH, and D_in ≤ TH; while a twist is needed to connect AFP_n with AFP_m in a traditional AFP-based method such as FATCAT, no twist is needed to connect AFP_n with AFP_m in our method because AFP_i and AFP_n share the same transformation.

2.4. Postprocessing

The postprocessing is executed after achieving the optimal correspondence of C_α atoms between the two proteins by dynamic programming. The FlexVAFP method will produce different number of transformation matrices according to the different number of twists introduced. To produce high-quality global alignment, the RMSD of aligned fragment pairs sharing the same transformation matrix is required to be less than RMSD_max. Suppose that its RMSD is larger than RMSD_max, the FlexVAFP method will introduce a twist until the condition is satisfied. The iterative refinement process is similar to that used by ProSup (Lackner et al., 2000).

3. Results

To intensively evaluate the proposed method, FlexVAFP, it is compared with both rigid structure alignment methods and flexible structure alignment methods. A rigid version of FlexVAFP is also produced to objectively and fairly compare FlexVAFP with rigid structure alignment methods. The rigid version of FlexVAFP (FlexVAFP^r) performs essentially the same as FlexVAFP, except that no twist is introduced during the process of chaining AFPs. All the parameters used in our experiments are listed in Table 1.

Table 1.

The Setting of the Parameters Used in FlexVAFP

Symbol	Value	Meaning
THD	3.5	The distance cutoff for the expansion of an AFP.
DEG	π/6	The angle cutoff for the seed of a possible AFP.
L_min	6	The minimum length of an AFP.
THRMSD	2	The RMSD threshold for an AFP.
R_s	3.0	A coefficient for rewarding an AFP.
P_t	−40	The penalty coefficient for a connection.
TH	4.0	The highest RMSD for penalizing a connection.
TL	1.0	The lowest RMSD for penalizing a connection.
M_s	−0.4	The penalty for mismatched residues.
M_g	−0.4	The penalty for gaps.
RMSD_max	4.0	The cutoff to determine whether a section needs to be split.

3.1. Comparison with other rigid structure alignment methods

FlexVAFP^r is compared with rigid structure alignment methods, DALI, CE, and ProSup, on several pairs of protein structures described as difficult in the literature (Ye and Godzik, 2003). FlexVAFP^r performs well in comparing distantly similar protein structures compared with the performance of CE, DALI, and ProSup (Table 2). For instance, in the comparison of proteins, 1BGE:B and 2GMF:A, FlexVAFP^r obtains 95 aligned residues with a lower RMSD of 3.10, while CE gets 94 aligned residues with an RMSD of 4.1, DALI produces 98 aligned residues with an RMSD of 3.5, and ProSup obtains 87 aligned residues with an RMSD of 2.4. In a few particular cases, FlexVAFP^r can achieve better results than the three different methods. For example, in the comparison of proteins, 1FXI:A and 1UBQ:_, FlexVAFP^r finds 61 aligned residues with an RMSD of 2.85; however, ProSup gets 54 aligned residues with an RMSD of 2.6. Therefore, these results also reflect that the way of identifying AFPs in FlexVAFP is reasonable and effective.

Table 2.

Comparison of Structure Alignments of 10 Difficult Pairs of Structures from Fischer et al. (1996) by Four Different Methods

		CE	DALI	ProSup	FlexVAFP
Protein A	Protein B	S^a/R^b	S/R	S/R	S/R
1FXI:A	1UBQ:_	−/−	−/−	54/2.6	61/2.85
1TEN:_	3HHR:B	87/1.9	86/1.9	85/1.7	83/1.86
3HLA:B	2RHE:_	85/3.5	63/2.5	71/2.7	80/3.30
2AZA:A	1PAZ:_	85/2.9	−/−	82/2.6	81/2.63
1CEW:I	1MOL:A	69/1.9	81/2.3	76/1.9	81/2.36
1CID:_	2RHE:_	94/2.7	95/3.3	84/2.3	93/2.50
1CRL:_	1EDE:_	187/3.2	211/3.4	161/2.6	185/3.14
2SIM:_	1NSB:A	264/3.0	286/3.8	248/2.6	272/3.16
1BGE:B	2GMF:A	94/4.1	98/3.5	87/2.4	95/3.10
1TIE:_	4FGF:_	116/2.9	108/2.0	101/2.4	110/2.90

The data for CE, DALI, and ProSup are from literature (Lackner et al., 2000).

S represents the size of aligned residues.

R represents the RMSD of all aligned residues in the unit of Å.

3.2. Comparison with other flexible structure alignment methods

To demonstrate that the FlexVAFP method usually introduces fewer twists, it is compared with three different flexible structure alignment methods, FlexProt, FATCAT, and FlexSnap, on FlexProt dataset (Shatsky et al., 2004). The results are shown in Table 3.

Table 3.

Comparison of FlexProt, FATCAT, FlexSnap, and FlexVAFP

		FlexProt		FATCAT		FlexSnap		FlexVAFP
Protein A	Protein B	S ^a /R ^b	T ^c	S/R	T	S/R	T	S/R	T
1WDN:A	1GGG:A	218/0.94	2	220/1.01	2	220/0.96	2	206/1.16	2
1HPB:P	1GGG:A	220/2.34	2	213/1.59	2	211/1.67	2	205/2.24	2
2BBM:A	1CLL:_	139/2.22	1	144/2.28	1	138/1.8	1	141/1.94	1
2BBM:A	1TOP:_	147/2.40	3	145/2.28	3	137/1.78	3	143/2.05	2
1AKE:A	2AK3:A	200/2.44	2	202/1.54	2	207/2.05	2	206/1.94	1
2AK3:A	1UKE:_	182/2.90	2	188/2.97	0	184/2.36	1	180/2.36	0
1MCP:L	4FAB:L	218/1.93	1	217/1.40	1	217/1.49	1	216/1.2	1
1MCP:L	1TCR:B	212/2.33	1	213/2.20	1	202/2.3	1	212/2.12	1
1LFH:_	1LFG:_	691/1.41	2	686/0.89	2	688/0.99	2	678/1.10	1
1TFD:_	1LFH:_	291/1.98	2	290/1.37	2	287/1.89	2	282/1.53	1
1B9 W:A	1DAN:L	75/2.78	1	80/2.39	2	82/2.25	2	80/2.42	2
1QF6:A	1ADJ:A	323/4.43	1	351/2.68	1	326/2.45	3	343/2.43	1
2CLR:A	3FRU:A	253/2.71	2	245/3.06	0	254/2.57	3	237/2.84	0
1FMK:_	1QCF:A	424/1.25	2	433/2.27	0	413/2.71	0	429/2.21	0
1FMK:_	1TKI:A	231/3.28	2	238/3.07	0	241/2.58	3	234/3.01	0
1A21:A	1HWG:C	163/2.75	4	153/3.16	1	156/2.35	3	157/2.26	1

The data for FlexProt, FATCAT, and FlexSnap are from literature (Salem et al., 2010).

S represents the size of the aligned residues.

R represents the RMSD of all aligned residues.

T is the number of twists.

According to the experimental results in Table 3, the proposed method, FlexVAFP, achieves competitive results against the three other methods judged from the number of aligned residues and the RMSD. For instance, in the comparison of proteins, 1MCP:L and 4FAB:L, a hinge is detected by FlexVAFP, which results in a structure alignment of 216 aligned positions with an RMSD of 1.2, while the results of the three other methods are 218 aligned residues with an RMSD of 1.93 (FlexProt), 217 aligned residues with an RMSD of 1.40 (FATCAT), and 217 aligned residues with an RMSD of 1.49 (FlexSnap) with the same number of twists. In all alignment results about the sixteen pairs of proteins in Table 3, the total numbers of twists introduced in the four methods are 30 (FlexProt), 20 (FATCAT), 31 (FlexSnap), and 16 (FlexVAFP), respectively. It can be seen from Table 3 that the number of twists introduced by FlexVAFP is not more than that of the other three different methods in all sixteen alignment results of Table 3, except the result for proteins, 1B9W:A and 1DAN:L.

To further illustrate that the FlexVAFP method can produce comparable alignment results with fewer twists, 25 pairs of protein structures are selected from the database of the literature as a test dataset (Gerstein and Krebs, 1998). The performance of FlexVAFP on the dataset is compared with the three other methods and the results are listed in Table 4.

Table 4.

Comparison of FlexProt, FATCAT, FlexSnap, and FlexVAFP on 25 Pairs of Protein Structures

		FlexProt		FATCAT		FlexSnap		FlexVAFP
Protein A	Protein B	S/R	T	S/R	T	S/R	T	S/R	T
1CNP:A	1A03:A	80/2.76	3	83/3.08	0	47/1.54	2	80/2.9	0
1A67:A	1CEW:I	81/2.56	3	106/2.99	2	94/2.54	2	91/2.87	0
1G96:A	1A67:A	81/2.46	3	100/3.25	1	63/1.98	3	97/2.96	1
2G96:A	1CEW:I	93/2.85	4	65/3.09	2	37/1.63	2	59/2.91	1
1KMO:A	1KMP:A	512/2.95	2	647/1.79	0	644/1.92	0	627/0.8	1
1CRL:A	1THG:A	496/2.96	1	525/3.16	0	509/2.4	2	512/1.63	1
1BYU:A	1RRP:A	188/2.87	1	186/2.01	1	177/1.26	3	196/2.14	2
1FSS:A	1VOT:A	529/2.86	1	527/0.49	0	527/0.49	0	527/0.49	0
1ACL:A	1FSS:A	527/2.88	1	527/0.55	0	527/0.55	0	527/0.55	0
3ADK:A	1AKE:A	182/2.63	2	179/3.06	0	173/2.6	3	172/2.18	1
1I2D:A	1M8P:A	572/1.9	1	572/4.07	0	572/1.9	1	566/1.23	1
1HNF:A	1HNG:A	175/2.45	0	174/2.17	0	171/2.52	0	171/2.01	0
1GTM:A	1HRD:A	382/2.96	2	411/1.89	0	390/2.49	0	408/1.83	0
4CLN:A	2BBM:A	143/2.22	1	143/2.24	1	138/2.27	1	141/2.05	1
1L5B:A	1L5E:A	101/1.19	1	101/0.96	1	100/1.57	1	101/0.7	1
1DDT:A	1MDT:A	523/1.58	1	520/0.91	1	523/1.8	1	521/0.93	1
1DPE:A	1DPP:A	507/0.58	1	498/2.14	1	507/0.81	2	503/0.71	1
1N0U:A	1N0V:D	812/2.9	2	812/2.2	1	819/1.1	2	782/1.37	1
1EPS:A	1G6S:A	420/2.31	1	420/1.71	1	422/2.2	1	424/1.31	1
1ERK:A	1KOB:A	212/2.95	3	276/3.00	1	281/2.63	3	279/2.42	2
1E8B:A	1E88:A	160/2.41	1	153/3.11	0	160/2.43	1	155/3.19	0
1JBV:A	1JBW:A	397/2.69	1	397/1.72	0	397/1.92	0	396/1.72	0
1EX6:A	1EX7:A	176/2.97	2	183/3.32	0	186/0.95	2	169/2.34	0
1EX6:A	1GKY:A	175/2.96	2	182/3.23	0	186/0.96	2	169/2.34	0
8OHM:A	1CU1:A	423/2.18	2	433/3.02	1	434/1.76	2	403/2.82	1

The data for FlexProt and FATCAT are from their online servers; the data for FlexSnap are from its stand-alone version.

The results in Table 4 also demonstrate that the method, FlexVAFP, can achieve comparable results compared with the other three different methods, FlexProt, FATCAT, and FlexSnap, by introducing fewer twists. In all 25 alignment results, FlexProt finds 7947 aligned residues in total with the sum of RMSD 62.03; FATCAT obtains 8220 aligned residues in total with the sum of RMSD 59.16; FlexSnap gets 8084 aligned residues in total with the sum of RMSD 44.22; and FlexVAFP matches 8076 aligned residues in total with the sum of RMSD 46.4. Compared with FlexProt and FATCAT, both FlexSnap and FlexVAFP achieve comparable aligned residues with lower RMSD. The total numbers of twists introduced by the four methods are 42 (FlexProt), 14 (FATCAT), 36 (FlexSnap), and 17 (FlexVAFP), respectively. Accordingly, compared with FlexSnap, FlexVAFP produces alignment with comparable length and RMSD while introducing fewer twists.

It is clearly seen from Tables 3 and 4 that total number of twists introduced in FlexVAFP is lower than that of FlexProt and FlexSnap. The reason why FlexVAFP can obtain competitive results by introducing lower number of twists is that FlexVAFP allows nonconsecutive AFPs to share the same transformation during the concatenation (Fig. 3).

FIG. 3.

The alignment results between proteins, 1AKE:A and 2AK3:A, generated by (a) FlexProt, (b) FATCAT, and (c) FlexVAFP. Different sections of 1AKE:A are marked with green, purple, and brown, and different sections of 2AK3:A are marked with red, yellow, and blue in (a) and (b). Different sections of 1AKE:A are marked with green and purple, and different sections of 2AK3:A are marked with red and yellow in (c). The sections in the same color share the same transformation.

From Figure 3, it is clear that both FlexProt and FATCAT introduce two twists in the alignment of proteins, 1AKE:A and 2AK3:A, while FlexVAFP only introduces one twist in the alignment of the same protein pair. It can be seen that the sections separated by the green section and the purple section in Figure 3a and b are considered as a single alignment section, which shares the same transformation by FlexVAFP.

3.3. Comparison of FlexVAFP and FATCAT about the number of AFPs

To show that the running time of the AFP-based structure alignment method strongly depends on the number of AFPs, the AFP identification process in FlexVAFP is modified to produce nine different variations of the method based on constant-length AFPs (with the length of AFP required to be 6, 7, 8, 9, 10, 11, 12, 13, and 14, respectively). Then, these nine methods are compared with the original FlexVAFP method. The relationship between the running time and the number of AFPs generated by the methods is shown in Figure 4.

FIG. 4.

The relationship between the running time and the number of AFPs for three protein pairs: 2BBM:A and 1CLL:_, 2BBM:A and 1TOP:_, and 1A21:A and 1HWG:C. The unfilled marks represent the data produced using the constant-length AFPs. For each pair of protein, nine different numbers of AFPs are produced by setting the length of AFP to nine different values: 6, 7, 8, 9, 10, 11, 12, 13, and 14. The filled marks represent the data produced by FlexVAFP. All the data can be approximately fitted by a single curve: y = ax², where a ≈2.3 × 10⁻⁴.

It can be seen from Figure 4 that the running time is approximately proportional to the square of the number of AFPs generated by the methods. So, the number of AFPs generated by an AFP-based method can reflect its efficiency. As seen from Table 5, it is clear that the number of AFPs produced by FATCAT is 10 times greater than that of FlexVAFP, which indicates that the running time of FlexVAFP is 100 times less than that of FATCAT. The use of the variable-length AFPs in FlexVAFP reduces the number of AFPs, which greatly improves the efficiency of the structure alignment.

Table 5.

Comparison of FlexVAFP and FATCAT About Number of AFPs

Protein A	Protein B	FATCAT numbers of AFPs	FlexVAFP number of AFPs
1WDN:A	1GGG:A	14,565	970
1HPB:P	1GGG:A	15,383	1140
2BBM:A	1CLL:_	8737	922
2BBM:A	1TOP:_	10,340	1024
1AKE:A	2AK3:A	14,798	1455
2AK3:A	1UKE:_	14,660	1333
1MCP:L	4FAB:L	18,910	1625
1MCP:L	1TCR:B	20,918	1611
1LFH:_	1LFG:_	148,736	9049
1TFD:_	1LFH:_	64,259	3330
1B9W:A	1DAN:L	3118	126
1QF6:A	1ADJ:A	83,383	6673
2CLR:A	3FRU:A	19,528	1880
1FMK:	1QCF:A	57,747	3270
1FMK:	1TKI:A	42,060	2915
1A21:A	1HWG:C	12,696	914

4. Conclusion

In this article, we have developed a new method to identify the local similar fragments between protein structures and to obtain the optimal structure alignment results by introducing as few twists as possible, which is realized by dynamic programming with an improved transition function. The proposed method identifies the local similar fragment pairs (called AFPs) of protein structures by defining a local coordinate system of a C_α atom using three previous consecutive residues, which simplifies the computation at each step of the expansion of AFPs. Our method also allows different AFPs to have different lengths, which can greatly reduce the number of AFPs and improve the efficiency of the proposed method compared with other methods that use constant-length AFPs. In addition, the FlexVAFP method allows nonconsecutive AFPs to share the same transformation during the concatenation, which helps in reducing the number of twists in the structure alignment. Compared with other methods, the proposed structure alignment method can produce competitive results with fewer twists and shorter running time.

Footnotes

Acknowledgments

This work is supported by the National Science Foundation of China (Grant no. 61272213) and the Fundamental Research Funds for Central Universities (Grant no. lzujbky-2016-k07). The authors would also like to thank anonymous reviewers for their valuable comments.

Author Disclosure Statement

No competing financial interests exist.

References

Altschul

S.F.

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

Baker

, and Sali

2001. Protein structure prediction and structural genomics. Science, 294, 93–96.

Brylinski

, and Skolnick

2008. A threading-based method (FINDSITE) for ligand-binding site rediction and functional annotation. Proc. Natl. Acad. Sci. U. S. A., 105, 129–134.

Fischer

, Elofsson

, Rice

, et al. 1996. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac. Symp. Biocomput. 1996, 300–318.

Gerstein

, and Krebs

1998. A database of macromolecular motions. Nucleic Acids Res. 26, 4280–4290.

Godshall

B.G.

, Tang

, Yang

, et al. 2013. An aggregate analysis of many predicted structures to reduce errors in protein structure comparison caused by conformational flexibility. BMC Struct. Biol., 13 Suppl 1, S10.

Hasegawa

, and Holm

2009. Advances and pitfalls of protein structural alignment. Curr. Opin. Struct. Biol., 19, 341–348.

Henikoff

, and Henikoff

J.G.

1992. Amino-acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A., 89, 10915–10919.

Holm

, and Sander

1993. Protein-structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123–138.

10.

Holmes

K.C.

2009. Structural biology actin in a twist. Nature, 457, 389–390.

11.

, and Yonggang

2015. Flexible protein structure alignment by variable-length aligned fragment pairs. IEEE Int. Conf. Bioinf. Biomed., 2015, 1280–1286.

12.

Krissinel

, and Henrick

2004. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr. D Biol. Crystallogr., 60, 2256–2268.

13.

Lackner

, Koppensteiner

W.A.

, Sippl

M.J.

, et al. 2000. ProSup: A refined tool for protein structure alignment. Protein Eng. 13, 745–752.

14.

J.Z.

, and Wang

2014. Algorithms, applications, and challenges of protein structure alignment. Adv. Protein Chem. Struct. Biol., 94, 121–175.

15.

Mayr

, Domingues

F.S.

, and Lackner

2007. Comparative analysis of protein structure alignments. BMC Struct. Biol., 7, 50.

16.

Ortiz

A.R.

, Strauss

C.E.

, and Olmea

2002. MAMMOTH (matching molecular models obtained from theory): An automated method for model comparison. Protein Sci. 11, 2606–2621.

17.

Razmara

, Deris

, and Parvizpour

2012. TS-AMIR: A topology string alignment method for intensive rapid protein structure comparison. Algorithm Mol. Biol., 7, 4.

18.

Salem

, Zaki

M.J.

, and Bystroff

2010. FlexSnap: Flexible non-sequential protein structure alignment. Algorithms Mol. Biol., 5, 12.

19.

Shatsky

, Nussinov

, and Wolfson

H.J.

2002. Flexible protein alignment and hinge detection. Proteins, 48, 242–256.

20.

Shatsky

, Nussinov

, and Wolfson

H.J.

2004. FlexProt: Alignment of flexible protein structures without a predefinition of hinge regions. J. Comput. Biol., 11, 83–106.

21.

Shindyalov

I.N.

, and Bourne

P.E.

1998. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11, 739–747.

22.

Taylor

W.R.

, and Orengo

C.A.

1989. Protein structure alignment. J. Mol. Biol., 208, 1–22.

23.

Wang

H.W.

, Chu

C.H.

, Wang

W.C.

, et al. 2014. A local average distance descriptor for flexible protein structure comparison. BMC Bioinformatics, 15, 95.

24.

Wang

, Ma

, Peng

, et al. 2013. Protein structure alignment beyond spatial proximity. Sci. Rep., 3, 1448.

25.

, and Godzik

2003. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics, 19 Suppl 2, ii246–ii255.

26.

Zemla

2003. LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374.