A Novel Algorithm for Enhanced Structural Motif Matching in Proteins

Abstract

As widely discussed in literature, spatial patterns of amino acids, so-called structural motifs, play an important role in protein function. The functionally responsible part of proteins often lies in an evolutionarily highly conserved spatial arrangement of only a few amino acids, which are held in place tightly by the rest of the structure. Those recurring amino acid arrangements can be seen as patterns in the three-dimensional space and are known as structural motifs. In general, these motifs can mediate various functional interactions, such as DNA/RNA targeting and binding, ligand interactions, substrate catalysis, and stabilization of the protein structure. Hence, characterizing and identifying such conserved structural motifs can contribute to the understanding of structure–function relationships. Therefore, and because of the rapidly increasing number of solved protein structures, it is highly desirable to identify, understand, and moreover to search for structurally scattered amino acid motifs. This work aims at the development and the implementation of a novel and robust matching algorithm to detect structural motifs in large sets of target structures. The proposed methods were combined and implemented to a feature-rich and easy-to-use command line software tool written in Java.

1. Introduction

1.1. Biological relevance

The assessment, identification and linkage of protein structure and function is still a demanding process. Functional coherences can be reflected in binding site similarity even if global structure or sequence differs significantly (Xie et al., 2009). Hence, the coupling of distantly related proteins is possible by means of drug targeting and response. The compulsory coherence between structure and function is well known and proven. Spatial arrangements of amino acids, so-called structural motifs, can be responsible for catalytic activity (Hedstrom, 2002), DNA/RNA interaction (Miller et al., 1985; Darnell, 2006), ion fixation (Kong et al., 2007; Ebrahimi et al., 2012; Xue et al., 2008), and structure stabilization (Koutsotoli and Tzakos, 2012) of proteins (Fig. 1 and Table 1).

FIG. 1.

Selection of biologically relevant motifs. (A) The superfamily representing motifs KDEEH and DSKSD for enolase superfamily (ES) and haloacid dehydrogenase superfamily (HADS) (Meng et al., 2004). Position-specific exchanges (PSEs) induced by divergent evolution of ES and HADS are depicted by translucent residues. (B) The YCAY RNA binding motif EQIR binds RNA specifically (Darnell, 2006). (C) The structure stabilizing motif WWP allows two C-H-π interactions between two tryptophan residues sandwiching a proline (Koutsotoli and Tzakos, 2012). (D) Metal ion coordinating motifs: the zinc ion binding motif CCHH of the zinc finger domain (Miller et al., 1985), diiron catalytic center motif EEHEQ of human H-chain ferritin (Ebrahimi et al., 2012), and the copper ion binding site in Alzheimer's amyloid precursor protein (Kong et al., 2007). (E) Catalytic activity of the serine proteases active site triad HDS. Nucleophilic attack toward the carbonyl group of the substrate peptide through serine is indicated by a dashed line (Ekici et al., 2008; Hedstrom, 2002).

Table 1.

Detailed Overview of Selected Biologically Relevant Motifs

Motif	Type	Description	Source	Chain(s)	Residues	Reference
CCHH	intra	zinc finger protein	1G2F	F	F-C207	Miller et al., 1985
		zinc ion binding site			F-C212
					F-H225
					F-H229
DSKSD	intra	haloacid dehalogenase superfamily descriptor	1QQ5	A	A-D8	Koonin and Tatusov, 1994
					A-S114→T
					A-K147→R
					A-S172→DE
					A-D176→E
EEHEQ	intra	ferroxidase diiron catalytic center of human H-chain ferritin	1FHA	A	A-E27	Ebrahimi et al., 2012
					A-E62
					A-H65
					A-E107
					A-Q141
EQIR	intra	YCAY RNA binding motif	1EC6	B	B-E14	Darnell, 2006
					B-Q40
					B-I41
					B-R54
HDS	inter	serine proteases catalytic triad, most widespread form	4CHA	B,C	B-H57	Ekici et al., 2008
					B-D102
					C-S195
HHY	intra	copper ion binding site in Alzheimer's amyloid precursor protein copper-binding domain	2FK1	A	A-H147	Kong et al., 2007
					A-H151
					A-Y168
KDEEH	intra	enolase superfamily descriptor	2MNR	A	A-K164→H	Meng et al., 2004
					A-D195
					A-E221
					A-E247→DN
					A-H297→K
WWP	intra	example for Me/π and XH/π interaction: functionally important structural feature in proteins	4FDI	A	A-W491	Koutsotoli and Tzakos, 2012
					A-W496
					A-P511

Shown for each motif are nomenclature, type, short description, source structure, chain(s) of occurrence, incorporated residues in the format [chain ID]-[amino acid type][residue number]→[exchange] and reference.

For instance, the ability to degrade polypeptides is an essential biological process. This chemical reaction is usually catalyzed by certain substrate-specific enzymes, the proteases. The first identified protease catalytic site was revealed in 1967 by studying α-chymotrypsin with X-ray diffraction (Matthews et al., 1967). Subsequent structure alignment of proteases uncovered remarkable similarity of active sites (Fischer et al., 1994). It was found that a spatial arrangement of three amino acids, a so-called triad, consisting of histidine, aspartate, and in most cases serine (Fig. 1, HDS), is responsible for peptide bond cleavage in proteases. For the purpose of conservation and structure stabilization the catalytic triad is part of an extensive hydrogen bonding network (Hedstrom, 2002). It was reported that, in general, serine attacks the carbonyl group of the peptide bond whereby histidine acts as general base in the first step. The protonated histidine is then stabilized by formation of hydrogen bonds with aspartate. Finally, by involving water, the amino group is broken (Hedstrom, 2002). Serine proteases are involved in many in vivo reactions like complex cascading inflammation processes, vascular homeostasis, thrombosis (Sharony et al., 2010) or chronic pancreatitis (Witt et al., 2000). Furthermore, decreased levels of the serine protease neurosin were observed in brain tissue of patients suffering from Alzheimer's or Parkinson's disease, although direct linkage is still unknown (Ogawa et al., 2000). Additionally, recent studies show evidence that selected inhibition of human immunodeficiency virus associated serine proteases can stop the spreading of the infection by preventing cleavage of viral polyproteins in functional subunits and hence thwart virus maturation (Titanji et al., 2013). Consequently proteases are highly important drug targets (Sanderson, 1999), thorough understanding of their molecular structure is mandatory, and bioinformatic investigation of catalytic mechanisms broken down to structural motifs is worthwhile.

Additional recently discussed topics include Me/π and XH/π interactions induced by aromatic amino acids. Widely present in proteins, such interaction motifs are evidently essential for protein function and folding (Plevin et al., 2010). In this context Koutsotoli and Tzakos described the disruption of the protein–protein interaction network in human host cells by infection of enterohemorrhagic Escherichia coli (EHEC). Here a π-π stacking interaction of two tryptophan residues sandwiching a proline is exploited during EHEC infection process. This so-called C-H-π interaction motif (Fig. 1, WWP) was found to be present in more than 600 cases in the PDB (Koutsotoli and Tzakos, 2012).

Furthermore, the active binding site of human H-chain ferritin (PDB:1FHA) that is essential for bioavailable iron storage (Ebrahimi et al., 2012) can be described as a structural motif consisting of five residues mediating ion binding (Fig. 1, EEHEQ).

These are only a few examples of functionally tailored, highly specific and conserved structural motifs that occur among different protein families and thus led to proteins that share common functional characteristics by evolutionary means. Moreover, the representation through structural motifs can be adduced to describe classes of proteins. Function can be preserved during protein evolution despite overall sequence identity or fold shape diverges and even if common function is not obvious, partial reactions can be shared. The motif KDEEH (Fig. 1, KDEEH), based on a mandelate racemase structure (PDB:2MNR), was derived by Meng et al. in 2004 and comprises several structures of the enolase superfamily (ES). Active site exchanges happened during evolution of the ES, nevertheless a partial chemical reaction as common function was persistent: the support of proton abstraction from carbon adjacent to carboxylic acid and the subsequent formation of an enolate anion intermediate (Babbitt et al., 1996). The lysine residue at position 164 can be substituted by histidine, glutamate 247 by aspartate or asparagine, and histidine 297 by lysine. Using this template definition it was possible to represent the ES and their subgroups appropriately (Meng et al., 2004). Another example of a superfamily representing motif is DSKSD derived from PDB:1QQ5 (Fig. 1, DSKSD). Like the ES, the haloacid dehalogenase superfamily (HADS) is thought to have evolved with a partial reaction constrained: the catalysis of hydrolytic nucleophilic substitution resulting in a covalent bond between a fully conserved aspartate and an atom of the substrate (usually phosphate) (Meng et al., 2004). Likewise to KDEEH, residue definitions are not stringent and substitutions are permitted (Table 1, DSKSD).

1.2. Motivation

On the one hand computational examination of substructure similarities is highly desirable, but on the other hand existing methods are limited or deprecated. Table 2 provides a brief overview of existing approaches, their underlying methodology, availability, and purpose of application. In literature amino acid representation of protein structures in the context of substructure matching was mainly focused on single or a few atoms like C_α-only, C_α and C_β, or pseudo-atom-based sidechain representations (Debret et al., 2009; Moll et al., 2010). While the latter represents—even if only marginal—the different types of amino acids geometrically, C_α-only representation does not allow the geometric mapping of amino acid types. A pseudo-atom sidechain representation as compromise may provide a good trade-off between fuzziness and sensitivity of the search method and is less sensitive to conformational changes (Meng et al., 2004). But the fact that such representations are insufficient and inaccurate in some cases was proven, for example, for the far-reaching ES (Meng et al., 2004). This carries great weight if proteins are of limited sequence identity and hence structural variant, which is known for ES (Meng et al., 2004; He et al., 2013) and even more evident for HADS (Meng et al., 2004; Aravind et al., 1998; Koonin and Tatusov, 1994; Baker et al., 1998).

Table 2.

Summary of Existing Structural Motif Search Approaches

Name	Method	Provided as	Application	Note	Reference
ASSAM	graph theoretical	WS	motif matching	—	Nadzirin et al., 2012
BALLAST	local geometric consideration	none	motif matching	—	He et al., 2013
CatSId	graph theoretical	WS	motif matching and identification	automated-access script provided	Nilmeier et al., 2013
LabelHash	geometric hashing	WS, ST	motif matching	large storage requirement	Moll et al., 2010
ProBIS	graph representation of protein surface	WS, ST	local structure alignments, binding site identification, motif matching	limited to protein surface comparison	Konc and Janezic, 2010
RASMOT-3D PRO	distance comparison	WS	motif matching	results limited	Debret et al., 2009
SPASM	distance comparison	ST	motif matching	deprecated	Kleywegt, 1999

An overview of existing approaches for structural motif searching showing their underlying method, availability as web service (WS) or standalone tool (ST), application, and reference.

Exploring the effect of considering specific atoms for matching, which might be important for or directly involved in active site functionality, is highly desirable and limitations of previous methods were already recognized (Meng et al., 2004). A holistic representation that allows for incorporating all atoms or arbitrary subsets of atoms can form a case-dependent and adequate mapping of functional important or structure-forming elements. Hence, detailed definition of atom representation is mandatory to retain specificity in certain cases. Nonetheless previous software lacks such arbitrary atom definitions (Debret et al., 2009; Moll et al., 2010; Nadzirin et al., 2012) and consequently this hurdle has to be overcome.

Furthermore, structural motifs can occur in the same protein chain (intramolecular) or scattered among different protein chains (intermolecular) per se (Koutsotoli and Tzakos, 2012; Ekici et al., 2008; Tsukada and Blow, 1985). Although intramolecular occurrences are more common, intermolecular motifs are often observed in protein–protein interfaces for structure stabilization (Koutsotoli and Tzakos, 2012), and ligand and substrate binding (Ekici et al., 2008), as well as contact sites in general. Therefore intermolecular findings are highly important, albeit most available matching methods are limited to single chains (Debret et al., 2009; Nilmeier et al., 2013; Konc and Janezic, 2010).

The problem to find a set of geometrically and compositionally defined amino acids (query motif) in a set of target structures is rather complex. The geometric component of the problem can be defined as pattern-matching in the three-dimensional space, while the compositional component introduces further restrictions regarding the amino acid types matched (He et al., 2013). These can be for instance the requirement of exact amino acid matching or the toleration of chemical similar or arbitrary residue substitutions. The compositional constraint can even be further tightened such that amino acid substitutions are restricted to specific positions of the motif and not, for instance, the exchange of tryptophan to glycine in general. This gains importance if structural motifs are considered that contain more than one amino acid of the same type (WWP, Fig. 1). Hereby it is essential to distinguish between substitutions at different residue positions. Exchanges of the tryptophan residue oriented toward the nitrogen and the C_δ atom of proline are not necessarily identical to those of the tryptophan facing proline C_β and C_γ atoms. Here so-called position-specific exchanges (PSEs) are crucial and of high interest to restrain structure-disrupting substitutions during matching, especially if motifs are occurring inter molecularly. However, nearly in all current matching methods the implementation of PSEs is lacking (Nadzirin et al., 2012; He et al., 2013; Debret et al., 2009; Kleywegt, 1999; Konc and Janezic, 2012). Consequently the outlined drawbacks were tackled during development and particular attention was paid to:

• arbitrary and user-definable atom representation of motifs,

• detection of intra- and intermolecular matches,

• and the definition of PSEs.

The lack of an easy-to-use implementation of a structural motif search algorithm underscores the necessity to develop versatile tools suitable for contemporary high-throughput analyses. Nowadays the number of protein-related data increases rapidly and structures are released prior to biochemical and functional annotation due to structural genomics effort and automated structure determination pipelines (Duarte et al., 2012). Hence, screening against libraries of structural motifs, for instance, derived from the Catalytic Site Atlas (CSA) (Torrance et al., 2005), may reveal hidden functional aspects. Furthermore, the identification of a partial chemical reaction for protein superfamilies may be possible and can aid deducing a superfamily representative template (Meng et al., 2004). As a consequence, spatial patterns of amino acids that are directly representing structure–function relationships can be utilized to search databases of protein structures with high sensitivity and allow for uncovering substructure similarity of divergently evolved proteins that are likely to share function (Meng et al., 2004; He et al., 2013). Therefore, the utilization of search methods for structural patterns can aid researchers to identify possible targets for drug repositioning or to discover off-target binding of specific drugs (Kirshner et al., 2013; Xie et al., 2009). Where similar substructure fold is observed it is possible that there are drug-binding capabilities as well, causing unwanted side effects. Additionally, the detection of allosteric protein-binding sites is a reasonable application of structural motif search algorithms. Newly determined or even theoretical (e.g., homology modeling based) structures can be screened against a library of binding sites or other relevant motifs to find and further investigate potential new drug targets (Kirshner et al., 2013).

1.3. Nomenclature

A unified nomenclature is used to increase comprehensibility of the following elucidations and to define an unmistakable description for structural motifs. At the motif level, one-letter codes of motif-incorporated amino acids are concatenated following their ascending sequential order. Unambiguous amino acids are labeled consistently with the chain of occurrence, one-letter notation of amino acid type and residue number according to the corresponding Protein Data Bank (PDB) entry.

For example, consider the widely discussed structural motif present in zinc finger domains, which is constituted of a pair of cysteine residues and a pair of histidine residues each (Miller et al., 1985). This motif, derived from PDB:1G2F, consists of the residues F-C207, F-C212, F-H225, and F-H229 and is therefore denoted as CCHH.

If PSEs are allowed they are indicated by an arrow following the sequence number of the residue that should be declared as a variable. The ES representing motif constituted of lysine, asparagine, two glutamate residues, and histidine (Meng et al., 2004) was derived from PDB:2MNR and is denoted as KDEEH according to the sequential order of the incorporated amino acids: A-K164 → H, A-D195, A-E221, A-E247 → DN, A-E297 → K.

1.4. Substructure similarity

For the assessment of geometric resemblance of structural motifs and a set of substructural match candidates of a target structure, the least-root-mean-square deviation (LRMSD) is used. The LRMSD (Eq. 1) is defined as the minimal root-mean-square deviation (RMSD) (Eq. 2) over all possible ideal superimpositions computed by singular value decomposition (Golub and Reinsch, 1970) of all permutations of two sets of atoms A and B (Fofanov et al., 2008). Hereby only atoms of the same type are considered for pairwise alignment to reduce computational effort. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}{ \rm LRMSD}\ ( A , B ) = \min ( { \rm RMSD} ( A , B ) ) \tag{1}\end{align*} \end{document} \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} { \rm RMSD } \ ( A , B ) = \sqrt { \frac { 1 } { n } \sum \limits_ { i = 1 } ^ { n } \parallel a_i - b_i \parallel^2 } \tag { 2 } \end{align*} \end{document}

However, inferring geometric similarity alone does not give a clue on significance of hits in a statistical and functional meaning (Fofanov et al., 2008). The LRMSD can vary greatly, depending on the number of atoms considered for calculation and the type of amino acids compared (Stark et al., 2003). Therefore the LRMSD is not suitable to infer functional similarity and, further, it is not trivial to define a universal significance threshold (Fofanov et al., 2008). For matches incorporating allowed PSEs, and if atoms beyond C_α are considered, there is a limitation for geometric similarity computation: comparison of different residues is futile. Hence a simplification was introduced that considers only C_α atoms for alignment of different amino acid types, despite definition of which atoms should be used for alignment. However, this exception only applies when PSEs are defined and hence direct comparison of nonidentical residues is mandatory.

To get rid of the incomparability of LRMSD values several methods were presented to apply a statistical model to estimate match significance (Moll et al., 2010; Stark et al., 2003; Fofanov et al., 2008; Xie et al., 2009). In this work a point-weight corrected model originally developed by Fovanov et al. was used to estimate significance and to diminish the influence of runtime-beneficial geometric constraints (e.g., LRMSD cutoff) by maximum likelihood approximation (Fofanov et al., 2008).

2. Methods

2.1. Algorithm

The following elucidations are supported by Figure 2, which illustrates the spatial abstractions used in the approach. Furthermore, Algorithm 1 shows the corresponding pseudo code listing of the search algorithm.

FIG. 2.

Exemplary illustration of the Fit3D algorithm for motif KDEEH. (A) Determination of maximal spatial extent r and mapping f (Q) for motif KDEEH. (B) Iterative search in the target structure and extraction of local environment \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$E^t = \{ e_1 , \ldots , e_{13} \} $$ \end{document} around amino acid t and determination of match candidates \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$C = \{ c_1 , c_2 , \ldots , c_5 \} $$ \end{document} .

The set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$L = \{ P , R , N , \ldots \} $$ \end{document} corresponds to labels—also known as one letter notation—of all twenty amino acid types, hence |L| = 20 and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$A = \{ a_1 , a_2 , \ldots , a_k \} $$ \end{document} is a set of amino acids. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$f : A \rightarrow { \cal P} ( L )$$ \end{document} be a mapping of amino acids to allowed labels. In other words f can describe both, the unambiguous amino acid type but also PSEs. Further let g : A → L be a mapping solely representing the unambiguous and invariant assignment of amino acids to their labels. A query motif is a set of amino acids and corresponding atom coordinates, therefore the query set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$Q = \{ q_1 , q_2 , \ldots , q_k \} \subset {\mathbb R}^3$$ \end{document} with k points (for simplification single-atom representation of amino acids is assumed here). Consequently the target structure is represented in the same way: as a set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$T = \{ t_1 , t_2 , \ldots , t_n \} \subset {\mathbb R}^3$$ \end{document} with n points. The minimal required geometric similarity (LRMSD threshold) between a candidate match and the query set is defined by parameter ε. The local environment around a target amino acid \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$t \in T$$ \end{document} is denoted as E^t. Furthermore, C is a set of hit candidates. The maximal spatial extent of the motif, required for local environment extraction, is defined as the maximum of all pairwise distances between all motif C_α atoms: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$r \leftarrow max ( \parallel q_i - q_j \parallel : q_{i , j} \in Q , i \ne j )$$ \end{document} . Figure 2A illustratively shows the determination of r and, for the case of motif KDEEH as query set, the mapping f (Q).

The search itself is an iterative process over all points in the target set T (Algorithm 1, line 7). If g (t) is a subset of at least one set of allowed amino acid labels for each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$q \in Q$$ \end{document} , the local environment E^t around t is extracted within the radius r + δ (see Fig. 2B and Algorithm 1, line 5), where δ is the distance tolerance threshold (default: 1 Å).

If pairwise distance filtering of E^t is enabled, only pairs of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( e_v , e_w ) \in E$$ \end{document} are kept where a corresponding pair \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( q_x , q_y ) \in Q$$ \end{document} exists (with v ≠ w, x ≠ y), such that amino acid labels are compatible and the distance is similar \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( \parallel e_v - e_w \parallel \le \parallel q_x - q_y \parallel + \delta : e_{v , w} \in E , q_{x , y} \in Q , g ( e_{v , w} ) \subset f ( q_{x , y} ) )$$ \end{document} . Subsequently all combinations of the local environment are calculated \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$( { \cal P} ( E^t ) )$$ \end{document} .

For each combination C where |C| = k and compatibility of amino acid labels is given according to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$g ( c ) \subset f ( q ) , \forall c \in C , \forall q \in Q$$ \end{document} (see yellow spheres in Fig. 2B) the optimal superimposition is determined by permutation. For each permutation the geometric similarity d is calculated; LRMSD(C, Q) determines the least-root-mean-square deviation (LRMSD) (Fofanov et al., 2008) between candidate set C and query set Q (|C| = |Q| = k), measured over all possible candidate-query alignments. Only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d \le \epsilon$$ \end{document} (geometric similarity is below the LRMSD threshold) is the candidate set considered to be a match. Illustratively, in Figure 2B this is the case for the candidates \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$C = \{ c_1 , c_2 , \ldots , c_5 \} $$ \end{document} shown in red.

2.2. Time complexity

The motif spatial extent r can be found in time \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal O} ( k^2 )$$ \end{document} by computing k² distances for all pairs of C_α atoms in Q. Furthermore the hit candidates \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal P} ( E^t )$$ \end{document} can be calculated within \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal O} \left( {l \choose k - 1} \right)$$ \end{document} time with |E^t| = l. This is due to the fact that the current target amino acid is stored and hence only k−1 amino acids of E^t have to be considered for combination. To find the optimal superimposition in the worst case k! calculations have to be performed. In fact only amino acids with compatible labels are aligned, which reduces alignment steps significantly and the level of runtime reduction is superior in practice. The algorithm complexity has to be defined in dependence of local filtering and differs uncertainly for searches with and without local filtering. In general the application of filtering speeds up runtime substantially, especially for motifs with large spatial extent r. The worst time complexity of the algorithm without pairwise distance filtering the local environment is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal O} ( k^2 + n {l \choose k - 1} k! \Theta ( k ) )$$ \end{document} , where Θ(k) is the time required to evaluate geometric constraints of a set of k amino acids. Otherwise, if filtering is enabled it takes additionally \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal O} ( l^2 )$$ \end{document} time to compute pairwise distances of the environment, but afterward Φ(l) less combinations have to be computed, where Φ(l) discards amino acids not fulfilling the pairwise distance constraints: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${ \cal O} ( k^2 + n ( l^2 + {l - \Phi ( l ) \choose k - 1} ) k! \Theta ( k ) )$$ \end{document} . This carries considerable weight if the query motif has a high spatial extent and the environment is consequently big.

However, it has to be admitted that the complexity of the Fit3D motif matching algorithm strongly depends on the density and size of the local environment. For target amino acids t buried in the protein, l tends to be bigger than for target amino acids at the protein surface and vice versa. In the worst case l could be as large as n, which happens practically never because the definition of small and locally occurring structural motif implies that l ≪ n.

2.3. Algorithm evaluation

For validation of the presented algorithm a two-step gold standard was defined to determine coverage of a foreground and background benchmark data set. The validation of substructure matching algorithms based on ES structures and CSA derived motifs was successfully applied several times in literature (Moll et al., 2010; He et al., 2013; Fofanov et al., 2008). Based on the assumption that statistically significant matches can be considered to be true positives, validation was conducted on the basis of p-value estimation of hits with a well-established cutoff of 0.001 (Moll et al., 2010; Moll and Kavraki, 2008). The limiting geometric threshold ε that defines the maximal allowed LRMSD was set to 4.0 Å to cover a good proportion of the right tail of the LRMSD distribution. In all, 31.133 structures of the nonredundant PDB (nrPDB) chain set as of March 14, 2014, with a BLAST p-value of 10⁻⁸⁰ (Altschul et al., 1990) were considered as background dataset. Furthermore, the set of 41 structures by Meng et al. (Meng et al., 2004) was used as foreground for validation based on the ES. Additionally, ES-based evaluation was conducted in dependence of different representations for motif KDEEH to show crucial influence on sensitivity and specificity: all non-hydrogen atoms, backbone atoms, sidechain atoms, and C_α atoms. All CSA-derived motifs were considered to be representatives and used for evaluation if each of the following conditions was fulfilled (adapted from Moll et al., 2010):

• the motif consists of 3–5 residues,

• the parent structure is fully classified by the Enzyme Commission (EC) number,

• the corresponding EC class has more than 50 structures,

• and the motif has a maximal spatial extent of 20 Å.

Employing these filter criteria for selection yielded a total number of 157 CSA-derived motifs where p-values could be computed, spanning 51 different EC classes and all six top-level classifications.

2.4. Implementation

The open source and platform-independent Java implementation of the algorithm was aimed to be simple in usage and flexible in application. Hence, the decision to deploy the software as command line tool considerably reflects these aspects. Extensive usage of the BioJava (Prlic et al., 2012) framework allowed to circumvent de novo implementation of PDB file parsers and structure alignment methods. The Fit3D algorithm implementation was realized by utilization of the BioJava data structure. Hence, special attention could be paid to runtime optimization.

3. Results and Discussion

3.1. Algorithm evaluation

The results of algorithm validation with motif KDEEH are shown in Figures 3 and 4. All-atom as well as sidechain representation resulted in perfect coverage of the foreground dataset (sensitivity of 100%), which dropped significantly if only backbone atoms were used as motif representation. Less than half of the foreground structures were identified if only C_α atoms were used for matching. A similar behavior was observed for specificity, which is even slightly better for all-atom representation (99.68%) than for sidechain-based motif abstraction (98.11%). For backbone-only representation of KDEEH a significant loss of sensitivity was observed (85.37%) that was even higher if only C_α atoms were considered. In general, all-atom motif mapping performed best, followed closely by sidechain representation. In contrast, C_α representation performed worst for all measurements with extraordinary low sensitivity and specificity.

FIG. 3.

Enolase superfamily algorithm validation results. The sensitivity or true positive rate (TPR) and specificity or true negative rate (TPR) as well as their inverse measurements false negative rate (FNR) and false positive rate (FPR) are shown in dependence of atom representation for the enolase superfamily motif.

FIG. 4.

Influence of motif representing atoms. (A) The matches found for the enolase superfamily motif KDEEH with least-root-mean-square deviation (LRMSD) values ≤1.0 Å. The atoms of the query motif are depicted as transparent spheres with the radius r=1.0 Å. Position-specific exchanges (PSEs) occurred during matching. (B) The logarithmic scaled frequency plot of LRMSD values of matches in dependence of motif representing atoms. The significance thresholds of p-value 0.001 for each variant are indicated by dashed lines.

By comparison of the false positives identified according to the dataset defined by Meng et al. in 2004, with the up-to-date set of ES structures derived from the Structure Function Linkage Database (SFLD) (Akiva et al., 2014), it is even possible to assess some false positives as true positives. The discrepancy in size of the two datasets is remarkably 41 structures for Meng et al. (2004) compared to 351 structures in SFLD as of June 10, 2014. For example, the structures PDB:4GFI or PDB:3RIT, which were matched as false positives, have now been identified to belong to the ES like reported by the SFLD. In contrast to 99 false positives regarding the Meng et al. (2004) dataset, only 40 remain if the SFLD dataset is considered as foreground, increasing the method specificity slightly from 99.68% to 99.87%. Additionally, the structures PDB:2QGY, PDB:2PPG, PDB:2OQH, PDB:3CYJ, PDB:1WUE, PDB:2OZ8, and PDB:2POZ have indeed only putatively defined functions, but may be related to the ES (Moll et al., 2010). In fact, these results are a first indicator for the ability of the proposed search algorithm to identify remote homologous structures.

A frequency plot showing LRMSD distribution in dependence of motif representation is given in Figure 4B. The corresponding alignments of sets of KDEEH matches with similarities to the query motif ≤1.0 Å illustrate different motif representations in Figure 4A. As expected, the LRMSD distribution for all representation types has a smaller peak in the left region (see Fig. 4B) and diverges at the right tail in accordance to observations of previous studies (Fofanov et al., 2008; Moll et al., 2010; Stark et al., 2003). Furthermore, the corresponding LRMSD at the p-value threshold of 0.001 is shifted toward a lower value if reduced and insufficient atom sets are considered: the nonspecificity is directly reflected in LRMSD distribution and thus also in the statistical significance. This is also perceptible by a closer look at Figure 4A, where the query motif KDEEH is shown as sphere model with a radius of 1.0 Å for each sphere. The sphere model intends to illustratively define the area of atom deviation from match to query, which is enormously for C_α-only representation, despite all matches are still below significance threshold. In contrast, match deviations are significantly lower for other representation types with no obvious differences at first glance. Note that outlying amino acids occur due to allowed residue substitutions (PSEs) of KDEEH and contributed to LRMSD calculation only with their C_α atoms.

It was clearly shown that atom representation of structural motifs plays a key role when it comes to matching in target structures. It was expected that multi-atom motif representation performs best, as already suggested for the ES (Meng et al., 2004). The unavailability of arbitrary atom selection is a known deficiency of existing approaches, and with our program it could be addressed appropriately. For the ES, which is known to have highly variable C_α atom positions, all-atom representation performed best, although the performance for sidechain motif abstraction was close. Both variants reached a perfect sensitivity and a nearly ideal specificity, albeit sidechain representation led to a slight loss of specificity compared to all-atom representation. Further on, backbone-only matching lags behind with a still acceptable but not desirable sensitivity, while C_α-only abstraction showed its strong inaccuracy.

In conclusion it was confirmed that multi-atom representation is the method of choice, highly important, and a worthwhile key feature of the Fit3D software. Albeit this statement cannot be generalized for every motif and has to be reflected for each specific search task. For instance, if variable regions of the motif are known or positional variance is even favored, it may be reasonable to constrain atom representation.

The results of the algorithm validation using CSA-derived motifs are shown in Figure 5. The CSA motifs were represented by all non-hydrogen atoms, but the results were rather diverging. While the sensitivity ranged from 0–95.60%, specificity was continuously high, which means that no excessive matches in the background dataset occurred. If we consider for example the best performing motifs derived from PDB:1STC (cAMP-dependent protein kinase, EC 2.7.11.11), PDB:1BSJ (peptide deformylase, EC 3.5.1.88), PDB:1BBS (renin, EC 3.2.23.15) and PDB:1AMY (α-amylase, EC 3.2.1.1), it is conspicuous that three of them span the same EC top-level class (hydrolases, EC 3). More detailed investigation reveals that the CSA entries derived from PDB:1BBS and PDB:1AMY are both catalytic triads consisting of two aspartate residues and one glutamate (PDB:1AMY) or serine (PDB:1BBS). According to SFLD α-amylase (PDB:1AMY) is known to catalyze hydrolysis of internal α-glucosidic linkages in starch and other related oligo- and polysaccharides (Furnham et al., 2014). Renin (PDB:1BBS) instead catalyzes, according to the ENZYME database (Bairoch, 2000), the cleavage of Leu-Xaa (Xaa can be any amino acid) bonds in angiotensinogen. Hence, the catalytic triads seem to be very descriptive for their EC class and therefore highly conserved. Other CSA-based motifs like the aspartate, lysine and tyrosine triad from fructose-bisphosphate aldolase (PDB:1OK4, EC 4.1.2.13), were not even able to detect one hit in the respective foreground dataset. We suppose that observed divergence in the performance of CSA-derived motifs is induced by unknown PSEs. This issue could be addressed if multiple sequence alignments of proteins in the motif-associated EC class are performed as already suggested in literature (Moll et al., 2010; He et al., 2013). Otherwise hand CSA-derived substructures are not per se structural motifs in a common sense. The CSA developers did not intend to define motifs descriptive for EC classes, nor the CSA was designed to cover all EC classes (Torrance et al., 2005). Instead, the goal to allow functional annotations of disparate proteins led to CSA development (Torrance et al., 2005). Torrence et al. considered the foreground for each motif to be the “CSA family” of PSI-BLAST (Altschul et al., 1997) identified relatives, rather than the full fourth-level EC classification. Furthermore, a single EC class is likely to have more than one representative CSA entry, which are highly different concerning their descriptive abilities. Obviously this is the case for the entries PDB:1BP2 and PDB:1CJY, both catalytic sites of phopholipases (EC 3.1.1.4). While entry PDB:1BP2 is a catalytic triad of histidine, glycine and aspartate, the active site of PDB:1CJY consists of four amino acids: serine, two glycine residues and an aspartate. Although both are catalyzing the same reaction, the cleavage of ester bonds, they are very variant concerning the ability to describe their EC class and consequently the applied validation approach fails. The triad derived from PDB:1BP2 reached a sensitivity of 71.48% and specificity of 99.37%, while CSA entry PDB:1CJY only covered 0.39% of the foreground and was to 99.55% specific. These hurdles concerning CSA-derived motifs were already part of research and some suggestions to overcome them were presented (Bryant et al., 2010); Chen et al., 2007). One idea is the combination of multiple motifs describing one EC class to increase performance. Bryant and colleagues introduced a method in which multiple and potentially overlapping motifs can be combined to a single function prediction test (Bryant et al., 2010). Another approach encompasses the further optimization of motifs by geometric sieving (Chen et al., 2007), something of which a preliminary variant was already implemented in LabelHash (Moll et al., 2011). The application of these methods could help to increase the representative level of CSA-derived structural motifs and should be considered in latter work. All in all, the validation results for the ES-descriptive motif KDEEH and the CSA library showed that clearly and meticulously defined motifs that were derived from literature-supported superfamily structures are more descriptive for EC classes and hence these motifs can be considered to be of high quality. This corresponds to observations that were proposed by previous studies (Moll et al., 2010; He et al., 2013).

FIG. 5.

Sensitivity and specificity for Catalytic Site Atlas–derived motifs. Sensitivity or true positive rate (TPR) and specificity or true negative rate (TNR) of the Fit3D search algorithm. Each data point represents a single Catalytic Site Atlas motif; considered matches have p-values ≤0.001. Ideal validation results would be located in the upper right corner of the scatterplot.

3.2. Software

Fit3D was packed to a single software tool programmed in Java in form of an executable JAR file. Furthermore it was purposefully implemented as command line standalone software tool. This application form allows flexibility, automation, and most of all superior integration into problem-specific workflows, especially because of its platform independence thanks to Java. Even extraordinary large datasets, for instance the entire PDB consisting of over 100,000 structures, can be searched conveniently and in an automated way. Primary aspects during development of the Fit3D software were usability and the capability of parameter-free motif matching. This could be realized by the underlying search algorithm, for which the only parameter that may influence results critically is the distance tolerance δ. However, δ is set to a reasonable default value of 1.0 Å, so no care has to be taken in most cases. To run Fit3D the user has to define only two variables: a query motif structure in PDB format and a search target or a list of targets separated by line break. A minimal command line call to start a search for motif KDEEH in a set of target structures can look as simple as the following command: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \textsf{java \ - jar \ Fit3D.jar \ - m \ motif \_ KDEEH.pdb \ - l \ nrpdb. \ 032614}\end{align*} \end{document}

where m defines the motif structure in PDB format and l a list of target structures in plain text format separated by line break.

The Fit3D software is freely available and extensively documented online. The Supplementary Material (available at www.liebertpub.com/cmb) contains all motif structures mentioned in Table 1 in PDB format. Furthermore, the CSA-derived motifs used for validation are included as well. Additionally, all foreground and background datasets that were used for the validation of the algorithm are included in plain text format. The latest version of the Fit3D implementation is also provided.

4. Conclusions

The mastering of the challenges can be seen as successful: an enhanced algorithm for structural motif matching was developed and validated using gold standard methods. By utilizing multi-atom representation of amino acids it could be shown that sensitivity of our algorithm was superior and nevertheless accompanied with high specificity. Furthermore, algorithmic robustness is guaranteed due to the small number of internal parameters. The novel algorithm was implemented in an easy-to-use command line software tool called Fit3D, which was released under the terms of public licensing and is freely available and ready to be optimized, adapted, or expanded by researchers. Fit3D provides an attractive and improved alternative for existing web services (Nadzirin et al., 2012; Debret et al., 2009; Moll et al., 2011) or standalone tools (Moll et al., 2010). Thereby its application is a double-sided approach: on the one hand it is possible to search for known structural motifs in a large set of target proteins, for example, in the form of a CSA-derived library like the one by Nilmeier et al., (2013). On the other hand, one can screen structures for newly discovered motifs to assess biological function. Due to the rapid growth of automated structure determination methods through structural genomics effort, protein structures are often solved prior to biochemical and functional characterization (Duarte et al., 2012). Hence, our method can be envisioned to be very helpful for researchers dealing with protein crystallography and function determination. The field of drug design and research is also addressed by this approach; the mechanism of drug effect often lies in the inhibition of protein active sites, which are in turn describable through structural motifs. According to the sequence-to-structure-to-function paradigm it can be stated that three-dimensional information is important for the protein chemistry and therefore active sites are conserved (Fetrow and Skolnick, 1998). Additionally, it is evident that ligand binding site resemblance is possible even when global sequence or structure similarity cannot be detected (Xie et al., 2009). It is of high interest for pharmacists to investigate active-site mechanisms and structures in a high-throughput manner. The utilization of search methods for structural patterns can aid researchers to discover off-target binding of specific drugs (Kirshner et al., 2013; Xie et al., 2009). Furthermore, it is conceivable that the approach can be combined with methods for automatic deduction of structural motifs based on functional subfamilies within diverse superfamilies (Redfern et al., 2009). The Fit3D implementation can be seen as a cutting-edge tool in the field of bioinformatic substructure research and computational biology and will be extended and applied in further research.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Acknowledgments

We would like to thank the Free State of Saxony and the Saxon Ministry of Science and Fine Arts for funding this work.

References

Akiva

, Brown

, Almonacid

D.E.

, et al. 2014. The structure-function linkage database. Nucleic Acids Res., 42, D521–D530.

Altschul

S.F.

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

Altschul

S.F.

, Madden

T.L.

, Schaffer

A.A.

, et al. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.

Aravind

, Galperin

M.Y.

, and Koonin

E.V.

1998. The catalytic domain of the P-type ATPase has the haloacid dehalogenase fold. Trends Biochem. Sci., 23, 127–129.

Babbitt

P.C.

, Hasson

M.S.

, Wedekind

J.E.

, et al. 1996. The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the alpha-protons of carboxylic acids. Biochemistry, 35, 16489–16501.

Bairoch

2000. The ENZYME database in 2000. Nucleic Acids Res., 28, 304–305.

Baker

A.S.

, Ciocci

M.J.

, Metcalf

W.W.

, et al. 1998. Insights into the mechanism of catalysis by the P-C bond-cleaving enzyme phosphonoacetaldehyde hydrolase derived from gene sequence analysis and mutagenesis. Biochemistry, 37, 9305–9315.

Bryant

D.H.

, Moll

, Chen

B.Y

. et al. 2010. Analysis of substructural variation in families of enzymatic proteins with applications to protein function prediction. BMC Bioinformatics, 11, 242.

Chen

B.Y.

, Fofanov

V.Y.

, Bryant

D.H.

, et al. 2007. The MASH pipeline for protein function prediction and an algorithm for the geometric refinement of 3D motifs. J. Comput. Biol., 14, 791–816.

10.

Darnell

R.B.

2006. Developing global insight into RNA regulation. Cold Spring Harb. Symp. Quant. Biol., 71, 321–327.

11.

Debret

, Martel

, and Cuniasse

2009. RASMOT-3D PRO: a 3D motif search webserver. Nucleic Acids Res., 37, W459–464.

12.

Duarte

J.M.

, Srebniak

, Scharer

M.A

. et al. 2012. Protein interface classification by evolutionary analysis. BMC Bioinformatics, 13, 334.

13.

Ebrahimi

K.H.

, Hagedoorn

P.L.

, and Hagen

W.R.

2012. A synthetic peptide with the putative iron binding motif of amyloid precursor protein (APP) does not catalytically oxidize iron. PLoS ONE, 7, e40287.

14.

Ekici

O.D.

, Paetzel

, and Dalbey

R.E.

2008. Unconventional serine proteases: variations on the catalytic Ser/His/Asp triad configuration. Protein Sci., 17, 2023–2037.

15.

Fetrow

J.S.

, and Skolnick

1998. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. J. Mol. Biol., 281, 949–968.

16.

Fischer

, Wolfson

, Lin

S.L.

, et al. 1994. Three-dimensional, sequence order-independent structural comparison of a serine protease against the crystallographic database reveals active site similarities: potential implications to evolution and to protein folding. Protein Sci., 3, 769–778.

17.

Fofanov

, Chen

, Bryant

, et al. 2008. A statistical model to correct systematic bias introduced by algorithmic thresholds in protein structural comparison algorithms. In Bioinformatics and Biomeidcine Workshops, 2008. IEEE International Conference on, pp. 1–8.

18.

Furnham

, Holliday

G.L.

, de Beer

T.A.

, et al. 2014. The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Res., 42, D485–D489.

19.

Golub

G.H.

, and Reinsch

1970. Singular value decomposition and least squares solutions. Numerische Mathematik, 14, 403–420.

20.

, Vandin

, Pandurangan

, et al. 2013. Ballast: a ball-based algorithm for structural motifs. J. Comput. Biol., 20, 137–151.

21.

Hedstrom

2002. Serine protease mechanism and specificity. Chem. Rev., 102, 4501–4524.

22.

Kirshner

D.A.

, Nilmeier

J.P.

, and Lightstone

F.C.

2013. Catalytic site identification—a web server to identify catalytic site structural matches throughout PDB. Nucleic Acids Res., 41, W256–265.

23.

Kleywegt

G.J.

1999. Recognition of spatial motifs in protein structures. J. Mol. Biol., 285, 1887–1897.

24.

Konc

, and Janezic

2010. ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment. Bioinformatics, 26, 1160–1168.

25.

Konc

, and Janezic

2012. ProBiS-2012: web server and web services for detection of structurally similar binding sites in proteins. Nucleic Acids Res., 40, W214–221.

26.

Kong

G.K.

, Adams

J.J.

, Harris

H.H

. et al. 2007. Structural studies of the Alzheimer's amyloid precursor protein copper-binding domain reveal how it binds copper ions. J. Mol. Biol., 367, 148–161.

27.

Koonin

E.V.

, and Tatusov

R.L.

1994. Computer analysis of bacterial haloacid dehalogenases defines a large superfamily of hydrolases with diverse specificity. Application of an iterative approach to database search. J. Mol. Biol., 244, 125–132.

28.

Koutsotoli

, and Tzakos

A.G.

2012. Host-pathogen crosstalking: the mastery of taking the helm of the host. Structure, 20, 1613–1615.

29.

Matthews

B.W.

, Sigler

P.B.

, Henderson

, et al. 1967. Three-dimensional structure of tosyl-alpha-chymotrypsin. Nature, 214, 652–656.

30.

Meng

E.C.

, Polacco

B.J.

, and Babbitt

P.C.

2004. Superfamily active site templates. Proteins, 55, 962–976.

31.

Miller

, McLachlan

A.D.

, and Klug

1985. Repetitive zinc-binding domains in the protein transcription factor IIIA from Xenopus oocytes. EMBO J., 4, 1609–1614.

32.

Moll

, Bryant

D.H.

, and Kavraki

L.E.

2010. The LabelHash algorithm for substructure matching. BMC Bioinformatics, 11, 555.

33.

Moll

, Bryant

D.H.

, and Kavraki

L.E.

2011. The LabelHash server and tools for substructure-based functional annotation. Bioinformatics, 27, 2161–2162.

34.

Moll

, and Kavraki

L.E.

2008. Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction. Comput. Syst. Bioinformatics Conf., 7, 157–168.

35.

Nadzirin

, Gardiner

E.J.

, Willett

, et al. 2012. SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures. Nucleic Acids Res., 40, W380–386.

36.

Nilmeier

J.P.

, Kirshner

D.A.

, Wong

S.E.

, et al. 2013. Rapid catalytic template searching as an enzyme function prediction procedure. PLoS ONE, 8, e62535.

37.

Ogawa

, Yamada

, Tsujioka

, et al. 2000. Localization of a novel type trypsin-like serine protease, neurosin, in brain tissues of Alzheimer's disease and Parkinson's disease. Psychiatry Clin. Neurosci., 54, 419–426.

38.

Plevin

M.J.

, Bryce

D.L.

, and Boisbouvier

2010. Direct detection of CH/pi interactions in proteins. Nat. Chem., 2, 466–471.

39.

Prlic

, Yates

, Bliven

S.E.

, et al. 2012. BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics, 28, 2693–2695.

40.

Redfern

O.C.

, Dessailly

B.H.

, Dallman

T.J.

, et al. 2009. FLORA: a novel method to predict protein function from structure in diverse superfamilies. PLoS Comput. Biol., 5, e1000485.

41.

Sanderson

P. E.

1999. Small, noncovalent serine protease inhibitors. Med. Res. Rev., 19, 179–197.

42.

Sharony

, Yu

P.J.

, Park

, et al. 2010. Protein targets of inflammatory serine proteases and cardiovascular disease. J. Inflamm. (Lond.), 7, 45.

43.

Stark

, Sunyaev

, and Russell

R.B.

2003. A model for statistical significance of local similarities in structure. J. Mol. Biol., 326, 1307–1316.

44.

Titanji

B.K.

, Aasa-Chapman

, Pillay

, et al. 2013. Protease inhibitors effectively block cell-to-cell spread of HIV-1 between T cells. Retrovirology, 10, 161.

45.

Torrance

J.W.

, Bartlett

G.J.

, Porter

C.T.

, et al. 2005. Using a library of structural templates to recognise catalytic sites and explore their evolution in homologous families. J. Mol. Biol., 347, 565–581.

46.

Tsukada

, and Blow

D.M.

1985. Structure of alpha-chymotrypsin refined at 1.68 A resolution. J. Mol. Biol., 184, 703–711.

47.

Witt

, Luck

, Hennies

H.C.

, et al. 2000. Mutations in the gene encoding the serine protease inhibitor, Kazal type 1 are associated with chronic pancreatitis. Nat. Genet., 25, 213–216.

48.

Xie

, Xie

, and Bourne

P.E.

2009. A unified statistical model to support local sequence order independent similarity searching for ligand-binding sites and its application to genome-based drug discovery. Bioinformatics, 25, i305–i312.

49.

Xue

, Davis

A.V.

, Balakrishnan

, et al. 2008. Cu(I) recognition via cation-pi and methionine interactions in CusF. Nat. Chem. Biol., 4, 107–109.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

11.95 MB

0.00 MB