Combined Topological Data Analysis and Geometric Deep Learning Reveal Niches by the Quantification of Protein Binding Pockets

Abstract

Protein pockets are essential for many proteins to carry out their functions. Locating and measuring protein pockets, as well as studying the anatomy of pockets, helps us further understand protein function. Most research studies focus on learning either local or global information from protein structures. However, there is a lack of studies that leverage the power of integrating both local and global representations of these structures. In this work, we combine topological data analysis (TDA) and geometric deep learning (GDL) to analyze the putative protein pockets of enzymes. TDA captures blueprints of the global topological invariant of protein pockets, whereas GDL decomposes the fingerprints into building blocks of these pockets. This integration of local and global views provides a comprehensive and complementary understanding of the protein structural motifs (niches for short) within protein pockets. We also analyze the distribution of the building blocks making up the pocket and profile the predictive power of coupling local and global representations for the task of discriminating between enzymes and nonenzymes, as well as predicting the enzyme class. We demonstrate that our representation learning framework for macromolecules is particularly useful when the structure is known, and the scenarios heavily rely on local and global information.

1. INTRODUCTION

Proteins are biological macromolecules responsible for carrying out many of the essential functions of cells. Understanding protein function remains a fundamental aim to understand life at the molecular level. Although the availability of protein sequence and structure information has grown exponentially, the experimental determination of the function of a protein is still limited by time and cost. To address this limitation, a variety of computational methods have been developed to predict protein function (Zhou et al., 2019). The key to these computational approaches is to infer protein function by finding proteins with similar sequence, structure, or other characteristics. For example, the shape and properties of the protein surface determine what interactions are possible with ligands and other macromolecules (Coleman and Sharp, 2010).

Among the multiple elements of protein structures, voids, pockets, and channels are important features of the protein surface; thus, they play a crucial role in many protein functions (Liang et al., 1998). For example, locating and measuring protein pockets and cavities has been shown to be useful for computer-aided drug design (Stank et al., 2016). Furthermore, studying the anatomy of protein pockets and cavities with geometry and topology helps us understand the shape and topological niches, referred to as protein structural motifs (Tian and Liang, 2018).

Toward this goal, a natural approach is to use topological data analysis (TDA) for capturing the global information within protein structure data (Chazal and Michel, 2021). For example, persistent homology (PH), a main workhorse of TDA, can represent macromolecules into persistent barcodes, diagrams, or landscapes as input to machine learning models (Bubenik and Dłotko, 2017). PH can also help reduce the structural complexity as well as preserve the topological invariant properties. Encoded features, including connected components, loops, voids, and other properties of higher order along with their persistence, are descriptors of global information in Euclidean space ℝ ⁿ . Traditional TDA-based methods cannot capture local structural information since topology studies properties of spaces that are invariant under any continuous deformation (Fasy and Wang, 2016). As a result, PH only captures changes of topological invariants and provides some persistence, which is not sensitive to homotopic shape evolution. Fortunately, recent efforts have aimed at addressing the shortcomings of TDA when it comes to the study of protein morphology. For example, Cang et al. (2015) proposed a topological approach for protein classification, while Kovacev-Nikolic et al. (2016) profiled the persistence landscapes of protein structures. Both studies demonstrated that TDA- and PH-based methods can be effectively used to analyze protein structures. In some cases, these studies have also shown that TDA-based approaches can even identify protein superdomains (Cang et al., 2015) or the patterns of maltose-binding protein (Kovacev-Nikolic et al., 2016). However, proteins with similar pocket shapes could have distinct functions. In addition, some key substitution of residues could alter the function or physicochemical properties to a large extent. Furthermore, in the case of enzymes, enzyme-substrate and intra-enzyme interactions should be included and modeled as they may carry critical information for binding affinity. Therefore, there is a need for methods that further incorporate biochemistry and biophysics information as a means to better understand enzyme structure–function relationships (Bourlieu et al., 2020).

To this end of integration of biochemical and biophysical features toward uncovering shapes and heterogeneous properties of structural data, an alternative approach involves the use of computational geometry such as geometric deep learning (GDL). GDL is an umbrella term for techniques aiming at generalizing deep neural models to non-Euclidean domains such as graphs, hypergraphs, manifolds, and so on. Compared with TDA, GDL is locally aware and gives us a zoomed-in view of the data while mapping it to the representation space coupled with domain knowledge (Atz et al., 2021). GDL can handle information beyond distance, mesh, shape descriptors, or curvature descriptors, including extended geometry-associated features such as node or edge labels, as well as different types of interactions, to name a few. In the context of proteomics, amino acid residues are the building blocks of proteins; thus, they are typically represented as nodes, while hyperedges are used to model the pairwise and higher-order relationships between interacting residues (Maruyama et al., 2001; Freudenberg et al., 2002; Ye et al., 2007). In the resulting hypergraph of an underlying protein (pocket) structure, both pairwise and higher-order residue interactions can naturally encode the conservation and “hyper-conservation” of the interacting residues (Ye et al., 2007). This “hyper-conservation” can only be captured by hypergraph-based approaches. Moreover, pockets with similar topology but different microenvironments caused by residue composition (e.g., charged, pH, hydrophilic, and hydrophobic) are likely to exhibit distinct binding affinity profiles. Therefore, GDL is a suitable approach for exploring these properties and revealing significant residue interactions in protein families derived from structural and functional constraints. Finally, GDL is also able to make complementary contributions to the structure profiles obtained by TDA.

Inspired by recent efforts that have shown that global structural topology and local geometry refinement can have mutual benefits toward protein pocket mining (Roy and Zhang, 2012; Swenson et al., 2020), we combine TDA and GDL to analyze putative protein pockets of enzymes. For a ligand-binding pocket, the initial motivation for PH was to shape the structures of the protein–ligand complexes. Capturing the appearance and disappearance of topological features successfully quantifies shapes and sizes. Recent efforts in GDL have also been made to improve the molecular descriptors of protein–ligand complexes by mining distributive patterns (Saha et al., 2019). Therefore, in this work, we study the topological invariants using the PH method, as well as the distributive information of the pockets using a fully labeled hypergraphlets (Lugo-Martinez et al., 2021) and feed neural network learners to find niches (Huang and Yang, 2021; Carrière et al., 2020).

1.1. Contributions

Given the success of TDA and GDL within the proteomics space, we hypothesize that the combination of TDA and GDL will reveal niches of protein binding pockets by leveraging the quantitative power of top-down and bottom-up representations. In particular, the contributions of this work are listed as follows: (1) we show that GDL successfully decomposes pocket blueprints into quantifiable “hyper-conservation” features, whereas TDA captures the global topological invariant of pockets in terms of structure. The combination of both views gives us a comprehensive and complementary understanding of the niches within the space of protein pockets, (2) we analyze and evaluate the efficacy of integrating local and global representations for discriminating between enzymes and nonenzymes, as well as predicting enzyme classes. Furthermore, we show that these predictions are supported by enzymology, and (3) we construct a novel representation learning framework for proteins. This framework is particularly useful when the structural information is known, and the downstream task is heavily based on local and global information.

1.2. Related work

1.2.1. Multiparameter persistent homology

Multiparameter persistent homology (MPH) is an extension of the PH of a single filtered space. As an active area of TDA, MPH can capture the topological invariants of interest by considering the multifiltered space. MPH provides and calculates (n-parameter) persistence modules (algebraic invariants of data), simply by applying homology field coefficients to a multifiltration (Botnan and Lesnick, 2022). MPH gives us insights to interpret and compare data at different types and scales simultaneously. For example, MPH has been used to study immune cell distributions with differing oxygenation levels (Vipond et al., 2021). In this work, we aim to apply MPH to protein pockets across different levels of pocket confidence.

1.2.2. Hypergraph kernels and hypergraph neural networks

Hypergraphs, a generalization of graphs, provide a flexible and accurate model to encode higher-order relationships inherently found in many disciplines. In particular, hypergraphlets, small hypergraphs rooted at a vertex of interest, have been successfully used to probe large hypergraphs, as hypergraphs can be thought of as being composed of a collection of independent hypergraphlets (Gaudelet et al., 2018; Lugo-Martinez et al., 2021). Furthermore, Lugo-Martinez et al. (2021) present a generalized algorithm for counting hypergraphlets as a means of defining a kernel method on vertex- and edge-labeled hypergraphs for analysis and learning.

To take advantage of the expressiveness of hypergraphs, researchers have tried to adapt graph neural networks (GNNs) to hypergraph neural networks (HGNNs) for graph representation learning. The challenge is how to learn powerful representative embeddings without losing such higher-order information. Huang and Yang (2021) proposed a unified framework for graph and HGNNs to unify the message passing process with minimal effort. The message passing in hypergraphs is shown to be as powerful as the one-dimensional generalized Weisfeiler–Lehman (1-GWL) algorithm in terms of distinguishing nonisomorphic hypergraphs (Böker, 2019).

1.2.3. Topological layers

Many studies have incorporated topological invariants for end-to-end learning with neural networks. Hofer et al. (2019) proposed the first topological layer using the idea of Gaussian transformation in persistence diagrams. They also proposed a novel type of readout operation to leverage PH computed via a real-valued, learnable filter function layer (Hofer et al., 2020). Another more comprehensive layer called PersLay for persistence and topological signatures was described by Carrière et al. (2020). PersLay is an end-to-end, differentiable framework for learning versatile PH descriptors in a neural network, which allows us to better understand the topological features of data in an automatic way. Various vectorization methods were used in PersLay for better learnable representations of persistent diagrams. Finally, Horn et al. (2021) proposed a topological neural network that is strictly more expressive than message passing GNNs.

2. METHODS

2.1. Background and notation

Here, we review the background on protein pockets, as well as some basic concepts and notations of TDA and GDL.

2.1.1. Protein binding pocket

As mentioned earlier, protein binding pockets (ligand binding sites or simply pockets) play an important role in protein function, as well as drug design. Pockets are regions with specific sizes, shapes, and physicochemical properties. On the contrary, ligands are specific small molecules that could fit into pockets and bind with host proteins. Different approaches have been used to predict ligand binding sites, including geometric-, energetic-, consensus-, template-, conservation-, and knowledge-based methods. A comprehensive review of these approaches is provided by Krivák and Hoksza (2018).

For simplicity, in this work, we treat a protein P of length n as a sequence of amino acid residues denoted as $S$ = s₁s₂s₃ ⋯ s_n, where each s_i ∈ $S$ represents 1 of the 20 common amino acids (Supplementary Table S1). These residues are quite diverse in terms of geometry, charge, hydrophobicity, polarity, and other properties. A protein pocket P_C with n amino acids denoted as $P_{C} = s_{1}^{'} s_{2}^{'} \dots s_{n}^{'}$ where $s_{1}^{'} s_{2}^{'} \dots s_{n}^{'}$ is a set of amino acid residues that are spatially closed in the 3D structure; thus, these residues are not necessarily contiguous in the protein sequence.

2.1.2. Vietoris–rips complex

A simplex is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. A k-simplex is a k-dimensional polytope that is the convex hull of its k + 1 vertices. A simplicial complex $K$ is a set composed of simplices and satisfies the following conditions: (1) every face of a simplex from $K$ is also in $K$ and (2) the nonempty intersection of any two simplices σ₁, σ₂ ∈ $K$ is a face of both σ₁ and σ₂.

A Vietoris–Rips complex consists of all those simplices whose vertices are at a pairwise distance less than or equal to r defined as: ${VR}_{r} (K) = {σ \subseteq K | \forall u, v \in σ, ‖ u - v ‖ \leq r}$

In this work, we only consider the metric space ( $X$ , $d_{X}$ ), where $X$ is the set of alpha-carbon coordinates of the pockets and $d_{X}$ is the Euclidean distance between them.

2.1.3. Multiparameter persistence homology and landscape

Multiparameter persistence homology is an extension of (single-parameter) persistence homology (Vipond, 2020; Carriere and Blumberg, 2020). Formally, MPH is induced by a multifiltration function f: X → ℝ ^d . For any a, b ∈ ℝ ^d , we denote a ≺ b when ∀i, a_i ≤ b_i. Then, the sublevel sets F_r = {x ∈ X| f (x) ≤ r} satisfy F_a ⊆ F_b as long as a ≺ b. For a family of multiparameters r₁, r₂, ⋯, r_n ∈ ℝ ^d , when r_i ≤ r_j, the sets $F_{r_{i}}$ and the inclusion relationships $F_{r_{i}} \subseteq F_{r_{j}}$ are called multifiltration of f. To obtain homology, we apply the homology functor H_k, which maps topological spaces to vector spaces. H_k (F_r) represents the kth topological feature of F_r. The sequence of vector spaces connected with linear maps (H_k (F_a) → H_k (F_b)) is called as persistence module of f, denoted as M (f). The canonical decomposition of a persistence module is the sum of simple modules. $M (f) ≃ ⨁_{i \in I} I (α_{b_{i}}, α_{d_{i}}),$ where $I$ (α_b, α_d) such that α_b < α_d is the interval module. Intuitively, an interval module represents a topological feature that appeared at parameter α_b and disappeared at parameter α_d in the filtration. A representation of decomposition M (f) in a plane is called the persistence diagram. Then the single-parameter persistence landscape of M (f) given by the index set ${(α_{b_{j}}, α_{d_{j}}), j \in J}$ is defined as: $λ (k, t) = {kmax}_{j \in J} {λ ((α_{b_{j}}, α_{d_{j}})) (1, t)}$ where kmax denotes the kth largest value operator of the indexed set, and $λ ((α_{b_{j}}, α_{d_{j}}))$ is the landscape associated with the interval module $I (α_{b_{j}}, α_{d_{j}})$ .

The multiparameter persistence landscape is similarly defined as: $λ (k, x) = \sup {ε \geq 0 : β^{x - h, x + h} \geq k, \forall h \geq 0, ‖ h ‖_{\infty} \leq ε}$

The multiparameter persistence landscape considers the maximal radius over which k features persist in every (positive) direction of x (Vipond, 2020).

2.1.4. Fully labeled hypergraphs

A hypergraph G is a pair (V, E), where V is the vertex set and E is a family of subsets of V called hyperedges. Any hyperedge e ∈ E is a nonempty subset of V and can connect any number of vertices. In a vertex-labeled hypergraph, a node labeling function f_V is defined as f_V: V → Σ, where Σ is a finite alphabet. Analogously, in a hyperedge-labeled hypergraph, another labeling function f_E is defined as f_E: E → Ξ, where Ξ is also a finite alphabet. Finally, a fully labeled hypergraph G is a 6-tuple (V, E, f_V, f_E, Σ, Ξ), where each node v ∈ V has a corresponding vertex label f_V (v) ∈ Σ and each hyperedge e ∈ E has a corresponding hyperedge label f_E (e) ∈ Ξ.

A hypergraphlet is a small (typically up to four nodes), simple, connected, rooted hypergraph (Lugo-Martinez et al., 2021). An n-hypergraphlet is a hypergraphlet of n nodes. Supplementary Figure S1 in the Supplementary Data shows all unlabeled hypergraphlets for n ∈ {1, 2, 3}.

2.1.5. From protein structures to hypergraphs

There is a long history of modeling protein structures using hypergraphs (Maruyama et al., 2001; Freudenberg et al., 2002; Ye et al., 2007; Zhang et al., 2022; Jiang et al., 2023). Previous studies have demonstrated the need for a hypergraph-based representation over a graph-based representation for accurately modeling protein structures across different tasks (Ye et al., 2007; Zhang et al., 2022; Jiang et al., 2023). In this work, we further expand previous work on hypergraph-based modeling of protein structures by encoding different types of bonds into the representation. In order to obtain hypergraph-based representations, protein structures were modeled as fully labeled hypergraphs G = (V, E, f_V, f_E, Σ, Ξ), where each amino acid residue was represented as a vertex, the vertex alphabet Σ was derived from the physicochemical properties of amino acids, hyperedges were defined by different combinations of bond or interaction types (e.g., hydrogen bond, spatial proximity, or electrostatic), and the hyperedge alphabet Ξ was derived by assigning a unique label to each biochemically possible combination of bond/interaction types. In the case of spatial proximity, for a given amino acid residue, we compute a sphere of a prespecified radius (6 Å in our case). Then, all residues within the sphere are considered spatially close. We summarize the vertex alphabets (Supplementary Table S1) and edge alphabet (Supplementary Table S2) used in this study in the Supplementary Data.

2.2. Datasets

In this work, we focus on the quantification of protein binding pockets from protein 3D structures. Binding pockets are highly related to protein functions, especially enzymatic functions (Stank et al., 2016). Therefore, we consider two enzyme-related biological tasks: (1) classification between enzymes and nonenzymes from protein structures and (2) prediction of the enzyme class from protein structures. For the former, we compiled two publicly available datasets: D&D and PROTEINS. D&D is a widely used dataset originally published by Dobson and Doig (2003), composed of 1178 proteins categorized as enzymes (691) and nonenzymes (487). PROTEINS (Dobson and Doig, 2003; Borgwardt et al., 2005) is another popular dataset of protein structures composed of 1128 proteins categorized as enzymes (665) and nonenzymes (463). For the latter task, we collected another publicly available dataset ENZYMES (Borgwardt et al., 2005). ENZYMES contains 600 protein structures listed in the BRENDA enzyme database (Schomburg et al., 2002). The proteins are further annotated based on the Enzyme Commission (EC) number, which is a numerical classification scheme for enzymes. In particular, the annotations are divided into six groups: Oxidoreductases (EC 1), Transferases (EC 2), Hydrolases (EC 3), Lyases (EC 4), Isomerases (EC 5), and Ligases (EC 6), respectively (McDonald and Tipton, 2023). In this multiclass classification task, we have 100 annotated structures for each group.

2.3. Identifying protein pockets

Let P be a protein structure of interest. We first predicted the binding pockets of P using P2Rank (Krivák and Hoksza, 2018), a software tool for the prediction of ligand binding sites from protein structures. Let $P_{C}$ denote the resulting set of predicted binding pockets of P across different levels of pocket confidence. For each predicted pocket, $P_{C_{i}} \in P_{C}$ , we only considered five nested putative pockets such that $P_{C_{1}} \subseteq P_{C_{2}} \subseteq P_{C_{3}} \subseteq P_{C_{4}} \subseteq P_{C_{5}}$ , denoted as $P_{C}^{'} = {P_{C_{1}}, ..., P_{C_{5}}}$ . Once the protein pockets $P_{C}^{'}$ have been identified, we learned the local and global representations of these pockets that are described in the next subsection, respectively.

2.4. Learning pocket geometry on a hypergraph

As described in Section 2.2, we defined an improved representation of protein structures as fully labeled hypergraphs. This enriched hypergraph-based representation enables a more accurate modeling of the biochemical information within protein structures than the previous graph-based protein structure models (Lugo-Martinez et al., 2016).

Let G be a fully labeled hypergraph G = (V, E, f_V, f_E, Σ, Ξ) corresponding to protein structure P, and let $P_{C}^{'}$ be the associated protein pockets $P_{C}^{'} = {P_{C_{1}}, ..., P_{C_{5}}}$ . The workflow for local representations takes each protein pocket $P_{C_{i}} \in P_{C}^{'}$ and first uses hypergraphlets to probe the hypergraph-based representation of the pocket as follows: for each $v \in P_{C_{i}}$ , the hypergraphlet count vector is computed as: $ϕ_{n} (v) = (φ_{n_{1}}, φ_{n_{2}}, \dots, φ_{n_{κ} (n, Σ, Ξ)})$ where $φ_{n_{i}}$ is the count of the ith fully labeled n-hypergraphlet rooted at v and $φ_{n_{κ} (n, Σ, Ξ)}$ is the total number of vertex- and hyperedge-labeled n-hypergraphlets (Lugo-Martinez et al., 2021). Given n, Σ, and Ξ, $κ (n, Σ, Ξ) = Σ_{i = 1}^{| S (n) |}$ m_i (n, Σ, Ξ) · |S_i (n)|, which is fully described in Supplementary Table S3.

Then the count vector for each vertex is then normalized and fed into a message-passing HGNN as the initial embedding of vertices ( $x_{i}^{0}$ ). As the baseline, we also feed the HGNN with the one-hot embeddings of 20 amino acids. The message-passing process in an HGNN (Huang and Yang, 2021) is as follows: $(MP) {\begin{matrix} h_{e} = φ_{1} ({x_{j}}_{j \in e}) \\ {\tilde{x}}_{i} = φ_{2} (x_{i}, {h_{e}}_{e \in E_{•}}) \end{matrix}$ where both φ₁ and φ₂ are permutation-invariant functions and aggregate information from vertices and hyperedges. The exact form we used is the hypergraph equivalent to Graph convolutional network with Initial residual and Identity mapping (GCNII), a powerful convolutional approach with initial residual connection and identity mapping mechanisms (Chen et al., 2020). $(MP) {\begin{array}{l} {\hat{x}}_{i} = \frac{1}{{\sqrt{d}}_{i}} \sum_{e \in {\tilde{E}}_{•}} \frac{1}{{\sqrt{d}}_{e}} h_{e} \\ {\tilde{x}}_{i} = ((1 - β) I + β W) ((1 - α) {\hat{x}}_{i} + α x_{i}^{0}) \end{array}$ where α and β are hyperparameters, I is identity matrix, $x_{i}^{0}$ is the initial embedding of vertex v_i, ${\tilde{x}}_{i}$ is the output embedding of vertex v_i after one round of message passing, d_i is the number of extended neighbor nodes of v_i, h_e is the hyperedge embedding, d_e is the average degree of a hyperedge, and ${\tilde{E}}_{•}$ is the set of extended edges as originally described by Huang and Yang (2021). This produces a vectorized local representation for the protein pocket $P_{C_{i}}$ denoted as R_li.

This process is repeated for the remaining putative pockets in $P_{C}^{'}$ , thus producing the local representation for each of the five pockets as R_l₁, R_l₂, R_l₃, R_l₄, and R_l₅. Finally, the corresponding local representation for each nested putative pockets is concatenated into the final local representation R_l for $P_{C}^{'}$ as R_l = Concat (R_l₁, R_l₂, R_l₃, R_l₄, R_l₅). Figure 1b–c (bottom) illustrates an example of this workflow for local representations on the bioF enzyme (8-amino-7-oxononanoate synthase) along with corresponding protein structure (PDB ID: IDJ9) and a putative pocket (Fig. 1a).

FIG. 1.

Global and local representation of a putative pocket for AONS of Escherichia coli. (a) AONS (Gene: bioF, PDB ID: 1DJ9) is an enzyme that catalyzes the decarboxylative condensation of pimeloyl-CoA and l-alanine to produce AON. (b) A P2Rank score filtration shows alpha carbon atoms (C_α) around the putative pocket (top). The corresponding fully labeled hypergraphs (bottom). (c) A discrete multiparameter persistence homology and topological layers are applied to get the vectorized global representation (top). Counting of hypergraphlets (one-hot embedding is a special case) is the input of hypergraph neural network in which the local information passing is via hyperedges (bottom). The resulting concatenated vector contains global and local information without any supervision.

2.5. Learning pocket topology on a multiparameter perspective

Let $P_{C}^{'}$ be the identified protein pockets $P_{C}^{'} = {P_{C_{1}}, ..., P_{C_{5}}}$ corresponding to the protein structure P. The global topological representation R_g of pockets $P_{C_{i}} \in P_{C}^{'}$ is captured by a biparameter PH, where the first filtration parameter is the distance r, while the second filtration parameter is the P2Rank score t, which measures the confidence of the pocket prediction. Therefore, the sublevel set in our task is as follows: $F_{(r, k)} = {σ \subseteq K | \forall u, v \in σ, ‖ u - v ‖ \leq r, t_{u} < k, t_{v} < k}$ where $K = \cup_{i}^{5} K_{i}$ is a simplicial complex of five putative nested pockets over the metric space ( $X$ , $d_{X}$ ) of all alpha carbon (C_α) atoms and the Euclidean distance. u, v represent the alpha carbon atoms with the residue-wise P2Rank pocket score (Krivák and Hoksza, 2018) t_u and t_v, respectively.

The filtration value for distance d ranges from 0 to the maximum diameter of all pockets by a step of 0.05 Å. The filtration value for pocket score k ranges from 0 to 1, but the steps are five quintiles of all residue-wise scores in the same protein. For those proteins with no or not enough putative pockets, we also treat the residues between each quintile (0, 0.2, 0.4, 0.6, 0.8, 1) of P2Rank scores as putative pockets. The final filtration value space is the Cartesian product of distance values and pocket scores. It is worth noting that we take all pockets together and use MPH to analyze the global topological information, which is different from the local information capture procedure described in the previous subsection.

Next, the persistence of our biparameter filtration is computed, and the persistence diagrams are obtained. The biparameter persistence diagrams are the input of a neural network with layers, Perslay, which is a unified topological layer that captures topological signatures. The unified operation toward a persistence diagram is given as: $TopoLay (Dg) = o p ({w (p) \cdot ϕ (p)}_{p \in Dg}),$ where op (•) is a permutation invariant operation. w (•) and $ϕ$ (•) are the weight function and the transformation function for points in persistence diagrams, respectively.

Topological signatures are automatically calculated in a topological layer with different weights and transformation functions. In this work, we focus on the topological landscape. A constant weight w = 1 and a triangle point transformation $ϕ_{Λ}$ (Dg) = [Λ (z₁), Λ (z₂), ⋯, Λ (z_n)]^⊤. Λ (•) = max {0, y – |z – x|} is a peak function at (x, y). All z are the regions where we want to see the landscape. The kth order persistent landscape could be extracted by op = kthmax. Finally, for each persistence diagram Dg, we pass it on to such a layer and get the global representation R_g. Given that the second filtration value k is discrete, a decomposition is feasible to get the persistence diagram for sublevel sets at five-pocket score intervals to map to the corresponding geometry. Therefore, the resulting global representation R_g for $P_{C}^{'} = {P_{C_{1}}, ..., P_{C_{5}}}$ is R_g = Concat (R_g₁, R_g₂, R_g₃, R_g₄, R_g₅). Figure 1b and c (top) illustrates an example of this workflow for global representations on the same bioF enzyme (8-amino-7-oxononanoate synthase) along with corresponding protein structure (PDB ID: IDJ9) and a putative pocket (Fig. 1a).

Finally, the final representation R for $P_{C}^{'}$ is the concatenation of the paired local (R_l) and global (R_g) representations, thus R = Concat (R_l, R_g).

2.6. Revealing niches by statistical analysis

The count vector representation provides a very useful distribution of fully labeled hypergraphlets to build the hypergraph, showing the frequency of higher-order interactions. To statistically analyze the enrichment of these interactions, we compare our hypergraphlet-based counting with the configuration model proposed by Chodrow (2020). Usually used as a null model, this configuration model builds random hypergraphs by holding constant node degree and edge dimension sequences but generates multiple configurations.

In this work, we extend the configuration model by adding both node labels from an alphabet Σ and hyperedge labels from an alphabet Ξ. In the null model, during sampling from the configuration, we randomly assign a node or edge label by its natural abundance. Taking a toy pocket with 3 residues and 2 interactions as an instance, we show it in Figure 2a. Consider the background abundance of node labels P, N, A are r_P, r_N, r_A, and the background abundance of hyperedge labels D, A are $r_{D}^{'}, r_{A}^{'}$ , then the probability of generating such a motif is $P (G_{fully labeled}) = P (G_{unlabeled}) \cdot (\begin{matrix} 3 \\ 1 \end{matrix}) r_{P} r_{N} r_{A} r_{D}^{'} r_{A}^{'}$

FIG. 2.

Examples of unlabeled hypergraphlets and fully labeled hypergraphlets. The string representation of each hypergraphlet is (a) PNA|DA (Type VI), (b) HP^†|E (Type I), and (c) PNP^†|D (Type II), where the top node is the root. The colors are consistent with Figure 6 (see Supplementary Fig. S1and Supplementary Table S1).

where ℙ (G_unlabeled) is from the original hypergraph configuration model (Chodrow, 2020). The equivalence classes for unlabeled hypergraphlets are shown in Supplementary Table S3.

We sample from our fully labeled configuration model multiple times and compute the frequency of each motif. We then acquire the over/under expression of patterns by the difference between the observed count f_let and the frequency sampled from the null model ${\hat{f}}_{let}$ . The total number of motifs is kept as the same in counting and simulation procedures. The abundance difference Δ_let is $Δ_{let} = \frac{f_{let} - 〈 {\hat{f}}_{let} 〉}{f_{let} + 〈 {\hat{f}}_{let} 〉 + ε}$

Following Milo et al. (2004), we set the smoothing parameter ε = 4 to avoid unrealistic large values when f_let and ${\hat{f}}_{let}$ are both small.

The ensemble of over/under expression of all the nth higher-order motifs is called the hypergraph significance profile (HSP). Normalized HSP Δ _n is the fingerprint of the local structure of the hypergraph (Lotito et al., 2022) and has the same length as the count vector $ϕ_{n}$ . $Δ_{n} = (Δ_{let, n_{1}}, Δ_{let, n_{2}}, \dots, Δ_{let, n_{κ} (n, Σ, Ξ)}) / {(Σ_{i} Δ_{let, n_{i}}^{2})}^{\frac{1}{2}}$

2.7. Evaluation methodology

In this section, we describe the evaluation methodology across each prediction task. For the classification task using the D&D and PROTEINS datasets, we first systematically evaluate the performance based on different representations. In addition, to test the effects of global and local information, we evaluate the concatenated representations of global and local representations in isolation and in combination. To evaluate the impact of using different numbers of pockets, only the local and global representation in top k (k = 1, 2, 3, 4, 5) are preserved, respectively. To test whether our local representation improves prediction performance, we evaluate our methodology by measuring the performance of the proposed hypergraphlet-based embeddings across three state-of-the-art methods: GIN (Xu et al., 2018), MEWISPool (Nouranizadeh et al., 2021), and DDGK (Al-Rfou et al., 2019). For this comparison, the initial node features are one-hot encoding of 20 types of amino acids. We then replace it with our hypergraphlet, counting as initial node features. Next, we test if the accuracy is increased by adding global information to the corresponding model. This is accomplished by concatenating our global features to the output of each model’s final pooling or readout function but before the output layer. The dimensions are correspondingly modified.

For the enzyme class prediction task, we also test the effects of global and local information as well as the impact of using different numbers of pockets; however, we only report the performance based on top k (k = 1, 3, 5) putative pockets, respectively.

In the evaluation of each method, a 10-fold cross-validation is implemented, which means that in each iteration, 10% of the samples in the data were selected for the test set, whereas the remaining 90% were used for training. Shallow neural network was used to construct our predictors and perform comparative evaluation. Furthermore, we used PyTorch (Paszke et al., 2017), a highly useful automatic differentiation tool, with the default value of a linear layer. The Rectified linear unit (ReLU) activation functions were used. The model was trained using the backpropagation algorithm and the cross-entropy loss function. The initial learning rate was set to 0.05, whereas the decay rate was set to 0.99. The best model within 100 epochs was saved as the final one.

3. RESULTS

In the Results section, we first report the overall performance of our proposed framework on two function prediction tasks: classifying enzymes versus nonenzymes and predicting the enzyme class. Then, we evaluate both local geometric representation and global topological representation and their power of expressiveness. We extensively evaluate how the captured geometry and topological features are aligned and consistent with experimentally verified structures in biochemistry, including mechanisms based on spectroscopic, kinetic, and crystallographic studies (Webster et al., 2000). That is, we align local and global information with reference pocket properties of enzymology. Finally, we give a case study of topology and geometry.

3.1. Enzyme classification

To study the impact of learned representations, we evaluated the classification performance in the enzyme dataset (Dobson and Doig, 2003). Enzymes are special functional proteins that speed up the rate of a specific type of biochemical reaction. The place where the substrate binds is called the active site. Active sites are almost among the putative binding pockets (Dobson and Doig, 2003). Among the 1178 proteins in D&D dataset, 691 are enzymes and 487 are nonenzymes. After removing proteins with poor structures or without pocket predictions, we keep 1139 out of 1178 proteins distributed as 666 enzymes and 473 nonenzymes (Supplementary Table R1). We perform a similar screening process on the PROTEINS dataset and keep 1092 proteins of which 640 out of 665 enzymes and 452 out of 463 are preserved (Supplementary Table R2). Table 1 provides a summary of two enzyme classification datasets.

Table 1.
Enzyme Classification Dataset Summary

Dataset D&D PROTEINS

No. of total entries 1178 1128

No. of original enzymes 691 665

No. of original nonenzymes 487 463

No. of kept entries 1139 1092

No. of kept enzyme structures 666 640

No. of kept nonenzyme structures 473 452

Dataset	D&D	PROTEINS
No. of total entries	1178	1128
No. of original enzymes	691	665
No. of original nonenzymes	487	463
No. of kept entries	1139	1092
No. of kept enzyme structures	666	640
No. of kept nonenzyme structures	473	452

We first study the power of global or local representation in isolation. As shown in Table 2 and Figure 3, fully labeled hypergraphlets performed better than one-hot amino acid encoding (accuracy: 0.721 vs. 0.707). More importantly, enzymes are better identified by combining local and global representations (accuracy, combined: 0.761; global: 0.741; local: 0.721), where the best performance is achieved by integrating global information into hypergraphlet-based counts. Furthermore, we study the effect of the number of pockets. In Figure 3, we vary the number of pockets from the top 1 to the top 5 and compare the predictive accuracy. The global representation is always better than the local representation, except for the top 1 pocket. However, the combined representations outperform either representation in isolation. Overall, the best accuracy for three approaches is always achieved on the top 4 or top 5 pockets.

Table 2.

Classification Accuracy for Local and Global Representation on the D&D Dataset and PROTEINS Dataset

Representation	D&D	PROTEINS
Global^a	0.741 ± 0.012	0.752 ± 0.009
Local^b	0.707 ± 0.005	0.701 ± 0.006
Local^b + Global^a	0.756 ± 0.022	0.760 ± 0.019
Local^c	0.721 ± 0.021	0.746 ± 0.025
Local^c + Global^a	0.761 ± 0.013	0.782 ± 0.015

Mean and standard deviation of binary classification in a 10-fold cross-validation using shallow (five-layer) neural networks.

The bold data indicates the highest value for each dataset (column).

Topological landscape extracted by the operator kth max.

One-hot embedding as initial node features.

Hypergraphlet counting embedding as initial node features.

FIG. 3.

The performance comparison of different number of putative pockets. The values represent the mean accuracy-based one representation Global°, Local^†, and Local^† + Global°.

Then we evaluate our approaches with some state-of-the-art methods, which are based on kernels or GNNs. As shown in Table 3, the accuracy is better with the help of global representation. The highest average accuracy 0.865 is achieved by a GNN-based approach with maximum entropy weighted independent set, SetMEWISPool (Nouranizadeh et al., 2021) plus global feature. For another graph kernel-based deep learning method, DDGK (Al-Rfou et al., 2019), the best performance is achieved with one-hot embeddings instead of hypergraphlet counting. It might be due to the overuse of kernel tricks in both steps. We note that the global information surprisingly performs well and further improves the state-of-the-art models. These results demonstrate the advantage of our representation.

Table 3.

Classification Accuracy with Different Models and Representations on D&D Dataset

Node representation	Model	Without global^a	With global^a
Local^b	NN	0.707 ± 0.005	0.756 ± 0.022
Local^c	NN	0.721 ± 0.021	0.761 ± 0.013
Local^b	GIN(sum) (Xu et al., 2018)	0.752 ± 0.034	0.789 ± 0.028
Local^c	GIN(sum)	0.773 ± 0.021	0.802 ± 0.019
Local^b	MEWISPool (Nouranizadeh et al., 2021)	0.843 ± 0.002	0.865 ± 0.008
Local^c	MEWISPool	0.856 ± 0.006	0.861 ± 0.015
Local^b	DDGK (Al-Rfou et al., 2019)	0.831 ± 0.027	0.853 ± 0.019
Local^c	DDGK	0.827 ± 0.017	0.848 ± 0.012

Mean and standard deviation of binary classification with or without topological features. Some performance of state-of-the-art methods is based on original 1178 proteins, whereas ours is based on 1139 proteins (Table 1).

The bold data indicates the highest value for each dataset (column).

Topological landscape extracted by the operator kth max.

One-hot embedding as initial node features.

Hypergraphlet counting embedding as initial node features.

NN, neural networks.

3.2. Enzyme class prediction

To comprehensively and robustly evaluate our framework, we predict the enzyme class using the ENZYMES dataset. Among the 600 enzyme structures, each class (i.e., EC numbers 1–6) has 100 structures. After filtering out proteins with poor structures or pocket predictions, we keep 597 out of 600. The resulting distribution of structures per class is shown in Table 4, and the detailed list of protein structures is provided in Supplementary Table R3.

Table 4.
Distribution of Structures in Six Enzyme Commission Number Classes of ENZYMES Dataset

EC1 EC2 EC3 EC4 EC5 EC6

No. of structures 100 100 99 100 100 98

	EC1	EC2	EC3	EC4	EC5	EC6
No. of structures	100	100	99	100	100	98

EC, Enzyme Commission.

Table 5 shows the results for the task of predicting the enzyme class. Similar to enzyme classification, combining local and global representations significantly outperforms local- and global-only approaches across different number of top pockets. For example, for top 5 pockets, combining local and global representations (accuracy: 0.722) outperforms both local-only (accuracy: 0.683) and global-only (accuracy: 0.604). In this task, we also provide evidence that local representations outperform global representations. It shows that the number of pockets plays a central role in the prediction of the enzyme class, which is consistent with the functional bases and rational nomenclature of enzymes (McDonald and Tipton, 2023). Next, we compare our Local + Global results with other state-of-the-art methods for this task within the published literature: FGW sp (Vayer et al., 2018) is an optimal transport-based method that achieves an average accuracy of 0.712 on the same dataset. Depthwise Separable Graph Convolution Network (DSGCN) (Balcilar et al., 2020), another GNN-based method aware of spectral and spatial domains, achieved an accuracy of 0.784 on the same data for this prediction task. DSGCN also provides a spectral analysis of convolution frequency profiles, which can partially explain the predictive power.

Table 5.

Enzyme Commission Number Prediction Accuracy for Local and Global Representation and Comparison of Different Number of Putative Pockets on ENZYMES Dataset

Representation	Top 1	Top 3	Top 5
Global^a	0.551 ± 0.036	0.594 ± 0.019	0.604 ± 0.016
Local^b	0.608 ± 0.007	0.649 ± 0.006	0.641 ± 0.007
Local^b + Global^a	0.601 ± 0.020	0.657 ± 0.014	0.675 ± 0.013
Local^c	0.631 ± 0.009	0.668 ± 0.010	0.683 ± 0.005
Local^c + Global^a	0.671 ± 0.023	0.713 ± 0.009	0.722 ± 0.018

Mean and standard deviation of multiclass classification in a 10-fold cross-validation using shallow (five-layer) neural networks.

The bold data indicates the highest value for each dataset (column).

Topological landscape extracted by the operator kth max.

One-hot embedding as initial node features.

Hypergraphlet counting embedding as initial node features.

3.3. Statistical analysis of niches

Finally, we investigate the power of statistical analyses to associate pocket niches with higher-order motifs (Lotito et al., 2022). We calculate the frequency and normalized HSP for fully labeled 1-, 2-, and 3-hypergraphlets in the top 1 pocket (Fig. 4).

FIG. 4.

The profile of the frequency of seven, 287 fully labeled motifs including all 1-hypergraphlet, 2-hypergraphlet, and 3-hypergraphlet in the scheme of positively charged (P)/negatively charged (N)/other amino acids (O). The corresponding normalized HSP Δ _n for all motifs are in descending order of Δ_let. HSP, hypergraph significance profile.

Taking the charge property as an example. We find the most abundant and significant 1-hypergraphlet, 2-hypergraphlet, and 3-hypergraphlet and relate them to enzymology studies. Here, we will denote a fully labeled hypergraphlet using its corresponding string representation (Supplementary Fig. S1). The most frequent and significant 1-, 2-, and 3-hypergraphlets are a noncharged amino acid (“O”), two proximal noncharged amino acids (OO|D, Type I), and three proximal noncharged amino acids (OOO|DD, OOO|DDD, OOO|DDDD, Type II to X). One positively charged amino acid with two accompanying noncharged amino acids (OOP|DD, Type IV or VI) in the pocket is immediately after in the list. Another example is a salt bridge where a glutamic acid and a lysine show an electrostatic interaction and a hydrogen bond (Horovitz et al., 1990). The occurrence of such salt bridges could be captured by a few 3-hypergraphlets, such as PNO|ID (Type III or V), which is the 264th most significant hypergraphlet, which is consistent with the widespread occurrence of salt bridges within proteins (Horovitz et al., 1990). Unsurprisingly, our 5 physicochemical-based vertex-labeling schemes and 15 interaction types (Supplementary Table S2) contribute a lot for incorporating domain knowledge into downstream biological analysis tasks. The emergence of task-specific motif families could leverage the interpretability with HSP.

3.4. Case study of geometry

Here, we give a detailed case study of the protein AONS in Figure 1. The Pyridoxal phosphate (PLP) cofactor (one substrate of 8-Amino-7-oxononanoate synthase [AONS]) is covalently bound to Lys236, while His133 and His207 are important for binding (Fig. 5). Figure 6 shows that the proximity relationship is captured by a “D” hyperedge. That is, Asp204 is hydrogen bonded to PLP, and O3 is hydrogen bonded to His207. Such a scenario is captured by another hydrogen bond but not proximity-based labeled relationship “A” in the hypergraph. Figure 2a shows the exact hypergraphlet (PNA|DA, Type VI), representing this biochemical relationship. In the counting step, once the hypergraphlet is matched, the corresponding count will increase by one.

FIG. 5.

The binding pocket and interaction illustration of protein AONS. The top 1 pocket $P_{C_{1}}$ is shown as the cyan mesh. The zoomed-in view of the protein, ligand, and their interactions are listed at the right panel.

FIG. 6.

The hypergraph illustration of the top 1 pocket for protein AONS. Left, part of the true 2D ligand interaction diagram. Right, the corresponding fully labeled hypergraph of the pocket.

In addition, the pyridoxal enzyme (Webster et al., 2000) interaction can be inferred if we add the hydrophobic node label for Ala206 and Thr233 and consider the relationship “E” that captures both proximity and hydrogen bond. Figure 2b shows the corresponding fully labeled hypergraphlet for this case.

However, it is not enough to use the top 1 pocket. Glu175 is a negatively charged residue that can polarize the hydroxyl group of Ser179 (Webster et al., 2000). Our top 1 hypergraph $G_{C_{1}}$ fails to capture it, but Glu175 is present in $G_{C_{2}}, G_{C_{3}}, G_{C_{4}}$ , and $G_{C_{5}}$ . Interestingly, considering the node labels for positively (P) and negatively (N) charged residues and polar residues (P_†), we could identify the system of His207-Ser179-Glu175. Figure 2c shows the fully labeled 3-hypergraphlet corresponding to this configuration that is missing in $G_{C_{1}}$ but present in $G_{C_{2}}, \dots, G_{C_{5}}$ shown (Fig. 6 right). Nested pockets are necessary to dissect pockets with different sizes or pockets that interact widely with two or more functional groups.

We argue that the nested fully labeled hypergraphs are simple but enriched representations of the microenvironment within the pocket (Fig. 6). Furthermore, hypergraphlets enable the incorporation of domain knowledge via the node and hyperedge labeling alphabets, thus enabling the study of distinct complex biological interactions at the local scale.

3.5. Case study of topology

As for the global information captured by TDA, we compared the persistent diagram for the AONS enzyme and its five putative pockets. Figure 7 shows the Vietoris–Rips complex of AONS (Fig. 7a) and five nested putative pockets (Fig. 7b–f) for alpha carbon atoms at maximum cutoff of r = 7.5 Å. While most of the topological features are present, loops or voids are more prominent in these pockets.

FIG. 7.

The illustration of the constructed Vietoris–Rips complex of (a) protein AONS P and (b)–(f) top 5 putative pockets $P_{C_{5}} \supseteq P_{C_{4}} \supseteq P_{C_{3}} \supseteq P_{C_{2}} \supseteq P_{C_{1}}$ . The global shape of our pockets matches the shape of our ligand (Fig. 5) quite well. MPH with nested pockets and the distance (Å) shape outputs persistent diagrams (bottom) and capture all connected components, loops, and voids with homology dimension h = 0, 1, 2. MPH, multiparameter persistent homology.

The persistence diagram of the AONS complex is too noisy to extract pocket information. For instance, in the Vietoris–Rips complexes of the top four putative pockets, the shape is consistent with that of the ligand KAM, where the benzene ring and phosphonooxymethyl group are at the bottom, and the keto-aminopelargonic side is at the top. On the contrary, the top 1 pocket is smaller and only captures the rich interaction void near the benzene ring and phosphonooxymethyl group. The scattered points (h = 0, 1, 2) give an overview of the size and surface area of the narrow pocket. It is worth noting that the operation in the topological layer will select landscapes that correspond to the most persistent structures.

4. CONCLUSION

We present a representation learning framework for macromolecules (Fig. 1), which is particularly useful if the structure of the underlying macromolecule is known. Our comprehensive evaluation shows that learned representations encode informative and biochemically explainable local and global features. Extended statistical approaches for hypergraph-based motifs identify some favorable patterns in protein structures that are consistent with enzymology. However, we note that hypergraphlet-based enumeration methods can become computationally expensive when we consider densely connected hypergraphs. Overall, our work shows evidence that methods combining labeled hypergraphlet-based inference and persistence topology-based analysis are competitive with other approaches.

Footnotes

ACKNOWLEDGMENT

The authors would like to thank Prof. Karsten Borgwardt for providing the corresponding PBD IDs for the PROTEINS and ENZYMES datasets.

AUTHORS’ CONTRIBUTIONS

P.J. and J.L.-M. conceived and designed the experiments. P.J. performed the experiments. P.J. and J.L.-M. analyzed the data. All authors contributed to the writing of the article. All authors read and approved the final article.

DATA AVAILABILITY

All the data underlying this article were collected from publicly available sources in the TUDatasets repository, https://chrsmrrs.github.io/datasets/docs/datasets/. In addition, we provide all the corresponding PDB entries for each dataset in the external Excel files. The hypergraphlet code and documentation are available at https://github.com/jlugomar/hypergraphlet-kernels. All other code and documentation will be available upon acceptance at .

AUTHOR DISCLOSURE STATEMENT

The authors declare that they have no competing interests.

FUNDING INFORMATION

No funding was received for this article.

SUPPLEMENTARY MATERIAL

References

Al-Rfou

, Perozzi

, Zelle

. Ddgk: Learning graph representations for deep divergence graph kernels. In: The World Wide Web Conference. 2019; pp. 37–48.

Atz

, Grisoni

, Schneider

. Geometric deep learning on molecular representations. Nat Mach Intell, 2021; 3(12):1023–1032.

Balcilar

, Renton

, Héroux

, et al. Bridging the gap between spectral and spatial domains in graph neural networks. arXiv, 2020.

Böker

. Color refinement, homomorphisms, and hypergraphs. In: Graph-Theoretic Concepts in Computer Science: 45th International Workshop, WG 2019, Vall de Núria, Spain, June 19–21, 2019, Revised Papers. Springer; 2019; pp. 338–350.

Borgwardt

, Ong

, Schönauer

, et al. Protein function prediction via graph kernels. Bioinformatics, 2005; 21(suppl_1):i47–i56.

Botnan

, Lesnick

. An introduction to multiparameter persistence. arXiv, 2022.

Bourlieu

, Astruc

, Barbe

, et al. Enzymes to unravel bioproducts architecture. Biotechnol Adv, 2020; 41:107546.

Bubenik

, Dłotko

. A persistence landscapes toolbox for topological statistics. J Symbolic Computation, 2017; 78:91–114.

Cang

, Mu

, Wu

, et al. A topological approach for protein classification. Computational and Mathematical Biophysics, 2015; 3(1).

10.

Carriere

, Blumberg

. Multiparameter persistence image for topological machine learning. Adv Neural Inf Process Syst, 2020; 33:22432–22444.

11.

Carrière

, Chazal

, Ike

, et al. Perslay: A neural network layer for persistence diagrams and new graph topological signatures. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2020; pp. 2786–2796.

12.

Chazal

, Michel

. An introduction to topological data analysis: Fundamental and practical aspects for data scientists. Front Artif Intell, 2021; 4:667963.

13.

Chen

, Wei

, Huang

, et al. Simple and deep graph convolutional networks. In: International conference on machine learning. PMLR; 2020; pp. 1725–1735.

14.

Chodrow

. Configuration models of random hypergraphs. J Complex Netw, 2020; 8(3):cnaa018.

15.

Coleman

, Sharp

. Protein pockets: Inventory, shape, and comparison. J Chem Inf Model, 2010; 50(4):589–603; doi: 10.1021/ci900397t

16.

Dobson

, Doig

. Distinguishing enzyme structures from non-enzymes without alignments. J Mol Biol, 2003; 330(4):771–783.

17.

Fasy

, Wang

. Exploring persistent local homology in topological data analysis. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2016; pp. 6430–6434.

18.

Freudenberg

, Zimmer

, Hanisch

, et al. A hypergraph-based method for unification of existing protein structure- and sequence-families. In Silico Biol. 2002; 2(3):339–349.

19.

Gaudelet

, Malod-Dognin

, Pržulj

. Higher-order molecular organization as a source of biological function. Bioinformatics, 2018; 34(17):i944–i953; doi: 10.1093/bioinformatics/bty570

20.

Hofer

, Kwitt

, Niethammer

. Learning representations of persistence barcodes. J Mach Learn Res, 2019; 20(126):1–45.

21.

Hofer

, Graf

, Rieck

, et al. Graph filtration learning. In: International Conference on Machine Learning. PMLR; 2020; pp. 4314–4323.

22.

Horn

, De Brouwer

, Moor

, et al. Topological graph neural networks. arXiv, 2021.

23.

Horovitz

, Serrano

, Avron

, et al. Strength and co-operativity of contributions of surface salt bridges to protein stability. J Mol Biol, 1990; 216(4):1031–1044.

24.

Huang

, Yang

. Unignn: A unified framework for graph and hypergraph neural networks. arXiv, 2021.

25.

Jiang

, Wang

, Feng

, et al. Explainable deep hypergraph learning modeling the peptide secondary structure prediction. Adv Sci, 2023; 10(11):2206151.

26.

Kovacev-Nikolic

, Bubenik

, Nikolić

, et al. Using persistent homology and dynamical distances to analyze protein binding. Stat Appl Genet Mol Biol, 2016; 15(1):19–38.

27.

Krivák

, Hoksza

. P2rank: Machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J Cheminform, 2018; 10(1):39.

28.

Liang

, Woodward

, Edelsbrunner

. Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design. Protein Sci, 1998; 7(9):1884–1897.

29.

Lotito

, Musciotto

, Montresor

, et al. Higher-order motif analysis in hypergraphs. Commun Phys, 2022; 5(1):79.

30.

Lugo-Martinez

, Pejaver

, Pagel

, et al. The loss and gain of functional amino acid residues is a common mechanism causing human inherited disease. PLOS Comput Biol, 2016; 12(8):e1005091–e23; doi: 10.1371/journal.pcbi.1005091

31.

Lugo-Martinez

, Zeiberg

, Gaudelet

, et al. Classification in biological networks with hypergraphlet kernels. Bioinformatics, 2021; 37(7):1000–1007; doi: 10.1093/bioinformatics/btaa768

32.

Maruyama

, Shoudai

, Furuichi

, et al. Learning conformation rules. In: Discovery Science. ( Jantke

and Shinohara

. eds). Springer Berlin Heidelberg: Berlin, Heidelberg; 2001; pp. 243–257.

33.

McDonald

, Tipton

. Enzyme nomenclature and classification: The state of the art. Febs J, 2023; 290(9):2214–2231.

34.

Milo

, Itzkovitz

, Kashtan

, et al. Superfamilies of evolved and designed networks. Science, 2004; 303(5663):1538–1542.

35.

Nouranizadeh

, Matinkia

, Rahmati

, et al. Maximum entropy weighted independent set pooling for graph neural networks. arXiv, 2021.

36.

Paszke

, Gross

, Chintala

, et al. Automatic differentiation in pytorch. 2017.

37.

Roy

, Zhang

. Recognizing protein-ligand binding sites by global structural alignment and local geometry refinement. Structure, 2012; 20(6):987–997.

38.

Saha

, Katebi

, Dhifli

, et al. Discovery of functional motifs from the interface region of oligomeric proteins using frequent subgraph mining. IEEE/ACM Trans Comput Biol Bioinform, 2019; 16(5):1537–1549.

39.

Schomburg

, Chang

, Schomburg

. Brenda, enzyme data and metabolic information. Nucleic Acids Res, 2002; 30(1):47–49.

40.

Stank

, Kokh

, Fuller

, et al. Protein binding pocket dynamics. Acc Chem Res, 2016; 49(5):809–815.

41.

Swenson

, Krishnapriyan

, Buluc

, et al. Persgnn: Applying topological data analysis and geometric deep learning to structure-based protein function prediction. arXiv, 2020.

42.

Tian

, Liang

. On quantification of geometry and topology of protein pockets and channels for assessing mutation effects. In: 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI). IEEE; 2018. pp. 263–266.

43.

Vayer

, Chapel

, Flamary

, et al. Optimal transport for structured data with application on graphs. arXiv, 2018.

44.

Vipond

. Multiparameter persistence landscapes. J Machine Learning Res, 2020; 21(1):2262–2299.

45.

Vipond

, Bull

, Macklin

, et al. Multiparameter persistent homology landscapes identify immune cell spatial patterns in tumors. Proc Natl Acad Sci USA, 2021; 118(41):e2102166118.

46.

Webster

, Alexeev

, Campopiano

, et al. Mechanism of 8-amino-7-oxononanoate synthase: Spectroscopic, kinetic, and crystallographic studies. Biochemistry, 2000; 39(3):516–528.

47.

, Hu

, Leskovec

, et al. How powerful are graph neural networks? arXiv, 2018.

48.

, Friedman

, Bailey-Kellogg

. Hypergraph model of multi-residue interactions in proteins: Sequentially-constrained partitioning algorithms for optimization of site-directed protein recombination. J Comput Biol, 2007; 14(6):777–790.

49.

Zhang

, Li

, Xiao

, et al. Hypergraph convolutional networks via equivalency between hypergraphs and undirected graphs. arXiv, 2022.

50.

Zhou

, Jiang

, Bergquist

, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol, 2019; 20(1):244–209.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.40 MB

0.04 MB

0.03 MB

Combined Topological Data Analysis and Geometric Deep Learning Reveal Niches by the Quantification of Protein Binding Pockets

Abstract

1. INTRODUCTION

1.1. Contributions

1.2. Related work

1.2.1. Multiparameter persistent homology

1.2.2. Hypergraph kernels and hypergraph neural networks

1.2.3. Topological layers

2. METHODS

2.1. Background and notation

2.1.1. Protein binding pocket

2.1.2. Vietoris–rips complex

2.1.3. Multiparameter persistence homology and landscape

2.1.4. Fully labeled hypergraphs

2.1.5. From protein structures to hypergraphs

2.2. Datasets

2.3. Identifying protein pockets

2.4. Learning pocket geometry on a hypergraph

2.6. Revealing niches by statistical analysis

3. RESULTS

3.1. Enzyme classification

Table 1. Enzyme Classification Dataset Summary Dataset D&D PROTEINS No. of total entries 1178 1128 No. of original enzymes 691 665 No. of original nonenzymes 487 463 No. of kept entries 1139 1092 No. of kept enzyme structures 666 640 No. of kept nonenzyme structures 473 452

Table 4. Distribution of Structures in Six Enzyme Commission Number Classes of ENZYMES Dataset EC1 EC2 EC3 EC4 EC5 EC6 No. of structures 100 100 99 100 100 98

Footnotes

ACKNOWLEDGMENT

AUTHORS’ CONTRIBUTIONS

DATA AVAILABILITY

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

SUPPLEMENTARY MATERIAL

References

Supplementary Material

Table 1.
Enzyme Classification Dataset Summary

Dataset D&D PROTEINS

No. of total entries 1178 1128

No. of original enzymes 691 665

No. of original nonenzymes 487 463

No. of kept entries 1139 1092

No. of kept enzyme structures 666 640

No. of kept nonenzyme structures 473 452

Table 4.
Distribution of Structures in Six Enzyme Commission Number Classes of ENZYMES Dataset

EC1 EC2 EC3 EC4 EC5 EC6

No. of structures 100 100 99 100 100 98