A New Structure Feature Introduced to Predict Protein

Abstract

Interaction between proteins often depends on the sequence features and structure features of proteins. Both of these features are helpful for machine learning methods to predict (protein–protein interaction) PPI sites. In this study, we introduced a new structure feature: concave–convex feature on the protein surface, which was computed by the structural data of proteins in Protein Data Bank database. And then, a prediction model combining protein sequence features and structure features was constructed, named SSPPI_Ensemble (Sequence and Structure geometric feature-based PPI site prediction). Three sequence features, i.e., PSSMs (Position-Specific Scoring Matrices), HMM (Hidden Markov Models) and raw protein sequence, were used. The Dictionary of Secondary Structure in Proteins and the concave–convex feature were used as the structure feature. Compared with the other prediction methods, our method has achieved better performance or showed the obvious advantages on the same test datasets, confirming the proposed concave–convex feature is useful in predicting PPI sites.

1. INTRODUCTION

Protein–protein interaction (PPI) is an important medium for organisms to perform their functions and establish their life processes. Many protein functions are realized through the interaction between proteins (Athanasios et al., 2017; Wang et al., 2007), such as signal transduction, immune response and cell proliferation.

There are already many efficient and low-cost computational approaches used to predict PPIs in the past decade years. These approaches can be divided into two categories: I. predict whether a protein will interact with other proteins (Chen et al., 2019; Yang et al., 2020; Yu et al., 2020), II. predict PPI sites on a protein. In the first category, many approaches have achieved a high accuracy. For example, Yang created a graph-based deep learning method and used both information of protein sequence and PPI network structure to predict PPIs (Yang et al., 2020), Liu trained the MindSpore ProteinBERT (MP-BERT) model, a Bidirectional Encoder Representation from Transformers, using protein pairs as inputs, making it suitable for identifying PPIs and their respective interaction sites (Liu et al., 2023), and Yu proposed a novel prediction pipeline for PPI based on gradient tree boosting, where improved strategies of extracting sequence features and removing redundant features were employed to select an optimal feature subset (Yu et al., 2020). In the second category, due to the rapid development of high-throughput sequencing technology, a large number of protein sequence data is available, and many sequence-based predictions methods are proposed (Hosseini et al., 2024; Hosseini and Ilie, 2022; Li et al., 2021; Manfredi et al., 2023; Mihel et al., 2008; Wang et al., 2019; Zeng et al., 2020). However, when predicting PPI sites, it is difficult to achieve high accuracy using sequence features alone. Some researchers have investigated the contribution of spatial geometric features in predicting PPI sites and obtained encouraging results (Dai and Bailey-Kellogg, 2021; Gainza et al., 2020; Northey et al., 2018). Alternatively, graph convolutional neural networks could be used to introduce potential structural features (Ding et al., 2024; Yuan et al., 2021). Therefore, it is imperative to develop a prediction method that combines sequence and geometric structure features.

The local features of proteins are also crucial for improving the performance of prediction model. The commonly used methods for constructing local features is the sliding window based on protein sequence position (Li et al., 2021; Zeng et al., 2020). However, the local features constructed by this way only reflect the domain relationships of amino acids on the protein sequence. In order to reflect the domain relationships of amino acids on the protein spatial distance relationships, the distance matrix was constructed based on Protein Point Cloud, and the local features were constructed based on the distance matrix (Yuan et al., 2021).

In this study, we introduced a new structure feature, concave–convex feature, to predict PPI sites. We proposed a deep neural network model to predict PPI site, called SSPPI (Sequence and Structure geometric feature-based PPI site prediction). The Sequence features i.e., PSSM (Position-Specific Scoring Matrices), HMM (Hidden Markov Models) and raw protein sequence, and the Structure features i.e., DSSP (Dictionary of Secondary Structure in Proteins) (secondary structure feature) and concave–convex feature, were used as the input of our model. In particular, we constructed the local protein features based on sequence positions and the distance matrix, and then fused the two types of local features through two different fusion methods. To the best of our knowledge, this is the first attempt to introduce the concave–convex features, which was useful to predict PPI sites, and the first attempt to fuse the two types of local features. SSPPI is available at: http://github.com/llwcool/SSPPI.

2. MATERIALS AND METHODS

2.1. Datasets

Three common benchmark datasets that are widely used were utilized, i.e., Dset_72, Dset_186 (Murakami and Mizuguchi, 2010) and Dset_164 (Dhole et al., 2014), which were named by the number of proteins contained. Each of these benchmark datasets was created using a set of established criteria (Yuan et al., 2021) for filtering protein-protein complexes found in the PDB (Protein Data Bank). A surface residue (RSA [Relative Solvent Accessibility] > 5%) could be defined as a protein-protein interacting residue if it lost more than 1² absolute solvent accessibility. Each protein has a different distribution in terms of interacting percentages, so it is important to ensure the same distributions. A dataset was integrated by the three datasets, and BLAST (Altschul et al., 1997) was utilized to eliminate proteins that were redundant and had over 25% sequence similarity and 90% sequence overlap. This process resulted in a collection of 395 protein chains, from which 335 protein chains were randomly selected for training (Train_335) and the remaining 60 chains were used as the independent test dataset (Dset_60).

To address the issue of outdated raw training datasets, we utilized the latest version of the protein interaction residue chains from the PiSite database (January 2019) (Higurashi et al., 2009). A total of 22,654 proteins were initially extracted from this database. We then filtered out sequences that did not contain interaction residues or had fewer length than 50 amino acids, resulting in a refined set of 14,203 sequences. To reduce redundancy, we applied PSI-CD-HIT (Position-Specific Iterative Cluster Database at High Identity with Tolerance) to remove sequences with over 25% similarity, further refining the dataset. Additionally, we excluded Dset_72, Dset_164, Dset_186, Dset_448, Dset_500, Dset_315, and Dset_70 from training dataset to ensure the uniqueness of the proteins used for training. Proteins lacking published three-dimensional (3D) structures were also removed, resulting in the exclusion of 70 proteins. Proteins with missing amino acid residues in their 3D structures were also filtered out, ensuring the completeness and robustness of the dataset for subsequent analysis. Ultimately, 1,326 proteins were selected to form the SSPPI training dataset, which was then split into a training set (90%) and a validation set (10%). This dataset serves as the foundation for the development and validation of our model. Details of the statistics of these datasets are given in Table 1.

Table 1.
Details of Datasets

Dataset Interaction residues Non-Interaction residues Interaction residue percentage (%)

Dset_60 2075 11069 15.79

Dset_70 2332 9459 17.78

Dset_315 9355 55976 14.32

Dset_72 1923 16217 10.60

Dset_164 6096 27585 18.10

Dset_186 5517 30702 15.23

Train_335 10374 55992 15.63

Train_1326 58782 195978 23.07

Dataset	Interaction residues	Non-Interaction residues	Interaction residue percentage (%)
Dset_60	2075	11069	15.79
Dset_70	2332	9459	17.78
Dset_315	9355	55976	14.32
Dset_72	1923	16217	10.60
Dset_164	6096	27585	18.10
Dset_186	5517	30702	15.23
Train_335	10374	55992	15.63
Train_1326	58782	195978	23.07

2.2. Sequence features

In our work, PSSM (Position-Specific Scoring Matrices), HMM (Hidden Markov Models) and raw protein sequence were used as the Sequence Features of a protein.

2.2.1. PSSM

We used PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) to get the corresponding score matrix S for each protein. The highly conservative position in multiple sequence alignment has a high score in PSSM, while the weakly conservative position has a score close to zero. The form of PSSM is as follows: $S = [\begin{matrix} s_{1, 1} & \dots & s_{1, 20} \\ ⋮ & \dots & ⋮ \\ s_{i, 1} & \dots & s_{i, 20} \\ ⋮ & \dots & ⋮ \\ s_{n, 1} & \dots & s_{n, 20} \end{matrix}]$ (1)where n is the length of a given protein sequence.

2.2.2. HMM

HHblits (Remmert et al., 2012) were used to align the query sequence against the UniClust30 database (Mirdita et al., 2017).

2.2.3. Raw protein sequence

We used a one-hot vector of length 20 to represent a specific amino acid molecule. For the 20 amino acids, our sorting order is the same as the order in which PSSM performs the alignment: $[A R N D C Q E G H I L K M F P S T W V Y]$ (2)

For example, in a raw protein sequence, if a position is the amino acid ‘A’, it is represented by ‘1’ and other positions are represented by ‘0’. Every amino acid was encoded into a 60 (3 × 20)-dimension vector to represent the Sequence Feature of a protein. And the 20-dimensions vector in PSSM or HMM was normalized to scores between 0 and 1.

2.3. Structure features

2.3.1. DSSP

We used the similar method with the literature (Yuan et al., 2021) to get these structural features in this section. The program DSSP (Kabsch and Sander, 1983) was used, and three types of structural properties were obtained, which are as follows: (1) Secondary structure states are processed into a 9-dimensional vector (8-dimensional one-hot encoding represents the 8 classes of amino acid secondary structure states, and the last dimension represents the unknown structure state). (2) Sine and cosine values of the PHI and PSI torsion angles of the peptide chain backbone are calculated, and they are processed into a 4-dimensional vector. (3) Solvent accessible surface area (ASA) is processed into a 1-dimensional vector. Finally, we got a 14-dimensional structural feature group, named DSSP.

2.3.2. Concave–convex feature

We constructed the protein surface concave–convex features via the following steps:

1.
For a given protein with a PDB structure, we first performed protonation using Reduce (Word et al., 1999), then triangulated protein surfaces using the MSMS program (Sanner et al., 1996).
2.
Protein meshes were down-sampled and regularized by PyMesh (Zhou, 2018). In this way, the protein surface consisted of n meshes.
3.
The measurement of the concave–convex properties of the protein surface is as follows. (a)
As is shown in Figure 1, for a given point O on the protein, we create a mesh object by the “form_mesh” function from the PyMesh library, and use the attribute “vertex_normal” that comes with the mesh object to get its normal vector V, so that we can represent the tangent planewith O and V.
(b)
From the above step 2, we can obtain n triangular meshes which contain the point O. For each mesh $M_{i}$ , we compute its barycenter $B C_{i}$ and the mesh normal vector $M N V_{i}$ on this barycenter (Fig. 1).
(c)
For each mesh and barycenter, there is a point of intersection with tangent plane S. Assuming that each intersection is $V_{i}$ , the coordinates of $V_{i}$ can be obtained by equation (3) and (4). $V_{i} = {\begin{array}{l} V_{i} . x = B C_{i} . x + m \times M N V_{i} . x \\ V_{i} . y = B C_{i} . y + m \times M N V_{i} . y \\ V_{i} . z = B C_{i} . z + m \times M N V_{i} . z \\ (V_{i} . x - O . x) \times V . x + (V_{i} . y - O . y) \times V . y + (V_{i} . z - O . z) \times V . z = 0 \end{array}$ (3) $m = \frac{(O . x - B C_{i} . x) \times V . x + (O . y - B C_{i} . y) \times V . y + (O . z - B C_{i} . z) \times V . z}{V . x \times M N V_{i} . x + V . y \times M N V_{i} . y + V . z \times M N V_{i} . z}$ (4)
(d)
The n intersections and the given point O can enclose a polygon $S_{1}$ on the tangent plane S, assuming its area is $A_{1}$ . If we move the polygon in a direction parallel to its normal vector V, the area of the polygon will increase or decrease. So, we can find a polygon $S_{n}$ with the smallest area $A_{n}$ . Then, the area changes rate of the polygon, AoD (Area of Distance), can be calculated according to the following equation: $AoD = \frac{A_{1} - A_{n}}{M D}$ (5)where MD is the distance moved, and the sign of D is determined by the moving direction of the polygon and the direction of the normal vector V. It can be inferred that the larger the absolute value of AoD obtained, the stronger the degree of convexity or concavity was. As shown in the two examples in Figure 2. Then, we used PyMesh to calculate the Gaussian and mean curvatures (GC and MC) at point O, so for each given point we can get four metrics, AoD, MD, GC and MC.

It is worth noting that in the PDB data, every amino acid contains many points (atoms), both of them need to be calculated. Then, in order to obtain the AoD of an amino acid, the AoD of convex and concave were both divided into 4 bins:

Convex bins: $(- \infty, - 0.5), [- 0.5, - 0.3), [- 0.3, - 0.1), [- 0.1, - 0.001)$ .

Concave bins: $(0.001, 0.1], (0.1, 0.3], (0.3, 0.5], (0.5, + \infty)$ .

If a point has the value of AoD between [−0.001, 0.001], we think it is a flat surface. We calculate the AoD of all points belong to the same amino acid, and for each bin, we collect the AoD at all points in that range. Thus, for AoD, we get a 9-dimension vector to describe. The MD was described in the same way but the number of bins and the range of each bin were different (10 bins).

Besides the concave–convex feature we proposed, we also used Gaussian curvature and Mean curvature to describe the convex and concave properties of protein surfaces. Gaussian curvature is the product of the curvatures at a point on a surface, and the average curvature is Mean curvature.

They can both be used to describe the shape of a surface at a certain point, such as convex and concave, and Mean curvature can even describe the trend of curvature change at that point. Based on the positive or negative value of curvatures, the shape of the surface at that point can be determined, such as convex, concave, cylindrical, and hyperbolic (Besl and Jain, 1988).

For example, when Gaussian curvature is positive, a positive Mean curvature indicates that the surface is convex at that point, a negative Mean curvature indicates that the surface is concave at that point, and a zero Mean curvature indicates that the surface is conical at that point. According to Gaussian curvature and Mean curvature at a point on a surface, the shape of the surface can be determined. All possible cases are listed in Table 2.

Therefore, an 8-length vector obtained based on Gaussian curvature and Mean curvature on the surface is used to describe the concave–convex features of the surface. And in the last, we add a dimension to represent the unknown concave–convex feature. Thus, we totally used a 28-dimension vector to describe the concave–convex feature.

FIG. 1.
Measurement of geometric feature of a point on an amino acid (the thick dotted lines with arrow represent the normal vectors (MNVs) of the meshes, then a polygon (S₁) is formed by the intersections of these normal vectors with the tangent plane (S) of the point (O), and the intersection of the thin dotted lines represents the barycenter (BC) of a mesh).

FIG. 2.
(A) A point at PDB 4ZQK_A with high convexity, the AoD of which was calculated to be −0.31. (B) A point at PDB 4ZQK_A with low concavity, the AoD of which was calculated to be 0.01. The first column: the surface of protein where point O is located. The second column: the black and white parts represent the triangular meshes with the vertex O. The third column: the changes of polygon area when moving the tangent plane of the point O (the polygon is surrounded by the intersections of the tangent plane and the normal vectors of the triangular meshes).

Table 2.
Details of Convex and Concave Properties

Gaussian curvature Mean curvature Surface shape

Zero Positive Cylindrical

Zero Zero Flat

Zero Negative Hyperbolic

Positive Positive Convex

Positive Zero Conical

Positive Negative Concave

Negative Positive Saddle

Negative Negative Convex

2.4. Network architecture

Gaussian curvature	Mean curvature	Surface shape
Zero	Positive	Cylindrical
Zero	Zero	Flat
Zero	Negative	Hyperbolic
Positive	Positive	Convex
Positive	Zero	Conical
Positive	Negative	Concave
Negative	Positive	Saddle
Negative	Negative	Convex

As shown in Figure 3, the sequence and structural features of the protein are used as inputs to the prediction model. In this prediction model, it can be seen that the input features include Seq_Feature and Struct_Feature, which is constructed by different approaches. The common way to construct local protein features is usually based on protein sequence positions using sliding windows, and the Seq_Feature is constructed in this way. The method of constructing input feature through sliding windows enables the model to focus on local features of the sequence, but does not integrate well with structural features. Therefore, a method of the Struct_Feature for constructing local protein features based on protein spatial distance was proposed, with the following construction steps. (1) According to the PDB file of a protein, the coordinates of the atom of each amino residue are acquired and the Euclidean distances between all residue pairs are then calculated. (2) Choosing a D_cutoff. If the distance between residue pairs is less than or equal to the D_cutoff, this residue pairs will be chosen to construct the protein local features. On the contrary, if the distance between residue pairs is greater than the D_cutoff, this residue pairs will not be chosen.

FIG. 3.

The network architecture of the proposed SSPPI model and the Feature Fusion model. SSPPI, Sequence and Structure geometric feature-based PPI site prediction.

In Figure 3, there were two different local feature fusion methods. The difference between them is when the Seq_Feature or the Struct_Feature is fused. In the Feature Fusion 1, the local feature will be further extracted before fusion. On the contrary, in the Feature Fusion 2, the same type of features within different local features will be fused firstly, and then the different types of features be fused secondly.

Particularly, in order to enable the model to focus on the positional features of the corresponding amino acids in the protein sequence, Position-index is added to the sequence feature in the Seq_Feature: $P E (pos, 2 i) = \sin (\frac{pos}{10000^{2 i / d}})$ (6) $P E (pos, 2 i + 1) = \cos (\frac{pos}{10000^{(2 i + 1) / d}})$ (7)

The Transformer Encoder (Fig. 4) is used to add attention mechanisms to both sequence and structural features. Input features are linearly transformed to generate the Q, K and V vectors, which are then split into n heads and each head has its own set of linear transformation parameters. For each head, a scaled dot-product attention is computed. Specifically, the product of Q and K matrices is calculated, scaled, and passed through the softmax function to derive the attention weights. These weights are then applied to the V matrix to obtain the attention output for each head. Finally, the outputs from all heads are concatenated and linearly transformed to produce the final attention result. The first use of the Transformer Encoder is to extract attention within features. For DSSP, Concave Convex and [PSSM, HMM, Raw protein] in the Seq_Feature or the Struct_Feature, every feature has a separate Transformer Encoder and is used to extract ${TE}_{DSSP}, {TE}_{Cc}$ and ${TE}_{Seq}$ . Then, for each feature extracted from their own Transformer Encoder, two different fusion modules are used to obtain Fused 1 and Fused 2. Because the dimension of the protein feature matrix is relatively small, a modified ResNet, called Mini ResNet, which reduces the number of the convolutional kernels, is used to further extract the features. Finally, a multi-layer perceptron is used to output the prediction results.

FIG. 4.

Transformer Encoder.

Table 3.

Parameters Used in SSPPI

Parameters	Value
[PSSM, HMM, Raw protein]
Num of head	6
Num of hidden layer	6
Dim of feature	60
DSSP
Num of head	2
Num of hidden layer	6
Dim of feature	14
Concave Convex
Num of head	4
Num of hidden layer	6
Dim of feature	28
Batch size	1024
Dropout rate	0.1
Loss function	Focal_loss (α = 0.85, γ = 2.0)
Optimizer	AdamW (β₁ = 0.9, β₂ = 0.99)
Patience in early stop	20
Learning rate	0.001
Regularizer	L2Decay (0.005)

The parameters are divided into four groups: the first three groups are the Transformer Encoder parameters for each feature ([PSSM, HMM, Raw protein], DSSP, and Concave Convex), while the last group consists of the training hyper-parameters.

HMM, Hidden Markov Models; PSSM, Position-Specific Scoring Matrices; DSSP, Dictionary of Secondary Structure in Proteins; SSPPI, Sequence and Structure geometric feature-based PPI site prediction.

2.5. Implementation details

We implemented our model with Paddle 2.2.2, with the following set of hyper-parameters: learning rate of 0.001, weight decay of L2Decay with 0.005 and the dropout rate of 0.1 to avoid over-fitting. We employed Focal Loss function (Lin et al., 2020) and AdamW optimizer for optimization. Detailed parameters are shown in Table 3. Focal Loss was calculated using the equation below: $Focal_Loss = {\begin{array}{l} - α \times {(1 - σ (predict))}^{γ} \log (σ (predict)), labels = 1 \\ - (1 - α) \times σ {(predict)}^{γ} \log (1 - σ (predict)), labels = 0 \end{array}$ (8)where $σ (Logit) = \frac{1}{1 + \exp (- Logit)}$ ; α is used to balance the positive and negative samples in the range of values [0, 1]; γ is used to balance the easily and difficult samples. In our model, $α = 0.85$ and $γ = 2.0$ .

The training process lasted at most 100 epochs and took approximately 26 seconds for every epoch on a NVIDIA Tesla A40 48G GPU.

2.6. Evaluation metrics

Similar to previous studies, 9 evaluation metrics were used in our work: accuracy (ACC), precision (PRE), sensitivity (SEN), specificity (SPE), Recall, F-measure (F1), Matthews’ Correlation Coefficient (MCC), area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPRC) to evaluate the predictive performance of the models. $ACC = \frac{T P + T N}{T P + F P + T N + F N}$ (9) $SEN = \frac{T P}{T P + F N}$ (10) $SPE = \frac{T N}{F P + T N}$ (11) $PRE = \frac{T P}{T P + F P}$ (12) $Recall = \frac{T P}{T P + F N}$ (13) $F 1 = 2 \times \frac{PRE \times Recall}{PRE + Recall}$ (14) $MCC = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}}$ (15)

These metrics were calculated based on a threshold to convert predicted interacting probabilities to binary predictions. Therefore, it is necessary to use threshold-independent metrics, such as AUROC and AUPRC, to revealing the overall performance of model.

3. RESULTS AND DISCUSSIONS

3.1. Comparison of methods for constructing local features

In this study, two methods for constructing local features of proteins are utilized: sliding windows based on protein sequences and spatial distance relationships based on protein structures. To determine the most appropriate sliding window length and spatial distance D_cutoff for constructing local protein features, various experimental combinations were established by adjusting these parameters. The sliding window sizes were set to 1, 3, 5, …, 19, resulting in a total of ten groups. For each set of experiments, the spatial distances D_cutoff were set to 4, 5, …, 15, forming a total of 12 groups. Additionally, comparative experiments were conducted to evaluate the local feature construction using only the sliding window method and only the spatial distance relationship method. The training set is Train_355, and the test set is Dset_60.

To maintain consistency and mitigate the influence of changes in parameter count on experimental outcomes, the model depicted in Figure 3 exclusively utilizes sliding windows to generate local features, constructing only Seq_Feature. To ensure consistency in parameter count, the Struct_Feature component of the model has been replaced with Seq_Feature. Similarly, this modification has been implemented in the model that relies exclusively on spatial distance relationships to construct local features.

Supplementary Table S1 shows the performance of predictive models under different D_cutoff when the window size is set to 1. Supplementary Tables S4, S5, S6, S7, S8, S9, S10, S11 and S12 shows other different window size’s results. Based on the data in the table, we can infer that when the window size is 1 and the spatial distance is 7, the prediction model achieves optimal performance, excelling in all performance metrics. Supplementary Table S2 shows the experimental results of using only the sliding window to construct local features. Supplementary Table S3 show the experimental results of using only the spatial distance relationship to construct local features.

For a prediction model that constructs local features using only a sliding window, performance reaches its peak when the window length is 15. For a prediction model that constructs local features using only spatial distance relationships, performance is best in threshold-independent metrics such as AUROC and AUPRC when the spatial distance D_cutoff is 14. In terms of the threshold-related comprehensive evaluation metrics, F1 and MCC decreased by 0.002 each compared to the prediction model with a spatial distance D_cutoff of 15, but the difference is not significant.

To investigate the impact of window length and spatial distance on model performance, the average model performance for all spatial distances at each window length is calculated and compared with the model that solely utilizes sliding window-constructed localized features. Similarly, the average model performance for all window lengths at each spatial distance is calculated and compared with the model that uses only spatial distance-constructed localized features. The results are shown in Figures 5 and 6.

FIG. 5.

Comparison of average performance under different window lengths (short dashed line) and performance using only sliding windows to construct local features (long dashed line).

FIG. 6.

Comparison of average performance under different spatial distances (short dashed line) and performance using only spatial distance to construct local features (long dashed line).

As shown in Figures 5 and 6, the short dashed line points represent the mean of the model performances, and the black points denote the standard deviations. The long dashed line depict the performance curves when using only a single type of local feature (either sliding window on the sequence or spatial distance). In Figure 5, the mean and standard deviation at a window length of 0 represent the model’s performance when local features are constructed solely based on spatial distance, without utilizing sliding windows. For the model that uses only sliding window-constructed local features, its performance peaks at a window length of 15. When the window lengths are 1 and 3, combining spatial distance-constructed local features significantly enhances the model’s predictive performance. However, for window lengths greater than 3, combining spatial distance-constructed local features leads to a decrease in predictive performance. In Figure 6, for the model that uses only spatial distance to construct local features, the optimal performance is achieved at a spatial distance of 14. At spatial distances of 4 and 5, incorporating local features constructed from the sequence sliding window slightly improves the model’s prediction performance. Conversely, for spatial distances greater than 5, combining local features from the sequence sliding window reduces the model’s prediction performance.

3.2. Feature ablation experiments

The PPI site prediction model in this study is based on both sequence and structural features of proteins. To demonstrate the relative importance of each adopted feature, we conducted eight groups of feature ablation experiments: using only the sequence feature group (Sequence), using only the structure feature group (Structure), removing only the protein PSSM feature group (-PSSM), removing only the protein HMM feature group (-HMM), removing only the protein raw amino acid sequence feature group (-Raw Protein), removing only the protein secondary structure characterization group (-DSSP), removing only the protein molecular surface concave–convex characterization group (-concave–convex), and using all characterization groups (SSPPI).

To better evaluate the performance of each group, we performed 5-fold cross-validation on the Train_355, where the data were split into five folds randomly. Each time, a model was trained on four folds and evaluated on the remaining one fold. This process was repeated five times, and the AUROC and AUPRC scores on the five folds were averaged as the overall validation performance.

The results of the feature ablation experiments are shown in Table 4. The performance using the structural feature group is better compared to that using the sequence feature group, indicating that structural features such as secondary structure and molecular surface concave–convex features are more directly relevant to the identification of PPI sites. Interestingly, the performance without the protein secondary structure features is almost identical to the performance without the protein molecular surface concave–convex features. However, the performance using all feature groups is better than using either of these groups alone, suggesting that these two features have a potential complementary relationship in predicting PPI sites.

Table 4.
Results of Ablation Experiment

Feature group AUROC AUPRC

Sequence 0.675±0.009 0.264±0.014

Structure 0.694±0.014 0.277±0.025

-PSSM 0.746±0.013 0.333±0.031

-HMM 0.761±0.022 0.362±0.051

-Raw protein 0.766±0.018 0.361±0.041

-DSSP 0.737±0.014 0.324±0.034

-Concave–convex 0.740±0.019 0.332±0.036

SSPPI 0.760±0.013 0.357±0.040

Feature group	AUROC	AUPRC
Sequence	0.675±0.009	0.264±0.014
Structure	0.694±0.014	0.277±0.025
-PSSM	0.746±0.013	0.333±0.031
-HMM	0.761±0.022	0.362±0.051
-Raw protein	0.766±0.018	0.361±0.041
-DSSP	0.737±0.014	0.324±0.034
-Concave–convex	0.740±0.019	0.332±0.036
SSPPI	0.760±0.013	0.357±0.040

AUPRC, area under the precision-recall curve; AUROC, area under the receiver operating characteristic curve.

3.3. Ensemble models for multiple types of local features

For the predictive model that combines local features constructed using sliding windows and spatial distances, its prediction performance does not show a significant improvement compared to models using a single type of local feature construction method (either using only sliding windows or only spatial distances). The specific results of each type of feature model on Dset_60 were analyzed. Table 5 shows the best performance for each type of feature model: SSPPI, which combines local features constructed using sliding windows and spatial distances, with a sliding window length of 1 and a spatial distance of 7; Sequence, which uses only sliding windows to construct local features, with a sliding window length of 15; and Structure, which uses only spatial distances to construct local features, with a spatial distance of 14. Although SSPPI achieves the best performance in all metrics, it does not show a significant improvement compared to the other two feature models.

Table 5.
The Optimal Performance of Various Types of Feature Models

Feature AUROC AUPRC ACC PRE Recall F1 MCC

SSPPI 0.776 0.394 0.814 0.412 0.414 0.413 0.303

Sequence 0.763 0.374 0.809 0.395 0.395 0.395 0.282

Structure 0.776 0.387 0.811 0.402 0.401 0.401 0.289

SSPPI_Ensemble 0.801 0.432 0.801 0.403 0.533 0.459 0.345

Feature	AUROC	AUPRC	ACC	PRE	Recall	F1	MCC
SSPPI	0.776	0.394	0.814	0.412	0.414	0.413	0.303
Sequence	0.763	0.374	0.809	0.395	0.395	0.395	0.282
Structure	0.776	0.387	0.811	0.402	0.401	0.401	0.289
SSPPI_Ensemble	0.801	0.432	0.801	0.403	0.533	0.459	0.345

ACC, accuracy; PRE, precision; SEN, sensitivity; SPE, specificity; F1, F-measure; MCC, Matthews’ Correlation Coefficient.

The bold numbers in the table represent the best results for each metric from different feature models.

Supplementary Table S13 shows the test results of various feature models on a subset of proteins in the test set. It can be observed that SSPPI does not achieve the best prediction results for all proteins. In the Dset_60, the SSPPI feature model achieves the best prediction results for 20 proteins, while the Sequence feature model achieves the best results for 16 proteins, and the Structure feature model for 24 proteins. Additionally, the results indicate that for proteins where SSPPI performs the best, such as 1e96_B and 1g14_A, the local features in the sequence and spatial domains complement each other, thereby enhancing the model’s performance. Conversely, for proteins where SSPPI performs the worst, such as 1ggp_A, 1gla_G, and 1qa9_A, the interaction between the two types of features may negatively affect the model’s performance. Therefore, besides constructing models that combine sliding window and spatial distance features for prediction, it is also beneficial to combine the prediction results from models that use only sliding window features and models that use only spatial distance features. Based on this, an ensemble model based on multiple local features is proposed. This model combines the predictions from different local feature models to achieve optimal predictive performance, as illustrated in Figure 7. Specifically, the parameters of each local feature model are fixed and not updated during the training of the ensemble model. The input to the ensemble model is the output from the penultimate fully connected layer of each type of prediction model (the last fully-connected layer is 32 × 1, which outputs the prediction probability; the penultimate layer is 128 × 32), and that is concatenated with the sequence length of the protein and then is used as input to MLP.

FIG. 7.

An ensemble models based on multiple local features.

The training parameters and environment of the ensemble model are consistent with those described in Figure 3, and the final results of this model are described in Table 5. The performances of the ensemble model, compared to SSPPI, in the five metrics of AUROC, AUPRC, Recall, F1 and MCC, increase from 0.776, 0.394, 0.414, 0.413 and 0.303 to 0.801, 0.432, 0.533, 0.459 and 0.345, obtaining a 3.222%, 9.645%, 28.744%, 11.138% and 13.861% improvement respectively, and the performance of the ensemble model has indeed improved.

Supplementary Figures S1 and Figure S2 show the AUROC and AUPRC results for each type of feature and the ensemble model, with the ensemble model exhibiting a significantly improved performance compared to the other three models.

3.4. Comparison with other methods

We compared SSPPI_Ensemble with five sequence-based predictors, PSIVER (Murakami and Mizuguchi, 2010), ProNA2020 (Qiu et al., 2020), SCRIBER (Zhang and Kurgan, 2019), DLPred (Zhang et al., 2019), and DELPHI (Li et al., 2021), and four structure-based predictors, DeepPPISP (Zeng et al., 2020), SPPIDER (Porollo and Meller, 2007), MaSIF-site (Gainza et al., 2020), and GraphPPIS (Yuan et al., 2021). Note that our test sets may be part of the training sets in other methods. If true, the results reported here would be upper limits for other methods.

As shown in Table 6, there is a large performance gap between SSPPI_Ensemble and the five sequence-based methods. The most likely reason is the lack of structural features in these sequence-based methods. As discussed in Section 3.3, the prediction model requires different features for different proteins. For example, by using the combination of different features, SSPPI_Ensemble improves over DELPHI by 14.592%, 35.423%, 23.387%, and 53.333% in AUROC, AUPRC, F1 and MCC. In comparing with four structure-based predictors, we pay more attention to MaSIF-site and GraphPPIS, and the former is based on point cloud geometric neural networks, while the latter is based on graph convolutional neural networks. These two predictors can automatically calculate protein spatial features, so the comparison with them better demonstrates the effectiveness of the structural features proposed in our study. The difference between our model and MaSIF-site is whether the local features are constructed or not. MaSIF-site constructs the geometric features and the chemical features, and takes the entire protein as input. Our model constructs local features based on protein sequence and spatial distance. SSPPI_Ensemble outperforms MaSIF-site in ACC, PRE, F1, MCC and AUROC, but is slightly worse than MaSIF-site in AUPRC and Recall. This indicates that through the local feature construction method proposed in this study, the predictors may learn the global features of proteins. GraphPPIS constructs the PSSM, HMM and DSSP features, and uses the adjacency matrix to choose the input amino acid residue. SSPPI_Ensemble outperforms GraphPPIS in ACC, PRE, AUROC, AUPRC, F1, and MCC, which indicates that the molecular surface concave–convex features proposed in this study as a structural feature can effectively improve the performance of predictors.

Table 6.
The Optimal Performance of Various Types of Feature Models on Dset_60

Method ACC PRE Recall F1 MCC AUROC AUPRC

PSIVER 0.561 0.188 0.534 0.278 0.074 0.573 0.190

ProNA2020 0.738 0.275 0.402 0.326 0.176 N/A N/A

SCRIBER* 0.667 0.253 0.568 0.350 0.193 0.665 0.278

DLPred* 0.682 0.264 0.565 0.360 0.208 0.677 0.294

DELPHI* 0.697 0.276 0.568 0.372 0.225 0.699 0.319

DeepPPISP* 0.657 0.243 0.539 0.335 0.167 0.653 0.276

SPPIDER 0.752 0.331 0.557 0.415 0.285 0.755 0.373

MaSIF-site* 0.780 0.370 0.561 0.446 0.326 0.775 0.439

GraphPPIS* 0.776 0.368 0.584 0.451 0.333 0.786 0.429

SSPPI_Ensemble 0.801 0.403 0.533 0.459 0.345 0.801 0.432

Method	ACC	PRE	Recall	F1	MCC	AUROC	AUPRC
PSIVER	0.561	0.188	0.534	0.278	0.074	0.573	0.190
ProNA2020	0.738	0.275	0.402	0.326	0.176	N/A	N/A
SCRIBER*	0.667	0.253	0.568	0.350	0.193	0.665	0.278
DLPred*	0.682	0.264	0.565	0.360	0.208	0.677	0.294
DELPHI*	0.697	0.276	0.568	0.372	0.225	0.699	0.319
DeepPPISP*	0.657	0.243	0.539	0.335	0.167	0.653	0.276
SPPIDER	0.752	0.331	0.557	0.415	0.285	0.755	0.373
MaSIF-site*	0.780	0.370	0.561	0.446	0.326	0.775	0.439
GraphPPIS*	0.776	0.368	0.584	0.451	0.333	0.786	0.429
SSPPI_Ensemble	0.801	0.403	0.533	0.459	0.345	0.801	0.432

Predictions by the programs marked with * were cited from Yuan et al. (2021). Predictions by PSIVER, ProNA2020 and SPPIDER were directly generated from their web servers. ProNA2020 only makes binary predictions and thus, the AUROC and AUPRC are not calculated. The Performance of SSPPI_Ensemble is based on Train_335. The bold numbers in the table represent the best results for each metric from different models on Dset_60.

FIG. 8.

ROC and PR curves for the tests sets (a) and (b) Dset_60, (c) and (d) Dset_70, (e) and (f) Dset_315. PR, precision-recall; ROC, receiver operating characteristic.

To further validate the performance of SSPPI_Ensemble, we applied the model to our newly constructed datasets, which include the updated protein sequences and interactions. As shown in Table 7 and Figure 8, our method, SSPPI_Ensemble, demonstrates its advantages over other methods across three test datasets. Although it performs slightly worse than Seq_InSite (Hosseini et al., 2024) in some metrics, the overall advantages is obvious. In Dset_70 dataset, SSPPI_Ensemble achieves the best performance in three key metrics: SPE (specificity), PRE (precision) and ACC (accuracy), with values of 0.921, 0.463, and 0.789, respectively, which surpass Seq_InSite’s performances of 0.864, 0.447, and 0.781.

Table 7.

The Optimal Performance of Various Types of Feature Models on Different Test Datasets

Dataset	Method	SEN	SPE	PRE	ACC	F1	MCC	AUROC	AUPRC
Dset_60	Seq-InSite*	0.448	0.897	0.448	0.826	0.448	0.345	0.798	0.430
	PITHIA	0.317	0.872	0.317	0.784	0.317	0.189	0.708	0.288
	RGN*	0.443	0.896	0.443	0.824	0.443	0.338	0.783	0.427
	SSPPI_Ensemble	0.425	0.893	0.427	0.819	0.426	0.319	0.783	0.425
Dset_70	Seq-InSite*	0.447	0.864	0.447	0.781	0.447	0.311	0.766	0.440
	PITHIA	0.369	0.844	0.369	0.750	0.367	0.213	0.700	0.369
	SSPPI_Ensemble	0.269	0.921	0.463	0.789	0.341	0.237	0.733	0.412
Dset_315	Seq-InSite*	0.398	0.899	0.398	0.827	0.398	0.297	0.782	0.380
	PITHIA	0.301	0.883	0.301	0.800	0.301	0.184	0.700	0.268
	RGN*	0.302	0.883	0.302	0.800	0.302	0.185	0.674	0.267
	SSPPI_Ensemble	0.395	0.898	0.393	0.827	0.394	0.292	0.779	0.388

Predictions by the programs marked with * were cited from Hosseini et al. (2024). The Performance of SSPPI_Ensemble is based on Train_1326. The bold numbers in the table represent the best results for each metric from different models on three test datasets.

On Dset_315 dataset, SSPPI_Ensemble achieves the highest AUPRC score of 0.388, outperforming Seq_InSite, PITHIA (Hosseini and Ilie, 2022), and RGN (Wang et al., 2022) by 2.105%, 44.776%, and 45.318%, respectively. For other metrics, our method is nearly equivalent to Seq_InSite, with the SEN values of 0.395 versus 0.398, the SPE values of 0.898 versus 0.899 and the ACC values of 0.827 versus 0.827. Furthermore, SSPPI_Ensemble significantly outperforms PITHIA and RGN across all metrics. For instance, SSPPI_Ensemble’s MCC value of 0.292 is notably higher than PITHIA’s 0.184 and RGN’s 0.185, further confirming the effectiveness of our approach.

These results highlight the power of the proposed structural three-dimensional point cloud concave–convex feature in capturing PPI site information, and the features is particularly effective in revealing surface shape variations and identifying potential binding sites. By integrating sequence context information and spatial distance relationships between amino acids, our ensemble model enhances the prediction performance of PPI sites.

4. CONCLUSION

Predicting PPI sites is very important for understanding the function of proteins. In this study, we combined sequence feature (PSSM, HMM and Raw protein sequence) and the structure feature (DSSP and concave–convex) to predict PPI sites. And the most important is that we proposed a novel method to calculate the geometrical features of amino acids based on the concave–convex properties of protein surface. We measured the degree of concave–convex by moving the tangent plane at a point of an amino acid, and examining the change rate of the polygon area formed by the intersections of the plane and the normal vectors on the triangular mashes of the point. Combining these two types of features, using the Transformer-encoder, mini-Resnet, and the proper ensemble methods, a novel prediction module, SSPPI_Ensemble, was proposed. Comparing with the state-of-the-art method, we obtained better prediction results or results with significant advantages, to which structure features make a significant contribution. From the ablation experiments, we found that the concave–convex feature can make the same contribution with the DSSP to the performance. Prediction of PPI sites is still a challenging task. It is necessary to introduce more features (such as co-evolution) or create novel features, to further improve the prediction performance.

Footnotes

ACKNOWLEDGMENTS

The authors thank Professor David R. Westhead (Leeds University, UK) for his valuable comments and useful suggestions on the revision of this article and Professor Fuyi Li (Northwest A&F University, China) for making some corrections to the grammar of this article.

AUTHORS’ CONTRIBUTIONS

L.L. and J.G. performed conceptualization, data curation, software development, formal analysis, and writing. H.D. conducted part of investigation and S.C. joined part of writing. J.Y. and L.H. provided conceptualization, methodology, writing, and supervision.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work was supported by the grant of Modern Agriculture and Rural Revitalization Bureau, Yangling Demonstration Zone (TG20250009).

SUPPLEMENTARY MATERIAL

References

Altschul

, Madden

, Schaffer

, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res, 1997; 25(17):3389–3402; doi: 10.1093/nar/25.17.3389

Athanasios

, Charalampos

, Vasileios

, et al. Protein-protein interaction (PPI) network: Recent advances in drug discovery. Curr Drug Metab, 2017; 18(1):5–10; doi: 10.2174/138920021801170119204832

Besl

, Jain

. Segmentation through variable-order surface fitting. IEEE Trans Pattern Anal Machine Intell, 1988; 10(2):167–192; doi: 10.1109/34.3881

Chen

, Ju

, Zhou

, et al. Multifaceted protein-protein interaction prediction based on Siamese residual RCNN. Bioinformatics, 2019; 35(14):i305–i314; doi: 10.1093/bioinformatics/btz328

Dai

, Bailey-Kellogg

. Protein interaction interface region prediction by geometric deep learning. Bioinformatics, 2021; 37(17):2580–2588; doi: 10.1093/bioinformatics/btab154

Dhole

, Singh

, Pai

, et al. Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier. J Theor Biol, 2014; 348:47–54; doi: 10.1016/j.jtbi.2014.01.028

Ding

, Li

, Han

, et al. MEG-PPIS: A fast protein-protein interaction site prediction method based on multi-scale graph information and equivariant graph neural network. Bioinformatics, 2024; 40(5); doi: 10.1093/bioinformatics/btae269

Gainza

, Sverrisson

, Monti

, et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods, 2020; 17(2):184–192; doi: 10.1038/s41592-019-0666-6

Higurashi

, Ishida

, Kinoshita

. PiSite: A database of protein interaction sites using multiple binding states in the PDB. Nucleic Acids Res, 2009; 37(Database issue):D360–D364; doi: 10.1093/nar/gkn659

10.

Hosseini

, Ilie

. PITHIA: Protein interaction site prediction using multiple sequence alignments and attention. Int J Mol Sci, 2022; 23(21):12814; doi: 10.3390/ijms232112814

11.

Hosseini

, Golding

, Ilie

. Seq-InSite: Sequence supersedes structure for protein interaction site prediction. Bioinformatics, 2024; 40(1); doi: 10.1093/bioinformatics/btad738

12.

Kabsch

, Sander

. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983; 22(12):2577–2637; doi: 10.1002/bip.360221211

13.

, Golding

, Ilie

. DELPHI: Accurate deep ensemble model for protein interaction sites prediction. Bioinformatics, 2021; 37(7):896–904; doi: 10.1093/bioinformatics/btaa750

14.

Lin

, Goyal

, Girshick

, et al. Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell, 2020; 42(2):318–327; doi: 10.1109/TPAMI.2018.2858826

15.

Liu

, Gao

, Ren

, et al. Protein-protein interaction and site prediction using transfer learning. Brief Bioinform, 2023; 24(6); doi: 10.1093/bib/bbad376

16.

Manfredi

, Savojardo

, Martelli

, et al. ISPRED-SEQ: Deep neural networks and embeddings for predicting interaction sites in protein sequences. J Mol Biol, 2023; 435(14):167963–167963; doi: 10.1016/j.jmb.2023.167963

17.

Mihel

, Sikic

, Tomic

, et al. PSAIA - protein structure and interaction analyzer. BMC Struct Biol, 2008; 8:21; doi: 10.1186/1472-6807-8-21

18.

Mirdita

, von den Driesch

, Galiez

, et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res, 2017; 45(D1):D170–D176; doi: 10.1093/nar/gkw1081

19.

Murakami

, Mizuguchi

. Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites. Bioinformatics, 2010; 26(15):1841–1848; doi: 10.1093/bioinformatics/btq302

20.

Northey

, Baresic

, Martin

ACR

. IntPred: A structure-based predictor of protein-protein interaction sites. Bioinformatics, 2018; 34(2):223–229; doi: 10.1093/bioinformatics/btx585

21.

Porollo

, Meller

. Prediction-based fingerprints of protein-protein interactions. Proteins, 2007; 66(3):630–645; doi: 10.1002/prot.21248

22.

Qiu

, Bernhofer

, Heinzinger

, et al. ProNA2020 predicts protein–DNA, protein–RNA, and protein–protein binding proteins and residues from sequence. J Mol Biol, 2020; 432(7):2428–2443; doi: 10.1016/j.jmb.2020.02.026

23.

Remmert

, Biegert

, Hauser

, et al. HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods, 2012; 9(2):173–175; doi: 10.1038/nmeth.1818

24.

Sanner

, Olson

, Spehner

. Reduced surface: An efficient way to compute molecular surfaces. Biopolymers, 1996; 38(3):305–320; doi: 10.1002/(SICI)1097-0282(199603)38:3\%3C305::AID-BIP4\%3E3.0.CO;2-Y

25.

Wang

, Wang

, Wu

, et al. Analysis on multi-domain cooperation for predicting protein-protein interactions. BMC Bioinformatics, 2007; 8:391; doi: 10.1186/1471-2105-8-391

26.

Wang

, Chen

, Han

, et al. RGN: Residue-based graph attention and convolutional network for protein–protein interaction site prediction. J Chem Inf Model, 2022; 62(23):5961–5974; doi: 10.1021/acs.jcim.2c01092

27.

Wang

, Yu

, Ma

, et al. Protein-protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics, 2019; 35(14):2395–2402; doi: 10.1093/bioinformatics/bty995

28.

Word

, Lovell

, Richardson

, et al. Asparagine and glutamine: Using hydrogen atom contacts in the choice of side-chain amide orientation. J Mol Biol, 1999; 285(4):1735–1747; doi: 10.1006/jmbi.1998.2401

29.

Yang

, Fan

, Song

, et al. Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC Bioinformatics, 2020; 21(1):323; doi: 10.1186/s12859-020-03646-8

30.

, Chen

, Zhou

, et al. GTB-PPI: Predict protein-protein interactions based on L1-regularized logistic regression and gradient tree boosting. Genomics Proteomics Bioinformatics, 2020; 18(5):582–592; doi: 10.1016/j.gpb.2021.01.001

31.

Yuan

, Chen

, Zhao

, et al. Structure-aware protein-protein interaction site prediction using deep graph convolutional network. Bioinformatics, 2021; 38(1):125–132; doi: 10.1093/bioinformatics/btab643

32.

Zeng

, Zhang

, Wu

, et al. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics, 2020; 36(4):1114–1120; doi: 10.1093/bioinformatics/btz699

33.

Zhang

, Li

, Quan

, et al. Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network. Neurocomputing, 2019; 357:86–100; doi: 10.1016/j.neucom.2019.05.013

34.

Zhang

, Kurgan

. SCRIBER: Accurate and partner type-specific prediction of protein-binding residues from proteins sequences. Bioinformatics, 2019; 35(14):i343–i353; doi: 10.1093/bioinformatics/btz324

35.

Zhou

. PyMesh—geometry processing library for python. 2018. Available from: https://github.com/PyMesh/PyMesh [Last accessed: March 9, 2023].

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

1.46 MB

0.02 MB

0.01 MB

0.02 MB

0.00 MB

1.46 MB

0.02 MB

A New Structure Feature Introduced to Predict Protein–Protein Interaction Sites

Abstract

1. INTRODUCTION

2. MATERIALS AND METHODS

2.1. Datasets

2.2.1. PSSM

2.2.3. Raw protein sequence

2.3.1. DSSP

2.3.2. Concave–convex feature

3.1. Comparison of methods for constructing local features

Footnotes

ACKNOWLEDGMENTS

AUTHORS’ CONTRIBUTIONS

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

SUPPLEMENTARY MATERIAL

References

Supplementary Material