Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins

Abstract

Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.

1. INTRODUCTION

The phage is a group of viruses that can infect and replicate within bacteria. A recent study reported that phage may become an appropriate alternative to traditional drugs due to its safety and effectiveness in living organisms (Lyon, 2017). Phage proteins are an important target for pharmaceutical research in the discovery of new drug (Sorokulova et al., 2014).

Currently, mass spectrometry is a common method used to identify phage virion proteins (PVPs) (Lavigne et al., 2009). However, this experiment-based method is expensive, time-consuming, and labor-intensive. Therefore, with the fast proliferation of protein sequences in the post-genome age, there is an urgent demand to exploit automated approaches for fast and reliable identification of PVPs.

In recent years, much effort has been devoted to predict PVPs using machine-learning models. In 2013, Feng et al. (2013) proposed the first computational model to identify PVPs. The model employed amino acid composition (AAC) and dipeptide composition (DC) to represent protein sequences. A filter approach correlation-based feature selection combining a best-first search strategy was applied to reduce the influence of redundant and noisy information. Naïve Bayes was used as the classifier to discriminate PVPs from phage non-virion proteins (non-PVPs).

This model achieved a classification accuracy of 79.15% based on the jackknife test. One year later, the same group proposed a new method to improve the classification performance. In the new model, sequences were first encoded by g-gap DC, and the analysis of variance (ANOVA) with incremental feature selection was employed to find the optimal feature subset.

Finally, support vector machine (SVM) was used to distinguish PVPs from non-PVPs. This predictor achieved a maximum accuracy of 85.02% in the jackknife test (Ding et al., 2014). Manavalan et al. (2018) constructed another predictor called PVP-SVM to explore this issue. In this predictor, five different feature descriptors were used to represent protein sequences, and a feature selection strategy was employed to select these optimal features.

Finally, SVM was used to predict PVPs. This predictor obtained classification accuracies of 87.9% and 79.8% on training and independent datasets, respectively. To further improve the classification performance, Arif and co-workers developed Pred-BVP-Unb to predict PVPs (Arif et al., 2020). In their predictor, three features, including composition and translation, split AAC, and bi-profile position specific scoring matrix, were used to describe protein sequences.

A synthetic minority oversampling technique was adopted to address the imbalanced data issue, and a feature selection approach was used to choose these optimal features. These selected features were provided to an SVM classifier to identify PVPs. This predictor achieved classification accuracies of 92.54% and 83.06% on the benchmark and independent datasets, respectively.

Charoenkwan and colleagues developed a sequence-based predictor named Meta-iPVP to investigate this problem (Charoenkwan et al., 2020c). First, seven different feature encodings were employed to denote protein sequences. The GA-SAR approach coupled to the SVM model was used to rank the most effective features. These effective features were input into a classifier to distinguish PVPs from non-PVPs. The predictor achieved classification accuracies of 84.6% and 81.7% on the training and independent datasets, respectively.

Recently, Ahmad et al. (2022) proposed a new computational model called SCORPION to identify PVPs. The model employed 13 different kinds of features to encode sequences. A two-step feature selection approach was adopted to select the optimal features, and promising classification results were obtained by this predictor. In addition, some other related studies were reported in this field (Charoenkwan et al., 2020b; Meng et al., 2020; Pan et al., 2018; Zhang et al., 2015).

Although the aforementioned studies have advanced our understanding of PVPs, there is still great room for improvement in classification performance. Therefore, in this work, we propose a novel classification model to distinguish PVPs from non-PVPs. First, 50 different kinds of physicochemical (PC) properties were employed to encode protein sequences.

Two different approaches, that is, Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC) (Reshef et al., 2011), were utilized to capture important interaction information between different PC properties. To capture more complex correlation information, PCC and MIC were used once again. The least absolute shrinkage and selection operator (LASSO) algorithm (Liu et al., 2009) was then adopted to remove irrelative and redundant features. Finally, these selected features were fed into an SVM for predicting PVPs from non-PVPs. Figure 1 presents the flowchart of the proposed method.

FIG. 1.

The flowchart of the proposed classification model.

2. METHODS AND MATERIALS

2.1. Datasets

An effective dataset is crucial for the success of the experimental process. In the current work, we used the same datasets as in prior studies (Ahmad et al., 2022; Charoenkwan et al., 2020c) to validate the proposed model. These datasets include two different sets, namely a training dataset and an independent dataset, which can be formulated as follows: $\{\begin{matrix} S_{t r a i n} = S_{t r a i n}^{+} \cup S_{t r a i n}^{-} \\ S_{i n d} = S_{i n d}^{+} \cup S_{i n d}^{-} \end{matrix}$ (1)

where $S_{t r a i n}$ is the training dataset, which contains 500 samples, of which 250 belong to positive samples $(S_{t r a i n}^{+})$ , and the others are negative samples $(S_{t r a i n}^{-})$ . There are 126 subjects included in the independent dataset $(S_{i n d})$ . The number of positive samples $(S_{i n d}^{+})$ is 63, whereas there are 63 negative samples included in the $S_{i n d}^{-}$ . The sequence similarity of any two sequences included in the same group is no more than 40% by using CD-HIT software (Huang et al., 2010).

2.2. Feature representation

PC property is an effective feature descriptor that has been widely used in various studies (Xiao et al., 2015; Xiao et al., 2012; Zou et al., 2022a; Zuo et al., 2020). Therefore, in this study, we adopted the same 50 types of PC properties as in previous studies (Zou and Yin, 2021; Zou et al., 2022c) to encode protein sequence. Accordingly, we can use the following formula to represent a protein sequence: $P C = [\begin{matrix} P C_{1}^{1} & P C_{2}^{1} & \dots & P C_{50}^{1} \\ P C_{1}^{2} & P C_{2}^{2} & \dots & P C_{50}^{2} \\ ⋮ & ⋮ & \dots & ⋮ \\ P C_{1}^{L} & P C_{2}^{L} & \dots & P C_{50}^{L} \end{matrix}]$ (2)

where $P C_{i}^{k}$ represent the i-th type PC property of the k-th amino acid residue in the sequence. To obtain useful information from the PC matrix, PCC and MIC approaches were utilized.

By using PCC approach, a protein can be described as $P C C = [\begin{matrix} p_{1, 1} & p_{1, 2} & \dots & p_{1, 50} \\ p_{2, 1} & p_{2, 2} & \dots & p_{2, 50} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ p_{50, 1} & p_{50, 2} & \dots & p_{50, 50} \end{matrix}]$ (3)

where $p_{i, j} = \frac{cov (P C_{i}, P C_{j})}{σ_{P C_{i}} σ_{P C_{j}}}$ , $cov (P C_{i}, P C_{j})$ denotes the covariance between the $i - t h$ and the $j - t h$ type of PC property, and $σ_{P C_{i}}$ is the standard deviation of the $i - t h$ PC property, similar for $σ_{P C_{j}}$ .

In addition, another descriptor called MIC, a non-linear correlation analysis method was also introduced to capture the interaction between different PC properties. In the following, we will provide a detailed description of this method. If there are two different variables A and B, a grid G (with a row and b columns) can be plotted on the scatterplot for these two variables. The grid G partitions the data to capture the relationship between the two variables using mutual information (MI). The MI can be calculated using the following formula (Zou and Yang, 2020): $M I (A, B) = H (A) - H (A | B) = H (B) - H (B | A) = H (A) + H (B) - H (A, B)$ (4)

where $H (A)$ and $H (B)$ are the entropies of A and B, $H (A | B)$ and $H (B | A)$ are the conditional entropies, and $H (A, B)$ is the joint entropy of A and B. Suppose I_g is the MI of the probability distribution partitioned by G. For fair comparison between grids of different resolutions, a normalization operation was conducted on I_g, and the value of I_g was normalized to [0, 1]. The $m_{a \times b}$ of the characteristic matrix was calculated by: $m_{a \times b} = \frac{{max}_{g \in G_{a \times b}} {I_{g}}}{log min {(a, b)}}$ (5)

MIC is the maximal value of $m_{a \times b}$ over all ordered pairs (a, b). In this work, we computed $m_{a \times b}$ over all (a, b) with $a b < K$ and $K = n^{0.6}$ , and n is the length of the variable. Thus, the MIC value between A and B is defined as: $M I C_{A B} = {max}_{a b < K} {m_{a \times b}}$ (6)

If A and B represent two different PC properties, the MIC value between them is computed as the non-linear connection of the corresponding PC properties. Thus, the non-linear connection (i.e., MIC) among 50 PC properties can be represented by: $M I C = [\begin{matrix} c_{1, 1} & c_{1, 2} & \dots & c_{1, 50} \\ c_{2, 1} & c_{2, 2} & \dots & c_{2, 50} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ c_{50, 1} & c_{50, 2} & \dots & c_{50, 50} \end{matrix}]$ (7)

where $c_{i, j} (i = 1, 2, \dots, 50; j = 1, 2, \dots, 50)$ represents the non-linear connectivity between the $i - t h$ and the $j - t h$ PC property, which can be obtained by using Eqs. (4–6).

MIC is an effective approach to measure the correlations between different PC properties; however, this type of correlation can only capture the interaction between two different PC properties. It is unknown whether connection among more than two PC properties may benefit the classification task. To further gain high-order connection information, we use Eqs. (3–6) once again. The high-order PCC (HOPCC) and high-order MIC (HOMIC) based on PCC and MIC among all 50 PC properties are denoted as: $H O P C C = [\begin{matrix} h p_{1, 1} & h p_{1, 2} & \dots & h p_{1, 50} \\ h p_{2, 1} & h p_{2, 2} & \dots & h p_{2, 50} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ h p_{50, 1} & h p_{50, 2} & \dots & h p_{50, 50} \end{matrix}]$ (8) $H O M I C = [\begin{matrix} h m_{1, 1} & h m_{1, 2} & \dots & h m_{1, 50} \\ h m_{2, 1} & h m_{2, 2} & \dots & h m_{2, 50} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ h m_{50, 1} & h m_{50, 2} & \dots & h m_{50, 50} \end{matrix}]$ (9)

where $h p_{i, j} (i = 1, 2, \dots, 50; j = 1, 2, \dots, 50)$ is the PCC value between the $i - t h$ and the $j - t h$ column of connection matrix as shown in Eq. (3), similar for $h m_{i, j} (i = 1, 2, \dots, 50; j = 1, 2, \dots, 50)$ . Figure 2 provides an example to illustrate the process about how to construct HOMIC based on PC properties.

FIG. 2.

An example to illustrate how to obtain HOMIC from PC matrix. HOMIC, high-order MIC; PC, physicochemical.

2.3. Feature extraction and feature selection

We collected the correlation information between different PC properties as discriminative features to distinguish PVPs from non-PVPs. However, there may be many irrelative and redundant features, which can lead to over-fitting. Therefore, a feature selection strategy was adopted in this work. The LASSO algorithm was employed to select the most discriminative features from the feature pool.

Assume that $F \in R^{m \times n}$ denote the feature matrix, which means there are m samples included in the dataset, and each sample is represented by a n-dimension vector. The label set $l = {[l_{1}, l_{2}, \dots, l_{m}]}^{T} \in R^{m \times 1}$ , and l_i is the label of i-th sample. To select the most discriminative features, we can solve the following problem (Liu et al., 2009):

where $x \in R^{n \times 1}$ ; when the i-th element in x was unequal to 0, then the i-th column of F was chosen. $λ$ controls the sparse level, which means a larger value of $λ$ , and lesser features were reserved. In this work, the value of $λ$ is 0.1.

2.4. Classifier

The SVM is a useful machine-learning algorithm (Cortes and Vapnik, 1995), and it has been successfully used in various fields (Chen et al., 2020; Xiao et al., 2019; Chen et al., 2016; Zou and Yang, 2019). Therefore, in this study, an SVM with radial basis kernel function was adopted as the classifier to discriminate PVPs from non-PVPs. A grid search strategy was adopted to find the best combination of the regularization parameter C and kernel parameter γ. The search space for these two parameters is [2⁻¹⁵, 2¹⁵] and [2⁻¹⁵, 2¹⁵], and the step size is 2² and 2⁻², respectively.

2.5. Cross-validation

The following three approaches, including K-fold cross-validation test, leave-one-out cross-validation (LOOCV) test, that is, jackknife test, and independent test, are commonly used to validate the performance of the proposed model. Among these methods, the jackknife test is considered the most objective because it can generate stable results.

Therefore, in this study, we adopted the jackknife test to validate the proposed classification model, as in various previous studies (Tang et al., 2018; Wang et al., 2015; Zou et al., 2022b). Briefly, suppose that there are m samples included in the dataset. In each LOOCV trail, m-1 subjects are selected to train the model, and the remaining one is used for testing. This process is repeated m times until each sample has been used for testing.

2.6. Performance evaluation

The classification accuracy (Acc), sensitivity (Sn), specificity (Sp), and Mathew's correlation coefficient (MCC) were employed to evaluate the performance of our proposed method. They are defined as follows (Basith et al., 2018; Charoenkwan et al., 2020a; Dai et al., 2021a; Li et al., 2020; Lissabet et al., 2019):

$A c c = \frac{T P + T N}{T P + F N + T N + F P} \times 100 %$ (11) $S n = \frac{T P}{T P + F N} \times 100 %$ (12) $S p = \frac{T N}{T N + F P} \times 100 %$ (13)

where TP represents true positive, which is the number of correctly recognized PVPs, TN (true negative) denotes the number of correctly classified non-PVPs, FP (false positive) and FN (false negative) represent the number of non-PVPs classified as PVPs, and PVPs are identified as non-PVPs, respectively. In addition to the four indicators mentioned earlier, the area under the receiver operating characteristic (ROC) curve (AUC) was also adopted to evaluate the performance of the model.

3. RESULTS AND DISCUSSION

3.1. Parameters setting

As mentioned earlier, we used the grid search method to find the optimal combination of the regularization parameter C and kernel parameter γ. We found that the best classification performance was obtained when using the following combination of these parameters: $\{\begin{matrix} C = 2^{5}, γ = 2^{- 13}, f o r t h e t r a i n i n g d a t a s e t \\ C = 2^{5}, γ = 2^{- 11}, f o r t h e i n d e p e n d e n t d a t a s e t \end{matrix}$ (15)

3.2. Classification performance on training dataset

Table 1 presents the classification results of the proposed method on the training dataset, and the ROC curves of these approaches are presented in Figure 3. As observed from Table 1, we can see that different feature descriptors generated distinct classification results. Compared with the linear connection approach, the non-linear method achieved better classification performance, regardless of low-order or high-order correlation.

FIG. 3.

The ROC curves of different features on the training dataset. ROC, receiver operating characteristic.

Table 1.

Classification Results of Different Feature Approaches on Training Dataset

Method	Acc (%)	Sn (%)	Sp (%)	MCC	AUC
PCC	84.20	83.60	84.80	0.6840	0.9126
MIC	84.80	88.40	81.20	0.6978	0.9027
HOPCC	83.20	85.60	80.80	0.6648	0.9058
HOMIC	88.80	85.60	92.00	0.7776	0.9664
Proposed	91.40	92.80	90.00	0.8283	0.9732

Acc, accuracy; AUC, area under the receiver operating characteristic curve; HOMIC, high-order MIC; HOPCC, high-order PCC; MCC, Mathew's correlation coefficient; MIC, maximal information coefficient; PCC, Pearson's correlation coefficient; Sn, sensitivity; Sp, specificity.

For instance, HOMIC obtained an 88.80% classification accuracy, and the value of AUC is 0.9664, whereas the corresponding values for HOPCC are 83.20% and 0.9058, respectively. Further, when combining all of these features, the performance has significantly improved. The classification accuracy arrived at 91.40%, which is 2.60% higher than HOMIC, the suboptimal approach. As for sensitivity, MCC and AUC, the proposed method also obtained the best classification performance.

3.3. Classification performance on independent dataset

The classification performance of different feature approaches on the independent dataset was summarized in Table 2, and Figure 4 plots the corresponding ROC curves. As shown in this table, compared with other approaches, the proposed method achieved the best classification results in terms of Acc, Sn, Sp, MCC, and AUC.

FIG. 4.

The ROC curves of different features on the independent dataset.

Table 2.

Classification Results of Different Feature Approaches on Independent Dataset

Method	Acc (%)	Sn (%)	Sp (%)	MCC	AUC
PCC	86.51	87.30	85.71	0.7303	0.9463
MIC	84.92	88.89	80.95	0.7006	0.9012
HOPCC	85.71	92.06	79.37	0.7201	0.9501
HOMIC	95.24	93.65	96.83	0.9052	0.9929
Proposed	96.83	95.24	98.41	0.9370	0.9987

For example, our proposed approach achieved a 96.83% classification accuracy, and the AUC value arrived at 0.9987, whereas HOMIC, the suboptimal method, obtained 95.24% classification accuracy, and the AUC value is 0.9929, which is 1.59% and 0.0058 lower than the proposed method, respectively. It indicates that the proposed method is powerful in identifying PVPs.

3.4. Feature analysis

In this study, four different feature descriptors were adopted to represent samples. In this subsection, we further analyzed the contribution of these four kinds of features. Figure 5 presents the percentage of different features in the final classification process. From this figure, we can see that among these four types of features, HOMIC provides the greatest contribution to the classification task.

FIG. 5.

The contribution of different features on the classification task. (a) Training dataset; (b) independent dataset.

This may explain why HOMIC achieved better classification results than the others as listed in Tables 1 and 2.

3.5. Effect of classifier

In this work, the SVM was employed to discriminate PVPs from non-PVPs. However, many other classifiers are often used in this area. To explore this issue, we also performed experiments on other classifiers, such as K nearest neighbor (Dhall et al., 2021; Dai et al., 2021b; Hasan et al., 2020; Liu and Chen, 2020; Lin et al., 2019), decision tree (Dhall et al., 2021; Lin et al., 2019), and random forest (Charoenkwan et al., 2021; Dai et al., 2021b; Dhall et al., 2021; Hasan et al., 2020; Lin et al., 2019; Liu and Chen, 2020; Liu et al., 2018).

All of the results were summarized in Figure 6. It is not difficult to find that different classifiers generate distinct classification results. As compared with the others, SVM achieved the best classification performance on both datasets.

FIG. 6.

The classification performance of different classifiers. (a) Training dataset; (b) independent dataset.

3.6. Classification performance on other datasets

In addition to the datasets used in this work, other datasets were also used in PVPs study. For further demonstrating the efficacy of the proposed method, we performed experiments on datasets collected from previous studies (Arif et al., 2020; Manavalan et al., 2018). There are two different datasets included, which we term as S1 and S2. S1 contains 99 PVPs and 208 non-PVPs, whereas S2 includes 30 PVPs and 64 non-PVPs.

All of the results were provided in Table 3. It can be observed that our proposed approach achieved some promising results, with significant improvements compared with previous predictors. Moreover, we can see that the proposed model has a good ability to deal with the imbalanced data.

Table 3.
Classification Performance of Different Models on S1 and S2

Predictor S1 S2

Acc (%) Sn (%) Sp (%) MCC Acc (%) Sn (%) Sp (%) MCC

Naïve Bayes (Feng et al., 2013) 79.15 75.76 80.77 0.55 N/A N/A N/A N/A

PVPred (Ding et al., 2014) 85.02 75.76 89.42 0.66 71.30 60.00 76.50 0.35

PVP-SVM (Manavalan et al., 2018) 86.97 73.73 93.27 0.70 79.80 66.70 85.90 0.53

Pred-BVP-Unb (Arif et al., 2020) 92.54 93.80 91.27 0.85 83.06 86.66 79.68 0.66

iPVP-MCV (Han et al., 2021) 87.90 77.80 92.80 0.72 84.00 66.70 92.20 0.62

Proposed 95.77 93.94 96.63 0.90 97.87 96.67 98.44 0.95

Predictor	S1	S2
Naïve Bayes (Feng et al., 2013)	79.15	75.76	80.77	0.55	N/A	N/A	N/A	N/A
PVPred (Ding et al., 2014)	85.02	75.76	89.42	0.66	71.30	60.00	76.50	0.35
PVP-SVM (Manavalan et al., 2018)	86.97	73.73	93.27	0.70	79.80	66.70	85.90	0.53
Pred-BVP-Unb (Arif et al., 2020)	92.54	93.80	91.27	0.85	83.06	86.66	79.68	0.66
iPVP-MCV (Han et al., 2021)	87.90	77.80	92.80	0.72	84.00	66.70	92.20	0.62
Proposed	95.77	93.94	96.63	0.90	97.87	96.67	98.44	0.95

N/A, not available; PVP, phage virion protein; SVM, support vector machine.

3.7. Comparison with the state-of-the-art method

Moreover, we also conducted experiments to compare our proposed method with state-of-the-art classification models on the same datasets. Since previous works used a 10-fold cross-validation test to validate their predictors, for a fair comparison, we also adopted a 10-fold test to measure the performance of the proposed model. To perform a 10-fold cross-validation test, the dataset is randomly divided into 10 parts, with nine of them used as the training set and the other part used as the testing set. Table 4 gives the classification performance of different predictors.

Table 4.
Performance Comparison Between the Proposed Model and Existing Predictors

Predictor Training Independent

Acc (%) Sn (%) Sp (%) MCC Acc (%) Sn (%) Sp (%) MCC

Meta-iPVP (Charoenkwan et al., 2020c) 84.60 86.00 83.20 0.698 81.70 88.90 74.60 0.642

iPVP-MCV (Han et al., 2021) 86.40 85.20 87.60 0.728 83.30 88.90 77.80 0.671

SCORPION (Ahmad et al., 2022) 86.80 88.40 85.20 0.741 87.30 84.10 90.50 0.748

Proposed 90.74 91.92 89.56 0.815 96.75 95.87 97.62 0.935

Predictor	Training	Independent
Meta-iPVP (Charoenkwan et al., 2020c)	84.60	86.00	83.20	0.698	81.70	88.90	74.60	0.642
iPVP-MCV (Han et al., 2021)	86.40	85.20	87.60	0.728	83.30	88.90	77.80	0.671
SCORPION (Ahmad et al., 2022)	86.80	88.40	85.20	0.741	87.30	84.10	90.50	0.748
Proposed	90.74	91.92	89.56	0.815	96.75	95.87	97.62	0.935

Obviously, the proposed method achieved the best classification performance on both the training and independent datasets, compared with other predictors. For instance, our proposed method achieved 90.74% and 96.75% classification accuracies on the training and independent dataset, whereas the best classification results of prior studies on this metric are 86.80% and 87.30%, which are 3.94% and 9.45% lower than the proposed method.

4. CONCLUSION

In this study, we proposed a novel approach to identify PVPs. In the model, multiple kinds of correlation information were employed to denote peptide sequences, and LASSO algorithm was applied to select the most discriminative features. We obtained 91.40% and 96.83% classification accuracies on the training and independent datasets, respectively.

Compared with the state-of-the-art approaches, our proposed method achieved significant improvement on both the training and independent datasets. The codes and datasets are freely available at https://figshare.com/articles/online_resource/iPVPs/19450913. We hope that the proposed method may play a complementary role for the current PVPs study.

Footnotes

ACKNOWLEDGMENT

The authors gratefully acknowledge the partial support from the Jiangxi Science and Technology Normal University.

AUTHORs' CONTRIBUTIONS

All authors agree that they have read and approved the article.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work was supported by the Youth Project of Jiangxi Education Department (GJJ2201350), the Doctoral start-up fund of Jiangxi Science and Technology Normal University (2022BSQD20).

References

Ahmad

, Charoenkwan

, Quinn

JMW

, et al. SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins. Sci Rep, 2022; 12:4106.

Arif

, Ali

, Ahmad

, et al. Pred-BVP-Unb: Fast prediction of bacteriophage Virion proteins using un-biased multi-perspective properties with recursive feature elimination. Genomics, 2020; 112:1565–1574.

Basith

, Manavalan

, Shin

, et al. iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree. Comput Struct Biotechnol J, 2018; 16:412–420.

Charoenkwan

, Chiangjong

, Nantasenamat

, et al. StackIL6: A stacking ensemble model for improving the prediction of IL-6 inducing peptides. Brief Bioinform, 2021; 22(6):bbab172.

Charoenkwan

, Kanthawong

, Nantasenamat

, et al. iDPPIV-SCM: A sequence-based predictor for identifying and analyzing dipeptidyl peptidase IV (DPP-IV) inhibitory peptides using a scoring card method. J Proteome Res, 2020a;19:4125–4136.

Charoenkwan

, Kanthawong

, Schaduangrat

, et al. PVPred-SCM: Improved prediction and analysis of phage virion proteins using a scoring card method. Cells, 2020b;9:353.

Charoenkwan

, Nantasenamat

, Hasan

, et al. Meta-iPVP: A sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. J Comput Aided Mol Des, 2020c;34:1105–1116.

Chen

, Feng

, Nie

. iATP: A sequence based method for identifying anti-tubercular peptides. Med Chem, 2020; 16:620–625.

Chen

, Han

, Gao

, et al. High-order resting-state functional connectivity network for MCI classification. Human Brain Mapp, 2016; 37:3282–3296.

10.

Cortes

, Vapnik

. Support-vector networks. Mach Learn, 1995; 20:273–297.

11.

Dai

, Feng

, Cui

, et al. Iterative feature representation algorithm to improve the predictive performance of N7-methylguanosine sites. Brief Bioinform, 2021a;22.

12.

Dai

, Zhang

, Tang

, et al. BBPpred: Sequence-based prediction of blood-brain barrier peptides with feature representation learning and logistic regression. J Chem Inform Model, 2021b;61:525–534.

13.

Dhall

, Patiyal

, Sharma

, et al. Computer-aided prediction and design of IL-6 inducing peptides: IL-6 plays a crucial role in COVID-19. Brief Bioinform, 2021; 22:936–945.

14.

Ding

, Feng

P-M

, Chen

, et al. Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Mol Biosyst, 2014; 10:2229–2235.

15.

Feng

P-M

, Ding

, Chen

, et al. Naive Bayes classifier with feature selection to identify phage virion proteins. Comput Math Methods Med, 2013; 2013:530696.

16.

Han

, Zhu

, Ding

, et al. iPVP-MCV: A multi-classifier voting model for the accurate identification of phage virion proteins. Symmetry, 2021; 13:1506.

17.

Hasan

, Schaduangrat

, Basith

, et al. HLPpred-Fuse: Improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics, 2020; 36:3350–3356.

18.

Huang

, Niu

, Gao

, et al. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics, 2010; 26:680–682.

19.

Lavigne

, Ceyssens

P-J

, Robben

. Phage proteomics: Applications of mass spectrometry. Methods Mol Biol, 2009; 502:239–251.

20.

, Dong

, Wang

, et al. Identification of secreted proteins from malaria protozoa with few features. IEEE Access, 2020; 8:89793–89801.

21.

Lin

, Chen

, Li

, et al. Accurate prediction of potential druggable proteins based on genetic algorithm and Bagging-SVM ensemble classifier. Artif Intell Med, 2019; 98:35–47.

22.

Lissabet

JFB

, Bel

NLH

, Farias

. AntiVPP 1.0: A portable tool for prediction of antiviral peptides. Comput Biol Med, 2019; 107:127–130.

23.

Liu

, Weng

, Huang

D-S

, et al. iRO-3wPseKNC: Identify DNA replication origins by three-window-based PseKNC. Bioinformatics, 2018; 34:3086–3093.

24.

Liu

, Ji

, Ye

. SLEP: Sparse learning with efficient projections. Arizona State Univ, 2009; 6:7.

25.

Liu

, Chen

. iMRM: A platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics, 2020; 36:3336–3342.

26.

Lyon

Phage therapy's role in combating antibiotic-resistant pathogens. JAMA, 2017; 318:1746–1748.

27.

Manavalan

, Shin

, Lee

. PVP-SVM: Sequence-based prediction of phage virion proteins using a support vector machine. Front Microbiol, 2018; 9:476.

28.

Meng

, Zhang J

X, et al. Review and comparative analysis of machine learning-based phage virion protein identification methods. Biochim Biophys Acta Proteins Proteom, 2020; 1868:140406.

29.

Pan

, Gao

, Lin

, et al. Identification of bacteriophage virion proteins using multinomial naive Bayes with g-gap feature tree. Int J Mol Sci, 2018; 19:1779.

30.

Reshef

, Reshef

, Finucane

, et al. Detecting novel associations in large data sets. Science, 2011; 334:1518–1524.

31.

Sorokulova

, Olsen

, Vodyanoy

. Bacteriophage biosensors for antibiotic-resistant bacteria. Exp Rev Med Dev, 2014; 11:175–186.

32.

Tang

, Zhao

, Zou

, et al. HBPred: A tool to identify growth hormone-binding proteins. Int J Biol Sci, 2018; 14:957–964.

33.

Wang

, Zhang

, et al. MultiP-SChlo: Multi-label protein subchloroplast localization prediction with Chou's pseudo amino acid composition and a novel multi-label classifier. Bioinformatics, 2015; 31:2639–2645.

34.

Xiao

, Wang

, Chou

. iNR-PhysChem: A sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix. PLoS One, 2012; 7.

35.

Xiao

, Xu

Z-C

, Qiu

W-R

, et al. iPSW (2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition. Genomics, 2019; 111:1785–1793.

36.

Xiao

, Zou

, Lin

. iMem-Seq: A multi-label learning classifier for predicting membrane proteins types. J Membrane Biol, 2015; 248:745–752.

37.

Zhang

, Zhang

, Gao

, et al. An ensemble method to distinguish bacteriophage virion from non-virion proteins based on protein sequence characteristics. Int J Mol Sci, 2015; 16:21734–21758.

38.

Zou

, Yang

, Yin

. Identifying N7-methylguanosine sites by integrating multiple features. Biopolymers, 2022a;113(2):e23480.

39.

Zou

, Yang

, Yin

Identification of tumor homing peptides by utilizing hybrid feature representation. J Biomol Struct Dynam, 2022b:1–8.

40.

Zou

, Yang

, Yin

iTTCA-MFF: Identifying tumor T cell antigens based on multiple feature fusion. Immunogenetics, 2022c;74(5):447–454.

41.

Zou

, Yang

. Dynamic thresholding networks for schizophrenia diagnosis. Artif Intell Med, 2019; 96:25–32.

42.

Zou

, Yang

. Multiple functional connectivity networks fusion for schizophrenia diagnosis. Med Biol Eng Comput, 2020; 58:1779–1790.

43.

Zou

, Yin

. Identifying dipeptidyl peptidase-IV inhibitory peptides based on correlation information of physicochemical properties. Int J Peptide Res Ther, 2021; 27:2651–2659.

44.

Zuo

, Zou

, Lin

, et al. 2lpiRNApred: A two-layered integrated algorithm for identifying piRNAs and their functions based on LFE-GM feature selection. RNA Biol, 2020; 17:892–902.