Abstract
Essential proteins possess critical functions for cell survival. Identifying essential proteins improves our understanding of how a cell works and also plays a vital role in the research fields of disease treatment and drug development. Recently, some machine-learning methods and ensemble learning methods have been proposed to identify essential proteins by introducing effective protein features. However, the ensemble learning method only used to focus on the choice of base classifiers. In this article, we propose a novel ensemble learning framework called multi-ensemble to integrate different base classifiers. The multi-ensemble method adopts the idea of multi-view learning and selects multiple base classifiers and trains those classifiers by continually adding the samples that are predicted correctly by the other base classifiers. We applied multi-ensemble to Yeast data and Escherichia coli data. The results show that our approach achieved better performance than both individual classifiers and the other ensemble learning methods.
1. INTRODUCTION
Essential proteins play an indispensable role in cell survival (Ren and Yan, 2009). Without them, cells will lose some important functions and even die. Some researches pointed out that essential proteins are closely related to human disease genes (Wang et al., 2013; Song et al., 2019a, b). Identifying essential proteins can help scientists understand how a cell works and find out the minimum requirements for cell survival. Hence, identifying essential proteins plays a vital role in the emerging field of synthetic biology and is also crucial to treat diseases and develop new drugs.
Recently, many computational methods have been proposed to identify essential proteins based on their one or several features. One of the most important features is their topological features in a biological network. Previous studies supposed that essential proteins tend to be the center of the protein-protein interaction (PPI) network because removing them from the networks will cause the collapse of the networks (Jeong et al., 2001). Therefore, some centrality methods that measure the importance of a node in the network have been utilized to detect essential proteins, including Degree Centrality (DC) (Vallabhajosyula et al., 2009), Betweenness Centrality (BC) (Freeman, 1977; Joy et al., 2014), Eigenvector Centrality (EC) (Bonacich, 1987), Information Centrality (IC) (Stephenson and Zelen, 1989), Closeness Centrality (CC) (Stefan and Stadler, 2003), Subgraph Centrality (SC) (Ernesto and Rodríguez-Velázquez, 2005), and Edge Clustering Coefficient Centrality (NC) (Wang et al., 2012).
Besides the topological features, there are some biological features related to the essentiality of proteins. For example, essential proteins are more conserved than nonessential ones (Jordan et al., 2002) and they tend to consist of more protein domains that less frequently appear in other proteins (Peng et al., 2015a). Essential proteins implement critical functions in certain subcellular components and under certain conditions (Peng et al., 2015b; Zhang et al., 2019). To improve the prediction accuracy, a group of researchers focus on combining the biological features with the topological features to mark essential proteins (Peng et al., 2015b; Zhang et al., 2019). Since different features contain some knowledge, they complement each other and can comprehensively describe the essentiality of proteins. Peng et al. (2012) design an iterative method to predict essential proteins by integrating the proteins' conservative property and their topological property in the PPI networks. The same group introduces protein domain information into a PPI network to make predictions. Li et al. combine subcellular localization information, orthology information, and PPI networks to predict essential proteins (Li et al., 2016). Considering the error-prone PPI networks and the dynamicity of protein functions, some methods [PeC (Min et al., 2013), weighted degree centrality (WDC) (Tang et al., 2012), fusing the dynamic PPI networks of different time points (FDP) (Zhang et al., 2019)] employ gene expression profiles to purify the PPI network or construct a dynamic PPI network and then propose novel methods to detect essential proteins based on the filtered PPI network.
The methods mentioned earlier are unsupervised methods that select essential proteins by assigning them ranking scores. The proteins ranked in the top list are selected as candidates for essential proteins (Song et al., 2019a, b). This kind of method need not know the essentiality labels ahead of time. The supervised machine-learning method is also a popular way to integrate multiple features and has been applied to predict essential proteins (Wang et al., 2013). Chen and Xu incorporate four types of features, including growth rates of mutants, protein sequence data, PPI data, and gene expression data, into a neural network or support vector machine (SVM) classifier to decide the protein dispensability (Chen and Xu, 2005). Gustafson et al. input the topological features and the biological features into a Naive Bayes classifier to predict essential proteins (Gustafson et al., 2006). Hwang et al. use an SVM classifier to combine multiple topological features, such as DC, BC, and CC, and biological features, such as open reading frame, paralogs, and strand, to predict essential proteins (Hwang et al., 2009). Zhong et al. (2013) design a Gene Expression Programming-based method that learns topological features and biological features for predicting essential proteins. Some ensemble learning methods assemble multiple base classifiers to identify essential proteins.
Acencio and Lemke take a voting strategy to integrate multiple decision tree classifiers to make final essentiality predictions (Acencio and Lemke, 2009). They include subcellular localization information, biological features, and network topological features in the classifiers. Deng et al. train and test four classifiers independently, including multi-models comprising Naive Bayes classifier, C4.5 decision tree, CN2 rule, and logistical regression model, and then produce final prediction results by combining the output of these diverse classifiers through an unweighted average approach (Jingyuan et al., 2011). Diverse classifiers can capture different but complementary features of essential proteins and these ensemble learning methods integrating these classifiers usually achieve better prediction performance than a single one. However, previous ensemble learning methods combine different classifiers by taking straightforward strategies, such as voting, unweighted averaging.
To improve the performance of essential protein prediction, in this work, we present a novel ensemble learning framework, namely multi-ensemble, to integrate different base classifiers. The idea of ensemble learning is to build a powerful prediction classifier by combining the strengths of a collection of base classifiers. Typically, an ensemble learning method is constructed in two steps. First, it generates different base classifiers in a parallel way or sequential way. The second step is to combine these classifiers. Bagging and boosting are two kinds of popular ensemble learning methods. Bagging (Breiman, 1996) produces diverse classifiers by subsampling the training data with replacement. These classifiers are independent of each other and are built in a parallel style. After that, bagging combines them by majority voting to make predictions. Boosting trains classifiers in a sequential style and the generation of a classifier influences the generation of its sub-sequential classifiers. Adaboost (Schapire et al., 1998), a famous boosting ensemble learning method, assigns weights to the data in the training set and the classifiers and updates the weights based on the misclassification rate. Another example of the boosting method is Xgboost (Chen and Guestrin, 2016), which weights and summarizes the prediction scores of different regression trees to make predictions.
In contrast to previous ensemble learning methods, our multi-ensemble method uses high-quality samples to train base classifiers and assembles these classifiers by a stacking strategy. In the training process, the base classifiers depend on each other, because their training samples are not fixed, but increase depending on the prediction results of other classifiers. Only the high-quality training samples are selected to generate the base classifiers. A training sample is regarded as a high-quality sample if other base classifiers discriminate them as positive or negative samples by consensus. This idea is similar to the Multi-view concept in the Tri-training for Semi-supervised learning proposed by Zhou and Li (2005). Finally, a logistic regression (LR) model combines the results of these base classifiers for final prediction. The parameters in the LR model are estimated by minimizing the difference between the real labels and the prediction results generated by these base classifiers over the training data. We employ the multi-ensemble method to assemble several base classifiers, such as softmax regression classifier (Bishop, 2006), decision tree (Quinlan, 1986), random forest (Breiman, 2001), Adaboost (Schapire et al., 1998), and Xgboost (Chen and Guestrin, 2016). The experimental results show that the Multi-ensemble method outperforms both individual base classifiers and the other state-of-the-art of ensemble learning methods on predicting the essential protein of Saccharomyces cerevisiae (Yeast) and Escherichia coli.
2. METHODS
The multi-ensemble method is constructed in two crucial steps. First, it generates different base classifiers by feeding high-quality samples. The second step involves combining these classifiers with an LR model. In the first step, the training data are partitioned into several parts. The high-quality training samples that are regarded as positive or negative samples by most of the other classifiers are selected to train the base classifiers. Figure 1 illustrates the framework of the multi-ensemble method for predicting essential proteins.

An overview of multi-ensemble. P_pred1, P_pred2, P_predm are the predicted probabilities of the multiple base classifiers on the samples in Pj and T_pred1, T_pred2, T_predm are the predicted probabilities of the multiple base classifiers on the test set. LR, logistic regression.
2.1. Generating basis classifiers
Given a training set, it is randomly divided into two partitions, T and P. T accounts for about a quarter of the training set, and P is the rest of the training set. After that, m of initial training sets, denoted by T1, T2…Tm, are generated for m base classifiers by bootstrap sampling with replacement from T. The m of initial training sets are input into the base classifiers, respectively. Mathematically, the relationship between T
i
and T is as follows.
Give m base classifiers C1, C2…Cm, they are initially trained by the data in T1, T2…Tm, respectively. After that, every base classifier predicts the data in Pi, where i = {1, 2, … n}. At each round of prediction, the samples in Pi are appended to the training set of a base classifier, if most of the other base classifiers think they are high-quality samples. The high-quality samples are determined by the following equation:
where Ci(x) means the output probability that the sample x is considered as positive by the base classifier Ci.
where Sj(x) is 1 if the base classifier Cj thinks the sample x as a high-quality sample, otherwise 0; m is the total number of base classifiers.
After appending the high-quality samples in Pi to the training set of a base classifier, the classifier is trained again and predicts the samples in Pj,(
2.2. Integrating base classifiers
After generating the base classifiers by using high-quality samples, an LR model is adopted to integrate the outputs of multiple base classifiers and to obtain the final predictions. Mathematically, the LR model is defined as follows:
where
3. RESULTS
3.1. Datasets
To test the effectiveness of our multi-ensemble method, we apply it to predict essential proteins of S. cerevisiae (Yeast) and E. coli. The yeast essential proteins were from the Munich Information Center for Protein Sequences (MIPS) database (Mewes et al., 2004), the Saccharomyces Genome Database (SGD) database (Cherry et al., 1998), the Database of Essential Genes (DEG) database (Ren and Yan, 2009), and the Saccharomyces Genome Deletion Project (SGDP) database. The Yeast PPI network data were downloaded from the Database of Interacting Proteins description (DIP) database (Ioannis et al., 2002), which consists of 5093 proteins and 24,743 edges. Among the 5093 Yeast proteins, 1167 were essential and 3926 were nonessential. The essential proteins and PPI network data of E. coli came from the DEG database and DIP database, respectively. These involve 2727 proteins and 11,803 edges. Among the 2727 E. coli proteins, 254 of 2727 were essential and 2473 were nonessential proteins.
The topological features have proved to be the typical features of essential proteins in the PPI network and exhibit powerful performance on essential protein prediction (Wang et al., 2013; Song et al., 2019a). Moreover, essential proteins were likely to be enriched in certain types of subcellular localization (Li et al., 2016). Similar to previous methods (Zhong et al., 2013), 16 subcellular localization features and 10 topological features were concatenated as inputting features of the classifiers in the prediction of Yeast essential proteins. The 16 subcellular localization features were downloaded from the Eukaryotic Subcellular Localization DataBase (eSLDB) database (Andea et al., 2007), including Vacuole, Vesicles, Peroxisome, Secretory pathway, Transmembrane, Lysosome, Membrane, Mitochondrion, Nucleus, Golgi, Endoplasmic reticulum, Endosome, Extracellular, Cytoplasm, Cell wall, and Cytoskeleton.
Table 1 lists the 10 topological features, including seven centrality methods and three composited features. The seven centrality methods include BC (Freeman, 1977; Joy et al., 2014), CC (Stefan and Stadler, 2003), DC (Vallabhajosyula et al., 2009), EC (Bonacich, 1987), IC (Stephenson and Zelen, 1989), NC (Wang et al., 2012), and SC (Ernesto and Rodríguez-Velázquez, 2005), which were calculated by a Cytoscape plugin CytoNCA (Tang et al., 2015). The three composited features were PeC (Min et al., 2013), WDC (Tang et al., 2012), and ION (Peng et al., 2012). The ION predicts essential proteins by integrating orthology data with the PPI network. PeC and WDC Combine PPI network with the gene expression profiles to predict essential proteins. The orthology data in ION were from the InParanoid database (Gabriel et al., 2010), and the gene expression profiles in PeC and WDC were from Tu et al. (2005). While predicting E. coli essential proteins, eight topological properties, including BC, CC, DC, EC, IC, NC, SC, and ION, were selected as features of the classifiers.
Description and Equation of Topological Features
PPI, protein-protein interaction; WDC, weighted degree centrality.
3.2. Evaluation metrics
There are usually more nonessential proteins than essential proteins in Yeast and E. coli datasets. The ratio of essential proteins to nonessential proteins is 1:3.36 (1167: 3926) in the Yeast dataset and 1:9.74 (254: 2473) in the E. coli dataset. Hence, we divided all sample data into five parts at random and maintained each part as the same ratio of essential proteins to nonessential proteins in the original data, that is, 1:3.36 in the Yeast dataset and 1:9.74 in the E. coli dataset. In the course of validation, four of the five parts constituted the training set and the remaining part was the testing set. The process was repeated five times until each one of the five parts was used for testing. According to the real labels, the prediction results were checked and some popular statistic metrics, including specificity (SP), sensitivity (SN), false positive rate (FPR), positive predictive value (PPV), negative predictive value (NPV), F-measure, accuracy (ACC), and Matthews Correlation Coefficient (MCC), were calculated to measure the prediction performance. They were defined as follows:
where TP indicates the number of true essential proteins, TN indicates the number of true nonessential proteins, FP indicates the number of nonessential proteins that are incorrectly predicted as essential ones, and FN indicates the number of essential proteins that are missed by the predictors. Besides, the area under the receiver operating characteristic (ROC) curve (AUC) was calculated to evaluate the overall performance of each method. In the Results section, we calculate the AUC score separately for each cross-validation fold and average them to obtain a final AUC value to estimate the performance of a method. In 5-fold cross-validation, each one of the five parts was used for tests and will get a probability of being an essential protein by a certain method. We rank all proteins in descending order according to the probability and select 1167 proteins in the Yeast dataset, and the first 254 proteins in E. coli as candidate essential proteins (according to the number of real essential proteins). After that, TP, FP, TN, and FN were calculated for each method based on the candidate essential proteins and the benchmark datasets; therefore, the SN, SP, FPR, PPV, NPV, F-Measure, ACC, and MCC values were generated.
3.3. Parameter settings
3.3.1. The number of the base classifiers
The multi-ensemble method integrates multiple base classifiers to predict essential proteins. The users can customize the number of base classifiers. To test what is the suitable number of base classifiers for the multi-ensemble method to obtain the best performance on essential protein prediction, we selected 2, 3, 4, and 5 base classifiers to implement experiments on Yeast data and E. coli data. The base classifiers include a softmax regression classifier (Bishop, 2006), a decision tree (Quinlan, 1986), a random forest (Breiman, 2001), an Adaboost (Schapire et al., 1998), and an Xgboost (Chen and Guestrin, 2016). Since there are too many kinds of combinations for the different number of classifiers, the combinations with the best performance were selected for comparison. Tables 2 and 3 show the comparisons of SN, SP, FPR, PPV, NPV, F-MEASURE, ACC, MCC, and AUC for multi-ensemble methods, with the different numbers of base classifiers on the Yeast dataset and the E. coli dataset. As can be seen from Tables 2 and 3, the multi-ensemble method achieves the best performances when integrating three base classifiers. Hence, in the following comparison, the multi-ensemble method integrates three different base classifiers to predict essential proteins.
The Performance Comparisons of the Multi-Ensemble Method with Different Numbers of Base Classifiers on Yeast Dataset
ACC, accuracy; AUC, area under the receiver operating characteristic (ROC) curve; FPR, false positive rate; MCC, Matthews Correlation Coefficient; NPV, negative predictive value; PPV, positive predictive value; SN, sensitivity; SP, specificity. The best and comparable results are in boldface.
The Performance Comparisons of the Multi-Ensemble Method with Different Numbers of Base Classifiers on Escherichia coli Dataset
3.3.2. The size of {T1, T2…Tm}
In our method, T was set to be a quarter of the training set. Bootstrap sampling was done on T to generate multiple datasets with the same size for every base classifier, denoted by {T1, T2…Tm}, where m is the number of base classifiers. In this work, the size of T was set to 1000 for the Yeast data and it was set to 530 for the E. coli data. To test the effect of initial training set size on the performance of our method, we did some comparative experiments by setting different sizes ranging from 500 to 1500 on the Yeast dataset and from 265 to 730 on the E. coli dataset. Tables 4 and 5 show the comparisons of SN, SP, FPR, PPV, NPV, F-MEASURE, ACC, MCC, and AUC for the multi-ensemble method with different sizes of {T1, T2…Tm} on the Yeast dataset and the E. coli dataset. It turns out that setting the size to 1000 for the Yeast dataset and to 530 for the E. coli dataset produces better results.
Performance Comparisons of Multi-Ensemble with Different Initial Training Set Sizes on Yeast Dataset
Performance Comparisons of Multi-Ensemble with Different Initial Training Set Sizes on E. coli Dataset
3.4. Comparing with base classifiers
The multi-ensemble method assembles multiple base classifiers to predict essential proteins. To assess its efficiency, we design four different ensemble strategies to combine three different base classifiers, namely multi-ensemble1, multi-ensemble2, multi-ensemble3, and multi-ensemble4. In multi-ensemble1, softmax regression classifier, Xgboost, and Decision tree are selected as base classifiers. Multi-ensemble2 combines softmax regression classifier, Adaboost, and Random Forest. The three base classifiers in multi-ensemble3 are softmax regression classifier, Gradient Boosting Decision Tree (GBDT) (Friedman, 2001), and Random Forest. For multi-ensemble4, the three base classifiers include softmax regression classifier, Xgboost, and Random Forest.
We compare the four multi-ensemble methods with the popular individual base classifiers, such as softmax regression classifier, Decision tree, Random Forest, as well as the state-of-the-art ensemble learning methods, that is, Adaboost, Xgboost, and GBDT. It should be noted that the multi-ensemble method can integrate any type of base classifiers to predict essential proteins, whereas default base classifiers of Adaboost, GBDT, and Xgboost are Decision tree and Regression tree. We also compare the multi-ensemble methods with the LR model that is employed to assemble the prediction of multiple base classifiers. In the multi-ensemble method, the base classifiers are trained by the high-quality samples that are agreed on by most other base classifiers. To investigate the effectiveness of this training strategy, we also compare the performance of the base classifiers that are trained by the selected high-quality samples, namely Softmax1, Random Forest1, Decision tree1, Adaboost,1 and Xgboost1. Tables 6 and 7 show all the comparison results on the Yeast and E. coli dataset.
Performance Comparisons of Multi-Ensemble and Individual Classifiers on Yeast Dataset
GBDT, Gradient Boosting Decision Tree; LR, logistic regression.
Performance Comparisons of Multi-Ensemble and Individual Classifiers on E. coli Dataset
We note that on the two datasets, all of the four multi-ensemble methods generate better prediction results than both their individual base classifiers, LR method and the other ensemble learning methods, such as Adaboost, GBDT, and Xgboost. It suggests that our multi-ensemble method can effectively integrate multiple different base classifiers and has the ability to improve the essential protein predictions to a high level. Moreover, it is interesting that almost all the normal base classifiers, such as Softmax, Random Forest, and the Decision tree, produce better prediction results by selecting high-quality samples on the two datasets. For example, Decision tree improves its F-measure values from 0.515 to 0.5296 and its AUC values from 0.7592 to 0.7538 on the Yeast data and improves its F-measure values from 0.3583 to 0.3661 and its AUC values from 0.7364 to 0.7371 on the E. coli data. For the other ensemble learning methods, the F-measure values of Xgboost are increased from 0.5278 to 0.5347 on the Yeast dataset and those of Adaboost are increased from 0.3425 to 0.3504 on the E. coli dataset. All of these results prove that our strategies for generating base classifiers can successfully improve the accuracy of the base classifiers and therefore contribute to producing final correct prediction results.
3.5. Comparison with other machine-learning methods
In previous research, some machine-learning methods were proposed to predict essential proteins, such as gene expression programming (GEP) (Zhong et al., 2013), SVM, sequential minimal optimization (SMO), J48, Random Tree, radial basis function (RBF) Network, NaiveBayes, Bays Network, and NaiveBayes Tree. GEP is a newly proposed method that utilizes gene expression programming to learn topological features and biological features for predicting essential proteins. It can accurately predict essential proteins. To further evaluate the performance of our methods, we compare it with GEP and the other popular machine-learning methods. All of the other machine-learning methods were implemented by WEKA software with default parameters. Table 8 shows the comparison between the multi-ensemble and the other machine-learning methods on the Yeast and E. coli datasets in terms of the AUC values. We can see from Table 8 that the AUC values of multi-ensemble4 are 0.7847 on the Yeast data and 0.7828 on the E. coli data, which are 1.17% and 0.38% higher than that of GEP, the best predictor among all of the other machine-learning methods. Multi-ensemble4 integrates three base classifiers, including softmax regression classifier, Xgboost, and Random Forest, and produces more accurate predictions for essential proteins than the machine-learning methods.
Area Under the Receiver Operating Characteristic Curve Values Comparison Between Multi-Ensemble and Other Machine-Learning Methods on Yeast and E. coli Datasets
GEP, gene expression programming; RBF, Radial Basis Function Network; SMO, sequential minimal optimization; SVM, support vector machines.
3.6. Comparison with deep learning methods
In recent years, deep learning methods have been widely used in the bioinformatics field. So we also compared our method with some state-of-the-art deep learning models, such as LeNet-5 (Lecun et al., 1998), Recurrent Neural Network (RNN) (Graves, 1997), Bidirectional RNN (Schuster and Paliwal, 1997), and CapsNet (Sabour et al., 2017). All parameters of these deep learning models were set to default values, and the size of convolution kernels is 1*2. Tables 9 and 10 show the comparisons between the multi-ensemble method and the deep learning models on the Yeast dataset and the E. coli dataset. It turns out that our multi-ensemble4 gains better prediction results than these deep learning models.
Comparisons Between Multi-Ensemble and Other Deep Learning Methods on Yeast Dataset
BRNN, Bidirectional Recurrent Neural Network; RNN, Recurrent Neural Network.
Comparisons Between Multi-Ensemble and Other Deep Learning Methods on E. coli Dataset
4. CONCLUSIONS
In this article, a novel ensemble framework called multi-ensemble was proposed to predict essential proteins. The multi-ensemble method is constructed in two crucial steps. First, it generates different base classifiers by feeding high-quality samples. The second step involves combining these classifiers with an LR model. In contrast to previous ensemble learning methods, our multi-ensemble method trains the base classifiers by high-quality samples that are determined by most of the other base classifiers. Hence, these base classifiers depend on each other. The multi-ensemble method can flexibly integrate any number and any kind of base classifiers. We employ the multi-ensemble method to assemble several base classifiers, including softmax regression classifier, Decision tree, random forest, Adaboost, and Xgboost, and we find that assembling three base classifiers can produce better results than assembling other numbers of base classifiers. We design four different ensemble strategies to combine three different base classifiers by the multi-ensemble method and apply them to predict essential proteins of Yeast and E. coli datasets. The experimental results on the two datasets show that all of the four multi-ensemble methods outperform both their individual base classifiers, LR method and the other ensemble learning methods, such as Adaboost, GBDT, and Xgboost. Moreover, most of the base classifiers improve their overall performance by selecting high-quality samples. Besides, the multi-ensemble method also has better prediction performance than the other machine-learning methods and the deep learning models.
Footnotes
AUTHOR DISCLOSURE STATEMENT
The authors declare they have no conflicting financial interests.
FUNDING INFORMATION
This work is supported in part by the National Natural Science Foundation of China under Grant No. 61972185, No. 61472133, the Natural Science Foundation of Yunnan Province of China (No. 2019FA024), Yunnan Key Research and Development Program (2018IA054), Yunnan Ten Thousand Talents Plan Young, and Natural Science Foundation of Hunan Province of China under Grant 2018JJ2262.
