A Novel Multi-Ensemble Method for Identifying Essential Proteins

Abstract

Essential proteins possess critical functions for cell survival. Identifying essential proteins improves our understanding of how a cell works and also plays a vital role in the research fields of disease treatment and drug development. Recently, some machine-learning methods and ensemble learning methods have been proposed to identify essential proteins by introducing effective protein features. However, the ensemble learning method only used to focus on the choice of base classifiers. In this article, we propose a novel ensemble learning framework called multi-ensemble to integrate different base classifiers. The multi-ensemble method adopts the idea of multi-view learning and selects multiple base classifiers and trains those classifiers by continually adding the samples that are predicted correctly by the other base classifiers. We applied multi-ensemble to Yeast data and Escherichia coli data. The results show that our approach achieved better performance than both individual classifiers and the other ensemble learning methods.

1. INTRODUCTION

Essential proteins play an indispensable role in cell survival (Ren and Yan, 2009). Without them, cells will lose some important functions and even die. Some researches pointed out that essential proteins are closely related to human disease genes (Wang et al., 2013; Song et al., 2019a, b). Identifying essential proteins can help scientists understand how a cell works and find out the minimum requirements for cell survival. Hence, identifying essential proteins plays a vital role in the emerging field of synthetic biology and is also crucial to treat diseases and develop new drugs.

Recently, many computational methods have been proposed to identify essential proteins based on their one or several features. One of the most important features is their topological features in a biological network. Previous studies supposed that essential proteins tend to be the center of the protein-protein interaction (PPI) network because removing them from the networks will cause the collapse of the networks (Jeong et al., 2001). Therefore, some centrality methods that measure the importance of a node in the network have been utilized to detect essential proteins, including Degree Centrality (DC) (Vallabhajosyula et al., 2009), Betweenness Centrality (BC) (Freeman, 1977; Joy et al., 2014), Eigenvector Centrality (EC) (Bonacich, 1987), Information Centrality (IC) (Stephenson and Zelen, 1989), Closeness Centrality (CC) (Stefan and Stadler, 2003), Subgraph Centrality (SC) (Ernesto and Rodríguez-Velázquez, 2005), and Edge Clustering Coefficient Centrality (NC) (Wang et al., 2012).

Besides the topological features, there are some biological features related to the essentiality of proteins. For example, essential proteins are more conserved than nonessential ones (Jordan et al., 2002) and they tend to consist of more protein domains that less frequently appear in other proteins (Peng et al., 2015a). Essential proteins implement critical functions in certain subcellular components and under certain conditions (Peng et al., 2015b; Zhang et al., 2019). To improve the prediction accuracy, a group of researchers focus on combining the biological features with the topological features to mark essential proteins (Peng et al., 2015b; Zhang et al., 2019). Since different features contain some knowledge, they complement each other and can comprehensively describe the essentiality of proteins. Peng et al. (2012) design an iterative method to predict essential proteins by integrating the proteins' conservative property and their topological property in the PPI networks. The same group introduces protein domain information into a PPI network to make predictions. Li et al. combine subcellular localization information, orthology information, and PPI networks to predict essential proteins (Li et al., 2016). Considering the error-prone PPI networks and the dynamicity of protein functions, some methods [PeC (Min et al., 2013), weighted degree centrality (WDC) (Tang et al., 2012), fusing the dynamic PPI networks of different time points (FDP) (Zhang et al., 2019)] employ gene expression profiles to purify the PPI network or construct a dynamic PPI network and then propose novel methods to detect essential proteins based on the filtered PPI network.

The methods mentioned earlier are unsupervised methods that select essential proteins by assigning them ranking scores. The proteins ranked in the top list are selected as candidates for essential proteins (Song et al., 2019a, b). This kind of method need not know the essentiality labels ahead of time. The supervised machine-learning method is also a popular way to integrate multiple features and has been applied to predict essential proteins (Wang et al., 2013). Chen and Xu incorporate four types of features, including growth rates of mutants, protein sequence data, PPI data, and gene expression data, into a neural network or support vector machine (SVM) classifier to decide the protein dispensability (Chen and Xu, 2005). Gustafson et al. input the topological features and the biological features into a Naive Bayes classifier to predict essential proteins (Gustafson et al., 2006). Hwang et al. use an SVM classifier to combine multiple topological features, such as DC, BC, and CC, and biological features, such as open reading frame, paralogs, and strand, to predict essential proteins (Hwang et al., 2009). Zhong et al. (2013) design a Gene Expression Programming-based method that learns topological features and biological features for predicting essential proteins. Some ensemble learning methods assemble multiple base classifiers to identify essential proteins.

Acencio and Lemke take a voting strategy to integrate multiple decision tree classifiers to make final essentiality predictions (Acencio and Lemke, 2009). They include subcellular localization information, biological features, and network topological features in the classifiers. Deng et al. train and test four classifiers independently, including multi-models comprising Naive Bayes classifier, C4.5 decision tree, CN2 rule, and logistical regression model, and then produce final prediction results by combining the output of these diverse classifiers through an unweighted average approach (Jingyuan et al., 2011). Diverse classifiers can capture different but complementary features of essential proteins and these ensemble learning methods integrating these classifiers usually achieve better prediction performance than a single one. However, previous ensemble learning methods combine different classifiers by taking straightforward strategies, such as voting, unweighted averaging.

To improve the performance of essential protein prediction, in this work, we present a novel ensemble learning framework, namely multi-ensemble, to integrate different base classifiers. The idea of ensemble learning is to build a powerful prediction classifier by combining the strengths of a collection of base classifiers. Typically, an ensemble learning method is constructed in two steps. First, it generates different base classifiers in a parallel way or sequential way. The second step is to combine these classifiers. Bagging and boosting are two kinds of popular ensemble learning methods. Bagging (Breiman, 1996) produces diverse classifiers by subsampling the training data with replacement. These classifiers are independent of each other and are built in a parallel style. After that, bagging combines them by majority voting to make predictions. Boosting trains classifiers in a sequential style and the generation of a classifier influences the generation of its sub-sequential classifiers. Adaboost (Schapire et al., 1998), a famous boosting ensemble learning method, assigns weights to the data in the training set and the classifiers and updates the weights based on the misclassification rate. Another example of the boosting method is Xgboost (Chen and Guestrin, 2016), which weights and summarizes the prediction scores of different regression trees to make predictions.

In contrast to previous ensemble learning methods, our multi-ensemble method uses high-quality samples to train base classifiers and assembles these classifiers by a stacking strategy. In the training process, the base classifiers depend on each other, because their training samples are not fixed, but increase depending on the prediction results of other classifiers. Only the high-quality training samples are selected to generate the base classifiers. A training sample is regarded as a high-quality sample if other base classifiers discriminate them as positive or negative samples by consensus. This idea is similar to the Multi-view concept in the Tri-training for Semi-supervised learning proposed by Zhou and Li (2005). Finally, a logistic regression (LR) model combines the results of these base classifiers for final prediction. The parameters in the LR model are estimated by minimizing the difference between the real labels and the prediction results generated by these base classifiers over the training data. We employ the multi-ensemble method to assemble several base classifiers, such as softmax regression classifier (Bishop, 2006), decision tree (Quinlan, 1986), random forest (Breiman, 2001), Adaboost (Schapire et al., 1998), and Xgboost (Chen and Guestrin, 2016). The experimental results show that the Multi-ensemble method outperforms both individual base classifiers and the other state-of-the-art of ensemble learning methods on predicting the essential protein of Saccharomyces cerevisiae (Yeast) and Escherichia coli.

2. METHODS

The multi-ensemble method is constructed in two crucial steps. First, it generates different base classifiers by feeding high-quality samples. The second step involves combining these classifiers with an LR model. In the first step, the training data are partitioned into several parts. The high-quality training samples that are regarded as positive or negative samples by most of the other classifiers are selected to train the base classifiers. Figure 1 illustrates the framework of the multi-ensemble method for predicting essential proteins.

FIG. 1.

An overview of multi-ensemble. P_pred1, P_pred2, P_predm are the predicted probabilities of the multiple base classifiers on the samples in P_j and T_pred1, T_pred2, T_predm are the predicted probabilities of the multiple base classifiers on the test set. LR, logistic regression.

2.1. Generating basis classifiers

Given a training set, it is randomly divided into two partitions, T and P. T accounts for about a quarter of the training set, and P is the rest of the training set. After that, m of initial training sets, denoted by T₁, T₂…T_m, are generated for m base classifiers by bootstrap sampling with replacement from T. The m of initial training sets are input into the base classifiers, respectively. Mathematically, the relationship between T_i and T is as follows. $T_{1} \subset T, T_{2} \subset T, \dots, T_{m} \subset T, w h e r e T_{1} \cap T_{2} \dots \cap T_{m} \neq \emptyset$ . Meanwhile, P is divided into n independent subsets with no intersection, namely P₁, P_2,…P_n. Mathematically, the relationship between P_i and P is $P = P_{1} \cup P_{2} \dots \cup P_{n},$ where $P_{1} \cap P_{2} \dots \cap P_{n} = \emptyset$ .

Give m base classifiers C₁, C₂…C_m, they are initially trained by the data in T₁, T₂…T_m, respectively. After that, every base classifier predicts the data in P_i, where i = {1, 2, … n}. At each round of prediction, the samples in P_i are appended to the training set of a base classifier, if most of the other base classifiers think they are high-quality samples. The high-quality samples are determined by the following equation: $| C_{i} (x) - μ_{i} | > ω * σ_{i}^{2}, i \in [1, m]$ (1)

where C_i(x) means the output probability that the sample x is considered as positive by the base classifier C_i. $σ_{i}^{2}$ and $μ_{i}$ are the variance and mean of the output positive probability of the base classifier C_i over all samples. $ω$ is a custom coefficient, which was set to 3 on the Yeast dataset and set to 1 on the E. coli dataset. Since not all samples can improve the efficiency of the base classifier, some low-quality samples may even cause errors. So, we discard samples whose output positive probabilities are close to the mean value. We expect that the probability of recognition as a nonessential protein is close to zero, and the probability of recognition as an essential protein is close to 1. For a base classifier C_i, a sample is appended to its training set, if most of the other classifiers think the sample is a high-quality sample. Here, most of the other classifiers refer to two-thirds of total base classifiers. Mathematically, a sample will append to the training set of the base classifier C_i, if the number of other classifiers recognizing it as a high-quality sample satisfies the following equation: $\sum_{j = 1}^{m} S_{j} (x) \geq \frac{2}{3} m, j \neq i$ (2)

where S_j(x) is 1 if the base classifier C_j thinks the sample x as a high-quality sample, otherwise 0; m is the total number of base classifiers.

After appending the high-quality samples in P_i to the training set of a base classifier, the classifier is trained again and predicts the samples in P_j,( $j \neq i$ ). The high-quality samples in P_j are selected according to the earlier cited criteria and appended to the base classifier again. The process is repeated n times until all samples in ${P_{1}, P_{2} \dots P_{n}}$ are predicted by the base classifiers. Ultimately, with the increase in the training samples, the ability of every base classifier is improved and the base classifiers are generated.

2.2. Integrating base classifiers

After generating the base classifiers by using high-quality samples, an LR model is adopted to integrate the outputs of multiple base classifiers and to obtain the final predictions. Mathematically, the LR model is defined as follows: $x = β_{0} + β_{1} x_{1} + β_{2} x_{2} \dots + β_{m} x_{m}$ (3) $f (x) = \frac{1}{1 + e^{- x}}$ (4)

where $x_{1}, x_{2} \dots x_{m}$ are the outputs of multiple base classifiers, and $β_{0}, β_{1}, β_{2} \dots β_{m}$ are the coefficients of the variables $x_{1}, x_{2} \dots x_{m}$ . To estimate the coefficients, we collect the prediction results of every base classifier for the samples in ${P_{1}, P_{2} \dots P_{n}}$ in the course of training the base classifiers. Then, the prediction results and corresponding real labels are used as the input of the LR model to learn its coefficients. This process is also a stacking integration strategy. After the LR model is well learned, the test data are fed into the base classifiers generated at the first step and the LR model combines the outputs of these base classifiers to make the final predictions. Algorithm 1 illustrates the whole process of the multi-ensemble method.

Algorithm 1: Multi-ensemble: Ensemble model
Input: T: Part of the Training set
P: The remaining part of the Training set, P = $P_{1} \cup P_{2} \dots \cup P_{n}$
Learn: Training the base classifier
LR: Logistic regression algorithm estimates its coefficients
Test: the Test set
Output: pred: predicted probability value
Begin
for $i \in \{1, 2 \dots m\}$ do
T_i ← BootstrapSample (T)
C_i ← Learn (T_i)
end
for $r \in \{1, 2 \dots n\}$ do
for $i \in \{1, 2 \dots m\}$ do
$S_{i}$ ← $\emptyset$
foreach $x \in T_{r}$ do
$δ_{i} (x)$ ← $\|C_{i} (x) - μ_{i}\|$
if $δ_{j} (x)$ > $ω < s u p > * < ∕ s u p > σ_{j}^{2}$ and $δ_{k} (x)$ > $ω < s u p > * < ∕ s u p > σ_{k}^{2}$ … and $δ_{m} (x)$ > $ω < s u p > * < ∕ s u p > σ_{m}^{2}$ (j, k $\dots m \neq i$ )
then $S_{i}$ ← $S_{i} \cup x$
end
C_i ← Learn ( $S_{i} \cup T_{i}$ )
end
D ← { $C_{i} (P_{r}) \cup C_{j} (P_{r}) \dots \cup C_{m} (P_{r})$ }^T
end
L ← LR (D)
pred ← L (Test)
End

3. RESULTS

3.1. Datasets

To test the effectiveness of our multi-ensemble method, we apply it to predict essential proteins of S. cerevisiae (Yeast) and E. coli. The yeast essential proteins were from the Munich Information Center for Protein Sequences (MIPS) database (Mewes et al., 2004), the Saccharomyces Genome Database (SGD) database (Cherry et al., 1998), the Database of Essential Genes (DEG) database (Ren and Yan, 2009), and the Saccharomyces Genome Deletion Project (SGDP) database. The Yeast PPI network data were downloaded from the Database of Interacting Proteins description (DIP) database (Ioannis et al., 2002), which consists of 5093 proteins and 24,743 edges. Among the 5093 Yeast proteins, 1167 were essential and 3926 were nonessential. The essential proteins and PPI network data of E. coli came from the DEG database and DIP database, respectively. These involve 2727 proteins and 11,803 edges. Among the 2727 E. coli proteins, 254 of 2727 were essential and 2473 were nonessential proteins.

The topological features have proved to be the typical features of essential proteins in the PPI network and exhibit powerful performance on essential protein prediction (Wang et al., 2013; Song et al., 2019a). Moreover, essential proteins were likely to be enriched in certain types of subcellular localization (Li et al., 2016). Similar to previous methods (Zhong et al., 2013), 16 subcellular localization features and 10 topological features were concatenated as inputting features of the classifiers in the prediction of Yeast essential proteins. The 16 subcellular localization features were downloaded from the Eukaryotic Subcellular Localization DataBase (eSLDB) database (Andea et al., 2007), including Vacuole, Vesicles, Peroxisome, Secretory pathway, Transmembrane, Lysosome, Membrane, Mitochondrion, Nucleus, Golgi, Endoplasmic reticulum, Endosome, Extracellular, Cytoplasm, Cell wall, and Cytoskeleton.

Table 1 lists the 10 topological features, including seven centrality methods and three composited features. The seven centrality methods include BC (Freeman, 1977; Joy et al., 2014), CC (Stefan and Stadler, 2003), DC (Vallabhajosyula et al., 2009), EC (Bonacich, 1987), IC (Stephenson and Zelen, 1989), NC (Wang et al., 2012), and SC (Ernesto and Rodríguez-Velázquez, 2005), which were calculated by a Cytoscape plugin CytoNCA (Tang et al., 2015). The three composited features were PeC (Min et al., 2013), WDC (Tang et al., 2012), and ION (Peng et al., 2012). The ION predicts essential proteins by integrating orthology data with the PPI network. PeC and WDC Combine PPI network with the gene expression profiles to predict essential proteins. The orthology data in ION were from the InParanoid database (Gabriel et al., 2010), and the gene expression profiles in PeC and WDC were from Tu et al. (2005). While predicting E. coli essential proteins, eight topological properties, including BC, CC, DC, EC, IC, NC, SC, and ION, were selected as features of the classifiers.

Table 1.

Description and Equation of Topological Features

Feature name	Description	Equation
BC	Betweenness Centrality	$δ_{u v} (k) = \frac{p (u, k, v)}{p (u, v)}, u \neq k \neq v, B C (k) = \sum_{u \in V} \sum_{u \in V} δ_{u v} (k)$
CC	Closeness Centrality	$C C (u) = \frac{N - 1}{\sum_{v \in V} d i s (u, v)}$
DC	Degree Centrality	$D C (v) = \sum_{u} e d g e (u, v)$
EC	Eigenvector Centrality	$E C (u) = e_{m a x} (u)$
IC	Information Centrality	$I_{u v} = {(C_{u u} + C_{v v} - C_{u v})}^{- 1}$ , $C = D - A + J^{- 1}$ , $I C (u) = {[\frac{1}{N} \sum_{v} \frac{1}{I_{u v}}]}^{- 1}$
NC	Edge Clustering Coefficient Centrality	$E C C (u, v) = \frac{z_{u, v}}{m i n \{D C (u) - 1, D C (v) - 1\} W W}$ , $N C (u) = \sum_{v} E C C (u, v)$
SC	Subgraph Centrality	$S C (u) = \sum_{l = 0}^{\infty} \frac{u_{l} (u)}{l!} = \sum_{v = 1}^{N} {[a_{v} (u)]}^{2} e^{λ_{v}}$
ION	Integration of the orthology and PPI networks	$h (u, v) = \{\begin{matrix} \frac{E C C (u, v)}{\sum_{ω \in N e (u)} E C C (u, ω)}, i f \sum_{ω ε N e (u)} E C C (u, ω) \neq 0 \\ 0, o t h e r w i s e \end{matrix}$ $I O N (u) = (1 - α) d (u) + α \sum_{v ε N e (u)} h (u, v) I O N (v)$
PeC	Integration of gene expression profiles and PPI data	$P e C (u) = \sum_{v \in N_{u}} E C C (u, v) \times P C C (u, v)$
WDC	Integration of gene expression profiles and PPI data and addition of the parameters to adjust the proportion	$P E (u) = \sum_{ν \in N_{u}} [E C C (u, v) \times λ + P C C (u, v) \times (1 - λ)]$

PPI, protein-protein interaction; WDC, weighted degree centrality.

3.2. Evaluation metrics

There are usually more nonessential proteins than essential proteins in Yeast and E. coli datasets. The ratio of essential proteins to nonessential proteins is 1:3.36 (1167: 3926) in the Yeast dataset and 1:9.74 (254: 2473) in the E. coli dataset. Hence, we divided all sample data into five parts at random and maintained each part as the same ratio of essential proteins to nonessential proteins in the original data, that is, 1:3.36 in the Yeast dataset and 1:9.74 in the E. coli dataset. In the course of validation, four of the five parts constituted the training set and the remaining part was the testing set. The process was repeated five times until each one of the five parts was used for testing. According to the real labels, the prediction results were checked and some popular statistic metrics, including specificity (SP), sensitivity (SN), false positive rate (FPR), positive predictive value (PPV), negative predictive value (NPV), F-measure, accuracy (ACC), and Matthews Correlation Coefficient (MCC), were calculated to measure the prediction performance. They were defined as follows: $S N = \frac{T P}{T P + F N}$ (5) $S P = \frac{T N}{T N + F P}$ (6)

F P R = \frac{F P}{F P + T N}

(7)

P P V = \frac{T P}{T P + F P}

(8)

N P V = \frac{T N}{T N + F N}

(9)

F - m e a s u r e = \frac{2 * T P}{2 * T P + F P + F N}

(10)

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(11)

M C C = \frac{T P * T N - F P * F N}{\sqrt{(T P + F P) * (T P + F N) * (T N + F P) * (T N + F N)}}

(12)

where TP indicates the number of true essential proteins, TN indicates the number of true nonessential proteins, FP indicates the number of nonessential proteins that are incorrectly predicted as essential ones, and FN indicates the number of essential proteins that are missed by the predictors. Besides, the area under the receiver operating characteristic (ROC) curve (AUC) was calculated to evaluate the overall performance of each method. In the Results section, we calculate the AUC score separately for each cross-validation fold and average them to obtain a final AUC value to estimate the performance of a method. In 5-fold cross-validation, each one of the five parts was used for tests and will get a probability of being an essential protein by a certain method. We rank all proteins in descending order according to the probability and select 1167 proteins in the Yeast dataset, and the first 254 proteins in E. coli as candidate essential proteins (according to the number of real essential proteins). After that, TP, FP, TN, and FN were calculated for each method based on the candidate essential proteins and the benchmark datasets; therefore, the SN, SP, FPR, PPV, NPV, F-Measure, ACC, and MCC values were generated.

3.3. Parameter settings

3.3.1. The number of the base classifiers

The multi-ensemble method integrates multiple base classifiers to predict essential proteins. The users can customize the number of base classifiers. To test what is the suitable number of base classifiers for the multi-ensemble method to obtain the best performance on essential protein prediction, we selected 2, 3, 4, and 5 base classifiers to implement experiments on Yeast data and E. coli data. The base classifiers include a softmax regression classifier (Bishop, 2006), a decision tree (Quinlan, 1986), a random forest (Breiman, 2001), an Adaboost (Schapire et al., 1998), and an Xgboost (Chen and Guestrin, 2016). Since there are too many kinds of combinations for the different number of classifiers, the combinations with the best performance were selected for comparison. Tables 2 and 3 show the comparisons of SN, SP, FPR, PPV, NPV, F-MEASURE, ACC, MCC, and AUC for multi-ensemble methods, with the different numbers of base classifiers on the Yeast dataset and the E. coli dataset. As can be seen from Tables 2 and 3, the multi-ensemble method achieves the best performances when integrating three base classifiers. Hence, in the following comparison, the multi-ensemble method integrates three different base classifiers to predict essential proteins.

Table 2.

The Performance Comparisons of the Multi-Ensemble Method with Different Numbers of Base Classifiers on Yeast Dataset

Methods	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
2 Classifiers	0.5433	0.8642	0.1358	0.5433	0.8642	0.5433	0.7907	0.4075	0.7806
3 Classifiers	0.5467	0.8653	0.1347	0.5467	0.8653	0.5467	0.7927	0.412	0.7847
4 Classifiers	0.5424	0.864	0.136	0.5424	0.864	0.5424	0.7903	0.4064	0.7794
5 Classifiers	0.5407	0.8635	0.1365	0.5407	0.8635	0.5407	0.7895	0.4042	0.7786

ACC, accuracy; AUC, area under the receiver operating characteristic (ROC) curve; FPR, false positive rate; MCC, Matthews Correlation Coefficient; NPV, negative predictive value; PPV, positive predictive value; SN, sensitivity; SP, specificity. The best and comparable results are in boldface.

Table 3.

The Performance Comparisons of the Multi-Ensemble Method with Different Numbers of Base Classifiers on Escherichia coli Dataset

Methods	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
2 Classifiers	0.3819	0.9365	0.0635	0.3819	0.9365	0.3819	0.8849	0.3184	0.7741
3 Classifiers	0.3898	0.9373	0.0627	0.3898	0.9373	0.3898	0.8863	0.3271	0.7828
4 Classifiers	0.3858	0.9369	0.0631	0.3858	0.9369	0.3858	0.8856	0.3227	0.7756
5 Classifiers	0.3858	0.9369	0.0631	0.3858	0.9369	0.3858	0.8856	0.3227	0.7785

3.3.2. The size of {T₁, T₂…T_m}

In our method, T was set to be a quarter of the training set. Bootstrap sampling was done on T to generate multiple datasets with the same size for every base classifier, denoted by {T₁, T₂…T_m}, where m is the number of base classifiers. In this work, the size of T was set to 1000 for the Yeast data and it was set to 530 for the E. coli data. To test the effect of initial training set size on the performance of our method, we did some comparative experiments by setting different sizes ranging from 500 to 1500 on the Yeast dataset and from 265 to 730 on the E. coli dataset. Tables 4 and 5 show the comparisons of SN, SP, FPR, PPV, NPV, F-MEASURE, ACC, MCC, and AUC for the multi-ensemble method with different sizes of {T₁, T₂…T_m} on the Yeast dataset and the E. coli dataset. It turns out that setting the size to 1000 for the Yeast dataset and to 530 for the E. coli dataset produces better results.

Table 4.

Performance Comparisons of Multi-Ensemble with Different Initial Training Set Sizes on Yeast Dataset

Size	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
500	0.5441	0.8645	0.1355	0.5441	0.8645	0.5441	0.7911	0.4086	0.7811
900	0.5424	0.864	0.136	0.5424	0.864	0.5424	0.7903	0.4054	0.7774
1000	0.5467	0.8653	0.1347	0.5467	0.8653	0.5467	0.7927	0.412	0.7847
1100	0.5467	0.8653	0.1347	0.5467	0.8653	0.5467	0.7927	0.412	0.7832
1500	0.5441	0.8645	0.1355	0.5441	0.8645	0.5441	0.7911	0.4086	0.7815

Table 5.

Performance Comparisons of Multi-Ensemble with Different Initial Training Set Sizes on E. coli Dataset

Size	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
265	0.3583	0.9341	0.0659	0.3583	0.9341	0.3583	0.8805	0.2924	0.7716
430	0.378	0.9361	0.0639	0.378	0.9361	0.378	0.8841	0.3141	0.7754
530	0.3898	0.9373	0.0627	0.3898	0.9373	0.3898	0.8863	0.3271	0.7828
630	0.374	0.9357	0.0643	0.374	0.9357	0.374	0.8834	0.3097	0.7749
730	0.378	0.9361	0.0639	0.378	0.9361	0.378	0.8841	0.3141	0.7791

3.4. Comparing with base classifiers

The multi-ensemble method assembles multiple base classifiers to predict essential proteins. To assess its efficiency, we design four different ensemble strategies to combine three different base classifiers, namely multi-ensemble¹, multi-ensemble², multi-ensemble³, and multi-ensemble⁴. In multi-ensemble¹, softmax regression classifier, Xgboost, and Decision tree are selected as base classifiers. Multi-ensemble² combines softmax regression classifier, Adaboost, and Random Forest. The three base classifiers in multi-ensemble³ are softmax regression classifier, Gradient Boosting Decision Tree (GBDT) (Friedman, 2001), and Random Forest. For multi-ensemble⁴, the three base classifiers include softmax regression classifier, Xgboost, and Random Forest.

We compare the four multi-ensemble methods with the popular individual base classifiers, such as softmax regression classifier, Decision tree, Random Forest, as well as the state-of-the-art ensemble learning methods, that is, Adaboost, Xgboost, and GBDT. It should be noted that the multi-ensemble method can integrate any type of base classifiers to predict essential proteins, whereas default base classifiers of Adaboost, GBDT, and Xgboost are Decision tree and Regression tree. We also compare the multi-ensemble methods with the LR model that is employed to assemble the prediction of multiple base classifiers. In the multi-ensemble method, the base classifiers are trained by the high-quality samples that are agreed on by most other base classifiers. To investigate the effectiveness of this training strategy, we also compare the performance of the base classifiers that are trained by the selected high-quality samples, namely Softmax¹, Random Forest¹, Decision tree¹, Adaboost,¹ and Xgboost¹. Tables 6 and 7 show all the comparison results on the Yeast and E. coli dataset.

Table 6.

Performance Comparisons of Multi-Ensemble and Individual Classifiers on Yeast Dataset

Methods	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
Softmax	0.5278	0.8597	0.1403	0.5278	0.8597	0.5278	0.7836	0.3875	0.7771
Softmax¹	0.5398	0.8632	0.1368	0.5398	0.8632	0.5398	0.7891	0.4031	0.781
Random Forest	0.5073	0.8535	0.1465	0.5073	0.8535	0.5073	0.7742	0.3608	0.7536
Random Forest¹	0.5236	0.8584	0.1416	0.5236	0.8584	0.5236	0.7817	0.3819	0.762
Decision tree	0.515	0.8558	0.1442	0.515	0.8558	0.515	0.7777	0.3708	0.7592
Decision tree¹	0.5296	0.8602	0.1398	0.5296	0.8602	0.5296	0.7844	0.3897	0.7538
Adaboost	0.5373	0.8625	0.1375	0.5373	0.8625	0.5373	0.7879	0.3997	0.7695
Adaboost¹	0.5244	0.8586	0.1414	0.5244	0.8586	0.5244	0.7821	0.3831	0.7636
GBDT	0.533	0.8612	0.1388	0.533	0.8612	0.533	0.786	0.3942	0.7738
GBDT¹	0.5338	0.8614	0.1386	0.5338	0.8614	0.5338	0.7864	0.3953	0.7679
Xgboost	0.5278	0.8597	0.1403	0.5278	0.8597	0.5278	0.7836	0.3837	0.7707
Xgboost¹	0.5347	0.8617	0.1383	0.5347	0.8617	0.5347	0.7868	0.3964	0.7692
LR	0.5321	0.8609	0.1391	0.5321	0.8609	0.5321	0.7856	0.3931	0.7762
Multi-ensemble¹	0.5441	0.8645	0.1355	0.5441	0.8645	0.5441	0.7911	0.4086	0.7819
Multi-ensemble²	0.5458	0.865	0.135	0.5458	0.865	0.5458	0.7919	0.4108	0.7833
Multi-ensemble³	0.5458	0.865	0.135	0.5458	0.865	0.5458	0.7919	0.4108	0.7831
Multi-ensemble⁴	0.5467	0.8653	0.1347	0.5467	0.8653	0.5467	0.7927	0.412	0.7847

GBDT, Gradient Boosting Decision Tree; LR, logistic regression.

Table 7.

Performance Comparisons of Multi-Ensemble and Individual Classifiers on E. coli Dataset

Methods	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
Softmax	0.3268	0.9309	0.0691	0.3268	0.9309	0.3268	0.8746	0.2576	0.7611
Softmax¹	0.3425	0.9325	0.0675	0.3425	0.9325	0.3425	0.8775	0.275	0.7659
Random Forest	0.3504	0.9333	0.0667	0.3504	0.9333	0.3504	0.879	0.2837	0.7612
Random Forest¹	0.3701	0.9353	0.0647	0.3701	0.9353	0.3701	0.8827	0.3054	0.7574
Decision tree	0.3583	0.9341	0.0659	0.3583	0.9341	0.3583	0.8805	0.2924	0.7364
Decision tree¹	0.3661	0.9349	0.0651	0.3661	0.9349	0.3661	0.8819	0.301	0.7371
Adaboost	0.3425	0.9325	0.0675	0.3425	0.9325	0.3425	0.8775	0.275	0.7606
Adaboost¹	0.3504	0.9333	0.0667	0.3504	0.9333	0.3504	0.879	0.2837	0.7649
GBDT	0.3425	0.9325	0.0675	0.3425	0.9325	0.3425	0.8775	0.275	0.7622
GBDT¹	0.3465	0.9329	0.0671	0.3465	0.9329	0.3465	0.8783	0.2793	0.7447
Xgboost	0.3701	0.9353	0.0647	0.3701	0.9353	0.3701	0.8827	0.3054	0.7674
Xgboost¹	0.378	0.9361	0.0639	0.378	0.9361	0.378	0.8841	0.3141	0.7694
LR	0.3228	0.9305	0.0696	0.3228	0.9305	0.3228	0.8739	0.2533	0.7617
Multi-ensemble¹	0.374	0.9357	0.0643	0.374	0.9357	0.374	0.8834	0.3097	0.7703
Multi-ensemble²	0.3819	0.9365	0.0635	0.3819	0.9365	0.3819	0.8849	0.3184	0.7787
Multi-ensemble³	0.3858	0.9369	0.0631	0.3858	0.9369	0.3858	0.8856	0.3227	0.7773
Multi-ensemble⁴	0.3898	0.9373	0.0627	0.3898	0.9373	0.3898	0.8863	0.3271	0.7828

We note that on the two datasets, all of the four multi-ensemble methods generate better prediction results than both their individual base classifiers, LR method and the other ensemble learning methods, such as Adaboost, GBDT, and Xgboost. It suggests that our multi-ensemble method can effectively integrate multiple different base classifiers and has the ability to improve the essential protein predictions to a high level. Moreover, it is interesting that almost all the normal base classifiers, such as Softmax, Random Forest, and the Decision tree, produce better prediction results by selecting high-quality samples on the two datasets. For example, Decision tree improves its F-measure values from 0.515 to 0.5296 and its AUC values from 0.7592 to 0.7538 on the Yeast data and improves its F-measure values from 0.3583 to 0.3661 and its AUC values from 0.7364 to 0.7371 on the E. coli data. For the other ensemble learning methods, the F-measure values of Xgboost are increased from 0.5278 to 0.5347 on the Yeast dataset and those of Adaboost are increased from 0.3425 to 0.3504 on the E. coli dataset. All of these results prove that our strategies for generating base classifiers can successfully improve the accuracy of the base classifiers and therefore contribute to producing final correct prediction results.

3.5. Comparison with other machine-learning methods

In previous research, some machine-learning methods were proposed to predict essential proteins, such as gene expression programming (GEP) (Zhong et al., 2013), SVM, sequential minimal optimization (SMO), J48, Random Tree, radial basis function (RBF) Network, NaiveBayes, Bays Network, and NaiveBayes Tree. GEP is a newly proposed method that utilizes gene expression programming to learn topological features and biological features for predicting essential proteins. It can accurately predict essential proteins. To further evaluate the performance of our methods, we compare it with GEP and the other popular machine-learning methods. All of the other machine-learning methods were implemented by WEKA software with default parameters. Table 8 shows the comparison between the multi-ensemble and the other machine-learning methods on the Yeast and E. coli datasets in terms of the AUC values. We can see from Table 8 that the AUC values of multi-ensemble⁴ are 0.7847 on the Yeast data and 0.7828 on the E. coli data, which are 1.17% and 0.38% higher than that of GEP, the best predictor among all of the other machine-learning methods. Multi-ensemble4 integrates three base classifiers, including softmax regression classifier, Xgboost, and Random Forest, and produces more accurate predictions for essential proteins than the machine-learning methods.

Table 8.

Area Under the Receiver Operating Characteristic Curve Values Comparison Between Multi-Ensemble and Other Machine-Learning Methods on Yeast and E. coli Datasets

Methods	AUC of Yeast data	AUC of E. coli data
Multi-ensemble⁴	0.7847	0.7828
GEP	0.773	0.779
SVM	0.577	0.5
SMO	0.608	0.5
NaiveBayes	0.744	0.7437
Bayes Network	0.731	0.7258
RBF Network	0.669	0.7094
J48	0.687	0.7238
Random Tree	0.612	0.5846
NaiveBayes Tree	0.746	0.7204

GEP, gene expression programming; RBF, Radial Basis Function Network; SMO, sequential minimal optimization; SVM, support vector machines.

3.6. Comparison with deep learning methods

In recent years, deep learning methods have been widely used in the bioinformatics field. So we also compared our method with some state-of-the-art deep learning models, such as LeNet-5 (Lecun et al., 1998), Recurrent Neural Network (RNN) (Graves, 1997), Bidirectional RNN (Schuster and Paliwal, 1997), and CapsNet (Sabour et al., 2017). All parameters of these deep learning models were set to default values, and the size of convolution kernels is 1*2. Tables 9 and 10 show the comparisons between the multi-ensemble method and the deep learning models on the Yeast dataset and the E. coli dataset. It turns out that our multi-ensemble⁴ gains better prediction results than these deep learning models.

Table 9.

Comparisons Between Multi-Ensemble and Other Deep Learning Methods on Yeast Dataset

Methods	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
Multi-ensemble⁴	0.5467	0.8653	0.1347	0.5467	0.8653	0.5467	0.7927	0.412	0.7847
LeNet-5	0.4456	0.8352	0.1648	0.4456	0.8352	0.4456	0.7459	0.2808	0.4734
RNN	0.5398	0.8632	0.1368	0.5398	0.8632	0.5398	0.7891	0.4031	0.7756
BRNN	0.5381	0.8627	0.1373	0.5381	0.8627	0.5381	0.7883	0.4008	0.7803
CapsNet	0.533	0.8612	0.1388	0.533	0.8612	0.533	0.786	0.3942	0.7819

BRNN, Bidirectional Recurrent Neural Network; RNN, Recurrent Neural Network.

Table 10.

Comparisons Between Multi-Ensemble and Other Deep Learning Methods on E. coli Dataset

Methods	SN	SP	FPR	PPV	NPV	F-measure	ACC	MCC	AUC
Multi-ensemble⁴	0.3898	0.9373	0.0627	0.3898	0.9373	0.3898	0.8863	0.3271	0.7828
LeNet-5	0.1181	0.9094	0.0906	0.1181	0.9094	0.1181	0.8357	0.0275	0.3985
RNN	0.0669	0.9042	0.0958	0.0669	0.9042	0.0669	0.8262	−0.0289	0.2746
BRNN	0.3307	0.9313	0.0687	0.3307	0.9313	0.3307	0.8753	0.2620	0.7738
CapsNet	0.315	0.9296	0.0704	0.315	0.9296	0.315	0.8724	0.2446	0.7598

4. CONCLUSIONS

In this article, a novel ensemble framework called multi-ensemble was proposed to predict essential proteins. The multi-ensemble method is constructed in two crucial steps. First, it generates different base classifiers by feeding high-quality samples. The second step involves combining these classifiers with an LR model. In contrast to previous ensemble learning methods, our multi-ensemble method trains the base classifiers by high-quality samples that are determined by most of the other base classifiers. Hence, these base classifiers depend on each other. The multi-ensemble method can flexibly integrate any number and any kind of base classifiers. We employ the multi-ensemble method to assemble several base classifiers, including softmax regression classifier, Decision tree, random forest, Adaboost, and Xgboost, and we find that assembling three base classifiers can produce better results than assembling other numbers of base classifiers. We design four different ensemble strategies to combine three different base classifiers by the multi-ensemble method and apply them to predict essential proteins of Yeast and E. coli datasets. The experimental results on the two datasets show that all of the four multi-ensemble methods outperform both their individual base classifiers, LR method and the other ensemble learning methods, such as Adaboost, GBDT, and Xgboost. Moreover, most of the base classifiers improve their overall performance by selecting high-quality samples. Besides, the multi-ensemble method also has better prediction performance than the other machine-learning methods and the deep learning models.

Footnotes

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

This work is supported in part by the National Natural Science Foundation of China under Grant No. 61972185, No. 61472133, the Natural Science Foundation of Yunnan Province of China (No. 2019FA024), Yunnan Key Research and Development Program (2018IA054), Yunnan Ten Thousand Talents Plan Young, and Natural Science Foundation of Hunan Province of China under Grant 2018JJ2262.

References

Acencio

M.L.

, and Lemke

2009. Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC Bioinformatics, 10, 290.

Andea

, Pier Luigi

, Piero

, et al. 2007. eSLDB: Eukaryotic subcellular localization database. Nucleic Acids Res. 35, 208–212.

Bishop

C.M.

Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York, New York, USA. 2006.

Bonacich

1987. Power and centrality: A family of measures. Am. J. Sociol. 92, 1170–1182.

Breiman

1996. Bagging predictors. Mach. Learn. 24, 123–140.

Breiman

2001. Random forests. Mach. Learn. 45, 5–32.

Chen

, and Guestrin

XGBoost: A scalable tree boosting system. In Acm Sigkdd International Conference on Knowledge Discovery & Data Mining. New York, New York, USA. pgs. 785–794.

Chen

, and Xu

2005. Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics, 21, 575–581.

Cherry

J.M.

, Adler

, Ball

, et al. 1998. SGD: Saccharomyces Genome Database. Nucleic Acids Res. 26, 73–79.

10.

Ernesto

, and Rodríguez-Velázquez

J.A.

2005. Subgraph centrality in complex networks. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 71, 056103.

11.

Freeman

L.C.

1977. A set of measures of centrality based on betweenness. Sociometry, 40, 35–41.

12.

Friedman

J.H.

2001. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232.

13.

Gabriel

, Thomas

, Kristoffer

, et al. 2010. InParanoid 7: New algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 38, D196.

14.

Graves

1997. Long short-term memory. Neural Comput. 9, 1735–1780.

15.

Gustafson

A.M.

, Snitkin

E.S.

, Parker

S.C.

, et al. 2006. Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics, 7, 265.

16.

Hwang

Y.C.

, Lin

C.C.

, Chang

J.Y.

, et al. 2009. Predicting essential genes based on network and sequence analysis. Mol. Biosyst. 5, 1672–1678.

17.

Ioannis

, Lukasz

, Xiaoqun Joyce

, et al. 2002. DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30, 303.

18.

Jeong

, Mason

S.P.

, Barabasi

A.L.

, et al. 2001. Lethality and centrality in protein networks. Nature, 411, 41–42.

19.

Jingyuan

, Lei

, Shengchang

, et al. 2011. Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res. 39, 795–807.

20.

Jordan

I.K.

, Rogozin

I.B.

, Wolf

Y.I.

, et al. 2002. Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res. 12, 962.

21.

Joy

M.P.

, Brock

, Ingber

D.E.

, et al. 2014. High-betweenness proteins in the yeast protein interaction network. J. Biomed. Biotechnol. 2005, 96.

22.

Lecun

, Bottou

, Bengio

, et al. 1998. Gradient-based learning applied to document recognition. Proc. IEEE, 86, 2278–2324.

23.

, Li

, Wang

, et al. 2016. Predicting essential proteins based on subcellular localization, orthology and PPI networks. BMC Bioinformatics, 17, 279.

24.

Mewes

H.W.

, F. D., Mayer

K.F.

, Munsterkotter

, et al. 2004. MIPS: Analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res. 34, 169–172.

25.

Min

L.I.

, Zhang

, and Fei

2013. Essential protein discovery method based on integration of PPI and gene expression data. J. Cent. South Univ. 44, 1024–1029.

26.

Peng

, Wang

, Cheng

, et al. 2015a. UDoNC: An algorithm for identifying essential proteins based on protein domains and protein-protein interaction networks. IEEE/ACM Trans Comput. Biol. Bioinform. 12, 276–288.

27.

Peng

, Wang

, et al. 2012. Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks. BMC Syst. Biol. 6, 1–17.

28.

Peng

, Wang

, et al. 2015b. Rechecking the centrality-lethality rule in the scope of protein subcellular localization interaction networks. PLoS One, 10, e0130743.

29.

Quinlan

J.R.

1986. Induction of decision trees. Mach. Learn. 1, 81–106.

30.

Ren

, and Yan

2009. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 37, D455.

31.

Sabour

, Frosst

, and Hinton

G.E.

2017. Dynamic Routing Between Capsules. Advances in Neu. Info. Proc. Systems, 30, 3856–3866.

32.

Schapire

R.E.

, Singer

, and Singhal

1998. Boosting and Rocchio applied to text filtering. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pgs. 215–223.

33.

Schuster

, and Paliwal

K.K.

1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681.

34.

Song

, Peng

, and Wang

2019a. A random walk-based method to identify driver genes by integrating the subcellular localization and variation frequency into bipartite graph. BMC Bioinformatics 20, 238.

35.

Song

, Peng

, and Wang

2019b. An Entropy-based method for identifying mutual exclusive driver genes in cancer. IEEE/ACM Trans Comput Biol Bioinform 17, 758–768.

36.

Stefan

, and Stadler

P.F.

2003. Centers of complex networks. J. Theor. Biol. 223, 45–53.

37.

Stephenson

, and Zelen

1989. Rethinking centrality: Methods and examples. Soc. Networks, 11, 1–37.

38.

Tang

, Wang

, and Yi

Identifying essential proteins via integration of protein interaction and gene expression data. In IEEE International Conference on Bioinformatics & Biomedicine. USA. pgs. 1–4. 2012.

39.

Tang

, Li

, Wang

, et al. 2015. CytoNCA: A cytoscape plugin for centrality analysis and evaluation of protein interaction networks. Biosystems, 127, 67–72.

40.

B.P.

, Andrzej

, Maga

, et al. 2005. Logic of the yeast metabolic cycle: Temporal compartmentalization of cellular processes. Science, 310, 1152.

41.

Vallabhajosyula

R.R.

, Deboki

, Samina

, et al. 2009. Identifying hubs in protein interaction networks. PLoS One, 4, e5344.

42.

Wang

, Li

, Wang

, et al. 2012. Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Trans. Comput. Biol. Bioinform. 9, 1070–1080.

43.

Wang

, Peng

, and Wu

F.-X.

2013. Computational approaches to predicting essential proteins: A survey. Proteomics Clin. Appl. 7, 181–192.

44.

Zhang

, Peng

, Yang

, et al. 2019. A novel method for identifying essential genes by fusing dynamic protein–protein interactive networks. Genes, 10, 31.

45.

Zhong

, Wang

, Peng

, et al. 2013. Prediction of essential proteins based on gene expression programming. BMC Genomics. 14, S7.

46.

Zhou

Z.-H.

, and Li

2005. Tri-training exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17, 1529–1541.

A Novel Multi-Ensemble Method for Identifying Essential Proteins

Abstract

1. INTRODUCTION

2. METHODS

2.1. Generating basis classifiers

2.2. Integrating base classifiers

3. RESULTS

3.1. Datasets

3.2. Evaluation metrics

3.3. Parameter settings

3.3.1. The number of the base classifiers

3.3.2. The size of {T1, T2…Tm}

3.4. Comparing with base classifiers

3.5. Comparison with other machine-learning methods

3.6. Comparison with deep learning methods

4. CONCLUSIONS

Footnotes

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

References

3.3.2. The size of {T₁, T₂…T_m}