Abstract
Abstract
Many previous works in protein function prediction make predictions one function at a time, fundamentally, which assumes the functional categories to be isolated. However, biological processes are highly correlated and usually intertwined together to happen at the same time; therefore, it would be beneficial to consider protein function prediction as one indivisible task and treat all the functional categories as an integral and correlated prediction target. By leveraging the function–function correlations, it is expected to achieve improved overall predictive accuracy. To this end, we develop a network-based protein function prediction approach, under the framework of multi-label classification in machine learning, to utilize the function–function correlations. Besides formulating the function–function correlations in the optimization objective explicitly, we also exploit them as part of the pairwise protein–protein similarities implicitly. The algorithm is built upon the Green's function over a graph, which not only employs the global topology of a network but also captures its local structures. In addition, we propose an adaptive decision boundary method to deal with the unbalanced distribution of protein annotation data. Finally, we quantify the statistical confidence of predicted functions to facilitate post-processing of proteomic analysis. We evaluate the proposed approach on Saccharomyces cerevisiae data, and the experimental results demonstrate very encouraging results.
1. Introduction
1.1. Network-based protein function prediction
Recent availability of protein interaction networks for many species has spurred on the development of network-based computational methods in protein function prediction. Typically, an interaction network is first modeled as a graph, with the vertices representing proteins and the edges representing the detected protein–protein interactions (PPI), followed by a graph-based statistical learning method to infer putative protein functions.
We use an example to illustrate the deficiencies of the aforementioned methods. A small part of the PPI graph constructed from the BioGRID data (Stark et al., 2006) and annotated by MIPS Funcat scheme (Mewes et al., 1999) is shown in Figure 1. The clear oral vertices are unannotated proteins while the elliptical ones are proteins annotated with function “Metabolism” and the rectangular ones are proteins not annotated with the same function. The task is to determine whether the unannotated proteins have the functionality of “Metabolism.” When neighbor-counting approaches are applied, only the annotated proteins contribute to the annotation of an unannotated protein. For example, the functions of “YIL152W” is determined solely by those of “HSP82” and “BUD4,” but the rest of the annotated proteins and their unannotated neighbor “YER071C” are not used. In global optimization approaches, the annotated proteins are always treated the same, no matter how far they are from and how many links they are connected to the unannotated proteins. For example, when the global optimization approaches applied to annotate “YER071C,” “SSB2,” and “BUD4” are treated the same, although the former is closer to “YER071C”; “HSP82” and “CAP1” are also treated the same, although there are two connections from the former to “YER071C” while there is only one from the latter. Function-flow approach (Nabieva et al., 2005) takes care of the distance and link patterns, but it restricts the propagation to a fixed number of steps.

A part of (PPI) graph constructed by BioGRID data, which illustrates the deficiencies of some existing approaches. The oval vertices without background color are the unannotated proteins, while the elliptical vertices with blue background are the annotated proteins associated with function “metabolism,” and the rectangular vertices are the annotated proteins not associated with function “metabolism”.
1.2. Multi-label correlated protein function prediction
Because a protein is usually observed to play several functional roles in different biological processes within an organism, it is natural to annotate it with multiple functions. Thus, protein function prediction is an ideal example of multi-label classification (Wang et al., 2009, 2010b–d, 2011) in machine learning. Multi-label classification, in which each object may belong to more than one class, is an emerging topic driven by the advances of modern technologies in recent years. Placing protein function prediction under the framework of multi-label classification, we use the Green's function approach to integrate the function–function correlations from the theory of reproducing kernel Hilbert space (RKHS) (in Section 2.4). Besides incorporating the function–function correlations as a regularizer in the optimization objective explicitly, we also take advantage of them as part of the pairwise protein similarities implicitly (in Section 2.5). In addition, we propose an adaptive decision boundary method to deal with the unbalanced distribution of protein annotation data (in Section 2.6), and quantify the statistical confidence of predicted putative functions for post-processing of proteomic analysis (in Section 2.7).
2. Methods
In this section, we propose a function–function correlated multi-label (FCML) approach using the Green's function on a graph to predict protein functions, which incorporates the function–function correlations in two levels: one from the function perspective to formulate the functionwise similarities explicitly in the optimization objective (in Section 2.4), and the other from the protein similarity perspective using function assignments to model the function correlations implicitly (in Section 2.5). Besides being used in the proposed approach, the latter also provides a means for all other previous related works to exploit the function correlations.
2.1. Notations and problem formalization
In protein function prediction, given K biological functions and n proteins, each protein xi is associated with a set of labels represented by a function assignment indication vector
We formalize a protein interaction network as a graph
In summary, for a protein function prediction task, we are given W and Yl as input, and the outputs of our method are decision values for the predicted putative functions assigned to the unannotated proteins, that is,
2.2. Protein function prediction using the Green's function over a graph
In this section, we first briefly review the Green's function approach for label propagation over a graph, from which we will develop the proposed FCML method in Section 2.4.
The Green's function is of significant importance in solving partial differential equations, because it transforms them into integral equations. In physics, the Green's function G(
To be more specific, given a graph with edge weights W, its combinatorial Laplacian is defined as L = D − W (Chung, 1997), where D = diag(W
where
where
Now we consider all the K functions. Extending Equation (3), we may assign functions to unannotated proteins as follows. Given K biological functions, we may assign functions to unannotated proteins as Ding et al. (2007):
We name Equation (4) simply as multi-label Green's function (MLGF) approach, beyond which we will propose a novel function–function correlated multi-label Green's function approach.
2.3. Utilizing both local and global structure of a PPI network by the Green's function
Before proceeding to propose our new approach, we first point out that the Green's function is closely related to a well-established distance metric on a generic weighted graph, where the edge weight measures the similarity between two end vertices. Based on the derivation of the distance metric, it is easy to see that the Green's function approach not only takes advantage of global topology of a network but also leverages its local structures.
We view a generic weighted graph as a network of electric resistors, where the edge connecting vertices xi and xj is a resistor with resistance rij. The graph edge weight (the pairwise similarity) between vertices xi and xj is wij = 1/rij. Two vertices not connected by a resistor are viewed as equivalently connected by a resistor with rij = ∞ or wij = 0. The most common task on a resistor network is to calculate the effective resistance between different vertices. The effective resistance Rij between vertices xi and xj is equal to 1/(total current between xi and xj) when xi is connected to voltage 1 and xj is connected to voltage 0. Let
where
From the analysis above, we can list some properties of the Green's function. First, G is clearly a semipositive definite function. Second, any function
2.4. Function–function correlated RKHS approach for multi-label classification
Although the function–function correlations are useful to infer putative functions of unannotated proteins, MLGF approach defined in Equation (4) neglects them because it treats the biological functions as isolated. In multi-label scenarios, however, we concentrate on making use of the function–function correlations, which could be defined as
Following Wang et al., (2009), we expect to maximize
Combining Equation (7) with the original RKHS objective in Equation (2), we minimize the following objective:
where
Differentiating J with respect to F, we have:
Because β is usually very small in typical empirical settings, we have:
where
We name Equation (11) as our proposed function–function correlated multi-label (FCML) approach for protein function prediction. By Equation (11) we can compute F in a closed form without iterations, which is more mathematically elegant than related approaches. Moreover, (I − αC)−1 can be seen as another graph, which propagates label influence through the label correlations over the whole network.
which can be further seen as the following iterative process:
Equation (13) reveals the insight of the proposed FCML method. At the initialization step, the decision matrix
2.5. Correlation augmented interaction network
Traditional network-based protein function prediction approaches only use biological interaction networks obtained from experimental data such as those from high-throughput technologies. When viewing protein function prediction as a multi-label classification problem, we can also build a computational interaction network
where WBio is the biological interaction network, which is same as in existing approaches. γ controls the relative importance of WL, and empirically selected as
The true power of the correlation augmented interaction network construction scheme defined in Equation (14) lies in that, the original biological similarities among proteins are augmented by the function assignment similarities, thereby label propagation pathways over a graph are reinforced. Moreover, with this interaction network construction scheme, the correlations among the functional categories are encoded into the graph weights, such that the resulted hybrid graph can be directly used in previous works to enhance their prediction performance. In this work, we use W defined in Equation (14) to compute the Green's function in Equation (1).
Our task in protein function prediction is to assign functions to unannotated proteins upon annotated ones. In order to compute WL, however, we need function assignments of all the proteins including those annotated and unannotated. Therefore, we first initialize unannotated proteins through a majority voting (Schwikowski et al., 2000) approach, which makes predictions using the top three frequent functions of the protein's interacting partners. Note that the class similarity defined in Equation (6) is different from the protein similarity defined here in Eq. (15). The former is a function-wise similarity matrix of size K × K, whereas the latter is a proteinwise similarity matrix of size n × n, although essentially they both convey label correlations.
where r(k) is estimated reliabilities of the corresponding network by expression profile reliability (EPR) index (Deane et al., 2002). Equation (16) reflects the fact that interactions detected in multiple experiments are generally more reliable than those detected by a single experiment (Von Mering et al., 2002).
Because in reality the overlap among different biological networks typically is very small, and the BioGRID PPI network data are fairly comprehensive, in this work we set WBio = W(1), where W(1)(i, j) = 1 if protein xi and xj interact, and 0 otherwise.
By using the graph constructed from Equation (14), in addition to explicitly modeling the function–function correlations as in Equation (11), the correlations are also implicitly incorporated into the network linkages, so that the predictive accuracy can be further enhanced.
2.6. Adaptive decision boundary for function assignment
The MLGF approach defined in Equation (4) and FCML approach defined in Equation (11) produce ranked lists for function/label assignment, therefore decision boundaries are required to make predictions. Most existing research works using ranking lists to predict protein functions normally do not supply a threshold explicitly. Instead, they use a set of ROC curves (or the variant “precision”–“recall” curves) to evaluate the prediction performance. In some of these approaches, a heuristic cutoff point is given at the function assignment step e.g., in the majority voting (MV) approach, Schwikowski et al. (2000) assigned the three most frequently occurring functions among its neighbors to an unannotated protein. However, such threshold might not be the optimal one.
In many semi-supervised learning algorithms, the threshold for classification is usually selected as 0, which again is not necessarily the best choice. We propose an adaptive decision boundary to achieve better performance, which is adjusted such that the weighted training errors of all positive and negative samples are minimized.
Considering the binary classification problem for the k-th class, we denote bk as the decision boundary, S+ and S− as the sets of positive and negative samples for the k-th class, and e+(bk) and e−(bk) as the numbers of misclassified positive and negative training samples. The adaptive (optimal) decision boundary is given by the Bayes' rule
Figure 2 shows the adaptive decision boundary for function “11” (transcription) defined in MIPS Funcat annotation scheme (version 2.1) using BioGRID PPI data (version 2.0.45). In the figure, the areas (probability likelihood) of misclassifications are minimized, and the adaptive decision boundary is different from 0.

Optimal decision boundary to minimize misclassification for function “11” (Transcription) (the black vertical line) is different from 0.
2.7. Statistical confidence of putative protein function
Many existing protein function prediction approaches only assign “yes” or “no” to a protein when deciding its membership to a function. However, due to the high noise in the biological experiments to generate PPI data, it would be better to estimate the probability of a given prediction rather than simply saying “yes” or “no.” Namely, the statistical confidence to a given prediction is necessary and often of great use in post-processing of proteomic analysis. For example, in order to minimize the experimental time, biologists would decide the order of biological experiments according to the confidence values of the putative protein functions.
Quantitatively evaluating the confidence of a prediction is usually not easy, because the underlying probability model and the actual training and testing data distribution are constantly changing for different biological functions. In this study, we adopt the posterior probability as a metric of the confidence for a prediction due to its clear statistical meaning and explicit computational formula.
Let Yik be the ground truth membership of protein xi for the k-th biological function, and
Hastie et al. (Hastie and Tibshirani, 1998) propose to fit Gaussians to the class-conditional densities p(Zik|Yik = ±1). The posterior probability is thus a sigmoid, whose slope is determined by the tied variance. Despite its clear intuition and explicit formulation, this approach is seldom useful in real applications because the assumption of Gaussian class-conditional densities is often violated. Figure 3 shows a plot of class-conditional densities p(Zik|Yik = ±1) for the training data of function “02” (Energy) in MIPS Funcat annotation system. The plot shows histograms of the densities (with bin 0.1 wide), derived from 10-fold cross-validation. Obviously, these densities are far away from Gaussian. In order to tackle this problem, inspired by empirical data, Platt et al. (1999) proposed to fit the conditional-class probabilities implicitly and fit the posterior probability to a parametric form of a sigmoid:

Class-conditional densities for the training data of function “02” (Energy). The solid red line is p(Zsik|Ysik = +1) while the dashed blue line is p(Zsik|Ysik =−1). Obviously, these two histograms are not Gaussian.
The posterior probability defined in Equation (19) takes the form of logistic function, which represents the cumulative distribution function of a big family of exponential distributions. Because of this statistical enrichment, we adopt Equation (19) as our quantitative metric to measure the prediction confidence. The fitted sigmoid curve for the data shown in Figure 3 is plotted in Figure 4.

The fit of a sigmoid to the training data of function “02” (Energy) as shown in Figure 3. Blue markers are the average posterior probabilities computed for the examples falling into a bin of width 0.1. The solid red line is the best-fit sigmoid to the posterior probabilities where A = −2.124e + 004 and B = 2.916.
3. Materials and Data Sets
Two types of data are involved in the experimental evaluations for protein function prediction: function annotation data and PPI data. In this section, we describe the data used in this work.
The functional catalogue (FunCat) (Mewes et al., 1999) is a project under the Munich Information Center for Protein Sequences (MIPS), which is an annotation scheme for the functional description of proteins from prokaryotes, unicellular eukaryotes, plants, and animals. Taking into account the broad and highly diverse spectrum of known protein functions, FunCat of version 2.1 consists of 27 main functional categories. Seventeen of them are involved in annotating Saccharomyces cerevisiae, which covers general fields such as cellular transport, metabolism, and cellular communication/signal transduction. The main branches exhibit a hierarchical, treelike structure with up to six levels of increasing specificity. Although there are still other protein annotation systems such, as the Gene Ontology (Ashburner et al., 2000), we use the Funcat annotation system due to its clear treelike hierarchical structure.
The protein–protein interaction data can be downloaded from the BioGRID database (Stark et al., 2006), and we focus on the S. cerevisiae. By removing the proteins connected by only one PPI, there are 4299 proteins with 72624 PPIs in the BioGRDI database of version 2.0.45 annotated by Funcat annotation scheme, together with 1997 unannotated proteins. All related 17 level-1 biological functions are listed in Table 1.
4. Results and Discussions
In this article, we proposed a function–function correlated multi-label (FCML) approach for protein function prediction to utilize the correlations among the biological functions to improve the overall prediction performance. Using the PPI data from BioGRID database (Stark et al., 2006) and Funcat annotation scheme (Mewes et al., 1999) on S. cerevisiae data, we evaluate our proposed approach and make predictions for unknown proteins.
For statistical metrics, we use the standard precision and F1 score that have been widely used in previous protein function prediction research work. Let TP (true positive) be the number of proteins that we correctly predict to have a given function, FP (false positive) be the number of proteins that we incorrectly predict to have the function, and FN (false negative) be the number of proteins which we incorrectly predict to not have the function. The “precision” is defined as
F1 score is extensively used in previous related work and other domains such as information retrieval. Typically, improving the precision of an algorithm decreases its recall and vice versa, therefore F1 score is a balanced performance metric.
4.1. Evaluation of the function–function correlations
Because function–function correlations are one of the most important mechanisms to improve the prediction performance in our proposed approach, we first evaluate its correctness. Using the FunCat 2.1 annotation data set for S. cerevisiae genome, the function–function correlations defined in Equation (6) are illustrated in the right panel of Figure 5. The high correlation value between functions “40” (Cell Fate) and “43” (Cell Type Differentiation) depicted in this figure shows that they are highly correlated. In addition, as shown in this figure, some other function pairs are also highly correlated, such as functions “11” (Transcription) and “16” (Protein with Binding Function or Cofactor Requirement), “18” (Regulation of Metabolism and Protein Function), and “30” (Cellular Communication/Signal Transduction Mechanism), etc. All these observations comply with the biological nature, which justifies the utility of the function correlations from a biological perspective. Figure 5 will also be used to demonstrate the power of function–function correlations in Section 4.4.

Right panel: Illustration of the correlation matrix defined in Equation (6) among the 17 main functional categories in FunCat 2.1 annotated to S. cerevisiae genome. Left panel: protein, “FUN19” is the testing protein and other neighboring proteins are training data that have functions “11” or “14.” Middle panel: Our FCML method correctly annotates protein “FUN19” with function “16” utilizing function–function correlations.
4.2. Robustness of adaptive decision boundary
Adaptive decision boundary is another important contribution of this article. Instead of heuristically selecting the thresholds by experience like many existing approaches, adaptive decision boundary method principally computes the thresholds for function assignment from the training data to deal with the unbalanced distribution problem between positive and negative training data. We need to evaluate whether the adaptive decision boundary is robust to the amount of training data to compute it.
In order to evaluate the robustness of the adaptive decision boundaries, we compute them using different amounts of training data and report the corresponding prediction performance in five-fold cross-validation. As a demonstration, we randomly select function “11” (Transcription) and conduct the experiments on its data. We randomly select 5%, 10%, 20%, 40%, or 80% of the training data to compute the decision boundaries. The prediction performance measured by F1 score is plotted in Figure 6. From the experimental results, we can see that the prediction performance does not degrade much with the decrease of the amount of training data to compute the adaptive decision boundaries. In other words, adaptive decision boundary is a robust thresholding method as long as the amount of data used to compute it is not very small. For example, 5% of the training data is enough to calculate a valid adaptive decision boundary in the demonstration experiments.

Evaluation on the effectiveness and robustness of adaptive decision boundary. The performance measured by F1 score vs. the percentage of training data used to compute the adaptive decision boundary for function “11” (transcription).
4.3. Improved function prediction in cross-validation
We compare the performances of our proposed multi-label Green's function (MLGF) approach and function–function correlated multi-label (FCML) approach to related commonly used methods, such as majority voting (MV) approach (Schwikowski et al., 2000), global majority voting (GMV) approach (Vazquez et al., 2003), χ2 approach (Hishigaki et al., 2001), functional flow (FF) approach (Nabieva et al., 2005), and kernel logistic regression (KLR) method (Lee et al., 2006). The PPI graph is built from BioGRID data of version 2.0.45 with annotation by MIPS Funcat scheme of version 2.1. The ten-fold cross-validation is used. For four other approaches, we use their respective optimal parameters. In MV approach, we select the three most frequently occurring functions in a protein's neighbors. In χ2 approach, radius = 1 gives the best performance. In FF approach, we assign functions according to the proportions of positive and negative training samples as suggested by Nabieva et al. (2005).
We report the overall prediction performance over all functions using the microaverages of two performance values to address multi-label scenario. The microaverage is computed from the sum of per class contingency table, which can be seen as a weighted average that emphasizes more on the accuracy of classes/functions with more positive samples. The microaveraged precision and F1 score by the compared approaches over all 17 level-1 biological functions are listed in Table 2, which quantify the advantages of our FCML approach over the others with more concrete evidence. Moreover, the performances by FCML approach are consistently better than those of MLGF, which demonstrates that incorporating the inherent correlations among biological functions can improve the prediction performance significantly.
4.4. Demonstration of the effectiveness of function–function correlations
In the above experiments, our FCML method outperforms all other methods. We carefully check those testing proteins, which are incorrectly annotated by other methods but correctly annotated by the FCML method. We find function–function correlations absolutely help the function prediction results. Now we will show one example to demonstrate the effectiveness of function–function correlations. For example, in the left panel of Figure 5, protein “FUN19” is the testing protein and other neighboring proteins are training data that have functions “11” (Transcription) or “14” (Protein Fate [Folding, Modification, Destination]).
In experimental results, protein “FUN19” is annotated with functions “11” and “14” by all six methods. But it is annotated with function “16” (Protein with Binding Function or Cofactor Requirement [Structural or Catalytic]) only by our FCML approach and not by the other approaches. We observe that no proteins directly interacting with “FUN19” are annotated with function “16,” and only a small fraction (90 out of 355) of proteins indirectly interacting with “FUN19” via an intermediate protein are annotated with function “16.” Thus, all five other methods fail to annotate protein “FUN19” with function “16.”
However, a majority of proteins directly interacting with “FUN19” are annotated with either function “11” or function “14.” By scrutinizing the function–function correlation matrix computed from Equation (6) as shown in the right panel of Figure 5, we can see that function “16” has the highest statistical correlations with functions “11” and “14.” Utilizing such function–function correlations, our FCML method correctly annotates protein “FUN19” with function “16” as shown in the middle panel of Figure 5. In other words, the functionwise correlations play a significant role to improve overall predictive accuracy in protein function annotations.
4.5. Prediction and putative functions of unannotated proteins
We apply the proposed FCML approach on the BioGRID data annotated by the MIPS Funcat scheme and predict functions for the unannotated proteins. A list of all putative function predictions for level-1 functions in MIPS Funcat scheme by our algorithm is provided in Table 3, which is supplied in the Appendix of this article. In addition to predicted functions, we also report the corresponding statistical confidence values. For example, we annotate function “11” (Transcription) with statistical confidence of 0.83 and function “32” (Cell Rescue, Defense and Virulence) with statistical confidence of 0.12 to protein “YNR024W.” Namely, our experimental results suggest that protein “YNR024W” is more likely to be annotated with function “11” than function “32.”
5. Conclusions
We proposed a novel function–function correlated multi-label (FCML) protein function prediction approach and showed its promising performance, which outperforms other related approaches. Different from most existing approaches that divide protein function prediction into multiple separate tasks and make predictions fundamentally one function at a time, the proposed FCML approach considers all the biological functions as a single correlated prediction target and predict protein functions via an integral procedure. In the proposed approach, correlations among the functional categories are leveraged. By formulating protein function prediction as a multi-label classification problem, we use the Green's function over a graph to efficiently resolve the problem. The Green's function approach takes advantage of both the full topology of the interaction network toward global optimization and the local structures, such that the deficiencies lying in the existing approaches can be overcome. In addition, we propose an adaptive decision boundary method to deal with the unbalanced distribution of protein annotation data and quantify the statistical confidence of predicted functions for post-processing of proteomic analysis.
6. Appendix
6.1. Predicted putative functions for unannotated proteins and corresponding statistical confidence
We apply the proposed FCML approach on the BioGRID data and predict functions for the unannotated proteins. We use MIPS Funcat annotation scheme. A list of all putative function predictions for level-1 functions in MIPS Funcat scheme by our algorithm is provided in Table 3. The nonempty cells indicate the predicted putative functions of the corresponding protein. For example, protein “YAL034C” is predicted to have functions “16” (Protein with Binding Function) and “18” (Regulation of Metabolism and Protein Function).
In addition to predicted putative functions, we also report the corresponding statistical confidence values. For example, we annotate function “11” (Transcription) with statistical confidence of 0.83 and function “32” (Cell Rescue, Defense and Virulence) with statistical confidence of 0.12 to protein “YNR024W”. Namely, our experimental results suggest that protein “YNR024W” is more likely to be annotated with function “11” than function “32”.
Footnotes
Author Disclosure Statement
The authors declare that no competing financial interests exist.
