Function–Function Correlated Multi-label Protein Function Prediction over Interaction Networks

Abstract

Many previous works in protein function prediction make predictions one function at a time, fundamentally, which assumes the functional categories to be isolated. However, biological processes are highly correlated and usually intertwined together to happen at the same time; therefore, it would be beneficial to consider protein function prediction as one indivisible task and treat all the functional categories as an integral and correlated prediction target. By leveraging the function–function correlations, it is expected to achieve improved overall predictive accuracy. To this end, we develop a network-based protein function prediction approach, under the framework of multi-label classification in machine learning, to utilize the function–function correlations. Besides formulating the function–function correlations in the optimization objective explicitly, we also exploit them as part of the pairwise protein–protein similarities implicitly. The algorithm is built upon the Green's function over a graph, which not only employs the global topology of a network but also captures its local structures. In addition, we propose an adaptive decision boundary method to deal with the unbalanced distribution of protein annotation data. Finally, we quantify the statistical confidence of predicted functions to facilitate post-processing of proteomic analysis. We evaluate the proposed approach on Saccharomyces cerevisiae data, and the experimental results demonstrate very encouraging results.

1. Introduction

Many existing methods in predicting protein function from protein interaction network data typically make predictions one function at a time, fundamentally. This turns the problem into a convenient form for using existing machine-learning algorithms, which, however, abstract the function correlations, although most biological functions are interdependent from one another. For example, “Transcription” and “Protein Synthesis” (Mewes et al., 1999) usually appear together, one after another i.e., they tend to appear in the biological processes involving the same protein. As a result, if a protein is known to be annotated with the function “Transcription,” it is highly probable to annotate the same protein with function “Protein Synthesis” as well. In other words, the function–function correlations convey valuable information toward understanding the biological processes, which provides a potential opportunity to improve the protein function prediction accuracy. To this end, how to effectively exploit function–function correlations presents a challenging, yet important, problem in proteomic analysis for protein function prediction. In this study, we tackle this new problem by placing protein function prediction under the framework of multi-label classification, an emerging topic in machine learning, to develop a new graph-based protein function prediction method to take advantage of function–function correlations.

1.1. Network-based protein function prediction

Recent availability of protein interaction networks for many species has spurred on the development of network-based computational methods in protein function prediction. Typically, an interaction network is first modeled as a graph, with the vertices representing proteins and the edges representing the detected protein–protein interactions (PPI), followed by a graph-based statistical learning method to infer putative protein functions.

Review of related works The most straightforward method using network data to predict protein determines the functions of a protein from the known functions of its neighboring proteins on a PPI network (Schwikowski et al., 2000; Hishigaki et al., 2001; Chua et al., 2006), which leverages only local information of a network. Later researchers used global optimization approaches to improve the protein function predictions by taking into account the full topology of networks (Vazquez et al., 2003; Karaoz et al., 2004; Nabieva et al., 2005). All these approaches can be summarized as the following common schemes: (1) compute a set of ranking lists, and (2) make predictions using certain thresholds on the ranking lists. In step 1, which is the most critical part of the algorithms, they all compute the ranking lists one function at a time and ignore the relationships among the functions. A broad variety of network-based approaches using other models for protein function prediction are surveyed in Sharan et al. (2007).

We use an example to illustrate the deficiencies of the aforementioned methods. A small part of the PPI graph constructed from the BioGRID data (Stark et al., 2006) and annotated by MIPS Funcat scheme (Mewes et al., 1999) is shown in Figure 1. The clear oral vertices are unannotated proteins while the elliptical ones are proteins annotated with function “Metabolism” and the rectangular ones are proteins not annotated with the same function. The task is to determine whether the unannotated proteins have the functionality of “Metabolism.” When neighbor-counting approaches are applied, only the annotated proteins contribute to the annotation of an unannotated protein. For example, the functions of “YIL152W” is determined solely by those of “HSP82” and “BUD4,” but the rest of the annotated proteins and their unannotated neighbor “YER071C” are not used. In global optimization approaches, the annotated proteins are always treated the same, no matter how far they are from and how many links they are connected to the unannotated proteins. For example, when the global optimization approaches applied to annotate “YER071C,” “SSB2,” and “BUD4” are treated the same, although the former is closer to “YER071C”; “HSP82” and “CAP1” are also treated the same, although there are two connections from the former to “YER071C” while there is only one from the latter. Function-flow approach (Nabieva et al., 2005) takes care of the distance and link patterns, but it restricts the propagation to a fixed number of steps.

FIG. 1.

A part of (PPI) graph constructed by BioGRID data, which illustrates the deficiencies of some existing approaches. The oval vertices without background color are the unannotated proteins, while the elliptical vertices with blue background are the annotated proteins associated with function “metabolism,” and the rectangular vertices are the annotated proteins not associated with function “metabolism”.

Motivation to use the Green's function approach From the example in Figure 1, we can see that the above existing approaches bank on two assumptions: local consistency and global consistency, which are the exact foundations of the label propagation approaches for classification in machine learning. This motivates us to formulate protein function prediction over a PPI network as a label propagation problem on a graph. Among existing label propagation methods, we choose to develop our new method from the Green's function approach (Ding et al., 2007; Wang et al., 2009) due to its demonstrated effectiveness in other applications and clear intuitions. Most importantly, the weaknesses in previous methods can be perfectly solved by the Green's function approach as detailed in Section 2.4.

1.2. Multi-label correlated protein function prediction

Because a protein is usually observed to play several functional roles in different biological processes within an organism, it is natural to annotate it with multiple functions. Thus, protein function prediction is an ideal example of multi-label classification (Wang et al., 2009, 2010b–d, 2011) in machine learning. Multi-label classification, in which each object may belong to more than one class, is an emerging topic driven by the advances of modern technologies in recent years. Placing protein function prediction under the framework of multi-label classification, we use the Green's function approach to integrate the function–function correlations from the theory of reproducing kernel Hilbert space (RKHS) (in Section 2.4). Besides incorporating the function–function correlations as a regularizer in the optimization objective explicitly, we also take advantage of them as part of the pairwise protein similarities implicitly (in Section 2.5). In addition, we propose an adaptive decision boundary method to deal with the unbalanced distribution of protein annotation data (in Section 2.6), and quantify the statistical confidence of predicted putative functions for post-processing of proteomic analysis (in Section 2.7).

2. Methods

In this section, we propose a function–function correlated multi-label (FCML) approach using the Green's function on a graph to predict protein functions, which incorporates the function–function correlations in two levels: one from the function perspective to formulate the functionwise similarities explicitly in the optimization objective (in Section 2.4), and the other from the protein similarity perspective using function assignments to model the function correlations implicitly (in Section 2.5). Besides being used in the proposed approach, the latter also provides a means for all other previous related works to exploit the function correlations.

2.1. Notations and problem formalization

In protein function prediction, given K biological functions and n proteins, each protein x_i is associated with a set of labels represented by a function assignment indication vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bf y}_i \in \{- 1 , 0 , 1 \} ^K$$\end{document} such that y_i(k) = 1 if protein x_i has the k-th function, y_i(k) = −1 if it does not have the k-th function, and y_i(k) = 0 if its function assignment is not yet known a priori. Given l annotated proteins \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{(x_1 , {\bf y}_1) , \ldots , (x_l , {\bf y}_l) \} $$\end{document} where l < n, the goal is to predict functions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{{\bf y}_i \} _{i = l + 1}^n$$\end{document} for the unannotated proteins \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{x_i \} _{i = l + 1}^n$$\end{document} . We write \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Y = \left[{\bf y}_1 , \ldots , {\bf y}_n \right] ^T$$\end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Y_l = \left[{\bf y}_1 , \ldots , {\bf y}_l \right] ^T = [{\bf y}^{(1)} , \ldots , {\bf y}^{(K)}]$$\end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bf y}^{(k)} \in {\mathbb R}^l$$\end{document} is a classwise function assignment vector. We also define \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$F = [{\bf f}_1 , \ldots , {\bf f}_n] ^T \in {\mathbb R}^{n \times K}$$\end{document} as the decision matrix for prediction, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{{\bf f}_i \} _{i = l + 1}^n$$\end{document} includes the decision values for prediction.

We formalize a protein interaction network as a graph \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal G} = ({\cal V} , {\cal E})$$\end{document} . The vertices \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal V}$$\end{document} corresponds to the proteins \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{x_1 , \ldots , x_n \} $$\end{document} , and the edges \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal E}$$\end{document} are weighted by an n × n similarity matrix W with W_ij indicating the similarity between x_i and x_j. In the simplest case, W is the adjacency matrix of the PPI graph where W_ij = 1 if proteins x_i and x_j interact, and 0 otherwise. In this work, W is computed in Equation (14) to incorporate more useful information.

In summary, for a protein function prediction task, we are given W and Y_l as input, and the outputs of our method are decision values for the predicted putative functions assigned to the unannotated proteins, that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{{\bf f}_i \} _{i = l + 1}^n$$\end{document} .

2.2. Protein function prediction using the Green's function over a graph

In this section, we first briefly review the Green's function approach for label propagation over a graph, from which we will develop the proposed FCML method in Section 2.4.

The Green's function is of significant importance in solving partial differential equations, because it transforms them into integral equations. In physics, the Green's function G(r, r′) represents the field response (i.e., influence) at location r to the presence of a charge at local r′. In machine learning, G(r, r′) quantifies the influence of a labeled data point at r′ to another unlabeled data point at r.

To be more specific, given a graph with edge weights W, its combinatorial Laplacian is defined as L = D − W (Chung, 1997), where D = diag(We) and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bf e} = [1 \ldots 1] ^T$$\end{document} . The Green's function over the graph is defined as the inverse of L with zero-mode discarded, which is computed as following (Ding et al., 2007; Wang et al., 2009, 2010a): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}G = {L} _ {+} ^ {- 1} = \frac {1} {{(D - W) _ {+}}} = \sum_ {i = 2} ^n \frac {{\bf v} _i {\bf v} _i^T} {{\lambda_i}} , \tag {1} \end{align*}\end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal K} = G$$\end{document} is the kernel, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mu \in (0 , 1)$$\end{document} is a constant to control the smoothness regularizer \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bf tr} (F^T{\cal K}^{- 1}F)$$\end{document} , and tr(·) denotes the trace of a matrix. If we only consider one biological function, protein function prediction amounts to a two-class classification problem, where the function assignment vector is then reduced as a scalar, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$y_i \in \{1 , 0 , - 1 \} $$\end{document} . Given labeled data \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{\left(x_i , y_i \right) \} _{i = 1}^l$$\end{document} and unlabeled data \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{x_i \} _{i = l + 1}^n$$\end{document} , the labels of unlabeled data are computed by influence propagation from labeled data to those unlabeled (Ding et al., 2007): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}y_j = {\rm sign} \left(\sum_{i = 1}^lG_{ji}y_i \right) , \ l < j \leq n. \tag{3}\end{align*}\end{document}

Now we consider all the K functions. Extending Equation (3), we may assign functions to unannotated proteins as follows. Given K biological functions, we may assign functions to unannotated proteins as Ding et al. (2007): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}{\bf y}_j = {\rm sign} \ ({\bf f}_i) , \ l < j \leq n , \quad {\rm where} \quad F = GY. \tag{4}\end{align*}\end{document}

We name Equation (4) simply as multi-label Green's function (MLGF) approach, beyond which we will propose a novel function–function correlated multi-label Green's function approach.

2.3. Utilizing both local and global structure of a PPI network by the Green's function

Before proceeding to propose our new approach, we first point out that the Green's function is closely related to a well-established distance metric on a generic weighted graph, where the edge weight measures the similarity between two end vertices. Based on the derivation of the distance metric, it is easy to see that the Green's function approach not only takes advantage of global topology of a network but also leverages its local structures.

We view a generic weighted graph as a network of electric resistors, where the edge connecting vertices x_i and x_j is a resistor with resistance r_ij. The graph edge weight (the pairwise similarity) between vertices x_i and x_j is w_ij = 1/r_ij. Two vertices not connected by a resistor are viewed as equivalently connected by a resistor with r_ij = ∞ or w_ij = 0. The most common task on a resistor network is to calculate the effective resistance between different vertices. The effective resistance R_ij between vertices x_i and x_j is equal to 1/(total current between x_i and x_j) when x_i is connected to voltage 1 and x_j is connected to voltage 0. Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$G = (D - W) _ + ^{- 1}$$\end{document} be the Green's function on the graph, a remarkable result established in 1970s (Klein and Randić, 1993): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}R_{ij} = ({\bf e}_i - {\bf e}_j) ^TG ({\bf e}_i - {\bf e}_j) = G_{ii} + G_{jj} - 2G_{ij} , \tag{5}\end{align*}\end{document}

where e_i is a vector of all 0's except a “1” at ith entry. Recall the Mahalanobis distance in a metric space is d²(x_i, x_j) = (x_i − x_j)^T M(x_i − x_j), we can view R_ij as a distance on a graph. The same conclusion can also be drawn from the random walk perspective of view (Ding et al., 2007). In statistics, given pairwise similarity S = (s_ij), a standard way to convert to distance is d_ij = s_ii + s_jj − s_ij. From Equation (5), we have s_ij = G_ij + const. By ignoring the additive constant, G is the similarity metric underlying the effective resistor distance. Therefore, by simulating label propagation over a graph as current flowing on an electric network, the voltage of each vertices (i.e., the function assigned to a protein) is determined by both global topology and local structures of the electric network. In other words, for protein function prediction, the Green's function approach not only targets on global optimization but also rewards the local linkage patterns and distance impacts as illustrated in Figure 1.

From the analysis above, we can list some properties of the Green's function. First, G is clearly a semipositive definite function. Second, any function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bf f} \in {\mathbb R}^n$$\end{document} can be expanded in the basis of G, that is ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bf v}_2 , \ldots , {\bf v}_n$$\end{document} ) plus a constant \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\bf e} / \sqrt{n} = {\bf v}_1$$\end{document} . Third, for a kernel function \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal K} , {\cal K}_{ij}$$\end{document} measures the similarity between two objects i and j. Therefore, Green's function is a bona fide kernel, which will be used to derive the proposed approach from RKHS theory (Section 2.4).

2.4. Function–function correlated RKHS approach for multi-label classification

Although the function–function correlations are useful to infer putative functions of unannotated proteins, MLGF approach defined in Equation (4) neglects them because it treats the biological functions as isolated. In multi-label scenarios, however, we concentrate on making use of the function–function correlations, which could be defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C \in {\mathbb R}^{K \times K}$$\end{document} using cosine similarity as follows (Wang et al., 2009, 2010b–d, 2011): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}C_ {kl} = \cos ({\bf y} ^ {(k)} , {\bf y} ^ {(l)}) = {\frac {\langle {\bf y} ^ {(k)} , {\bf y} ^ {(l)} \rangle} {\ \parallel {\bf y} ^ {(k)} \parallel \; \ \parallel {\bf y} ^ {(l)} \parallel}} . \tag {6} \end{align*}\end{document}

Following Wang et al., (2009), we expect to maximize tr (FCF^T). In order to make connection with the theory of RKHS, instead of directly using F, we use kernel-assisted decision matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ {\cal K} ^ {- \frac {1} {2}} F$$\end{document} , which leads to the following objective to maximize (Wang et al., 2009): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}J_C (F) = {\bf tr} \, \left({\cal K} ^ {- \frac {1} {2}} FCF^T {\cal K} ^ {- \frac {1} {2}} \right) . \tag {7} \end{align*}\end{document}

Combining Equation (7) with the original RKHS objective in Equation (2), we minimize the following objective: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}J (F) = \beta \ \parallel F - Y \ \parallel ^2 + {\bf tr} \, \left(F^T {\cal K} ^ {- 1} F \right) - \alpha \ {\bf tr} \, \left({\cal K} ^ {- \frac {1} {2}} FCF^T {\cal K} ^ {- \frac {1} {2}} \right) , \tag {8} \end{align*}\end{document}

Differentiating J with respect to F, we have: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \frac {\partial J} {\partial F} = 2 \beta (F - Y) + 2 {\cal K}^{-1} F - 2 \alpha {\cal K}^{- 1} FC = 0 \Longrightarrow \ F = \frac {1} {\beta I + {\cal K}^{- 1}} \beta Y + \alpha \frac {1} {\beta I + {\cal K}^{- 1}} {\cal K}^{- 1} FC. \tag {9} \end{align*} \end{document}

Because β is usually very small in typical empirical settings, we have: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} \frac {F} {\beta} = {\cal K} Y + \alpha \frac {F} {\beta} C \Longrightarrow \ \tilde {F} = {\cal K} Y + \alpha \tilde {F} C = GY + \alpha \tilde {F} C , \tag {10} \end{align*}\end{document}

We name Equation (11) as our proposed function–function correlated multi-label (FCML) approach for protein function prediction. By Equation (11) we can compute F in a closed form without iterations, which is more mathematically elegant than related approaches. Moreover, (I − αC)⁻¹ can be seen as another graph, which propagates label influence through the label correlations over the whole network.

A more in-depth look at FCML method. In practice, we select \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\alpha < \frac {1} {\max (\zeta_k)} $$\end{document} , where ζ_k(0 < k < K) are the eigenvalues of C. Under this condition, Equation (11) can be written as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\tilde{F} = GY \left(I + \alpha C + \alpha^2C^2 + \cdots \right) , \tag{12}\end{align*}\end{document}

which can be further seen as the following iterative process: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\begin{cases} \tilde{F}^{(0)} = GY , \\ { \tilde{F}^{(t + 1)} = GY + \alpha \tilde{F}^{(t) }C.}\end{cases} \tag{13}\end{align*}\end{document}

Equation (13) reveals the insight of the proposed FCML method. At the initialization step, the decision matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\tilde{F}^{(0)}$$\end{document} is first initialized via label propagation using the Green's function method, which is exactly the same as the MLGF method defined in Equation (4). Then at each iteration step, besides retaining the initial information (the first term), the influence by function–function correlations are also taken into account (the second term). For example, we consider the case when protein x_i is annotated with the k_s₁-th function but not the k₂-th function, that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Y_{ik_1} = 1$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Y_{ik_2} = 0$$\end{document} . If through label propagation on the interaction network, protein x_i still can not acquire the k₂-th function upon the network topology, after the initialization step we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\tilde{F}_{ik_1}^{(0)} > 0$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\tilde{F}_{ik_2}^{(0)} = 0$$\end{document} . On the other hand, if these two functions are correlated, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$C_{k_1k_2} > 0$$\end{document} , then through the iteration step, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\tilde{F}_{ik_2}^{(1)} = \tilde{F}^{(0)}_{ik_2} + \left({\alpha \tilde{F}^{(0)}C} \right) _{ik_2} > 0$$\end{document} , protein x_i is likely to be annotated with the k₂-th function due to the function–function correlations.

2.5. Correlation augmented interaction network

Traditional network-based protein function prediction approaches only use biological interaction networks obtained from experimental data such as those from high-throughput technologies. When viewing protein function prediction as a multi-label classification problem, we can also build a computational interaction network \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$W_{\rm L} \in {\rm R}^{n \times n}$$\end{document} from label assignment perspective. As one of our important contributions, we make use of this new computational interaction network and propose a correlation augmented interaction network as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}W = W_{\rm Bio} + \gamma W_{\rm L} , \tag{14}\end{align*}\end{document}

where W_Bio is the biological interaction network, which is same as in existing approaches. γ controls the relative importance of W_L, and empirically selected as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\gamma = {{\sum_{i , j , i \neq j}W_{\rm Bio} (i , j)} \over{\sum_{i , j , i \neq j}W_{\rm L} (i , j)}}$$\end{document} .

The true power of the correlation augmented interaction network construction scheme defined in Equation (14) lies in that, the original biological similarities among proteins are augmented by the function assignment similarities, thereby label propagation pathways over a graph are reinforced. Moreover, with this interaction network construction scheme, the correlations among the functional categories are encoded into the graph weights, such that the resulted hybrid graph can be directly used in previous works to enhance their prediction performance. In this work, we use W defined in Equation (14) to compute the Green's function in Equation (1).

Protein–protein similarity from function assignments (W_L) Because multiple functions could be assigned to one single protein, the overlap between the function assignments of two proteins can be used to evaluate their similarity. The more functions shared by two proteins, the more similar they are. As a result, besides the class membership indications, the label assignment vector y_i is enriched with characteristic meaning and can be used as an attribute vector to characterize protein x_i. Using cosine similarity, the function assignment similarity between two proteins is computed as: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}W_ {\rm L} (i , j) = \cos ({\bf y} _i , {\bf y} _j) = {\frac {\langle {\bf y} _i , {\bf y} _j \rangle} {\ \parallel {\bf y} _i \parallel \ \parallel {\bf y} _j \parallel}} . \tag {15} \end{align*}\end{document}

Our task in protein function prediction is to assign functions to unannotated proteins upon annotated ones. In order to compute W_L, however, we need function assignments of all the proteins including those annotated and unannotated. Therefore, we first initialize unannotated proteins through a majority voting (Schwikowski et al., 2000) approach, which makes predictions using the top three frequent functions of the protein's interacting partners. Note that the class similarity defined in Equation (6) is different from the protein similarity defined here in Eq. (15). The former is a function-wise similarity matrix of size K × K, whereas the latter is a proteinwise similarity matrix of size n × n, although essentially they both convey label correlations.

Biological protein–protein similarity (W_Bio) and multisource integration W_Bio in Equation (14) computes the protein–protein similarity from biological experimental data, which is the same as existing works and can integrate multiple experimental sources. Let W⁽¹⁾ be the graph built from BioGRID PPI data (Ho et al., 2002; Giot et al., 2003), W⁽²⁾ be that from synthetic lethal data (Tong et al., 2004), W⁽³⁾ be that from gene coexpression data (Edgar et al., 2002), W⁽⁴⁾ be that from gene regulation data (Harbison et al., 2004), etc., W_Bio is computed as follows (Pei and Zhang, 2005): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}W_{\rm Bio} (i , j) = 1 - \prod_k \left[1 - r^{(k)}W^{(k)} (i , j) \right] , \tag{16}\end{align*}\end{document}

where r^(k) is estimated reliabilities of the corresponding network by expression profile reliability (EPR) index (Deane et al., 2002). Equation (16) reflects the fact that interactions detected in multiple experiments are generally more reliable than those detected by a single experiment (Von Mering et al., 2002).

Because in reality the overlap among different biological networks typically is very small, and the BioGRID PPI network data are fairly comprehensive, in this work we set W_Bio = W⁽¹⁾, where W⁽¹⁾(i, j) = 1 if protein x_i and x_j interact, and 0 otherwise.

By using the graph constructed from Equation (14), in addition to explicitly modeling the function–function correlations as in Equation (11), the correlations are also implicitly incorporated into the network linkages, so that the predictive accuracy can be further enhanced.

2.6. Adaptive decision boundary for function assignment

The MLGF approach defined in Equation (4) and FCML approach defined in Equation (11) produce ranked lists for function/label assignment, therefore decision boundaries are required to make predictions. Most existing research works using ranking lists to predict protein functions normally do not supply a threshold explicitly. Instead, they use a set of ROC curves (or the variant “precision”–“recall” curves) to evaluate the prediction performance. In some of these approaches, a heuristic cutoff point is given at the function assignment step e.g., in the majority voting (MV) approach, Schwikowski et al. (2000) assigned the three most frequently occurring functions among its neighbors to an unannotated protein. However, such threshold might not be the optimal one.

In many semi-supervised learning algorithms, the threshold for classification is usually selected as 0, which again is not necessarily the best choice. We propose an adaptive decision boundary to achieve better performance, which is adjusted such that the weighted training errors of all positive and negative samples are minimized.

Considering the binary classification problem for the k-th class, we denote b_k as the decision boundary, S₊ and S₋ as the sets of positive and negative samples for the k-th class, and e₊(b_k) and e₋(b_k) as the numbers of misclassified positive and negative training samples. The adaptive (optimal) decision boundary is given by the Bayes' rule \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$b_k^ {\rm opt} = \arg \min_ {b_k} \left[\frac {e_ {+} (b_k)} {\mid S_ {+} \mid} + \frac {e_ {-} (b_k)} {\mid S_ {-} \mid} \right]$$\end{document} . And the decision rule is given by: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*} x_i \ {\rm acquires \ label} \ k = \begin{cases} + 1 , \quad {\rm if} \quad \tilde{F}_{ik} > b_k^{\rm opt}; \\ {- 1 , \quad {\rm if} \quad \tilde{F}_{ik} < b_k^{\rm opt}.}\end{cases} \tag{17}\end{align*}\end{document}

Figure 2 shows the adaptive decision boundary for function “11” (transcription) defined in MIPS Funcat annotation scheme (version 2.1) using BioGRID PPI data (version 2.0.45). In the figure, the areas (probability likelihood) of misclassifications are minimized, and the adaptive decision boundary is different from 0.

FIG. 2.

Optimal decision boundary to minimize misclassification for function “11” (Transcription) (the black vertical line) is different from 0.

2.7. Statistical confidence of putative protein function

Many existing protein function prediction approaches only assign “yes” or “no” to a protein when deciding its membership to a function. However, due to the high noise in the biological experiments to generate PPI data, it would be better to estimate the probability of a given prediction rather than simply saying “yes” or “no.” Namely, the statistical confidence to a given prediction is necessary and often of great use in post-processing of proteomic analysis. For example, in order to minimize the experimental time, biologists would decide the order of biological experiments according to the confidence values of the putative protein functions.

Quantitatively evaluating the confidence of a prediction is usually not easy, because the underlying probability model and the actual training and testing data distribution are constantly changing for different biological functions. In this study, we adopt the posterior probability as a metric of the confidence for a prediction due to its clear statistical meaning and explicit computational formula.

Let Y_ik be the ground truth membership of protein x_i for the k-th biological function, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$Z_{ik} \in \pm 1$$\end{document} be the predicted membership, we denoted the confidence for the prediction as c(Z_ik). Given the prior probabilities, P(Y_ik = ± 1), either computed from the training data or set equally to be 0.5, and the class-conditional densities p(Z_ik|Y_ik = ±1), the posterior probability P(Y_ik = +1|Z_ik) (i.e., the confidence c(Z_ik)) is given by the Bayes' rule as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}c (Z_ {ik}) = P (Y_ {ik} = + 1 \mid Z_ {ik}) = {\frac {p (Z_ {ik} \mid Y_ {ik} = + 1) P (Y_ {ik} = + 1)} {\sum_ {l = \pm1} p (Z_ {ik} \mid Y_ {ik} = l) P (Y_ {ik} = l)}} . \tag {18} \end{align*}\end{document}

Hastie et al. (Hastie and Tibshirani, 1998) propose to fit Gaussians to the class-conditional densities p(Z_ik|Y_ik = ±1). The posterior probability is thus a sigmoid, whose slope is determined by the tied variance. Despite its clear intuition and explicit formulation, this approach is seldom useful in real applications because the assumption of Gaussian class-conditional densities is often violated. Figure 3 shows a plot of class-conditional densities p(Z_ik|Y_ik = ±1) for the training data of function “02” (Energy) in MIPS Funcat annotation system. The plot shows histograms of the densities (with bin 0.1 wide), derived from 10-fold cross-validation. Obviously, these densities are far away from Gaussian. In order to tackle this problem, inspired by empirical data, Platt et al. (1999) proposed to fit the conditional-class probabilities implicitly and fit the posterior probability to a parametric form of a sigmoid: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}c (Z_ {ik}) = P (Y_ {ik} = + 1 \mid Z_ {ik}) = \frac {1} {1 + \exp \left(AZ_ {ik} + B \right)} . \tag {19} \end{align*}\end{document}

FIG. 3.

Class-conditional densities for the training data of function “02” (Energy). The solid red line is p(Z_sik|Y_sik = +1) while the dashed blue line is p(Z_sik|Y_sik =−1). Obviously, these two histograms are not Gaussian.

The posterior probability defined in Equation (19) takes the form of logistic function, which represents the cumulative distribution function of a big family of exponential distributions. Because of this statistical enrichment, we adopt Equation (19) as our quantitative metric to measure the prediction confidence. The fitted sigmoid curve for the data shown in Figure 3 is plotted in Figure 4.

FIG. 4.

The fit of a sigmoid to the training data of function “02” (Energy) as shown in Figure 3. Blue markers are the average posterior probabilities computed for the examples falling into a bin of width 0.1. The solid red line is the best-fit sigmoid to the posterior probabilities where A = −2.124e + 004 and B = 2.916.

3. Materials and Data Sets

Two types of data are involved in the experimental evaluations for protein function prediction: function annotation data and PPI data. In this section, we describe the data used in this work.

The functional catalogue (FunCat) (Mewes et al., 1999) is a project under the Munich Information Center for Protein Sequences (MIPS), which is an annotation scheme for the functional description of proteins from prokaryotes, unicellular eukaryotes, plants, and animals. Taking into account the broad and highly diverse spectrum of known protein functions, FunCat of version 2.1 consists of 27 main functional categories. Seventeen of them are involved in annotating Saccharomyces cerevisiae, which covers general fields such as cellular transport, metabolism, and cellular communication/signal transduction. The main branches exhibit a hierarchical, treelike structure with up to six levels of increasing specificity. Although there are still other protein annotation systems such, as the Gene Ontology (Ashburner et al., 2000), we use the Funcat annotation system due to its clear treelike hierarchical structure.

The protein–protein interaction data can be downloaded from the BioGRID database (Stark et al., 2006), and we focus on the S. cerevisiae. By removing the proteins connected by only one PPI, there are 4299 proteins with 72624 PPIs in the BioGRDI database of version 2.0.45 annotated by Funcat annotation scheme, together with 1997 unannotated proteins. All related 17 level-1 biological functions are listed in Table 1.

Table 1.

Function IDs and Names by Funcat Scheme Version 2.1.

ID	Name
‘01’	Metabolism
‘02’	Energy
‘10’	Cell cycle and dna processing
‘11’	Transcription
‘12’	Protein synthesis
‘14’	Protein fate (folding, modification, destination)
‘16’	Protein with binding function or cofactor requirement (structural or catalytic)
‘18’	Regulation of metabolism and protein function
‘20’	Cellular transport, transport facilitation and transport routes
‘30’	Cellular communication/signal transduction mechanism
‘32’	Cell rescue, defense and virulence
‘34’	Interaction with the environment
‘38’	Transposable elements, viral and plasmid proteins
‘40’	Cell fate
‘41’	Development (systemic)
‘42’	Biogenesis of cellular components
‘43’	Cell type differentiation

4. Results and Discussions

In this article, we proposed a function–function correlated multi-label (FCML) approach for protein function prediction to utilize the correlations among the biological functions to improve the overall prediction performance. Using the PPI data from BioGRID database (Stark et al., 2006) and Funcat annotation scheme (Mewes et al., 1999) on S. cerevisiae data, we evaluate our proposed approach and make predictions for unknown proteins.

For statistical metrics, we use the standard precision and F1 score that have been widely used in previous protein function prediction research work. Let TP (true positive) be the number of proteins that we correctly predict to have a given function, FP (false positive) be the number of proteins that we incorrectly predict to have the function, and FN (false negative) be the number of proteins which we incorrectly predict to not have the function. The “precision” is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ {\frac {\rm TP} {\rm TP + FP}} $$\end{document} , and the “recall” (also known as “sensitivity”) is defined as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ {\frac {\rm TP} {\rm TP + FN}} $$\end{document} . We do not report the specificity of the procedures, because even a trivial algorithm that assigns all proteins to membership of −1 will achieve high specificity due to the unbalanced distribution of positive and negative samples in protein annotation data. In addition, we also use the “F1 score” to evaluate precision and recall together, which is the harmonic mean of precision and recall and defined as following: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\hbox {F1 score} = \frac {2 \times {\rm Precision} \times {\rm Recall}} {{\rm Precision} + {\rm Recall}} \end{align*}\end{document}

F1 score is extensively used in previous related work and other domains such as information retrieval. Typically, improving the precision of an algorithm decreases its recall and vice versa, therefore F1 score is a balanced performance metric.

4.1. Evaluation of the function–function correlations

Because function–function correlations are one of the most important mechanisms to improve the prediction performance in our proposed approach, we first evaluate its correctness. Using the FunCat 2.1 annotation data set for S. cerevisiae genome, the function–function correlations defined in Equation (6) are illustrated in the right panel of Figure 5. The high correlation value between functions “40” (Cell Fate) and “43” (Cell Type Differentiation) depicted in this figure shows that they are highly correlated. In addition, as shown in this figure, some other function pairs are also highly correlated, such as functions “11” (Transcription) and “16” (Protein with Binding Function or Cofactor Requirement), “18” (Regulation of Metabolism and Protein Function), and “30” (Cellular Communication/Signal Transduction Mechanism), etc. All these observations comply with the biological nature, which justifies the utility of the function correlations from a biological perspective. Figure 5 will also be used to demonstrate the power of function–function correlations in Section 4.4.

FIG. 5.

Right panel: Illustration of the correlation matrix defined in Equation (6) among the 17 main functional categories in FunCat 2.1 annotated to S. cerevisiae genome. Left panel: protein, “FUN19” is the testing protein and other neighboring proteins are training data that have functions “11” or “14.” Middle panel: Our FCML method correctly annotates protein “FUN19” with function “16” utilizing function–function correlations.

4.2. Robustness of adaptive decision boundary

Adaptive decision boundary is another important contribution of this article. Instead of heuristically selecting the thresholds by experience like many existing approaches, adaptive decision boundary method principally computes the thresholds for function assignment from the training data to deal with the unbalanced distribution problem between positive and negative training data. We need to evaluate whether the adaptive decision boundary is robust to the amount of training data to compute it.

In order to evaluate the robustness of the adaptive decision boundaries, we compute them using different amounts of training data and report the corresponding prediction performance in five-fold cross-validation. As a demonstration, we randomly select function “11” (Transcription) and conduct the experiments on its data. We randomly select 5%, 10%, 20%, 40%, or 80% of the training data to compute the decision boundaries. The prediction performance measured by F1 score is plotted in Figure 6. From the experimental results, we can see that the prediction performance does not degrade much with the decrease of the amount of training data to compute the adaptive decision boundaries. In other words, adaptive decision boundary is a robust thresholding method as long as the amount of data used to compute it is not very small. For example, 5% of the training data is enough to calculate a valid adaptive decision boundary in the demonstration experiments.

FIG. 6.

Evaluation on the effectiveness and robustness of adaptive decision boundary. The performance measured by F1 score vs. the percentage of training data used to compute the adaptive decision boundary for function “11” (transcription).

4.3. Improved function prediction in cross-validation

We compare the performances of our proposed multi-label Green's function (MLGF) approach and function–function correlated multi-label (FCML) approach to related commonly used methods, such as majority voting (MV) approach (Schwikowski et al., 2000), global majority voting (GMV) approach (Vazquez et al., 2003), χ² approach (Hishigaki et al., 2001), functional flow (FF) approach (Nabieva et al., 2005), and kernel logistic regression (KLR) method (Lee et al., 2006). The PPI graph is built from BioGRID data of version 2.0.45 with annotation by MIPS Funcat scheme of version 2.1. The ten-fold cross-validation is used. For four other approaches, we use their respective optimal parameters. In MV approach, we select the three most frequently occurring functions in a protein's neighbors. In χ² approach, radius = 1 gives the best performance. In FF approach, we assign functions according to the proportions of positive and negative training samples as suggested by Nabieva et al. (2005).

We report the overall prediction performance over all functions using the microaverages of two performance values to address multi-label scenario. The microaverage is computed from the sum of per class contingency table, which can be seen as a weighted average that emphasizes more on the accuracy of classes/functions with more positive samples. The microaveraged precision and F1 score by the compared approaches over all 17 level-1 biological functions are listed in Table 2, which quantify the advantages of our FCML approach over the others with more concrete evidence. Moreover, the performances by FCML approach are consistently better than those of MLGF, which demonstrates that incorporating the inherent correlations among biological functions can improve the prediction performance significantly.

Table 2.

Microaverage of Precision and F1 Score by Six Approaches in Comparison over All Main Functional Categories by Funcat Scheme (mean ± std)

Approaches	Average precision	Average F1 score
MV	30.69% ± 1.12%	29.04% ± 1.05%
GMV	31.13% ± 2.14%	22.41% ± 1.75%
χ ²	14.8% ± 1.21%	7.60% ± 0.67%
FF	28.01% ± 1.69%	27.05% ± 1.54%
KLR	36.81% ± 2.31%	37.54% ± 2.69%
MLGF	32.45% ± 2.45%	36.36% ± 2.61%
FCML	54.83% ± 2.78%	43.74% ± 2.61%

4.4. Demonstration of the effectiveness of function–function correlations

In the above experiments, our FCML method outperforms all other methods. We carefully check those testing proteins, which are incorrectly annotated by other methods but correctly annotated by the FCML method. We find function–function correlations absolutely help the function prediction results. Now we will show one example to demonstrate the effectiveness of function–function correlations. For example, in the left panel of Figure 5, protein “FUN19” is the testing protein and other neighboring proteins are training data that have functions “11” (Transcription) or “14” (Protein Fate [Folding, Modification, Destination]).

In experimental results, protein “FUN19” is annotated with functions “11” and “14” by all six methods. But it is annotated with function “16” (Protein with Binding Function or Cofactor Requirement [Structural or Catalytic]) only by our FCML approach and not by the other approaches. We observe that no proteins directly interacting with “FUN19” are annotated with function “16,” and only a small fraction (90 out of 355) of proteins indirectly interacting with “FUN19” via an intermediate protein are annotated with function “16.” Thus, all five other methods fail to annotate protein “FUN19” with function “16.”

However, a majority of proteins directly interacting with “FUN19” are annotated with either function “11” or function “14.” By scrutinizing the function–function correlation matrix computed from Equation (6) as shown in the right panel of Figure 5, we can see that function “16” has the highest statistical correlations with functions “11” and “14.” Utilizing such function–function correlations, our FCML method correctly annotates protein “FUN19” with function “16” as shown in the middle panel of Figure 5. In other words, the functionwise correlations play a significant role to improve overall predictive accuracy in protein function annotations.

4.5. Prediction and putative functions of unannotated proteins

We apply the proposed FCML approach on the BioGRID data annotated by the MIPS Funcat scheme and predict functions for the unannotated proteins. A list of all putative function predictions for level-1 functions in MIPS Funcat scheme by our algorithm is provided in Table 3, which is supplied in the Appendix of this article. In addition to predicted functions, we also report the corresponding statistical confidence values. For example, we annotate function “11” (Transcription) with statistical confidence of 0.83 and function “32” (Cell Rescue, Defense and Virulence) with statistical confidence of 0.12 to protein “YNR024W.” Namely, our experimental results suggest that protein “YNR024W” is more likely to be annotated with function “11” than function “32.”

Table 3.

Statistical Confidence of Predicted Putative Functions for Unannotated Proteins

	Function categories defined in MIPS Funcat annotation scheme
Proteins	“1”	“2”	“10”	“11”	“12”	“14”	“16”	“18”	“20”	“30”	“32”	“34”	“38”	“40”	“41”	“42”	“43”
YAL027W	0.83		0.86	0.55	0.19	0.32		0.12		0.39	0.44	0.49					0.63
YAL034C							0.36	0.12
YAL053W				0.62			0.87	0.61	0.99	0.77	0.88
YAR027W	0.49		0.38	0.31	0.19		0.32	0.18	0.55		0.32	0.18				0.44
YAR028W	0.54	0.27				0.72
YBL046W			0.51	0.67		0.81	0.27	0.24			0.17	0.32		0.17
YBL049W		0.58	0.64		0.22	0.81	0.51	0.11				0.30	0.03	0.15		0.37	0.28
YBL060W						0.29	0.72	0.29	0.85	0.19		0.72				0.92
YBL104C					0.56	0.70	0.56		0.56								0.65
YBR025C		0.11	0.59		0.38	0.60	0.21	0.11			0.21
YBR062C	0.85										0.75	0.23
YBR094W			0.77		0.44		0.78	0.42			0.18
YBR096W						0.68	0.48		0.88	0.23
YBR108W	0.60					0.26			0.39	0.26	0.19					0.48	0.23
YBR137W						0.72	0.47		0.44	0.11
YBR162C			0.62		0.57					0.16		0.54		0.23			0.41
YBR187W				0.91	0.29		0.48	0.12						0.17		0.39
YBR194W			0.42	0.45			0.42
YBR225W	0.66		0.84				0.28				0.15					0.40	0.43
YBR246W		0.50			0.55		0.37				0.39	0.28
YBR255W	0.40					0.39	0.22		0.34			0.20
YBR270C	0.44	0.17			0.45			0.58				0.28
YBR273C	0.34			0.26			0.38							0.10
YBR280C	0.44					0.40
YBR287W				0.47			0.26							0.12		0.29
YCL028W		0.07
YCL045C	0.71					0.40		0.08	0.66	0.17	0.39	0.22				0.43
YCL056C	0.48		0.79	0.28		0.39	0.21				0.44		0.10				0.20
YCR007C		0.38			0.43	0.33		0.35	0.96	0.18		0.87					0.18
YCR016W	0.79	0.09		0.63		0.35	0.86			0.82	0.23		0.03
YCR030C								0.13	0.62		0.83				0.50
YCR043C	0.39		0.72					0.52	0.37	0.23			0.02	0.51		0.64	0.64
YCR061W						0.42			0.61					0.36		0.86	0.45
YCR076C		0.13				0.93	0.36		0.49							0.33
YCR082W			0.62	0.55		0.38	0.33	0.30		0.12	0.12	0.21
YCR095C	0.50					0.25		0.12		0.10	0.18
YDL012C	0.92	0.88			0.30		0.71		0.82		0.89			0.13	0.22	0.83	0.27
YDL063C					0.18	0.29	0.46		0.61
YDL072C						0.55			0.69
YDL089W		0.07			0.24		0.56				0.25
YDL091C	0.65	0.17					0.34		0.47				0.01			0.33
YDL099W									0.28
YDL121C	0.39	0.08				0.76	0.65		0.34	0.12						0.31
YDL123W	0.77	0.29		0.31	0.70	0.81		0.17			0.26		0.01	0.75			0.88
YDL139C						0.67		0.10			0.52		0.14			0.56
YDL156W	0.48	0.14	0.75	0.39				0.66	0.61				0.26
YDL167C			0.59	0.55	0.40		0.26
YDL189W	0.44			0.54	0.25		0.60			0.49	0.38						0.34
YDL204W	0.70		0.61	0.61		0.35	0.27	0.14		0.22	0.56	0.57		0.46		0.68	0.44
YDR049W			0.33	0.97			0.68		0.92	0.52			0.02				0.34
YDR051C			0.35	0.32		0.73	0.52	0.51			0.80	0.26		0.19
YDR056C	0.93	0.58
YDR063W	0.45	0.10				0.51								0.13		0.29
YDR067C			0.57	0.42									0.01
YDR068W		0.63		0.67			0.96	0.64	0.34				0.01				0.39
YDR078C			0.70				0.41				0.26					0.48
YDR084C			0.56				0.34	0.40	0.64
YDR100W	0.74		0.48			0.63	0.68	0.08	0.71	0.54	0.14			0.26		0.64	0.14
YDR105C	0.67	0.42				0.58	0.81		0.80			0.22		0.18
YDR106W											0.23		0.01
YDR126W	0.99	0.49	0.99	0.98			0.94	0.44		0.27	0.97	0.28					0.97
YDR128W		0.18			0.21					0.35	0.40	0.58					0.65
YDR132C			0.74	0.61			0.53	0.12					0.01			0.29
YDR134C	0.46		0.36	0.41			0.27
YDR152W	0.38			0.54			0.26
YDR161W			0.42		0.49		0.44	0.75		0.53							0.37
YDR186C	0.89	0.22	0.61			0.62		0.74	0.58	0.89	0.32			0.40		0.40	0.69
YDR198C		0.21	0.48	0.31						0.16	0.16	0.39				0.37
YDR222W	0.40	0.15				0.29			0.74			0.42				0.37
YDR233C	0.42	0.07				0.31	0.22		0.68
YDR239C	0.65					0.76		0.15	0.93	0.68	0.29	0.82					0.14
YDR266C	0.41		0.62	0.49	0.30	0.38	0.33	0.23	0.46		0.36						0.23
YDR326C		0.40			0.34			0.08		0.11		0.28
YDR339C				0.82		0.62	0.40		0.98		0.29			0.13	0.23		0.21
YDR346C	0.41		0.64	0.53			0.48
YDR348C	0.37		0.63	0.25		0.26	0.57	0.14
YDR357C	0.42			0.72		0.55	0.75	0.12			0.13
YDR361C	0.91	0.42	0.42				0.52			0.26						0.72	0.39
YDR367W					0.16						0.34
YDR374C	0.50	0.14		0.46	0.17	0.63	0.68	0.08	0.63	0.43	0.50	0.52		0.47			0.68
YDR383C	0.43	0.11	0.51	0.39		0.46	0.26	0.31		0.21	0.29	0.32		0.23		0.66	0.50
YDR411C	0.36							0.11
YDR458C			0.91	0.32		0.80		0.26		0.26	0.51	0.20	0.06			0.28
YDR475C	0.58			0.41		0.45	0.22	0.09		0.18	0.19	0.22
YDR476C			0.42				0.60				0.46		0.04			0.48
YDR482C	0.62			0.68			0.41	0.20		0.43			0.04
YDR486C	0.37	0.07				0.27			0.45
YDR505C			0.39	0.49		0.32	0.31
YDR520C	0.50	0.06		0.60							0.29
YDR532C					0.28	0.65										0.32
YEL001C	0.58	0.07		0.43		0.76		0.16	0.42		0.15					0.40
YEL018W						0.41										0.36
YEL043W			0.57			0.26	0.34	0.08	0.43							0.41	0.26
YEL044W			0.60	0.66				0.08						0.27		0.29	0.43
YEL048C		0.10	0.37		0.16			0.08
YER004W					0.56	0.97	0.78	0.17	0.98			0.41			0.45	0.60
YER030W			0.53	0.55			0.30									0.53
YER033C				0.51			0.28			0.18
YER048W-A	0.73	0.59				0.49		0.12	0.56	0.10				0.11			0.28
YER049W		0.08					0.55	0.13				0.33
YER067W	0.45	0.17		0.26		0.48	0.53	0.13
YER071C	0.53	0.13	0.28		0.20	0.57	0.60				0.95	0.15			0.27	0.69
YER092W			0.40	0.26	0.22	0.58	0.53				0.24					0.43
YER113C		0.09				0.33	0.90		0.32			0.15				0.36
YER128W	0.39	0.08				0.40			0.69
YER139C	0.44		0.36	0.75		0.54		0.28	0.29	0.76	0.11	0.74					0.17
YER182W					0.79					0.12		0.20		0.15
YFL034W	0.50	0.08							0.69					0.17		0.31
YFL062W		0.09			0.18	0.82			0.93		0.38	0.80		0.19	0.25		0.50
YFR016C			0.41				0.81		0.54	0.16	0.24	0.31				0.64	0.41
YFR017C	0.48	0.15	0.37
YFR042W	0.53		0.65			0.57	0.38	0.20		0.17	0.60			0.10			0.28
YFR043C	0.56		0.59	0.65			0.53	0.50		0.38	0.55
YFR048W		0.07	0.62			0.33	0.31	0.13			0.15
YGL010W	0.49		0.61			0.30	0.38					0.25					0.18
YGL036W			0.92		0.62		0.54			0.18				0.15			0.49
YGL060W			0.85		0.84								0.63		0.27		0.47
YGL081W		0.24		0.48			0.76							0.20		0.70
YGL083W		0.17		0.29		0.62	0.30				0.50			0.22
YGL108C				0.27			0.76	0.08	0.57				0.01				0.15
YGL131C				0.32	0.42							0.18
YGL168W		0.07	0.32			0.32	0.21
YGL220W		0.13				0.70	0.45	0.34	0.85	0.13	0.12	0.48		0.30		0.78
YGL231C						0.25	0.43		0.42		0.20					0.27
YGL242C	0.69	0.21		0.36									0.04
YGR017W	0.40		0.48	0.65		0.65	0.41	0.21			0.64		0.02	0.13
YGR058W	0.53	0.10		0.64									0.01				0.17
YGR068C		0.17	0.28	0.47		0.34	0.72	0.23			0.47
YGR071C	0.37		0.60			0.32	0.67			0.27	0.24					0.37	0.15
YGR093W	0.53	0.44		0.52	0.93			0.08						0.73
YGR106C	0.41	0.17							0.69		0.22	0.32
YGR122W	0.47	0.13		0.32				0.10						0.15
YGR126W				0.55	0.28		0.49
YGR130C						0.41	0.78									0.45
YGR149W					0.82											0.42
YGR163W	0.80	0.10					0.81	0.18			0.60	0.60		0.58
YGR187C					0.45	0.80	0.45	0.08			0.19		0.02
YGR189C	0.83	0.15			0.40	0.43				0.18				0.58			0.67
YGR196C						0.58		0.30	0.70	0.56	0.36	0.82		0.23		0.45	0.66
YGR206W	0.44					0.70		0.09	0.32	0.11		0.26
YGR237C	0.64		0.63	0.82		0.40	0.45	0.11			0.12	0.44
YGR263C		0.33		0.35		0.25	0.27							0.11
YGR266W		0.34		0.29								0.20
YGR271C-A	0.83	0.32		0.92			0.51	0.60								0.40
YGR283C	0.34	0.08		0.68	0.75	0.35	0.44	0.88
YGR295C						0.67						0.19
YHL006C					0.32	0.24		0.35
YHL014C					0.29			0.19			0.32		0.01
YHL021C	0.36		0.53	0.64	0.32	0.70	0.54		0.48	0.17	0.36	0.16					0.22
YHL029C	0.43					0.36	0.20				0.13
YHL039W	0.65			0.49		0.24	0.47
YHL042W				0.34							0.18				0.50
YHR009C						0.39			0.58					0.19		0.44	0.14
YHR029C	0.94	0.55				0.90	0.99		0.97			0.57		0.98		0.41	0.89
YHR045W	0.74	0.11					0.29	0.14	0.40		0.34		0.06
YHR059W				0.91	0.21		0.39	0.22
YHR087W			0.32	0.33	0.30	0.52	0.45	0.09	0.50
YHR097C							0.71	0.19	0.69					0.12		0.38
YHR105W					0.30	0.75			0.48	0.42							0.26
YHR131C			0.31			0.55		0.11
YHR140W			0.28			0.67			0.80	0.17	0.43					0.80	0.69
YHR151C			0.48	0.64			0.50	0.12
YHR199C		0.10				0.53		0.08	0.82							0.32
YHR207C			0.40	0.73		0.44							0.01
YIL023C		0.58	0.98	0.97		0.87	0.32	0.56			0.17		0.15	0.26	0.50	0.93	0.95
YIL027C	0.72	0.41			0.16	0.34							0.02	0.10			0.19
YIL039W	0.52					0.27			0.38
YIL096C		0.28		0.85			0.75	0.43
YIL108W						0.40	0.26	0.14	0.82		0.18						0.14
YIL127C	0.92	0.55
YIL151C			0.56	0.72				0.32					0.21				0.18
YIL152W			0.67			0.41	0.77	0.14		0.23	0.38	0.18		0.55		0.90	0.64
YIL157C	0.74	0.61				0.58	0.86		0.36	0.40						0.50
YIL161W				0.51			0.60	0.11	0.40
YIR003W	0.46	0.07				0.68	0.29	0.15		0.13	0.16	0.57				0.42
YIR007W		0.33			0.52		0.43	0.21		0.25		0.23
YJL048C	0.61		0.39	0.66	0.34	0.24			0.37					0.24		0.48	0.22
YJL051W	0.75	0.07	0.70	0.69	0.43	0.82	0.68			0.29						0.51	0.20
YJL057C		0.26		0.70		0.83	0.83			0.19				0.46		0.44
YJL058C		0.12					0.42
YJL066C						0.79	0.62		0.93	0.40		0.52		0.37		0.30	0.39
YJL082W	0.47	0.14					0.23	0.39			0.19
YJL097W	0.68	0.18				0.39	0.27		0.41		0.46					0.64
YJL105W	0.50			0.89			0.30	0.18			0.11
YJL107C	0.46		0.67				0.24			0.10	0.29					0.43
YJL122W		0.21			0.71							0.17
YJL123C	0.66		0.32		0.23	0.34		0.31	0.47
YJL149W			0.31			0.29	0.44			0.09				0.12		0.30
YJL151C	0.41	0.09				0.50			0.51		0.23
YJL162C	0.46	0.27		0.93	0.60		0.80	0.13			0.50	0.41
YJL171C		0.22					0.75				0.27	0.54	0.11	0.10
YJL181W			0.40	0.83			0.66	0.08					0.01			0.26
YJL185C		0.09	0.53				0.41	0.20		0.25	0.12	0.38		0.22		0.71	0.31
YJL207C	0.35	0.60						0.11	0.88	0.11	0.36	0.70	0.01	0.24		0.93	0.23
YJR011C	0.38			0.88	0.26	0.41	0.61	0.32		0.09
YJR061W	0.77					0.58		0.09		0.12	0.15		0.01	0.17			0.17
YJR067C		0.14			0.75		0.26		0.53
YJR082C			0.63		0.71	0.46	0.70			0.17						0.39
YJR088C	0.50					0.38		0.09
YJR118C			0.51		0.20		0.40	0.28	0.30	0.21	0.27	0.18	0.01
YJR134C						0.50			0.36	0.20		0.30
YKL023W				0.42	0.25		0.22
YKL037W		0.23				0.61	0.22	0.10	0.33		0.41	0.16				0.49
YKL050C	0.81	0.26	0.47								0.54
YKL061W	0.42					0.44					0.39	0.27	0.02	0.14		0.48
YKL063C			0.57				0.77	0.48	0.62
YKL065C							0.26
YKL069W		0.10				0.36	0.40
YKL075C			0.60	0.74							0.13
YKL094W	0.86	0.23							0.71		0.66	0.22
YKL098W	0.64	0.60	0.44	0.70	0.32							0.29	0.01	0.58		0.33	0.15
YKL151C	0.74			0.51		0.32	0.25		0.47	0.24		0.30
YKL183W	0.81	0.55		0.68	0.83			0.88	0.49	0.70			0.01	0.11			0.42
YKL206C			0.39		0.54	0.75			0.48
YKR071C	0.53	0.20					0.81			0.10	0.47	0.33
YKR077W	0.41		0.49	0.73	0.26	0.40	0.77					0.17	0.02			0.31	0.16
YKR088C						0.29			0.50
YKR100C	0.64						0.31				0.11	0.15
YLL014W				0.52
YLL023C		0.63	0.83	0.79		0.75	0.55	0.73	0.73	0.30	0.88		0.03		0.63		0.42
YLL032C		0.73			0.23										0.36	0.56
YLR021W	0.72	0.07	0.60	0.75				0.33			0.48	0.35		0.26			0.34
YLR030W			0.54			0.37				0.12	0.13
YLR031W		0.51	0.52			0.24	0.27	0.07	0.38								0.14
YLR036C	0.76		0.92	0.97		0.41	0.81						0.02		0.49
YLR050C		0.09				0.52			0.97
YLR064W		0.16				0.77	0.27		0.68		0.50	0.46		0.16		0.62	0.15
YLR065C						0.70		0.16	0.81
YLR072W	0.54				0.19	0.56		0.08			0.15
YLR108C		0.08			0.66								0.02
YLR114C	0.64			0.69			0.26				0.13
YLR173W				0.66	0.70					0.32
YLR177W				0.71			0.88
YLR187W	0.68	0.09
YLR190W			0.33	0.31	0.22	0.60											0.14
YLR196W		0.23	0.61	0.51		0.66	0.34		0.78		0.61	0.58	0.02	0.18		0.77	0.61
YLR199C		0.14				0.79				0.18	0.31			0.12			0.68
YLR218C	0.50		0.30			0.30		0.33		0.16						0.30
YLR241W						0.57			0.40	0.09							0.14
YLR253W	0.60	0.26	0.62			0.37				0.20	0.47					0.44
YLR254C				0.45		0.65					0.28	0.75	0.02
YLR257W	0.80	0.31				0.37	0.24		0.34								0.21
YLR267W			0.74			0.80	0.80	0.51		0.10	0.31	0.18		0.66		0.42	0.43
YLR287C	0.52			0.37	0.58					0.39
YLR315W				0.85			0.51	0.16			0.19
YLR326W		0.11			0.29				0.93	0.49	0.77	0.94		0.17			0.33
YLR352W			0.33			0.27	0.48			0.12				0.13		0.32	0.14
YLR376C					0.61	0.58					0.25
YLR392C			0.31	0.32			0.48	0.12		0.22
YLR407W	0.64	0.11	0.57	0.68			0.33			0.10	0.16	0.21
YLR408C			0.53	0.34		0.34		0.08				0.20					0.57
YLR413W		0.11				0.46	0.52	0.14	0.73
YLR426W	0.44	0.07					0.52	0.30		0.21	0.24	0.18		0.13		0.29
YLR437C	0.44	0.16	0.42				0.38				0.22					0.40
YLR446W		0.29		0.97	0.22						0.46	0.41
YLR455W	0.64		0.33	0.72		0.31	0.57	0.16						0.15
YML011C			0.92	0.57						0.19		0.83	0.02	0.51		0.67	0.85
YML018C		0.08				0.50	0.38	0.27	0.45	0.17		0.20					0.14
YML030W						0.46	0.37	0.08			0.17
YML036W		0.93		0.93	0.23		0.92	0.59	0.98	0.48	0.81		0.07			0.97
YML072C		0.10					0.39			0.10			0.02	0.18		0.26
YML101C	0.48		0.42			0.63				0.12		0.24
YML119W			0.65		0.36	0.67		0.25	0.64	0.35							0.28
YMR003W		0.13			0.21		0.82		0.39	0.45			0.04
YMR010W						0.39	0.32				0.17
YMR031C	0.58		0.71	0.41		0.55	0.24				0.29	0.20		0.11		0.55
YMR067C			0.59			0.33	0.25									0.32	0.17
YMR071C		0.07				0.30	0.36	0.12	0.73			0.35
YMR074C		0.08	0.99	0.90	0.43	0.91	0.98		0.98	0.40	0.97	0.36	0.57	0.79		0.99	0.95
YMR075W	0.42		0.76	0.58		0.47	0.27					0.39					0.34
YMR086W	0.84	0.26	0.51			0.36	0.63	0.12		0.24		0.32		0.27		0.36	0.30
YMR099C	0.67	0.10	0.57				0.36			0.35
YMR102C	0.45		0.61				0.30	0.09			0.13
YMR110C			0.48			0.25	0.34				0.13
YMR111C	0.49	0.09	0.67	0.25			0.51	0.09
YMR122W-A	0.42	0.09	0.52					0.39					0.02			0.42
YMR124W	0.45		0.48	0.50		0.33	0.33	0.14			0.45	0.25	0.01	0.17			0.19
YMR144W		0.46	0.65		0.16	0.81	0.49		0.68		0.44	0.37				0.79
YMR163C						0.67			0.82	0.13
YMR191W	0.53		0.49			0.66	0.42	0.27			0.39			0.14			0.24
YMR221C		0.06				0.90			0.94		0.45	0.43				0.38
YMR233W			0.33			0.66	0.57	0.22			0.19
YMR253C	0.66	0.10				0.63		0.27		0.14	0.18	0.15				0.32
YMR258C						0.25		0.08
YMR259C	0.38	0.06	0.67	0.72		0.25					0.36	0.38				0.42
YMR310C	0.80		0.82	0.76			0.54			0.53	0.69	0.52		0.47		0.40	0.17
YNL022C				0.75										0.44
YNL024C		0.09	0.48	0.46	0.16	0.55	0.24	0.31				0.32		0.14		0.65
YNL035C	0.66	0.24				0.38	0.44	0.09			0.27	0.20	0.04	0.20		0.65
YNL046W			0.68		0.45	0.29	0.70	0.09	0.48	0.36		0.78	0.04	0.77		0.32	0.77
YNL056W	0.37					0.40
YNL087W		0.26		0.37			0.58			0.19	0.19					0.59
YNL092W		0.25	0.74				0.93		0.59				0.05	0.88		0.94	0.81
YNL095C					0.26	0.53		0.19	0.40					0.14			0.31
YNL122C									0.40	0.14
YNL146W					0.45				0.40					0.10		0.92
YNL149C	0.63	0.08					0.28					0.16
YNL155W	0.63	0.22		0.50			0.50	0.20			0.14	0.75		0.79
YNL157W		0.69	0.41				0.70							0.39		0.92	0.54
YNL181W		0.08				0.38	0.26	0.14	0.60							0.58
YNL212W						0.54	0.35			0.10		0.16
YNL215W	0.61		0.57		0.79	0.91				0.42		0.36		0.15			0.32
YNL224C				0.77		0.77		0.25	0.90	0.89	0.15	0.61				0.95	0.88
YNL260C	0.65		0.70	0.63	0.47	0.65	0.26	0.32			0.52	0.25
YNL279W	0.44	0.07				0.25	0.50	0.37						0.13
YNL300W	0.48			0.39		0.37	0.25	0.07
YNL310C	0.52	0.07				0.57	0.40	0.23		0.12							0.19
YNL321W			0.35						0.63				0.01
YNR004W		0.07	0.44	0.65	0.32		0.71	0.29			0.15			0.12		0.36
YNR009W	0.60		0.65			0.45					0.44	0.30		0.18		0.46
YNR014W	0.45	0.51	0.61	0.68				0.13			0.35					0.29
YNR020C			0.49
YNR021W		0.13							0.57
YNR024W				0.83							0.12
YNR065C	0.75			0.64					0.41	0.50	0.51	0.71		0.15		0.42	0.25
YOL070C	0.58		0.89				0.52	0.26		0.11	0.27	0.30	0.01			0.49
YOL087C		0.83		0.92					0.42			0.80		0.23		0.88
YOL098C			0.43			0.60	0.51	0.09			0.49	0.20				0.26	0.37
YOL107W	0.66	0.07				0.55			0.45		0.16	0.22
YOL131W			0.56	0.68		0.25	0.32	0.16			0.20					0.28
YOL137W						0.38
YOR006C	0.91		0.96		0.67	0.93		0.16			0.92	0.63				0.91
YOR007C	0.36		0.57			0.95	0.22		0.29					0.35
YOR042W							0.77			0.38
YOR044W	0.54	0.16				0.80						0.25
YOR051C						0.79	0.36		0.36	0.13	0.25			0.17
YOR059C				0.29			0.32		0.29		0.36	0.53				0.55
YOR066W	0.36		0.61	0.68			0.68									0.27	0.23
YOR086C					0.20		0.32				0.26					0.48	0.15
YOR091W			0.37		0.43		0.58				0.19			0.11		0.26
YOR111W				0.35		0.34	0.61	0.37	0.63	0.16		0.45	0.01	0.27			0.18
YOR112W						0.63	0.56		0.52
YOR141C		0.07	0.43							0.18	0.26	0.21				0.32
YOR164C	0.48					0.52	0.34
YOR173W	0.46		0.45	0.46
YOR175C	0.60	0.08				0.32	0.23		0.80		0.35	0.22	0.01			0.29
YOR189W	0.79		0.46	0.74									0.02			0.45
YOR220W			0.69	0.29		0.39	0.38									0.57	0.14
YOR227W							0.33	0.17			0.16	0.19
YOR252W		0.08			0.37	0.74	0.54	0.09	0.42
YOR264W			0.85					0.17			0.35
YOR289W		0.11			0.71		0.20
YOR311C						0.55			0.41				0.01
YOR342C			0.92		0.62	0.41	0.31	0.14			0.12
YOR352W	0.94	0.83				0.27	0.99	0.08	0.95	0.40	0.35	0.72
YPL005W				0.30		0.38					0.12
YPL009C			0.81	0.38			0.67	0.31		0.09	0.34		0.02	0.11
YPL030W				0.41		0.62		0.71	0.61	0.18		0.82	0.05			0.83	0.20
YPL064C	0.50			0.60			0.29
YPL066W				0.58			0.21									0.42
YPL077C			0.93	0.45	0.28	0.24	0.59		0.46			0.68					0.83
YPL105C					0.83						0.61
YPL109C				0.74	0.75		0.91							0.11			0.29
YPL137C	0.78	0.36	0.79		0.29	0.90						0.32				0.77
YPL144W		0.13	0.41	0.88				0.20			0.44	0.18		0.27
YPL162C						0.96			0.98		0.57					0.56
YPL165C			0.63	0.72				0.09
YPL166W						0.87	0.54		0.32	0.13				0.13		0.52	0.29
YPL183C			0.85	0.93			0.77
YPL189C-A	0.44	0.75				0.38	0.85	0.11	0.93	0.33						0.79
YPL199C			0.74	0.61	0.68	0.76	0.97			0.55	0.80	0.73		0.29	0.30
YPL206C	0.62			0.44		0.34	0.23			0.20		0.19		0.42		0.47	0.14
YPL207W	0.60	0.07			0.19		0.34	0.08			0.20		0.01
YPL222W				0.67		0.52	0.35
YPL247C						0.76	0.31	0.12	0.69		0.17						0.20
YPL263C				0.32		0.45	0.34				0.29
YPL267W			0.68			0.80	0.29	0.52		0.12	0.30					0.66
YPR045C				0.56			0.65	0.15			0.13
YPR063C	0.67	0.23				0.35			0.63		0.39						0.15
YPR071W	0.85					0.78		0.17	0.93	0.16	0.21
YPR114W			0.75						0.29				0.01	0.14
YPR116W		0.22			0.48	0.48							0.02	0.13			0.29
YPR148C		0.10	0.43			0.29	0.34		0.71	0.19		0.19		0.33			0.57
YPR152C				0.67	0.16		0.28	0.08		0.13				0.12
YPR153W	0.98	0.57	0.91	0.85		0.98	0.96	0.62	0.98		0.98					0.89	0.72
YPR174C			0.49	0.36	0.23	0.41	0.51	0.41			0.17

5. Conclusions

We proposed a novel function–function correlated multi-label (FCML) protein function prediction approach and showed its promising performance, which outperforms other related approaches. Different from most existing approaches that divide protein function prediction into multiple separate tasks and make predictions fundamentally one function at a time, the proposed FCML approach considers all the biological functions as a single correlated prediction target and predict protein functions via an integral procedure. In the proposed approach, correlations among the functional categories are leveraged. By formulating protein function prediction as a multi-label classification problem, we use the Green's function over a graph to efficiently resolve the problem. The Green's function approach takes advantage of both the full topology of the interaction network toward global optimization and the local structures, such that the deficiencies lying in the existing approaches can be overcome. In addition, we propose an adaptive decision boundary method to deal with the unbalanced distribution of protein annotation data and quantify the statistical confidence of predicted functions for post-processing of proteomic analysis.

6. Appendix

6.1. Predicted putative functions for unannotated proteins and corresponding statistical confidence

We apply the proposed FCML approach on the BioGRID data and predict functions for the unannotated proteins. We use MIPS Funcat annotation scheme. A list of all putative function predictions for level-1 functions in MIPS Funcat scheme by our algorithm is provided in Table 3. The nonempty cells indicate the predicted putative functions of the corresponding protein. For example, protein “YAL034C” is predicted to have functions “16” (Protein with Binding Function) and “18” (Regulation of Metabolism and Protein Function).

In addition to predicted putative functions, we also report the corresponding statistical confidence values. For example, we annotate function “11” (Transcription) with statistical confidence of 0.83 and function “32” (Cell Rescue, Defense and Virulence) with statistical confidence of 0.12 to protein “YNR024W”. Namely, our experimental results suggest that protein “YNR024W” is more likely to be annotated with function “11” than function “32”.

Footnotes

Author Disclosure Statement

The authors declare that no competing financial interests exist.

References

Ashburner

, Ball

, Blake

et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25:25.

Chua

, Sung

, Wong

2006. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, 22:1623–1630.

Chung

1997. Spectral Graph Theory. American Mathematical Society: Providence, RI.

Deane

, Salwinski

, Xenarios

, Eisenberg

2002. Protein interactions two methods for assessment of the reliability of high throughput observations* Molecular & Cellular Proteomics, 1:349–356.

Ding

, Simon

, Jin

et al. 2007. A learning framework using Green's function and kernel regularization with application to recommender system. In Proc. of ACM SIGKDD, 2007; 260–269.

Edgar

, Domrachev

, Lash

2002. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res., 30:207.

Giot

, Bader

, Brouwer

et al. 2003. A protein interaction map of Drosophila melanogaster. Science, 302:1727–1736.

Harbison

, Gordon

, Lee

et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature, 431:99–104.

Hastie

, Tibshirani

1998. Classification by pairwise coupling. Annals of Statistics, 451–471.

10.

Hishigaki

, Nakai

, Ono

et al. 2001. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast, 18:523–531.

11.

, Gruhler

, Heilbut

et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415:180–183.

12.

Karaoz

, Murali

, Letovsky

et al. 2004. Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. Natl. Acad. Sci. USA, 101:2888–2893.

13.

Klein

, Randić

1993. Resistance distance. J. Math. Chem., 12:81–95.

14.

Lee

, Tu

, Deng

et al. 2006. Diffusion kernel-based logistic regression models for protein function prediction. Omics, 10:40–55.

15.

Mewes

, Heumann

, Kaps

et al. 1999. MIPS: a database for genomes and protein sequences. Nucleic Acids Res., 27:44.

16.

Nabieva

, Jim

, Agarwal

et al. 2005. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21:302–310.

17.

Pei

, Zhang

2005. A topological measurement for weighted protein interaction network. In 2005 IEEE Computational Systems Bioinformatics Conference, 268–278.

18.

Platt

1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers.

19.

Schwikowski

, Uetz

, Fields

2000. A network of protein-protein interactions in yeast. Nat. Biotechnol., 18:1257–1261.

20.

Sharan

, Ulitsky

, Shamir

2007. Network-based prediction of protein function. Mol. System Biol., 3.

21.

Stark

, Breitkreutz

, Reguly

et al. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Res., 34:D535.

22.

Tong

, Lesage

, Bader

et al. 2004. Global mapping of the yeast genetic interaction network. Science, 303:808–813.

23.

Vazquez

, Flammini

, Maritan

et al. 2003. Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol., 21:697–700.

24.

Von Mering

, Krause

, Snel

et al. 2002. Comparative assessment of large-scale data sets of protein–protein interactions. Nature, 417:399–403.

25.

Wang

, Ding

, Huang

2010a. Directed graph learning via high-order co-linkage analysis. In Proc. of ECML/PKDD, 2010; 451–466.

26.

Wang

, Ding

, Huang

2010b. Multi-label classification: Inconsistency and class balanced k-nearest neighbor. In Twenty-Fourth AAAI Conference on Artificial Intelligence.

27.

Wang

, Ding

, Huang

2010c. Multi-label linear discriminant analysis. In Proc. of ECCV, 2010; 126–139.

28.

Wang

, Huang

, Ding

2009. Image annotation using multi-label correlated greens function. In Proc. of IEEE ICCV, 2009; 2029–2034.

29.

Wang

, Huang

, Ding

2010d. Multi-label feature transform for image classifications. In Proc. of ECCV, 2010; 793–806.

30.

Wang

, Huang

, Ding

2011. Image annotation using bi-relational graph of images and semantic labels. In Proc. of IEEE CVPR, 2011; 793–800.