Predicting Protein–Protein Interactions from Multimodal Biological Data Sources via Nonnegative Matrix Tri-Factorization

Abstract

Protein interactions are central to all the biological processes and structural scaffolds in living organisms, because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Several high-throughput methods, for example, yeast two-hybrid system and mass spectrometry method, can help determine protein interactions, which, however, suffer from high false-positive rates. Moreover, many protein interactions predicted by one method are not supported by another. Therefore, computational methods are necessary and crucial to complete the interactome expeditiously. In this work, we formulate the problem of predicting protein interactions from a new mathematical perspective—sparse matrix completion, and propose a novel nonnegative matrix factorization (NMF)-based matrix completion approach to predict new protein interactions from existing protein interaction networks. Through using manifold regularization, we further develop our method to integrate different biological data sources, such as protein sequences, gene expressions, protein structure information, etc. Extensive experimental results on four species, Saccharomyces cerevisiae, Drosophila melanogaster, Homo sapiens, and Caenorhabditis elegans, have shown that our new methods outperform related state-of-the-art protein interaction prediction methods.

1. Introduction

Proteins play an essential role in nearly all cellular functions, such as promoting biochemical reactions and composing cellular structures. The multiplicity of functions that proteins execute in most cellular processes and biochemical events is attributed to their interactions with other proteins. As a result, it is critical to understand protein–protein interactions (PPIs) in both scientific research and practical applications such as new drug development. A variety of techniques are now available to experimental biologists for discovering protein–protein interactions, such as yeast two-hybrid systems (Ito et al., 2000), mass spectrometry (Ho et al., 2002), and many others as surveyed in Shoemaker and Panchenko (2007a). Although these high-throughput experimental methods have accumulated a large amount of data, interactomes of many organisms are far from complete (Shoemaker and Panchenko, 2007a). The low interaction coverage, along with the experimental biases toward certain protein types and cellular localizations reported by most experimental techniques, call for the development of computational methods that are able to predict more reliable putative PPI for further experimental screening (Shoemaker and Panchenko, 2007b).

Computational methods for predicting protein interaction partners in early ages use genomic or protein context to infer functional associations (Shoemaker and Panchenko, 2007b). It is assumed in gene neighbor and gene cluster methods (Bowers et al., 2004b; Ermolaeva et al., 2001; Moreno-Hagelsieb and Collado-Vides, 2002; Strong et al., 2003; Salgado et al., 2000) that genes with closely related functions that encode potentially interacting proteins are often transcribed as a single unit—an operon. The phylogenetic profile methods (Bowers et al., 2004a, 2005; Pagel et al., 2004; Barker and Pagel, 2005) are based on the hypothesis that functionally linked and potentially interacting nonhomologous proteins coevolve and have orthologs in the same subset of fully sequenced organisms. The Rosetta Stone approaches (Marcotte et al., 1999; Enright et al., 1999; Marcotte and Marcotte, 2002; Yanai et al., 2001) infers protein interactions from protein sequences in different genomes. Sequence-based coevolution methods (Goh et al., 2000; Jothi et al., 2006; Ramani and Marcotte, 2003) take the perspective that interacting proteins very often coevolve so that changes in one protein leading to the loss of function or interaction should be compensated by the correlated changes in another protein. More in silico methods to predict protein–protein interactions are surveyed in Shoemaker and Panchenko (2007b).

Recently, machine-learning techniques, such as Bayesian networks (Jansen et al., 2003), decision trees (Zhang et al., 2004; Chen and Liu, 2005), random forest (Qi et al., 2005; Chen and Jeong, 2009), and support vector machines (SVM) with different kernels (Martin et al., 2005; Ben-Hur and Noble, 2005; Shen et al., 2007; Martial et al., 2010), have been successfully applied to predict PPIs. These methods used various data sources to train a classifier to distinguish positive examples of truly interacting protein pairs from the negative examples of noninteracting pairs. However, all these methods suffer from a fundamental difficulty—how to choose the negative samples. Compared to the obvious choice of positive samples from truly interacting protein pairs, negative samples are typically hard to be chosen. First, noninteracting protein pairs refer to those currently without experimental or computational evidence to support a physical interaction or functional association. In reality, however, such protein pairs could interact. Second, the number of noninteracting protein pairs is much larger than the number of the interacting ones, therefore, unbalanced training data often cause skewed prediction models that lead to unsatisfactory prediction results. In most existing classification-based methods, heuristic ways have been employed to tackle these problems.

With the above recognitions, instead of considering PPI prediction as a classification problem, we approach it from a different perspective using matrix completion, which is an important mathematical topic to address the problem to recover a matrix from what appears to be incomplete or even corrupted (Candès and Plan, 2009). An illustrative example to predict missing PPIs using matrix completion is given in Figure 1. Given a protein interaction network, a graph can be naturally constructed as shown in the top left panel of Figure 1, with vertices representing proteins and edges representing known PPIs. The adjacency matrix of the resulted graph is shown in the bottom left panel of Figure 1, where “1” indicates a known PPI and “?” indicates that there currently exists no experimental evidence to support a physical interaction between the corresponding protein pair. Our task is to identify putative interacting protein pairs from those marked as “?”. Using a matrix completion algorithm, the unknown entries of the matrix is filled as in the bottom right panel of Figure 1, where the value filled for an unknown entry indicates how likely it should be recovered. As a result, a list of ranking scores are produced for the noninteracting protein pairs, and prediction can be performed by picking up the top-ranking ones as putative PPIs. As shown in the top right panel of Figure 1, the top three ranked unknown entries are predicted as putative PPIs, which are marked as blue dash lines and correspond to the underlined entries in the filled matrix in the bottom right panel of Figure 1. Obviously, matrix completion only uses truly interacting protein pairs without requiring negative training samples, thus the difficulty in existing classification-based PPI prediction methods is circumvented.

FIG. 1.

Prediction of putative protein–protein interactions (PPIs) can be performed as a process of matrix completion. Top left: Original PPI graph. Bottom left: Adjacency matrix of the original PPI graph, where “1” indicates a known PPI and “?” indicates the protein pair has no experimental support to have a physical link. Bottom right: A process of matrix completion. Top right: Predict missing PPIs by taking the entries with top 3 ranking scores in the filled matrix.

Another important property of protein interaction networks makes matrix completion of particular use in predicting PPIs. Recent studies (Valdar and Thornton, 2001; Teichmann, 2002; Nooren and Thornton, 2003; Aloy et al., 2003; Littler and Hubbard, 2005; Panchenko et al., 2005; Dai and Prasad, 2010) have confirmed that, as opposed to the huge number of protein interactions, the number of interaction types or modes is limited and rather small, (i.e., the adjacency matrices of PPI graphs by nature are low-rank matrices). It is also proved in mathematics (Candès and Recht, 2009) that if the unknown matrix is known to have low rank or approximately low rank, accurate and even exact matrix recovery is possible. Therefore, using matrix completion methods, putative protein interactions can be inferred from incomplete, or even noisy, input PPI graphs.

In this article, we propose a novel nonnegative matrix tri-factorization (NMTF)-based (Lee and Seung, 1999; Ding et al., 2006b) matrix completion approach to predict candidate protein–protein interactions. NMTF focuses on the analysis of data matrices whose elements are nonnegative, such as the adjacency matrix of a PPI graph, and decomposes the input matrix into three nonnegative factor matrices that approximate the input matrix by a low-rank nonnegative representation (Ding et al., 2005, 2006a), which, due to its mathematical elegance, has been widely studied (Wang et al., 2011d) and applied to solve a variety of real-world problems (Wang et al., 2011b, 2011c). We first employ NMTF approach to predict putative protein interactions, which only makes use of the PPI network data. After that, we extend the standard NMTF framework by adding manifold regularization (Gu and Zhou, 2009), such that additional biological data, for example, protein sequences data, protein structures information, and gene expressions, can be incorporated to achieve enhanced PPI prediction performance. Our extension for manifold regularization is different from existing works (Cai et al., 2008; Gu and Zhou, 2009) where we emphasize the orthogonality on the factor matrices, which avoids degenerate solutions and makes our method more robust to parameter selection. Extensive empirical evaluations on four different genomic species have shown encouraging performance, which demonstrate the effectiveness of the proposed methods.

2. Methods

First we briefly formalize the problem of PPI prediction. Given a PPI network, we may construct a graph G = (V, Ω, X), with V corresponding to n = |V| proteins and Ω ⊆ V × V corresponding to known PPIs. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${{\bf X}} \in \{0 , 1 \} ^{n \times n}$$ \end{document} is the adjacency matrix, such that X_ij = 1 if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$(i , j) \in \Omega$$ \end{document} (i.e., there exists a PPI between protein i and protein j), and X_ij = 0 otherwise. Our task is to identify a subset of noninteracting protein pairs M ⊆ (V × V)\Ω, which tend to interact and can be served as potential targets for further experimental screening.

Throughout this article, we denote matrices as boldface, uppercase characters. Given a matrix M, its Frobenius norm and trace are denoted as ∥M∥ and tr (M) respectively. For convenience, given a index set M of a matrix X, we define X_M as following: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}({\bf X}_M) _{ij} = \begin{cases}{\bf X}_{ij} , \quad \ \forall \ (i , j) \in M , \\ 0 \quad \qquad {\rm otherwise.}\end{cases} \tag{1}\end{align*} \end{document}

2.1. Predict new protein interactions via PPI networks

Objective to predict PPIs We first predict protein interactions only using PPI network data. Considering the protein interaction prediction as a matrix completion problem, where the input PPI adjacency matrix X contains missing entries (pairs of proteins whose interactions are yet to be determined), we wish to predict Y, which has full entries (i.e., every element of Y is filled with computed values). Y completes X in the sense that Y_Ω = X_Ω, or more explicitly, Y_ij = X_ij, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\forall (i , j) \in \Omega$$ \end{document} , where Ω denotes the set of edges where the input adjacency matrix X has known values (the set of interacting edges). Mathematically, the PPI prediction problem can be solved as the following optimization problem: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min \limits_{\bf {Y}} \ J_1 = \ \mid\mid {\bf X} - {\bf Y} \mid\mid _ \Omega^2 = \sum \limits_{(i , j) \in \Omega} ({\bf X} - {\bf Y}) _{ij}^2. \tag{2}\end{align*} \end{document}

Due to the low-rank nature of the adjacency matrix as discussed earlier, the completed matrix Y can be factorized and written as Y = HSH^T, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf H} \in {\mathbb R}_{+} ^{n \times k}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf S} \in {\mathbb R}_{+} ^{k \times k}$$ \end{document} are the factor matrices with nonnegative elements. As a result, Y = HSH^T can be seen as a low-rank representation of the input matrix X with rank of k ≪ n. Thus, we can rewrite Equation (2) as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min \limits_{{\bf H} \geq 0 , {\bf S} \geq 0} \ J_2 = \mid\mid {\bf X} - {\bf HSH}^T \mid\mid _ \Omega^2. \tag{3}\end{align*} \end{document}

Note that, although there exist other low-rank matrix approximation methods, for example, singular value decomposition (SVD), using NMTF as in Equation (3) to constrain the factor matrices H and S to be nonnegative is a natural choice because all the entries of the input PPI adjacency matrix X are positive by definition. Moreover, because of the clustering interpretation of NMTF (Ding et al., 2005, 2006b), other biological data sources can be easily incorporated via manifold regularization as introduced later.

The solution algorithm Different from standard NMTF-based objectives (as in Lee and Seung, 1999; Ding et al., 2005, 2006b; Wang et al., 2011a), which are defined over the whole input nonnegative matrix, the objective in Equation (3) for PPI prediction is defined over a subset of the entries that correspond to known PPIs. Therefore, the solution algorithms to standard NMTF cannot be directly applied to solve Equation (3). To this end, we present an iterative algorithm in Algorithm 1 to solve Equation (3).

The main step of Algorithm 1 is step 4, which solves a symmetric NMF problem (Wang et al., 2011a). Following our earlier publication in Wang et al. (2011a), Equation (5) can be solved by an iterative algorithm with the following updating rules: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\bf H} _ {jk} \leftarrow {\bf H} _ {jk} {\left[\frac {{\bf (ZHS)} _ {jk}} {({\bf HSH} ^T {\bf HS}) _ {jk}} \right] ^ {1 / 4}} , \qquad {\bf S} _ {ik} \leftarrow {\bf S} _ {ik} \frac {({\bf H} ^T {\bf ZH}) _ {ik}} {({\bf H} ^T {\bf HSH} ^T {\bf H}) _ {ik}} , \tag {4} \end{align*} \end{document}

where the superscripts are removed for notation brevity. The correctness and convergence of the updating rules in Equation (4) can be rigorously proved as detailed in Wang et al. (2011a). Apparently, Algorithm 1 converts the difficult problem to solve Equation (3) as a series of standard NMTF problems, for which existing algorithms, such as the updating rules in Equation (4), can be used.

Solving Equation (3) by Algorithm 1 for matrix completion, our NMTF approach to predict PPIs is proposed.

The convergence of the solution algorithm Now we prove the following theorem that guarantees the convergence of Algorithm 1.

Algorithm 1: Algorithm to solve Eq. (3).
Input: Input PPI adjacency matrix X;
Index set of known PPIs Ω.
begin
1. t = 0;
2. Initialize Z = X_Ω;
while not converge do
3. t = t + 1;
4. Solve
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align} \arg\min_{{\bf H} \geq 0 \, {\bf S} \geq 0} \mid \mid {\bf Z}^{(t - 1)} - {\bf H}^{(t)}{\bf S}^{(t)} {{(\bf H)}^{(t)}}^T \mid \mid ^2 \tag{5} \end{align} \end{document}
to obtain H^(t) and S^(t);
5. Compute Y^(t) = H^(t)S^(t) (H^(t))^T;
6. Compute \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf Z}^{(t)} = {\bf X}_ \Omega + {\bf Y}^{(t)}_M$$ \end{document} ;
end
end
Output: Output matrix with filled missing entries Y.

Theorem 1. Algorithm 1 monotonically decreases the objective value of J₁ in Equation (2) in each iteration.

Proof. First, by definition we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf Y}^{(t)} = {\bf Y}_ \Omega^{(t)} + {\bf Y}^{(t)}_M ,$$ \end{document} thus \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mid \mid {\bf X} - {\bf Y}^{(t)} \mid \mid ^2_ \Omega = \mid \mid ({\bf X}_ \Omega + {\bf Y}^{(t)}_M) - {\bf Y}^{(t)} \mid \mid ^2. \tag{6}\end{align*} \end{document}

Then because Algorithm 1 can be summarized as computing \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min_{{\bf Y}^{(t + 1)}} \mid \mid ({\bf X}_ \Omega + {\bf Y}_M^{(t)}) - {\bf Y}^{(t + 1)} \mid \mid ^2 , \ \ t = 0 , 1 , {\ldots} \,, \; {\bf Y}_M^{(0)} = 0 , \tag{7}\end{align*} \end{document}

we can derive \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mid \mid ({\bf X} + {\bf Y}^{(t)}_M) - {\bf Y}^{(t)} \mid \mid ^2 \geq \mid \mid ({\bf X} + {\bf Y}_M^{(t)}) - {\bf Y}^{(t + 1)} \mid \mid ^2. \tag{8}\end{align*} \end{document}

Again, by definition we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf Y}^{(t + 1)} = {\bf Y}^{(t + 1)}_ \Omega + {\bf Y}^{(t + 1)}_M$$ \end{document} , thus \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \mid \mid ({\bf X} + {\bf Y}^{(t)}_M) - {\bf Y}^{(t + 1)} \mid \mid ^2 & = \mid \mid ({\bf X} - {\bf Y}^{(t + 1)}) _ \Omega + ({\bf Y}^{(t)} - {\bf Y}^{(t + 1)}) _M \mid \mid ^2 & (9) \\ & = \mid \mid {\bf X} - {\bf Y}^{(t + 1)} \mid \mid ^2_ \Omega + \mid \mid {\bf Y}^{(t)} - {\bf Y}^{(t + 1)} \mid \mid ^2_M \geq \mid \mid {\bf X} - {\bf Y}^{(t + 1)} \mid \mid ^2_ \Omega\end{align*} \end{document}

Combining Equations (6), (8), and (9), we obtain \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\mid \mid {\bf X} - {\bf Y}^{(t)} \mid \mid ^2_ \Omega \geq \mid \mid {\bf X} - {\bf Y}^{(t + 1)} \mid \mid ^2_ \Omega \quad {\rm for} \ t = 0 , 1 , 2 , 3 , {\ldots} \,, \tag{10}\end{align*} \end{document}

which proves the theorem. ▪

As in Equation (10), the approximation errors, that is, the objective value J₁ in Equation (2), goes down monotonically but remains bigger than zero, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\mid \mid {\bf X} - {\bf Y}^{(0)} \mid \mid ^2_ \Omega \geq \mid \mid {\bf X} - {\bf Y}^{(1)} \mid \mid ^2_ \Omega \geq \mid \mid {\bf X} - {\bf Y}^{(2)} \mid \mid ^2_ \Omega \geq \cdots \geq 0$$ \end{document} , therefore Theorem 1 guarantees the convergence of the algorithm.

2.2. Predict new protein interactions from multimodal biological data

In the last subsection, we introduced how to infer putative protein interactions using only the PPI network data, while in practice we may have other biological data as well, such as protein sequence data (Martin et al., 2005; Shen et al., 2007; Martial et al., 2010), 3D protein structures (Ben-Hur and Noble, 2005; Chen and Liu, 2005; Qiu et al., 2007), and so on. To exploit the useful information contained in the biological data from other sources, in this subsection we further develop the proposed NMTF-based matrix completion approach.

An important reason for the popularity of NMTF in statistical learning lies in its close connection to k-means clustering (Ding et al., 2005, 2006b). Specifically, given a symmetric nonnegative input matrix W, the factor matrix H can be seen as the clustering indications of the vertices (Luo et al., 2009). Therefore, if we have biological data other than PPI networks appearing in form of pairwise similarity, we can incorporate them through manifold regularization (Chung, 1997; Shi and Malik, 2000). Specifically, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf W}_{(k)} (0 \leq k \leq K)$$ \end{document} be a set of pairwise similarities constructed from different biological data, an integrated similarity among proteins can be constructed as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf W} = \sum \nolimits_k \eta_k{\bf W}_{(k)} \ {(\eta_k \geq 0, \sum \nolimits_k \eta_k = 1)}$$ \end{document} where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\eta_k$$ \end{document} are parameters to balance the data from different sources. We further develop the objective in Equation (3) as follows: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}\min \limits_{{\bf H} \geq 0 , {\bf S} \geq 0} J_4 = \mid \mid {\bf X} - {\bf HSH}^T \mid \mid _ \Omega^2 + 2 \lambda {\bf tr} ({\bf H}^T ({\bf D} - {\bf W}) {\bf H)} , \qquad s.t. \ {\bf H}^T{\bf D}{\bf H} = I , \tag{11}\end{align*} \end{document}

where D is a diagonal matrix whose diagonal entries D_ii = ∑_j W_ij are the degree of the corresponding data points on W, and λ is a parameter to balance the relative importance of the regularization term, which is empirically selected as λ = 0.01 in all our experimental evaluations. Because H can be seen as the “soft” clustering labels (Ding et al., 2005), the second term in Equation (11) enforces the smoothness over the variation of the clustering labels with respect to the underlying manifold described by W (Cai et al., 2008; Gu and Zhou, 2009), by which additional biological data sources are incorporated.

Equation (11) takes a similar form to Equation (3), which, again, is not a standard NMTF problem. We use Algorithm 1 to solve it by replacing step 4 to minimize the following objective: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}J_4 = \mid \mid {\bf Z} - {\bf HSH}^T \mid \mid ^2 + 2 \lambda{\bf tr} ({\bf H}^T ({\bf D} - {\bf W}) {\bf H}) , \qquad s.t. \ {\bf H} \geq 0 , {\bf S} \geq 0 , {\bf H}^T{\bf DH} = I , \tag{12}\end{align*} \end{document}

which can be solved using the following updating rules: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} {\bf H} _ {ik} \leftarrow {\bf H} _ {ik} \left[\frac {({\bf ZHS} + \lambda {\bf WH}) _ {ik}} {({\bf HSH} ^T {\bf HS} + {\bf DH} {\Lambda}) _ {ik}} \right] ^ \frac {1} {4} , \qquad {\bf S} _ {ik} \leftarrow {\bf S} _ {ik} \left[\frac {({\bf H} ^T {\bf ZH}) _ {ik}} {({\bf H} ^T {\bf HSH} ^T {\bf H}) _ {ik}} \right] ^ \frac {1} {2} , \tag {13} \end{align*} \end{document}

where Λ is the Lagrangian multiplier for the constraint H^TDH = I, and its value is given by Λ =H^T XHS − H^THSH^T HS + λH^TWH. The rigorous proof of the correctness and the convergence of the iteration procedures in Equation (13) can be found in our earlier publication in Wang et al. (2011e).

Note that, the objective in Equation (12) is different from the objectives of many existing related works (Cai et al., 2008; Gu and Zhou, 2009) using NMF in that the orthogonality H^TDH = I is enforced. This is important because the second term of Equation (12) can be written as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\bf tr} ({\bf H}^T ({\bf D} - {\bf W}) {\bf H}) = \sum\nolimits_{k = 1}^K h_k^T ({\bf D} - {\bf W}) h_k$$ \end{document} . Without the orthogonality constraint, different columns become independent of each other, thereby reaching the same minimum with same solution, that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$h_1^* = \cdots = h_k^*$$ \end{document} . When λ is not small, degenerate solution will be obtained and the approximation of Z by HSH^T is degraded. We refer interested readers to Wang et al. (2011e) for more detailed discussions on the importance and effectiveness of enforcing orthogonality constraints in manifold regularized NMTF.

Solving Equation (11) for matrix completion, our regularized non-negative matrix tri-factorization (R-NMTF) approach for PPI prediction is proposed, which is able to utilize both PPI network data as well as other biological data.

3. Materials And Data Sources

Protein interaction networks We construct PPI graphs using the protein interaction networks compiled by BioGRID database (Stark et al., 2006). We evaluate our methods on the four species as follows: Saccharomyces cerevisiae, Drosophila melanogaster, Homo sapiens, and Caenorhabditis elegans. For each species, an undirected graph is constructed, with vertices representing proteins and edges representing observed physical interactions. For each graph, we only consider the largest connected component of the physical interaction map from BioGRID database of version 2.0.56. The details of the four PPI graphs are listed in Table 1, where “coverage” computes the percentage of known PPIs against the total number of protein pairs (n × (n − 1)/2).

Table 1.

PPI Graphs of Four Species Constructed Using BioGRID Database of Version 2.0.56

	S. cerevisiae	D. melanogaster	H. sapiens	C. elegans
Proteins	5056	7294	8255	3353
Edges (number/coverage)	9439/0.738%	24960/0.094%	25929/0.076%	6449/0.114%

Protein sequence data We download protein sequence data from GenBank (Benson et al., 2006) and compute the sequence based similarity using the mismatch kernel (Leslie et al., 2003). A protein sequence s_i is first mapped to a feature vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\Phi_{k , m}{(s_i)} = \{\phi_ \beta{(\alpha)} \} _{\beta \in {\cal A}^k}$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal A}$$ \end{document} is the alphabet of 20 amino acids. The neighborhood \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal N}_{k , m} (\alpha)$$ \end{document} of a k-mer α is the set of k-mers that differs in at most m positions. The feature vector encodes all the k-mers in the neighborhood for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\phi_ \beta (\alpha) = 1 \ {\rm if} \ \alpha \in {\cal N}_{k , m} (\beta)$$ \end{document} , and 0 otherwise. Then the mismatch kernel, thereby the induced pairwise similarity between two protein sequences s_i and s_j, is computed as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$W_{ij}^{(1)} = {\cal K}{(s_i , s_j)} = \langle \Phi_{k , m}{(s_i)} , \Phi_{k , m}{(s_j)} \rangle$$ \end{document} . In our empirical studies, we set k = 6 and m = 1, which is the same as in Martial et al. (2010). In our empirical studies, protein sequence data are used as the addition biological data source, that is, W = W⁽¹⁾.

Protein annotation data We use the functional annotations from Gene Ontology (GO) Consortium (Ashburner et al., 2000), which is a set of structured vocabularies organized in a rooted directed acyclic graph (DAG), describing attributes of gene products (proteins or RNA) in three categories of “cellular component,” “molecular function,” and “biological process.”

4. Experimental Results And Discussions

4.1. Improved prediction capability in cross-validation

We first evaluate the proposed methods and compare their prediction capabilities against three most recent PPI prediction methods.

(1) Tensor product pairwise kernel (TPPK) method (Ben-Hur and Noble, 2005): This method builds a kernel for pairwise objects. In order for a fair comparison, protein sequence and protein interaction network topology are used for kernel construction. PPI prediction is then carried out by the ranking scores for noninteracting protein pairs yielded by an SVM on the resulted score.

(2) Metric learning pairwise kernel (MLPK) method (Qiu et al., 2007): This method represents a pair of objects as the difference between its members, such that the resulted kernel is invariant with respect to the order of the proteins. Again, SVM is used to compute the ranking score for putative PPIs.

(3) Nearest neighbor (NN) (Martial et al., 2010) method: NN is the simplest classification method in machine learning. In Martial et al. (2010), a ranking score for each noninteracting protein pair is computed as

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*}f_{NN} (x_i) = \sum \limits_{x_j \in ({\cal N}_k (x_i) \cap E)} d (x_i , x_j) - \sum \limits_{x_j \in ({\cal N}_k (x_i) \cap ((V \times V) \setminus E))} d (x_i , x_j) , \tag{14}\end{align*} \end{document}

where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $${\cal N}_k (x_i)$$ \end{document} is the set of k-nearest neighbors of x_i, and d (·,·) is distance function built from a kernel by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$d (x_i , x_j) = \sqrt{{\cal K} (x_i , x_i) - 2{\cal K} (x_i , x_j) + {\cal K} (x_j , x_j)}$$ \end{document} . In our evaluations, we use the mismatch kernel for protein sequence data.

Experimental procedures For each method, we perform 20-fold cross-validation as follows. For each trial, we remove 5% known edges (PPIs) from the input graph and try to recover them using the remaining graph, which is repeated by 10 times. The average results over the 10 trials on every species are reported in Figure 2. During each trial, an internal five-fold cross validation is performed for parameter selection. For our NMTF and R-NMTF methods, the parameter is the rank k of the factor matrices H and S. For TPPK and MLPK methods, we use the Gaussian kernel, therefore the parameters are the two regularization parameters. For NN method, we select the k of NN from {1, 2, 3, 5, 10, 15}, which is the same as in Martial et al. (2010). We fine-tune the parameters for best prediction precision for all the compared methods.

FIG. 2.

Precision-recall curves by 20-fold cross-validation on four species by the compared PPI prediction methods. NMFT, nonnegative matrix tri-factorization; R-NMTF, regularized nonnegative matrix tri-factorization; TPPK, tensor product pairwise kernel method; MLPK, metric learning pairwise kernel method; NN, nearest neighbor.

Results Because all compared methods produce a list of ranking scores for noninteracting protein pairs, we employ precision-recall curves to measure the prediction performance. We compute the precisions and recalls when picking up a range of top k noninteracting protein pairs as predictions and average them over the 10 trials. The resulted precision-recall curves of four species are reported in Figure 2. From the results, we can see that both proposed methods, NMTF and R-NMTF, consistently outperform comparable methods, sometimes very significantly. In addition, the prediction performances of R-NMTF method are always better than those of NMTF method, which is consistent with our previous theoretical analysis in that multimodal biological data, i.e., protein interaction network plus protein sequence data, offer enhanced prediction performance. We also observe that the TPPK method achieves similar performances to the proposed NMTF method for D. melanogaster and C. elegans species. However, our NMTF method uses only protein interaction network data, while TPPK method exploits both network data and sequence data.

A more careful analysis on the prediction results shows that the noninteracting protein pairs (including the noninteracting protein paris in the original PPI graphs and those removed due to cross-validation) with high ranking scores identified by the proposed methods typically exhibit high similarities in their functional roles. In Table 2, we list the predicted protein pairs with top 10 highest ranking scores by R-NMTF method on S. cerevisiae species, in which the biological functions of all protein pairs are very similar to each other. For example, “PHO91” works as “Low-affinity phosphate transporter of the vacuolar membrane,” which is also the main functional role of its putative interacting partner “PHO90.” Moreover, both of them have functionalities of “transcription independent of Pi and Pho4p activity” and “overexpression results in vigorous growth.” These observations clearly demonstrate that these two proteins are functionally related and tend to interact with each other, which provides concrete evidence to support that the predicted protein interactions by the proposed R-NMTF are biologically meaningful.

Table 2.

Comparisons of Biological Functionalities of the Predicted Protein Interactions

LAT1	PDX1
Dihydrolipoamide acetyltransferase component (E2) of pyruvate dehydrogenase complex, which catalyzes the oxidative decarboxylation of pyruvate to acetyl-CoA—Lat1p; protein coding	Dihydrolipoamide dehydrogenase (E3)-binding protein (E3BP) of the mitochondrial pyruvate dehydrogenase (PDH) complex, plays a structural role in the complex by binding and positioning E3 to the dihydrolipoamide acetyltransferase (E2) core—Pdx1p; protein coding
PHO91	PHO90
Low-affinity phosphate transporter of the vacuolar membrane; deletion of pho84, pho87, pho89, pho90, and pho91 causes synthetic lethality; transcription independent of Pi and Pho4p activity; overexpression results in vigorous growth—Pho91p; protein coding	Low-affinity phosphate transporter; deletion of pho84, pho87, pho89, pho90, and pho91 causes synthetic lethality; transcription independent of Pi and Pho4p activity; overexpression results in vigorous growth—Pho90p; protein coding
PHO91	PHO87
Low-affinity phosphate transporter of the vacuolar membrane; deletion of pho84, pho87, pho89, pho90, and pho91 causes synthetic lethality; transcription independent of Pi and Pho4p activity; overexpression results in vigorous growth—Pho91p; protein coding	Low-affinity inorganic phosphate (Pi) transporter, involved in activation of PHO pathway; expression is independent of Pi concentration and Pho4p activity; contains 12 membrane-spanning segments—Pho87p; protein coding
PHO87	PHO89
Low-affinity inorganic phosphate (Pi) transporter, involved in activation of PHO pathway; expression is independent of Pi concentration and Pho4p activity; contains 12 membrane-spanning segments—Pho87p; protein coding	Na+/Pi cotransporter, active in early growth phase; similar to phosphate transporters of Neurospora crassa; transcription regulated by inorganic phosphate concentrations and Pho4p—Pho89p; protein coding
COY1	SVP26
Coy1p—Golgi membrane protein with similarity to mammalian CASP; genetic interactions with GOS1 (encoding a Golgi snare protein) suggest a role in Golgi function; protein coding	Integral membrane protein of the early Golgi apparatus and endoplasmic reticulum, involved in COP II vesicle transport; may also function to promote retention of proteins in the early Golgi compartment—Svp26p; protein coding
HOM3	HOM2
Aspartate kinase (L-aspartate 4-P-transferase); cytoplasmic enzyme that catalyzes the first step in the common pathway for methionine and threonine biosynthesis; expression regulated by Gcn4p and the general control of amino acid synthesis—Hom3p; protein coding	Aspartic beta semi-aldehyde dehydrogenase, catalyzes the second step in the common pathway for methionine and threonine biosynthesis; expression regulated by Gcn4p and the general control of amino acid synthesis—Hom2p; protein coding
REC114	REC102
Protein involved in early stages of meiotic recombination; possibly involved in the coordination of recombination and meiotic division; mutations lead to premature initiation of the first meiotic division—Rec114p; protein coding	Protein involved in early stages of meiotic recombination; required for chromosome synapsis; forms a complex with Rec104p and Spo11p necessary during the initiation of recombination—Rec102p; protein coding
PCF11	REF2
Pcf11p—mRNA 3’ end processing factor, essential component of cleavage and polyadenylation factor IA (CF IA), involved in pre-mRNA 3’ end processing and in transcription termination; binds C-terminal domain of largest subunit of RNA pol II (Rpo21p); protein coding	RNA-binding protein involved in the cleavage step of mRNA 3’-end formation prior to polyadenylation, and in snoRNA maturation; part of holo-CPF subcomplex APT, which associates with 3’-ends of snoRNA- and mRNA-encoding genes—Ref2p; protein coding
DIM1	IMP4
Dim1p—Essential 18S rRNA dimethylase (dimethyladenosine transferase), responsible for conserved m6(2)Am6(2)A dimethylation in 3’-terminal loop of 18S rRNA, part of 90S and 40S pre-particles in nucleolus, involved in pre-ribosomal RNA processing; protein coding	Component of the SSU processome, which is required for pre-18S rRNA processing; interacts with Mpp10p; member of a superfamily of proteins that contain a sigma(70)-like motif and associate with RNAs—Imp4p; protein coding The BioGRID Database Seperator
FCY22	FCY2
Fcy22p—Putative purine-cytosine permease, very similar to Fcy2p but cannot substitute for its function; protein coding	Fcy2p—Purine-cytosine permease, mediates purine (adenine, guanine, and hypoxanthine) and cytosine accumulation; protein coding The BioGRID Database Seperator

Note that, in our empirical studies, we only use one additional data source (i.e., protein sequence data) for the purpose of demonstration. In practice, more biological data, when available, could be incorporated through a proper kernel construction under the R-NMTF prediction framework to achieve better prediction results.

4.2. Conserved functional similarity of predicted new protein interactions

It is generally believed that interacting protein pairs tend to have similar functional roles. We test this hypothesis by evaluating the function similarities between our predicted interacting proteins on S. cerevisiae species. We use the Resnik score (Resnik, 1995) on the GO annotation system to measure the functional similarities between protein pairs.

Specifically, given a term t in GO, its information content is quantified as −log, where the probability of the term is taken to be its frequency among annotations to all GO terms. If a term t is annotated to a protein, it must be annotated with all the ancestors of the term t. Therefore, the higher the term is located in the GO hierarchy, the less its information content. In the case of GO, a term might have multiple parent terms, so that a pair of terms might have more than one path of common ancestors. Denoting the set of all common-ancestor terms of t₁ and t₂ as A(t₁, t₂) we define the similarity between two terms, t₁ and t₂, as sim \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$ (t_1, t_2) = \max \nolimits_{a \in A} (t_1, t_2) - \log p(a) $$ \end{document} . Because an individual protein could have multiple functions, we compute the similarity of two proteins, p₁ and p₂, by matching each function of p₁ to its most similar function of p₂, and the average is taken over all such pairs of functions as sim \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$(p_1 \rightarrow p_2) = {\rm avg} \left[\sum{_{s \in p_1}} \max \nolimits_{{t \in p_2}} \ {\rm sim} (s , t) \right]$$ \end{document} . As this measurement is not symmetric with respect to p₁ and p₂, the final semantic similarity between the two proteins is defined as sim \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$(p_1 , p_2) = \frac {1} {2} \times \left[{\rm sim} \ (p_1 \rightarrow p_2) + {\rm sim} \ (p_2 \rightarrow p_1) \right]$$ \end{document} .

The average Resnik scores of known interacting protein pairs, noninteracting protein pairs, and predicted putative interacting protein pairs by the proposed NMTF and R-NMTF methods are reported in Table 3, from which we have the following observations.

Table 3.

Resnik Scores of Different Protein Pairs

	Average ± standard deviation
Known interacting protein pairs	7.160 ± 1.321
Noninteracting protein pairs	3.481 ± 1.129
Predicted putative interacting protein pairs by NMTF method	6.246 ± 1.031
Predicted putative interacting protein pairs by R-NMTF method	6.737 ± 0.927

NMTF, nonnegative matrix tri-factorization; R-NMTF, regularized nonnegative matrix tri-factorization.

First, the average Resnik score of known interacting protein pairs is higher than that of noninteracting protein pairs, which is consistent with the aforementioned assumption that the interacting protein pairs have more similar biological functions than those not interacting.

Second, the average Resnik score of the predicted putative interacting protein pairs by the proposed NMTF and R-NMTF methods are close to that of known interacting protein pairs. Namely, the proteins in the predicted PPIs are highly functionally similar, which confirms the correctness of the proposed methods from protein function similarity perspective.

Last, but not least, the predicted protein interactions by R-NMTF method have higher semantic score than those by NMTF method. Therefore, prediction by R-NMTF method using multimodal biological data sources is more advantageous than that by NMTF method using only one data source.

4.3. Capability to predict new protein interactions

The ultimate goal of computational methods to predict protein interactions is to discover new interacting protein pairs that can be served as potential targets for experimental screening. Therefore, we predict protein interactions on S. cerevisiae species by the proposed R-NMTF method on the PPI graph using BioGRID data of version 2.0.56 (published on August 31, 2009).

We examine the prediction results and find that among the top 200 putative PPIs predicted by our R-NMTF method, 87 protein pairs are consistent with other evidence (i.e., they are verified by experimental results and appear in recent published literatures). For example, protein “BNI1” and protein “CTF3” are not documented as a PPI in BioGRID data of version 2.0.56 but predicted to be interacting by our method. By a careful document survey, we notice that the experimental supporting document to this protein pair only appears in early 2010 (Vizeacoumar et al., 2010). This result firmly confirms the effectiveness of our method in predicting new protein interactions. Other predicted putative protein pairs together with the PubMed document IDs for their supporting literatures are listed in Appendix Table 5. Most them are also incorporated in the most recent BioGRID data of version 3.1.69 (published on September 30, 2010). The high overlaps between our predictions and the results in existing literature give a solid support of the usefulness of the proposed method.

4.4. Improved protein function prediction using predicted PPI networks

Protein interaction networks are broadly used in various biological applications, whose performances are inevitably determined by the quality input PPI graphs. Therefore, we assess the quality of predicted PPI networks in protein function prediction on S. cerevisiae species.

We predict protein functions on the original PPI graph constructed from the BioGRID database, the PPI graph filled by the top 1000 putative interacting protein pairs predicted by NMTF method, and that by R-NMTF method. We make predictions using the following three benchmark graph-based protein function prediction methods:

(1) Majority voting (MV) (Schwikowski et al., 2000) method: This method assigns functions to a protein via its connecting neighbors in certain ranges.

(2) Iterative majority voting (IMV) (Vazquez et al., 2003) method: This method is the same as the MV method, but iteratively repeats the function assignment process until certain conditions are satisfied.

(3) Function flow (FF) (Nabieva et al., 2005) method: This method formulates the annotation problem as a minimum multiway-cut problem, where the goal is to assign a unique function to all unannotated proteins so as to minimize the cost of edges connecting proteins with different assignments.

We implement these methods following the details in the original literatures. Because FF method produces a ranking list of predicted protein functions, we select a threshold such that the prediction precision is maximized. Five-fold cross-validation is performed to predict the functions in “biological process” of GO. The average prediction precision over all test functions and five trials of cross-validation of the involved methods on different PPI graphs are reported in Table 4.

Table 4.

Performance of Protein Function Prediction by Involved Method on Compared PPI Graphs

	MV	IMV	FF
Original PPI graph	30.12%	30.92%	32.99%
Predicted PPI graph by NMTF method	34.85%	35.21%	36.02%
Original PPI graph by R-NMTF method	35.98%	36.33%	38.19%

PPI, protein–protein interaction; MV, majority voting; IMV, iterative majority voting; FF, function flow.

The results in Table 4 show that the function prediction performance for all three methods are improved when the predicted PPI graphs are used. Such results experimentally prove that the predicted PPI graphs have higher quality than the original one, which demonstrates that the filled putative protein interactions by the proposed methods are largely biologically meaningful. Thus, we can tentatively conclude that the proposed NMTF and R-NMTF indeed can improve the protein interaction networks. Again, multimodal biological data sources based R-NMTF method is better than single data source-based NMTF method.

5. Conclusions

In this article, instead of considering protein–protein interaction as a binary classification problem, as in many existing works, we formulated it as a matrix completion problem. Taking this different perspective, the difficulty of selecting negative training samples in classification-based methods is averted. Moreover, because the number of protein interaction types is small, recovery of missing PPIs from an incomplete observed protein interaction network can be suitably solved under the framework of matrix completion. We first proposed to use NMF approach to predict PPIs from protein interaction network data, and then extended it through manifold regularization to incorporate multimodal biological data sources. We have conducted extensive empirical studies to evaluate different aspects of the proposed methods on four genomic species including S. cerevisiae, D. melanogaster, H. sapiens, and C. elegans. Promising results in the experiments validate our methods, which are consistent with our theoretical analysis.

6. Appendix

6.1. Predicted new protein interactions with highest-ranking scores by R-NMTF method

In Table 5, we list a subset (87 out of 200) of the predicted putative protein interactions with highest ranking scores on the original S. cerevisiae PPI graph by the proposed R-NMTF method. These protein pairs are already discovered to be interacting in existing literatures. The document ID in PubMed of the supporting literatures are listed in the third column of Table 5.

Table 5.

Predicted PPIs with Highest Ranking Scores by R-NMTF Method

Interacting protein A	Interacting protein B	PubMed ID of the supporting literature
BNI1	CTF3	20065090
ACS2	RTC1	18676811
BNA4	VHS3	20093466
ALG3	ERG2	19325107
APQ12	NUR1	20093466
AVT5	ELP2	20093466
BOS1	YDR186C	20093466
BMS1	IPT1	20093466
ATP5	PTC1	20093466
ATP5	CHK1	20093466
ARF1	GET2	16269340
AFT1	RPN4	20439772
BCY1	NFT1	20093466
ANP1	VID22	20093466
ANP1	MKK1	20093466
BEM1	UBP14	20093466
BNI1	MXR2	20065090
ATG8	APE2	20093466
ARP8	BUD13	20041197
ARP8	ESC2	20041197
ARO7	CSF1	20093466
ARP8	GSG1	20041197
ALG8	SPF1	19325107
ALP1	ARG1	16941010
ATP14	MDM35	20093466
AFG3	PET8	20093466
BIM1	RAD55	20065090
ARP9	PYC2	20093466
BAT1	SGF29	20093466
BRE2	ESBP6	20093466
ANP1	HXK2	20093466
APC5	RPN10	20093466
ABD1	MET6	20093466
ACT1	RIM101	20093466
ARO2	DTD1	20093466
BRE5	RPS10A	20508643
BUD14	PIG1	19841731
ABD1	PAC1	20093466
BET5	GSG1	19416478
BNI1	YPL066W	20093466
ARF1	IST2	20093466
BRE2	NUP188	20093466
ACH1	CIT1	20093466
AMD1	YJL070C	18719252
ARC18	PTC6	20093466
ATP18	ATP18	16716082
ARL1	MRN1	20093466
BNI4	VAC14	20093466
ARP1	KEM1	20093466
ALG8	APM3	20093466
BIM1	SAC6	20065090
ATP18	AEP2	20093466
BNI1	RTT109	20065090
ARP9	YJR079W	20093466
ALG9	SAP155	20093466
AHA1	SPT3	20093466
BCK1	MTC5	20093466
AAD3	MDM12	20093466
ALG9	SPT4	19325107
AIM22	IMP2	20093466
BRE5	TIF1	20508643
AVO1	GAA1	20101242
ARP8	URE2	20041197
BAT1	BCH1	20093466
BRR2	RAX1	20093466
AIM26	BCK1	14764870
APC5	MUM2	20093466
ALD5	OAR1	20093466
ALK2	LSM12	20489023
BNI1	DIP5	20065090
ALT1	MCX1	20093466
ACS2	OLA1	16429126
ALB1	TIF6	16651379
ATG21	INP53	20093466
BTS1	TLG2	19325107
APL1	SPG1	20093466
BRE5	HTB1	20508643
BIM1	RPL19B	20093466
ALG1	OPI3	20093466
ALB1	REI1	16651379
APS1	HOM6	20093466
AEP2	FYV10	20093466
AIM44	RAD5	20093466
BIM1	PET8	20065090
ARL3	ARL1	20093466
ANP1	PUS2	20093466
BOS1	YGR021W	20093466

References

Aloy

, Ceulemans

, Stark

et al. 2003. The relationship between sequence and interaction divergence in proteins. J. Mol. Biol., 332:989–998.

Ashburner

, Ball

, Blake

et al. 2000. Gene ontology: tool for the unification of biology. Nat. Genet., 25:25–29.

Barker

, Pagel

2005. Predicting functional gene-links from phylogenetic-statistical analyses of whole genomes. PLoS Comp. Biol., 1:24–31.

Ben-Hur

, Noble

2005. Kernel methods for predicting protein–protein interactions. Bioinformatics, 21:i38.

Benson

, Karsch-Mizrachi

, Lipman

2006. GenBank. Nucleic Acids Res., 34:D16–20.

Bowers

, Cokus

, Eisenberg

et al. 2004a. Use of logic relationships to decipher protein network organization. Science, 306:2246.

Bowers

, O'Connor

, Cokus

et al. 2005. Utilizing logical relationships in genomic data to decipher cellular processes. FEBS Journal, 272:5110–5118.

Bowers

, Pellegrini

, Thompson

et al. 2004b. Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol., 5:R35.

Cai

, He

, Wu

et al. 2008. Non-negative matrix factorization on manifold. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 63–72.

10.

Candès

, Plan

2009. Matrix completion with noise. Proceedings of the IEEE.

11.

Candès

, Recht

2009. Exact matrix completion via convex optimization. Found. Comut. Math., 9:717–772.

12.

Chen

, Jeong

2009. Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics, 25:585.

13.

Chen

, Liu

2005. Prediction of protein–protein interactions using random decision forest framework. Bioinformatics, 21:4394.

14.

Chung

1997. Spectral Graph Theory. Amer. Math. Society.

15.

Dai

, Prasad

2010. Low-rank matrix completion for inference of protein-protein interaction networks. In AIP Conference Proceedings. 1281:1531.

16.

Ding

, He

, Simon

2005. On the equivalence of nonnegative matrix factorization and spectral clustering. Proc. SIAM Data Mining Conf, 606–610.

17.

Ding

, Li

, Jordan

2006a. Convex and semi-nonnegative matrix factorizations for clustering and low-dimension representation. Technical Report LBNL-60428. Lawrence Berkeley National Laboratory.

18.

Ding

, Li

, Peng

et al. 2006b. Orthogonal nonnegative matrix t-factorizations for clustering. Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, 126–135.

19.

Enright

, Iliopoulos

, Kyrpides

et al. 1999. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402:86–90.

20.

Ermolaeva

, White

, Salzberg

2001. Prediction of operons in microbial genomes. Nucleic Acids Res., 29:1216.

21.

Goh

, Bogan

, Joachimiak

et al. 2000. Co-evolution of proteins with their interaction partners1. J. Mol. Biol., 299:283–293.

22.

, Zhou

2009. Co-clustering on manifolds. Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, 359–368.

23.

, Gruhler

, Heilbut

et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415:180–183.

24.

Ito

, Tashiro

, Muta

et al. 2000. Toward a protein–protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proceedings of the National Academy of Sciences of the United States of America, 97:1143.

25.

Jansen

, Yu

, Greenbaum

et al. 2003. A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science, 302:449.

26.

Jothi

, Cherukuri

, Tasneem

et al. 2006. Co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein–protein interactions. J. Mol. Biol., 362:861–875.

27.

Lee

, Seung

1999. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791.

28.

Leslie

, Eskin

, Weston

et al. 2003. Mismatch string kernels for SVM protein classification. Adv Neural Inf Process Syst., 1441–1448.

29.

Littler

, Hubbard

2005. Conservation of orientation and sequence in protein domain-domain interactions. J. Mol. Biol., 345:1265–1279.

30.

Luo

, Ding

, Huang

et al. 2009. Non-negative laplacian embedding. In 2009 Ninth IEEE International Conference on Data Mining, 337–346.

31.

Marcotte

, Marcotte

2002. Predicting functional linkages from gene fusions with confidence. Applied Bioinformatics, 1:93–100.

32.

Marcotte

, Pellegrini

, Ng

et al. 1999. Detecting protein function and protein–protein interactions from genome sequences. Science, 285:751.

33.

Martial

, Michael

, Jean-Philippe

et al. 2010. Large-scale prediction of protein–protein interactions from structures. BMC Bioinformatics, 11.

34.

Martin

, Roe

, Faulon

2005. Predicting protein–protein interactions using signature products. Bioinformatics, 21:218.

35.

Moreno-Hagelsieb

, Collado-Vides

2002. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics, 18:S329.

36.

Nabieva

, Jim

, Agarwal

et al. 2005. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21:i302.

37.

Nooren

, Thornton

2003. Structural characterisation and functional significance of transient protein–protein interactions. J. Mol. Biol., 325:991–1018.

38.

Pagel

, Wong

, Frishman

2004. A domain interaction map based on phylogenetic profiling. J. Mol. Biol., 344:1331–1346.

39.

Panchenko

, Wolf

, Panchenko

et al. 2005. Evolutionary plasticity of protein families: coupling between sequence and structure variation. Proteins: Structure, Function, and Bioinformatics, 61:535–544.

40.

, Klein-Seetharaman

, Bar-Joseph

2005. Random forest similarity for protein–protein interaction prediction from multiple sources. Pac Symp Biocomput., 10:531–542.

41.

Qiu

, Hue

, Ben-Hur

et al. 2007. A structural alignment kernel for protein structures. Bioinformatics, 23:1090.

42.

Ramani

, Marcotte

2003. Exploiting the co-evolution of interacting proteins to discover interaction specificity. J. Mol. Biol., 327:273–284.

43.

Resnik

1995. Using information content to evaluate semantic similarity in a taxonomy. Arxiv preprint cmp-lg/9511007.

44.

Salgado

, Moreno-Hagelsieb

, Smith

et al. 2000. Operons in Escherichia coli: genomic analyses and predictions. Proceedings of the National Academy of Sciences of the United States of America, 97:6652.

45.

Schwikowski

, Uetz

, Fields

2000. A network of protein–protein interactions in yeast. Nat. Biotech., 18:1257–1261.

46.

Shen

, Zhang

, Luo

et al. 2007. Predicting protein–protein interactions based only on sequences information. Proceedings of the National Academy of Sciences, 104:4337.

47.

Shi

, Malik

2000. Normalized cuts and image segmentation. IEEE. Trans. on Pattern Analysis and Machine Intelligence, 22:888–905.

48.

Shoemaker

, Panchenko

2007a. Deciphering protein–protein interactions. Part I. Experimental techniques and databases. PLoS Comp. Biol., 3:337–334.

49.

Shoemaker

, Panchenko

2007b. Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comp. Biol., 3:595–601.

50.

Stark

, Breitkreutz

, Reguly

et al. 2006. BioGRID: a general repository for interaction datasets. Nucleic Acids Res., 34:D535.

51.

Strong

, Mallick

, Pellegrini

et al. 2003. Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol., 4:R59.

52.

Teichmann

2002. The constraints protein-protein interactions place on sequence divergence. J. Mol. Biol., 324:399–407.

53.

Valdar

, Thornton

2001. Protein-protein interfaces: analysis of amino acid conservation in homodimers. Proteins: Structure, Function, and Bioinformatics, 42:108–124.

54.

Vazquez

, Flammini

, Maritan

et al. 2003. Global protein function prediction from protein-protein interaction networks. Nat. biotechnology, 21:697–700.

55.

Vizeacoumar

, Van Dyk

, Vizeacoumar

et al. 2010. Integrating high-throughput genetic interaction mapping and high-content screening to explore yeast spindle morphogenesis. J. Cell. Biol., 188:69.

56.

Wang

, Huang

, Ding

2011a. Simultaneous clustering of multi-type relational data via symmetric nonnegative matrix tri-factorization. In The 20th ACM Conference on Information and Knowledge Management.

57.

Wang

, Huang

, Nie

et al. 2011b. Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization. In Proceedings of the 34th international ACM SIGIR conference (ACM SIGIR 2011), 933–942.

58.

Wang

, Nie

, Huang

, Ding

2011c. Dyadic transfer learning for cross-domain image classification. In 2011 IEEE International Conference on Computer Vision (IEEE ICCV 2011), 551–556.

59.

Wang

, Nie

, Huang

et al. 2011d. Fast nonnegative matrix tri-factorization for large-scale data co-clustering. In Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI 2011).

60.

Wang

, Nie

, Huang

et al. 2011e. Nonnegative matrix tri-factorization based high-order co-clustering and its fast implementation. In 2011 Eleventh IEEE International Conference on Data Mining.

61.

Yanai

, Derti

, DeLisi

2001. Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes. Proceedings of the National Academy of Sciences of the United States of America, 98:7940.

62.

Zhang

, Wong

, King

et al. 2004. Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics, 5:38.