Abstract
Abstract
Biological networks reconstruction is a crucial step towards the functional characterization and elucidation of living cells. Computational methods for inferring the structure of these networks are of paramount importance since they provide valuable information regarding organization and behavior of the cell at a system level and also enable careful design of wet-lab experiments. Despite many recent advances, according to the scientific literature, there is room for improvements from both the efficiency and the accuracy point of view in link prediction algorithms. In this article, we propose a new method for the inference of biological networks that makes use of a notion of similarity between graph vertices within the framework of graph regularization for ranking the links to be predicted. The proposed approach results in more accurate classification rates in a wide range of experiments, while the computational complexity is reduced by two orders of magnitude with respect to many current state-of-the-art algorithms.
1. Introduction
The network reconstruction problem is commonly recast as a pattern recognition problem on graphs whose nodes represent a (sub)set of molecules (e.g., proteins, genes, enzymes) of a given organism and whose edges (either directed or undirected) are representative of particular biological properties of the system (e.g., protein physical interaction, gene regulation, reaction catalysis). The reconstruction problem, therefore, can be reduced to the inference of interactions among vertices of these graphs, given a partial (and often affected by experimental errors) knowledge of the network. The whole process could in principle benefit from the inclusion of heterogeneous biological information from a wide spectrum of data sources: genomic sequence similarity, gene expression levels, and cellular localization (as well as other features) can be integrated into the network under study, with the aim of getting more accurate predictions (Bleakley et al., 2007; Yip and Gerstein, 2009; Kashima et al., 2009b).
1.1. Related work
In the rich literature concerning network inference, a major line of research has been devoted to the development of machine learning supervised or semi-supervised methods. Supervised learning (SL) is a pattern recognition technique that aims at learning a classification function from the labeled (known) data given as input to the classifier during the training phase. Semi-supervised learning (SSL) is a novel learning paradigm that takes into account not only labeled data, but also unlabeled data, during the learning phase. The aim is to complement the information given as input to the classification algorithms with the reciprocal relationships of the whole input data in order to the enrich the learning process. This framework is particularly useful if labels are scarce and difficult to obtain. Its usefulness and effectiveness has been demonstrated in a wide and heterogeneous number of experimental settings (Chapelle et al., 2006).
Among the class of SL algorithms applied to the problem of biological network inference, many approaches have exploited support vector machines (which are considered state of the art statistical learning algorithms). For instance, Ben-Hur and Noble (2005) proposed to frame the network reconstruction problem into a binary classification problem by taking protein pairs as input of the classifier algorithm and by using ad hoc developed kernels named pairwise kernels (P-SVM) for representing similarities between proteins. Bleakley et al. (2007) proposed a divide and conquer strategy to reduce space and time complexity with respect to SVM-based algorithms (i.e., learning subnetworks around each given vertex of the graph). This approach has been used for inferring metabolic (Bleakley et al., 2007) and regulatory networks (Mordelet and Vert, 2008).
For the semi-supervised approach to the problem, we recall here the work of Yip and Gerstein (2009), which refined the local model approach by expansion of the training set of each vertex. The basic assumption the authors made is that accurate predictions can be used to widen the training set in order to ensure a robust set of examples to the classifier algorithm.
Another SSL algorithm that has been recently proposed is a link mining method named LinkPropagation (Kashima et al., 2009a,b), which exploits the so-called label propagation method (Zhu et al., 2003; Chapelle et al., 2006) to predict the existence of a given link in a graph, given the partial knowledge of the network and the similarities among the nodes. Informally, label propagation consists of an application of the guilty by association principle: two vertices of a graph that are similar (according to a given measure) are considered to belong to the same class (e.g., functional category). Hence, if the class label for one of the two nodes is known, the label can be propagated and the annotation transferred.
Although many approaches have been proposed during the last years, many challenging questions are still considered to be open. Mainly, the inherent noisy nature of datasets and their size prompt the need for both accurate and efficient algorithms design. On the one hand, data provided as input to learning algorithms is incomplete and noisy (false positive rates for yeast, worm, and fly data are estimated to range from 25% to 45% while false negatives range from 75% to 90%, according to Huang et al., 2007), hence justifying the search for noise-resilient approaches. On the other hand, high-throughput techniques produce data at increasing rates, thus making of primary importance the development of efficient solutions.
In this work, we investigate a new approach for the network reconstruction problem based on regularization theory on graphs via the Laplace operator. In particular, we aim at taking advantage of the so-called Regularized Laplacian (Smola and Kondor, 2003).
The Regularized Laplacian (hereafter also denoted as RL) is a kernel matrix with several interesting theoretical properties (Smola and Kondor, 2003), whose usefulness has been demonstrated in graph mining when applied to citations of research articles (Shimbo et al., 2007). Interestingly, this type of regularization on graphs is intertwined with a graph-based similarity measure known as Matrix-Forest similarity metrics (hereafter also denoted as MFSM). MFSM has been proposed as an alternative way of measuring the affinity between objects represented as nodes of a graph (Chebotarev and Shamis, 1997, 1998). It integrates indirect paths between vertices of the network and is related to the Laplacian of the adjacency matrix of the underlying graph, from which it can be computed (Fouss et al., 2007). The concept of MFSM has been recently exploited in collaborative recommendation systems, with interesting results in terms of accuracy of the retrieval system under study (Fouss et al., 2007).
By means of the regularization operator, we aim at deriving a proximity measure between each pair nodes of the graphs under study. This type of measure can be naturally used as the score to rank links to be inferred. The rationale behind our approach is to exploit the inherent capability of the regularization approach of handling noisy and missing information, in order to correctly recover the structure of the underlying networks. The proposed method (hereafter denoted as RL-BNI,
This article is organized as follows: in Section 2.1, we describe our approach to the problem, while in Section 3, we introduce the experimental protocol we used for cross-validation. Then, in Section 4, we present and discuss experimental results, and in Section 5, we conclude with some final considerations and sketch some possible future research directions.
2. Methods
2.1. The proposed approach
In order to detail the proposed algorithm, we first recall some preliminary definitions. Let
With the help of the above introduced definitions, we may now sketch the RL-BNI algorithm. We are given in input an m × m matrix
where ∣E∣ is the number of training examples (i.e., known edges), ∣E+∣ is the number of positive training examples, and ∣E−∣ is the number of negative training examples. This pre-processing allows us to set target values according to Fisher discriminant following Kashima et al. (2009a).
Notwithstanding its simplicity, the performance of RL-BNI is interestingly effective, as we demonstrate below.
2.2. Computational complexity
The computation of the normalized Laplacian has time complexity O(m2), while the computation of the Regularized Laplacian kernel is O(m3) (since it entails the inversion of a m × m matrix). Hence, the resulting time complexity of RL-BNI is O(m3)
Regarding the space complexity, the computation of the normalized Laplacian costs O(m2). Since computing the RL-kernel has memory requirement O(m2), we can state that the overall space complexity of the proposed approach is O(m2).
Table 1 summarizes the time and space asymptotic complexities of RL-BNI compared to the time and space complexities of LinkPropagation and P-SVM (which are considered two state-of-the-art algorithms). The time complexity of RL-BNI is two orders of magnitude lower than the complexity of LinkPropagation and three orders of magnitude lower than the complexity of P-SVM, while the memory requirement is the same for RL-BNI and LinkPropagation (i.e., two orders of magnitude lower than the space complexity required by P-SVM).
3. Experimental Protocol And Data
3.1. Competing methods
In order to carry out a comprehensive evaluation of our method, we compared it with a state-of-the-art semi-supervised algorithm, namely LinkPropagation (Kashima et al., 2009b), with a random-walk with restart algorithm (RWR) (Tong et al., 2006; Gallagher et al., 2008) and with a diffusion-based kernel (DK) (Kondor and Lafferty, 2002; Smola and Kondor, 2003).
3.1.1. LinkPropagation
The LinkPropagation algorithm extends the guilty-by-association principle to the problem of link mining by considering pairs of vertices instead of isolated nodes. Within this setting, two pairs of nodes can be considered to have similar link weights if they are (enough) similar to each other. This latter method has been renamed, by analogy, link propagation principle. The main feature of LinkPropagation is its capability of handling multiple associations among nodes, hence allowing one to embed in the link strength the information regarding two node pairs in two different networks. LinkPropagation has been used for simultaneously predicting the networks of multiple species (Kashima et al., 2009b).
3.1.2. Random walk with restart
We also chose to implement, for comparison purpose, another link prediction method that we selected as representative of pure topology-based algorithms. In particular, we opted for a random-walk with restart algorithm. Approaches that make use of random-walk similarity among nodes of a graph are routinely used in many information retrieval, machine learning, and computational biology applications (Page et al., 1998; Gallagher et al., 2008; Kohler et al., 2008; Morrison et al., 2005), hence making it a natural choice for network analysis.
The RWR algorithm aims at deriving the proximity between two nodes i and j by simulating a random walk process: the random walker starts from node i and, at each iteration, moves to one of its neighbors with probability proportional to the weight of the arc that join them (Gallagher et al., 2008; Tong et al., 2006; Morrison et al., 2005). At each iteration, the walker has also a probability equal to (1 − d) of being teleported back to starting node i (where 0 ≤ d ≤ 1 is a parameter known as restart probability or damping factor). The proximity score between nodes i and j is then defined as the steady-state probability that the random walker will finish its walk at node j. Given a graph
Matrix
3.1.3. Diffusion kernel
The diffusion kernel has been introduced by Kondor and Lafferty (2002) and has been demonstrated to be a useful method for representing the structure of pairwise similarities among points of a data space in many applications. In this context, we make use of the regularization framework that includes DK as special case, as defined by Smola and Kondor (2003). In particular, given a graph with adjacency matrix
This kernel has an interesting interpretation in terms of diffusion processes: if we hypothetically inject a given quantity at node i and let it diffuse through the graph, the result of this quantity flow measured at node j (at steady state) is
3.2. Data and experimental settings
To evaluate our approach, we adopted the same experimental setting used by Kashima et al. (2009b) to validate LinkPropagation. In particular, we used the metabolic networks of C. elegans, H. pylori, and S. cerevisiae as gold standard. The nodes of each network are enzymes. Links connect two nodes if the pair of enzymes associated to the nodes catalyze successive reactions in a given metabolic pathway (Kanehisa et al., 2008; Kashima et al., 2009b). Table 2 summarizes the number of nodes and edges for each of the investigated networks. We also used the same cross-species similarities (i.e., the normalized Smith-Waterman scores of pairwise local protein alignments) and the intra-species similarities (i.e., the gene expression data) used by Kashima et al. (2009b) to construct the nine similarity matrices used as input for inference by LinkPropagation.2
We compared the proposed method with the LinkPropagation algorithm, with the random-walk based algorithm and with a diffusion kernel algorithm, to evaluate its predictive accuracy. For RWR, we tried different values of the restart probability (d). Results refer to d = 0.85, which provided the higher accuracy levels. For DK and RL-BNI, we tested different values to be assigned to parameter σ, taken, respectively, from the set {0.01, 0.1, 0.5, 1, 5, 10, 20, 30, 50, 80, 10}. The metabolic networks have been randomly sampled to obtain training and test data. We chose three different ratios of the training set: particularly, the 25%, 50%, and 75% of all the node pairs have been used as training data. The results have been evaluated by taking into consideration two types of metrics: the AUC (Area under the ROC curve) (Gribskov and Robinson, 1996) and the AUF (Area under the FDR curve) (Bleakley et al., 2007). The AUC summarizes the ROC curve which plots the true positives as a function of false positives when a threshold used for predicting interactions from the ranking scores is varied. The AUF summarizes the FDR curves which represents the ratio of false positives among all positive predictions. The AUC is a widely adopted measure to evaluate classification algorithms (in fact it has been used as a metric to benchmark LinkPropagation), but we decided also to take into account the AUF since biological network reconstruction is a type of problem that presents many more negative than positive examples. This prompts the need for a metric (such as the FDR) able to capture the fact that, among the first-ranked predictions, there should be enough true positives (Bleakley et al., 2007). We also recall here that a perfect classifier would present AUC equal to 1 and AUF equal to 0, while a random classifier would obtain an AUC of 0.5 and an AUF equal to the ratio of not connected nodes between all pairs (which is very close to 1). Motivated by the above considerations we run, for each ratio of training set, five experiment trials and we computed average values for either AUC and AUF.
4. Results And Discussion
The results are reported in Tables 3–5 (for the AUC metrics) and in Tables 6–8 (for the AUF metrics). Each table reports, for any of the three metabolic networks, the AUC and the AUF of LinkPropagation with individual (LPind) and simultaneous (LPind) inference, RWR, DK, and RL-BNI. For DK and RL-BNI, we decided to report, for the sake of conciseness, only the first two best results corresponding to given σ values. In particular, we found that the best accuracies are achieved by DK with σ = 5 and σ = 10, and by RL-BNI with σ = 1 and σ = 5. Results are reported for different values of the training set ratio: 25% for Tables 3 and 6, 50% for Tables 4 and 7, and 75% for Tables 5 and 8.
The best result for each reconstructed network is highlighted in bold.
The best result for each reconstructed network is highlighted in bold.
The best result for each reconstructed network is highlighted in bold.
The best result for each reconstructed network is highlighted in bold.
The best result for each reconstructed network is highlighted in bold.
The best result for each reconstructed network is highlighted in bold.
The proposed approach outperforms its competing methods in a wide range of experimental conditions. In particular, LinkPropagation gives slightly better results in terms of AUC when the ratio of training data is 25% (for S. cerevisiae network reconstruction and also in terms of the total AUC). When the ratio of training data is raised to 50% and, even more, when this ratio becomes 75%, the performance of RL-BNI is markedly more accurate than that of LinkPropagation, RWR and DK. It is worthwhile to notice that RWR becomes competitive with LinkPropagation on the 50% training set (with AUC values higher with respect to LinkPropagation with individual inference), and it outperforms both variants of LinkPropagation for the 75% training set ratio, while its performance sharply degrades when run on networks with lower training set ratio. We also may observe the good performance of DK that achieved AUC values higher than those of LPind and LPsim with training data ratio of 75%, while being between LPind and LPsim with training data ratios of 25% and 50%.
The cross-validation evaluated in terms of AUF provides further evidence of the effectiveness of the graph regularization method. In fact, when algorithms are compared according to the false discovery rate, RL-BNI always reaches the lowest values of AUF in the whole range of experimental conditions.
We also decided to evaluate the efficiency in terms of computational resources of the proposed approach, by comparing the CPU time needed by RL-BNI with that of LinkPropagation. The timing results refer to reconstruction tasks within the above described experimental setting and are reported in Table 9. Each result is the average of a given inference task (composed of three networks) on 5 trials (corresponding to the fivefold cross-validation runs). All algorithms have been implemented in Matlab and run on an Intel Pentium4 CPU 2.80-GHz PC with 2 GB of RAM, with Linux operating system. The comparison clearly demonstrates the impact of the lower time complexity of the RL-BNI algorithm with respect to LinkPropagation. This impact should be even sharper when scaling to bigger datasets.
CPU time is expressed in seconds.
5. Conclusion
We have introduced a new method based on graph regularization (i.e., RL-BNI) for the reconstruction of biological networks. The approach we developed is a link prediction method that relies on the proximity scores obtained from the regularization procedure for ranking the links to be inferred. The proposed algorithm has been tested on a biological benchmark in the inference of metabolic networks of S. cerevisiae, H. pylori, and C. elegans. The results of the cross-validation experiments demonstrate the capability of RL-BNI of ranking the links to be inferred with higher level of accuracies with respect to its competing algorithms in a broad range of experimental conditions.
The strength of the proposed approach is represented by its capability of obtaining (in most cases) improved predictive accuracies at reduced computational requirements. Moreover, the accuracy levels remain competitive even when RL-BNI is compared with methods that exploit additional information sources, such as LinkPropagation with the simultaneous inference option.
There are a few directions that could be interesting to investigate in the future. First of all, different methods of graph-based regularization could be investigated and perhaps combined. A careful choice of these different combinations could also lead to improved results. Another line of research which might be pursued is the improvement of computational complexities which, despite the improved efficiency of RL-BNI with respect to other algorithms, could become too demanding for the inference of very large scale networks.
