The IsoRank algorithm of Singh, Xu, and Berger was a pioneering algorithmic advance that applied spectral methods to the problem of cross-species global alignment of biological networks. We develop a new IsoRank approximation that exploits the mathematical properties of IsoRank’s linear system to solve the problem in quadratic time with respect to the maximum size of the two protein–protein interaction (PPI) networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while remaining computationally inexpensive. In experiments on synthetic and real PPI networks with various proposed metrics to measure alignment quality, we find the results of our approximate IsoRank are nearly as accurate as the original IsoRank. In fact, for functional enrichment-based measures of global network alignment quality, our approximation performs better than the exact IsoRank, which is doubtless because it is more robust to the noise of missing or incorrect edges. It also performs competitively against two more recent global network alignment algorithms. We also present an analogous approximation to IsoRankN, which extends the network alignment to more than two species.
INTRODUCTION
As has been noted in the social network and pattern recognition application communities as well (Emmert-Streib et al., 2016; Khan et al., 2012), spectral methods have enjoyed some success in many variants of the approximate graph matching problem, including the Global Network Alignment problem studied by computational biologists for aligning protein–protein interaction (PPI) networks from different species. The first spectral method for the PPI global alignment problem proposed by the biological community was the IsoRank method of Singh et al. (2008). This method for global biological network alignment is still used today. With the advent of many more experiments, the size of the networks available to match has grown considerably, making running IsoRank unfeasible on these networks without access to substantial computational resources. In this article, we develop a new IsoRank approximation that exploits the eigenproperties of IsoRank’s linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while still remaining computationally inexpensive. Finally, we analogously show how to incorporate approximate IsoRank into the IsoRankN algorithm (Liao et al., 2009), where IsoRankN is a method that extends IsoRank to multiple network alignments.
The quality of our approximation is investigated in two different settings. First, in synthetic experiments, we generate random graphs using the Erdős and Rényi (1960) and Albert and Barabási (2002) models and ask IsoRank to recover the graph isomorphism between our graphs and a random node permutation. We parameterize the node similarity with a noise parameter , which scales from 0 to 1 (where 1 indicates a random node permutation). For three ranges of the noise level, we measure the quality of the approximation recovered by our IsoRank approximation: a low-error regime where IsoRank itself will recover the matching of the true isomorphism, an intermediate where exact IsoRank only recovers 80%−90% of the true matching, and a noisy where exact IsoRank recovers only 70%−75% of the true matching. In each case, we show how close approximate IsoRank comes, as well as demonstrating the considerable savings in CPU time. Next, we return to the biological domain and consider real-world networks from five species: mouse (Mus musculus), rat (Rattus norvegicus), fly (Drosophila melanogaster), baker’s yeast (Saccharomyces cerevisiae), and human (Homo sapiens). In this case, as in the original IsoRank, sequence similarity (as measured by BLAST [Altschul et al., 1990]) becomes the measure of node similarity, where more distant species will typically have less sequence similarity, but we are no longer in the setting where we are trying to recover a permutation of a single network: we do not even have the same number of nodes because of gene duplication and loss. As evolutionary distance increases, there will also be increasing network rewiring, resulting in inserted and deleted edges in each PPI network. Finally, the networks differ because we have only a noisy and incomplete sample of the true PPI networks. Because we of course do not have a ground truth alignment in this setting as we do for the synthetic data, we instead adopt the most commonly reported measures of alignment quality [see Hashemifar and Xu (2014) and Section 3.2 below], including graph-theoretical measures that report edge and subgraph consistency across the alignment, as well as functional coherence measures that report how often aligned proteins are known to be involved in similar biological functions across the two species. While these are the quality measures we adopt here, we note that alternative ways of measuring the quality of biological network alignment appear in El-Kebir et al. (2015), Mamano and Hayes (2017), Neyshabur et al. (2013), and Vijayan et al. (2015).
We additionally compare exact and approximate IsoRank to Hubalign (Hashemifar and Xu, 2014) and FINAL (Zhang and Tong, 2016), where these alternative alignment methods were chosen because (1) they are both in the set of recent well-performing competing network alignment algorithms, and (2) from this set, they are the ones we found that can also be successfully run on a desktop computer (with sufficient memory). While Approximate IsoRank is consistently slightly worse than exact IsoRank on the graph theoretical measures, when compared with Hubalign and FINAL, Approximate IsoRank is often better and, when not, usually a close second. Most interestingly, across the board, we find that Approximate IsoRank performs better than exact IsoRank, according to our functional enrichment measures, and also has better functional enrichment in the majority of experiments than the competing alignment methods. Perhaps this is because Approximate IsoRank is less sensitive to missing and incorrect edges than exact IsoRank.
Finally, we extend our approximation to the IsoRankN algorithm of Liao et al. (2009), which produces a network alignment across multiple species. We discuss the performance of exact and Approximate IsoRankN for cross-species functional orthology clustering.
ALGORITHM
IsoRank
Let us first recall the IsoRank algorithm (Singh et al., 2008) and some basic techniques from linear algebra (Strang, 2003). We consider first the version of IsoRank that only tries to match network topology but does not also incorporate a measure of node–node similarity. The key observation is that IsoRank looks at the tensor product of two graphs, which is essential for developing our approximation algorithm. The relation to the tensor product was not known by the developers of IsoRank but was also previously observed by Zhang et al. (Kazemi and Grossglauser, 2016; Zhang and Tong, 2016; Zhang et al., 2017).
Consider two simply connected, undirected, and possibly weighted graphs and . Their adjacency matrices are and , respectively, and weighted degree matrices are , , and , , respectively. Thus, the corresponding transition matrices are
which are both column stochastic, i.e., and . and can be viewed as transition matrices defining the random walk in the PageRank algorithm that underlies the IsoRank algorithm.
Next we consider the tensor product of the two graphs, which is a graph with nodes and its adjacency matrix is defined as
where denotes the Kronecker product (Schafer, 2017). Similarly, the transition matrix of the tensor product graph is
where is the degree matrix of the tensor product graph. This is because
and
As suggested in Singh et al. (2008), the basic version of IsoRank (i.e., without the node–node similarity matrix E) tries to find , , such that
This is done by the power method, which is the following iterative procedure, for
The initial value should be chosen based on data.
A modified version of IsoRank tries to find , , such that, for where , which contains the domain-specific node–node similarity information to improve the graph alignment result. For example, in the original IsoRank article of Singh et al. (2008), besides the two PPI networks of the species, additional sequence similarity data, which are presented by , are required in order to perform the graph matching. Here, denotes how similar the amino acid sequence of the protein of is to that of the protein of . This sequence similarity information is obtained using the standard BLAST sequence alignment (Altschul et al., 1990). Since we use the tensor product notation in our article, we have transformed E to , where converts a matrix to a vector.
As suggested in Singh et al. (2008), (2) is again solved iteratively as follows, for
Other approximation methods can also be used to solve (2) (e.g., Kazemi and Grossglauser, 2016).
Remark 1. Note that, in both iterative procedures, the normalization step is not needed if we have , , , and This is because
Approximate IsoRank
Let us first consider the basic version of IsoRank (1) without taking into account node similarity, and the key here is to notice that IsoRank is actually working on a tensor product graph of the original two graphs. Therefore, based on the Perron–Frobenius theorem (Smyth, 2002), the basic IsoRank (1) tries to find the right eigenvector of the transition matrix P that corresponds to the unique eigenvalue 1. Therefore, is nothing but the steady-state distribution (SSD) of the tensor product graph, which can be computed directly without using the power method at all since IsoRank works on a tensor product graph. Using the properties of the transition matrix P of a simply connected, undirected, and possibly weighted graph, the SSD can be obtained without any iterative procedure as follows;
where and (which is the weighted total degree). This means, for the basic version IsoRank (1) (without the node-similarity measure), we have directly, and there is no need of the power method or any iterative procedure, which involves matrix vector multiplications.
Now let us consider the general version of IsoRank (2), which also incorporates the extra sequence similarity information. Note that when , (2) reduces to (1) and, therefore, . In contrast, when , we have . Thus, for , we can use the following convex combination to get an approximation:
Again, there is no iterative procedure and no matrix–vector multiplications at all. We call this approximation represented by Equation 5 either or Base. The reason we call this approximation Base is that it can be used either directly or as the initial guess of an iterative procedure for solving (2).
According to (2) and (5), we have
Let us introduce , , , and .
Then we have
In contrast, multiplying on the left of both sides of (2), we obtain
This implies
Since is symmetric and its largest eigenvalue is 1, its eigenvalues are , . The corresponding eigenvectors are denoted by , , with and , . It is easy to check that . Since is symmetric, , , form an orthonormal basis and, thus with . Note that
Now we are ready to present the following theorem, which measures the approximation error of Base.
Theorem 1.
LetFor, we havewhere.
Proof: Based on (6), for ,
This means,
Therefore,
This completes the proof.
As we mentioned, if we are not satisfied with Base, we can use it as an initial guess and perform power iterations. In practice, since it provides a good initial guess, we found that usually only one step of power iteration is needed, which provides another approximation as follows:
Because this is the result of performing only one linear step, we also refer to this approximation of Equation (7) as Line. Note that, due to the fact that , the matrix–vector multiplication step can be computed without explicitly forming P. We can use the following equivalent expression to compute :
where is the inverse operation of .
Computational complexity
To determine the computational cost of using the iterative method (3) to solve IsoRank (2), we note that our approximations Base and Line mainly depend on the matrix–vector multiplication for computing in the iteration (if the implementation uses the tensor product, then this step becomes matrix–matrix multiplication) and vector addition for computing , . Let us assume the complexity for matrix–vector multiplication (or matrix–matrix multiplication in the tensor product implementation) is and the complexity for vector addition is . Then the complexity of using the iterative method (3) to solve IsoRank (2) is where k is the number of iterations used in the implementation. Using the standard convergence analysis of (3), at least steps are required to reach a given relative error tolerance (measured in norm). Here, is the second largest eigenvalue of P, which happens to be the second largest eigenvalue of since P and are similar.
For our approximation = Base, since we only use vector addition once, the complexity is Finally, the approximation Line does one step of iteration using as the initial guess. Thus, its computational cost is .
As we can see, our approximation’s main computational advantage is that there is no need for iterations. Consider the worst case that , let us choose as used in our experiments. The iterative method (3) needs about iterations to achieve . This implies that our approximations are about 30 times faster than iterating to convergence here. Of course, the computational saving will depend on specific implementations and computers.
EXPERIMENTS
Synthetic networks
Our first set of experiments uses the Erdős–Rényi (ER) random graphs (Erdős and Rényi, 1960) and the Barabási–Albert (BA) random graphs (Albert and Barabási, 2002). For the ER graphs, to ensure with high probability that the resulting random graphs are connected, we set the probability that two nodes are connected to be . For the BA graphs, we use an initial seed graph consisting of five nodes and edge set , forming a path. Each new node is then connected to exactly five existing nodes with a probability proportional to the number of edges the existing nodes already have.
To apply and test the IsoRank algorithm (2) and our two approximation algorithms, i.e., Base and Line, we first generate a simply connected random graph and then randomly permute the nodes to obtain another random graph. As suggested in Singh et al. (2008), we choose in (2) for all the tests. In addition, we generate different by adding a different level of random noise to the edge weights of the correct random permutation. Starting with the correct random permutation, whose values are either 0 or 1, for a “good” , we add random noise between . For a “bad” , we add random noise between . When we add random noise between , we refer to the resulting as “intermediate.” Once the random noise is added, we take the absolute value component-wise and then normalize to ensure that and all the components are nonnegative. Intuitively, the “good” provides more information about the correct alignment, resulting in better accuracy than other choices of . Finally, since exact IsoRank(2) is solved by an iterative procedure, we stop the iteration when the Frobenius norm (denoted by , and equal to the square root of the summed squares of absolute values of the elements of A) of the difference between two successive iterations is less than or the maximal number of iterations, which is set to be 100, is reached. In all our experiments, it took less than 20 iterations to converge.
We report the accuracy, CPU time, error, and error. The accuracy of the alignment is measured in terms of how many pairs of nodes between the original network and permuted network are predicted correctly (in percentage). Since we implement all the algorithms in Matlab, the CPU time is measured in seconds using the Matlab built-in functions tic and toc. Finally, we use the standard norm to measure the error between the exact IsoRank result and its approximations. To verify the theory, we also look at the error using the weighted norm. The numerical results are reported in Tables 1 and 2. Note that both exact IsoRank and its approximations recover the correct alignment in the “good” regime, but even exact IsoRank does not with more noise in the node-similarity measure. However, both Approximate IsoRank approximations are very close to exact IsoRank in both norms and have much faster running times.
Comparison of Exact and Approximate IsoRank on Erdős–Rényi (ER) Graphs
Good
Intermediate
Bad
Acc.
CPU
Acc
CPU
Acc
CPU
,
,
,
100%
1.44
—
—
83.9%
1.41
—
—
71.6%
1.53
—
—
100%
0.19
4.92e-6
1.55e-7
83.3%
0.19
4.90e-6
1.56e-7
71.1%
0.20
5.05e-6
1.57e-7
100%
0.31
4.54e-7
1.45e-8
83.8%
0.29
4.58e-7
1.46e-8
71.6%
0.32
4.66e-7
1.47e-8
,
,
,
100%
8.21
—
—
83.6%
7.88
—
—
71.2%
8.36
—
—
100%
0.83
2.25e-6
6.49e-8
83.0%
0.83
2.31e-6
6.70e-8
70.7%
0.84
2.29e-6
6.63e-8
100%
1.45
1.99e-7
5.80e-9
83.6%
1.43
2.08e-7
6.08e-9
71.2%
1.45
2.05e-7
5.97e-9
,
,
,
100%
42.56
—
—
83.1%
45.66
—
—
71.7%
44.26
—
—
100%
2.98
1.04e-6
2.78e-8
82.4%
3.27
1.05e-6
2.82e-8
71.1%
3.12
1.03e-6
2.77e-8
100%
6.33
8.81e-8
2.38e-9
83.1%
6.62
8.97e-8
2.43e-9
71.7%
6.47
8.78e-8
2.37e-9
R = Exact IsoRank; = Base; =Line. Small, medium, and large ER graphs are generated with good, intermediate, and bad noise measures of pairwise node similarity.
Comparison of Exact and Approximate IsoRank on Barabási–Albert (BA) Graphs
Good
Intermediate
Bad
Acc.
CPU
Acc
CPU
Acc
CPU
,
,
,
100%
3.37
—
—
87.3%
3.22
—
—
75.3%
3.27
—
—
100%
0.21
4.69e-5
2.66e-6
84.9%
0.19
4.41e-5
2.57e-6
73.2%
0.22
4.40e-5
2.63e-6
100%
0.38
1.19e-5
4.28e-7
87.2%
0.35
1.14e-5
4.04e-7
75.3%
0.40
1.07e-5
4.18e-7
,
,
,
100%
13.81
—
—
86.9%
14.03
—
—
72.9%
14.27
—
—
100%
0.84
2.41e-5
1.32e-6
83.9%
0.83
2.37e-5
1.33e-6
71.0%
0.94
2.27e-5
1.31e-6
100%
1.57
6.77e-6
2.10e-7
86.8%
1.57
6.39e-6
2.11e-7
73.0%
1.81
6.09e-6
2.08e-7
,
,
,
100%
38.85
—
—
86.6%
37.43
—
—
73.2%
38.62
—
—
100%
3.15
1.21e-5
6.60e-7
84.1%
3.08
1.22e-5
6.55e-7
71.1%
3.15
1.30e-5
6.63e-7
100%
5.18
3.62e-6
1.05e-7
86.5%
5.11
3.48e-6
1.04e-7
73.0%
5.17
3.72e-6
1.06e-7
R = Exact IsoRank; = Base; =Line. Small, medium, and large BA graphs are generated with good, intermediate, and bad noise measures of pairwise node similarity.
Quality measures for global network alignments
On synthetic networks, we had the ground truth alignment and could measure performance relative to how successful each algorithm was in recovering this alignment. For real-world networks, there is no ground truth available, and other measurements of alignment quality must be chosen in order to quantify the quality of the alignment. Following Hamp et al. (2013), we used edge correctness (EC), largest common connected subgraph (LCCS), and average functional similarity (AFS) to measure the quality of the alignments. The first two are graph-theoretical measures that compare the topological properties of neighborhoods of nodes that are matched, while the third also measures the biological coherence of the alignments using the cellular functions defined by the gene ontology (GO) (GO Consortium, 2004). Details of these measures are given in the following paragraphs.
Edge correctness
Let and represent two networks, with and let be a one-to-one mapping between the vertices of the two networks. For an edge , we call to be a translation of from to . Then, the metric EC, which measures the percentage of the edge-translation of edges in that is an actual edge in , is mathematically described as
Since PPI connections are (weakly) preserved during evolution, higher EC values imply better protein pairings.
Largest common connected subgraph
Given two graphs, , with , and a one-to-one mapping , let E denote the set of edges of that map to edges in using g. Then, denotes the number of edges in the largest connected component of E. Like EC, LCCS also measures the number of edges that are preserved across the species by the mapping g.
Average functional similarity
The two previous metrics calculate how well the protein mapping preserves the structural features of the PPI networks. Since important biological functional pathways are also broadly preserved during evolution, we also expect a good mapping to strongly preserve protein function across species. Following Hashemifar and Xu (2014), we also measure the AFS across species, measuring the similarity of sets of Gene Ontology Consortium (2004) (GO) terms assigned to matched nodes. Recall that the GO is a hierarchical and species-agnostic directed acyclic graph representation of protein functional labels (i.e., GO terms) composed of three domains (molecular function [MF], biological process [BP], and cellular component [CC]), each with a separate root. The GO taxonomy is hierarchical, so the GO terms closer to the roots represent more general protein functions, which become increasingly specialized as we move farther away from the roots. All nodes are associated with their known GO terms as collected in the GO database (GO Consortium (2004)). For each protein mapping , we use Schlicker’s similarity (i.e., {MF, BP, CC}) (Hashemifar and Xu, 2014; Schlicker et al., 2006) to find the similarity of GO terms of p with that of . So, the is computed as
where |g| represents the size of one-to-one protein mappings in g (if the alignment is global, ). Mappings with higher AFS based on the known functional labels are preferred. We depart from Hashemifar and Xu (2014) to match recommended best practice in the field (Devkota et al., 2022) and enforce some specificity of protein function by including only the GO labels whose shortest distance from a root is 5 in our computation of the AFS measure.
Biological networks
We extracted PPI networks from the IntAct database (Hermjakob et al., 2004); these networks are constructed using all physical interactions detected using coimmunoprecipitation (Anderson, 1998) and yeast 2-hybrid (Kohalmi et al., 1998) experiments. We generated the networks for five species: S = {Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus, Rattus norvegicus}. We obtain pairwise node similarity scores from the blastp sequence similarity bit scores (assigning a 0% similarity if blastp does not align the sequences) and, as above and as recommended by Singh et al. (2008), set the parameter that trades off between network and node similarity to be 0.6. The graph properties of these constructed PPI networks are provided in Table 3.
Graph Properties of IntAct Networks
Network
#nodes
#edges
Avg degree
Avg cc
human
32431
301529
18.595
0.069
fly
11247
42555
7.567
0.027
mouse
13967
36541
5.232
0.051
rat
10792
23315
4.320
0.075
yeast
6478
70638
21.808
0.285
Table includes information regarding the number of nodes, number of edges, average node degree, and average clustering coefficient for human, fly, mouse, rat, and yeast networks.
For the 10 distinct pairwise combinations of these five species, for each pair of corresponding PPI networks, we performed experiments using the true IsoRank similarity matrix R and the two approximations Base and = Line. Finally, for each obtained similarity matrix and , we used the greedy matching algorithm to find the top 2000 protein mappings , and , respectively.
Graph theoretical results on biological networks
In order to see how the IsoRank approximations compare with the exact IsoRank mappings, we performed a similar analysis to that done in Section 3.1. For a given pair of species , let be the true IsoRank similarity matrix, and let and be the approximations Base and Line described in (5) and (7), respectively, for a given -value. After using the greedy matching algorithm on , , and , let , , and be the corresponding list of 2000 top one-to-one cross-species protein matches for the species pair in Exact IsoRank and Approximate IsoRank versions Base and Line. We generated results for each and ; results were remarkably stable across different values; therefore, we present only the results with recommended (Singh et al., 2008). We also computed the EC and LCCS scores for the matchings. Results appear in Table 4.
L2 Approximation Loss and Matching Similarity for R = Exact IsoRank, as Compared with Approximations Base and Line
Orgs
Norms
Matching sim.
Edge cor.
LCCS
N1
N2
R
R
F
B
0.000947
0.000088
0.507
0.810
0.074
0.178
0.185
695
825
808
F
M
0.001208
0.000121
0.810
0.941
0.028
0.073
0.072
15
23
23
F
R
0.003482
0.000335
0.378
0.516
0.015
0.037
0.051
53
42
77
H
B
0.000371
0.000032
0.976
0.991
0.213
0.216
0.216
1174
1207
1202
H
F
0.000483
0.000032
0.973
0.994
0.063
0.065
0.064
21
132
134
H
M
0.000571
0.000059
0.986
0.998
0.086
0.093
0.093
158
264
264
H
R
0.001522
0.000158
0.903
0.982
0.018
0.040
0.041
76
408
388
M
B
0.001115
0.000178
0.780
0.941
0.089
0.193
0.191
389
586
581
M
R
0.005213
0.000599
0.779
0.940
0.041
0.129
0.128
57
373
363
R
B
0.004086
0.000525
0.353
0.469
0.107
0.135
0.393
426
679
1135
Edge correctness (EC) and largest common connected subgraph (LCCS) similarities for each of R, and . F = fly network, B = Baker’s yeast network, H = human network, M = mouse network, R = rat network. for all experiments.
Across all species pairs, we find that the approximation loss of approximate IsoRank as compared with exact IsoRank is always , and as expected lower for Line than Base. Similarly, in all species pairs, the matching similarity of the top 2000 nodes to exact IsoRank is superior for Line as compared with Base. The disparity is most pronounced in the fly-yeast pair, where the difference is . However, for pairs where one of the organisms is human, there is relatively minor difference between the mapping produced by the two approximations, and the mappings themselves are closer to the true mapping.
Table 4 results also show that the best EC and LCCS values always come from either the true IsoRank R or its Line approximation. Out of the 10 cross-species experiments tested for , R and both produced the best EC scores in 4 settings each (the remaining two being ties). For LCCS, Line produced better LCCS scores more often (five times) than exact IsoRank (three times). This demonstrates that in a realistic biological setting, performing just a single iteration of our approximation scheme is enough to retain, and sometimes even enrich, the network-theoretical properties of the true IsoRank mappings in biological networks.
Functional similarity results on biological networks
We used the AFS measure, described above, to measure how functionally similar the protein pairs mapped by IsoRank and its approximations are. The AFS results are provided in Table 5. Since the AFS results were not sensitive to -values (fixing GO category, species pairs and type of IsoRank, and the expected s.d. obtained by setting was ), we report in Table 5 and present the obtained AFS scores across all GO categories and species pairs.
Average Functional Similarity Values for Top 2000 Matching-Obtained Approximations ( = Base and = Line) and True IsoRank () Matrices for All Three Gene Ontology Hierarchies (MF = “Molecular Function,” BP = “Biological Process,” and CC = “Cellular Component”), with
MF
BP
CC
N1
N2
F
B
0.513
0.515
0.517
0.261
0.266
0.268
0.539
0.547
0.550
F
M
0.574
0.573
0.570
0.268
0.270
0.269
0.579
0.586
0.583
F
R
0.495
0.429
0.391
0.190
0.170
0.154
0.429
0.399
0.369
H
B
0.595
0.594
0.595
0.362
0.362
0.361
0.503
0.503
0.502
H
F
0.630
0.627
0.628
0.335
0.332
0.332
0.585
0.582
0.581
H
M
0.661
0.659
0.659
0.348
0.346
0.345
0.575
0.574
0.573
H
R
0.622
0.621
0.621
0.308
0.308
0.308
0.464
0.464
0.464
M
B
0.515
0.523
0.521
0.265
0.273
0.274
0.526
0.530
0.529
M
R
0.613
0.600
0.600
0.297
0.293
0.292
0.536
0.530
0.530
R
B
0.477
0.426
0.360
0.216
0.188
0.159
0.409
0.377
0.345
The columns “N1” and “N2” represent IntAct networks for the different pairs of organisms. Best performing results are in bold.
The results are in stark contrast to the results obtained from the graph-theoretical measures. Unlike EC and LCCS outputs, the mapping obtained from the Base approximation (i.e., ) gave the highest AFS scores in the majority of cases across all GO categories (though in most cases by a slight margin). So, for the biological function measures, it seems that the Approximate IsoRank is superior to the exact IsoRank. Even in cases where it did not produce the highest-scoring one-to-one mapping, it closely followed the best scoring algorithm. This implies that if the downstream application of interest is using a better-annotated species to assign GO functional labels in a less well-annotated species, perhaps because it manages to somewhat denoise the network (see Section 5), the simple approximate IsoRank Base may be superior to the original IsoRank for this task.
Competing methods
We further evaluated our Base approximation by comparing its EC, LCCS, and AFS scores against the results from two recent network alignment methods: fast attributed network alignment (FINAL) (Zhang and Tong, 2016) and Hubalign (Hashemifar and Xu, 2014). These methods were chosen among the many competing global network alignment methods because they both have been shown to produce good alignments, and they are relatively fast. The results appear in Tables 6 and 7. FINAL takes in two networks (and optional edge/node attributes, where we use the node pairwise similarity matrix as in the competing methods) and finds an internetwork pairwise alignment matrix that solves a topology-sensitive optimization problem. It generates the one-to-one mappings by performing greedy alignment on this optimized matrix. For our experiments, we generated the top 2000 FINAL 1–1 node alignments.
EC and LCCS Results for Top 2000 Matching Obtained from Approximate IsoRank (, ) and Two Competing Methods (FINAL and Hubalign), for All Gene Ontology Hierarchies
N1
N2
FINAL
Hubalign
(a) EC
F
B
0.0740
0.0996
0.0301
F
M
0.0284
0.0450
0.0166
F
R
0.0150
0.0062
0.0033
H
B
0.2137
0.0969
0.0408
H
F
0.0638
0.0197
0.0061
H
M
0.0862
0.0292
0.0194
H
R
0.0189
0.0101
0.0037
M
B
0.0896
0.1256
0.0443
M
R
0.0413
0.0658
0.0352
R
B
0.1077
0.0514
0.0156
(b) LCCS
F
B
695
14
38
F
M
15
6
5
F
R
53
3
1
H
B
1174
1155
1481
H
F
21
27
35
H
M
158
401
836
H
R
76
29
26
M
B
389
31
51
M
R
57
57
55
R
B
426
22
23
The columns “N1” and “N2” represent IntAct networks for organisms: Human (H), Mouse (M), Rat (R), Fly (F) and Baker’s yeast (B). Best performing results are in bold.
EC, edge correctness; LCCS, largest common connected subgraph.
Average Functional Similarity Results for Top 2000 Matching Obtained from Approximate IsoRank (, ) and Two Competing Methods (FINAL and Hubalign) for MF, BP, and CC Gene Ontology Hierarchies
MF
BP
CC
N1
N2
FINAL
Hub.
FINAL
Hub.
FINAL
Hub.
F
B
0.513
0.537
0.521
0.261
0.287
0.277
0.539
0.545
0.538
F
M
0.574
0.593
0.536
0.268
0.279
0.246
0.579
0.598
0.557
F
R
0.495
0.482
0.478
0.190
0.185
0.186
0.429
0.419
0.420
H
B
0.595
0.557
0.521
0.362
0.329
0.295
0.503
0.486
0.464
H
F
0.631
0.606
0.513
0.334
0.318
0.248
0.585
0.562
0.487
H
M
0.661
0.647
0.614
0.348
0.344
0.308
0.575
0.581
0.546
H
R
0.622
0.588
0.579
0.308
0.282
0.271
0.464
0.436
0.437
M
B
0.515
0.552
0.520
0.265
0.288
0.269
0.526
0.542
0.523
M
R
0.613
0.614
0.614
0.297
0.298
0.301
0.537
0.536
0.539
R
B
0.477
0.472
0.463
0.216
0.209
0.207
0.409
0.402
0.400
The columns “N1” and “N2” represent IntAct networks for the different pairs of organisms: Human (H), Mouse (M), Rat (R), Fly (F) and Baker’s yeast (B). Best performing results are in bold.
Hubalign operates by running a minimum degree heuristic algorithm to estimate the topological and functional importance of proteins and comparing it with proteins from another network to form alignments. Similar to FINAL and Base, we wanted to obtain the top 2000 one-to-one Hublign mappings, but their current C++ implementation can only return the complete global alignment as output. To solve this issue, we reimplemented the Hubalign code from scratch. Furthermore, we made an additional change to the alignment implementation, where we postponed the selection of relatively “weak” alignment between single degree nodes by greedily selecting the stronger node mappings first. We made this change primarily to make the Hubalign results computationally feasible, but we tested our alternative Hubalign with this postponement against the original Hubalign on the smallest two networks (aligning fly and yeast), and, as expected, since we are preferring stronger node mappings, all three reported metrics, EC, LCCS, and AFS, got substantially better with our change. The python source code of our Hubalign implementation is freely available in our Github repository.
EXTENDING APPROXIMATE ISORANK TO MORE THAN TWO SPECIES
An extension of IsoRank, called IsoRankN (or IsoRank-Nibble) (Liao et al., 2009), has been developed to perform network alignment across multiple species. However, unlike in the two-species setting, IsoRankN does not operate by generating similarity scores between every possible node configuration, as that would result in the formation of a higher-order tensor instead of a similarity matrix when the species count is > 2. IsoRankN instead uses the collection of PPIs to produce functionally enriched cross-species protein clusters, where the functional similarity between proteins within each cluster is informed by the graph-spectral properties of the networks and the sequence relationships between the proteins across species.
Given K PPI networks , the sequence similarity matrices between every network pair, the threshold parameter β, IsoRank weight , and the PageRank-Nibble constant , the IsoRankN cross-species clustering is performed in the following way.
Run the original IsoRank on every pair of networks to obtain scores on all edges of the functional similarity graph.
For every protein v, compute the star , where is the neighborhood of v in the functional similarity graph.
Pick an arbitrary remaining PPI network and order the proteins by the sum of edge weights in the induced graph on . Following the ordering and excluding proteins already assigned to clusters, spectrally partition to obtain .
Merge every pair of clusters and in which and .
Repeat steps 3 and 4 until all proteins are assigned to a cluster.
Base and Line approximations for IsoRankN
Observing that the IsoRankN algorithm also suffers from the same computational bottleneck involving the construction of the IsoRank similarity matrices, we proceeded to reimplement the IsoRankN algorithm, where we replaced the exact IsoRank similarities with the and approximations. We set the IsoRankN parameters to their default values () and computed the cross-species clusters on the IntAct PPI networks of the species: Mouse, Rat, Fly, and Yeast (we did not include the human PPI network in this experiment because its addition caused the overall memory requirement to go beyond 64GB of RAM). As in the original IsoRankN article, we used two entropy metrics on the protein clusters to measure their functional properties, which we call “(un)normalized averaged clusterwide entropy” here. We also calculated the same measure of cluster species diversity by counting the number of clusters with their protein members belonging to all (or some subset of) the species set. In detail, these measures are as follows.
Unnormalized averaged clusterwide entropy (UACE)
For a given cluster C, let be the proportion of proteins in C that have the GO label i. Then, per-cluster UACE (i.e., ) is computed as
where d is the number of all GO labels and the logarithm is base 2. Averaging this score across all clusters, we get the clusterwide entropy score , where N is the total number of clusters obtained.
Normalized averaged clusterwide entropy (NACE)
We get from by normalizing it with the log of cluster size (i.e., . Averaging this across all the clusters, we obtain the final clusterwide NACE.
Note that even though step 3 of the IsoRankN algorithm described above specifies arbitrary selection of the species network during each iteration, the actual IsoRankN implementation kept the order fixed. Since our choice of the PPI networks is different from the original implementation, for the sake of completeness, we reinvestigated the better approach among the two by using the UACE and NACE metrics. For this experiment, we chose the following ordering: (1) baker’s yeast, (2) fly, (3) mouse, and (4) rat. Computing the UACE and NACE results for GO = MF, we found that the randomized versus fixed scores were very close (NACE = 0.98 and UACE = 0.93 for both). Thus, we selected the randomized PPI selection setting as our default option during clustering.
Table 8 shows that, when we take all the clusters into account, the IsoRankN clustering obtained using the = Base approximation produces smaller cluster entropy than the exact IsoRank and the = Line approximation across all entropy metrics and GO hierarchies. However, when we measured the clusters’ species diversity by counting the number of clusters with proteins from more than one species (Table 9), the count was always the lowest for Base. Additionally, the Base approximation also produced far fewer clusters with size () compared with Line () and original IsoRank (). Thus, we recomputed the entropy evaluation by excluding all the size = 2 clusters from all 3 IsoRankN implementations. The entropy results (Table 10) showed that this filtering significantly deteriorated the functional coherence of the = Base clusters.
NACE and UACE Results for the Exact IsoRankN Clustering and Its Two Approximations for All Three Gene Ontology (GO) Hierarchies (MF = “Molecular Function,” BP = “Biological Process,” and CC = “Cellular Component”)
IsoRank
R
Metric type
GO
NACE
BP
1.395
1.454
1.457
CC
0.973
0.986
0.984
MF
0.970
0.986
0.981
UACE
BP
2.714
3.037
3.068
CC
0.922
1.006
1.020
MF
0.854
0.939
0.954
, Base; , Line; R, original IsoRank. Best performing results are in bold.
Table Showing the Number of Clusters (Obtained from Exact IsoRank (R) and Its Approximations Base and Line ( and , Respectively)) Having Proteins from Exactly k out of 4 Number of Species
#species in a cluster (k)
R
1
15046
12977
12317
2
1962
2134
2116
3
186
382
404
4
53
85
102
total
17247
15578
14939
NACE and UACE Results for the Exact IsoRankN Clustering and Its Two Approximations for All Three Gene Ontology (GO) Hierarchies (MF = “Molecular Function,” BP = “Biological Process,” and CC = “Cellular Component”)
IsoRank
R
Metric type
GO
NACE
BP
2.589
2.038
2.040
CC
1.113
1.069
1.065
MF
0.948
1.021
1.027
UACE
BP
9.412
6.226
6.208
CC
2.115
1.628
1.618
MF
1.633
1.376
1.421
The scores are computed by selecting only those clusters with size > 2. Best performing results are in bold.
Thus, the results in Tables 9 and 10 show that the cluster entropy and species diversity are comparable between Line and exact IsoRank. Both produce bigger and more diverse clusters, with higher functional similarity among proteins in the larger clusters than the Base approximation.
DISCUSSION
Synthetic networks
The results on the synthetic networks serve to provide some guidance on which IsoRank approximation should be utilized as a function of the measured node similarities in the two networks. As shown in Tables 1 and 2, for both ER and BA random graphs, when “good” is used, the accuracy for all three algorithms is , i.e., all the algorithms can recover the network alignments perfectly, and the errors, measured in both norms, are very small. However, the original IsoRank R (2) is much slower than the two approximation algorithms. Approximate IsoRank = Base (5) is about 7.6–14.3 times faster, and Approximate IsoRank = Base (7) is about 4.7–6.7 times faster. Since they all recover the perfect alignment, when contains accurate information about the alignment, we suggest using approximate IsoRank = Base (5). When “in-between” is used, although the error between the original IsoRank and approximations is still small, the accuracy for all three algorithms decreases. As we can see, Approximate IsoRank = Base (5) is the fastest among the three, but its accuracy is slightly worse than the original IsoRank R (2). Approximate IsoRank = Base (7) is slightly slower than Approximate IsoRank Base (5) but achieves roughly the same accuracy as the original IsoRank R (2). But Approximate IsoRank Line (7) is still much faster than original IsoRank (2) and, therefore, balancing the accuracy and CPU time, we would suggest using Approximate IsoRank Line (7) in this case. Finally, when is “bad,” we observe the same results as the case is “in-between.” In particular, Approximate IsoRank Base (5) is the fastest, but its accuracy is the worst (but not by much, still close to the accuracy of the original IsoRank [2]). Approximate IsoRank Line (7) seems to achieve the best balance between accuracy and CPU time, and we would suggest using it in this case as well. However, if the CPU time is the priority, then we recommend Approximation IsoRank Base (5) since it is the fastest. Finally, from the numerical experiments, we observe that both our approximation algorithms achieve optimal computational complexity , where , as expected since the synthetic networks considered here are sparse (namely in the range 15.4–18.5 in Table 1 and approximately 5.0 in Table 2) and in our implementation.
Biological networks
Interestingly, we found the Base approximation to IsoRank to be more robust and produce better predictions of protein functional roles across the board than even an exact IsoRank. Since inferring the function of unknown proteins across species is one of the main applications of the Global Alignment of Biological Networks, perhaps given the fact that the networks are not actually isomorphic and we have an imperfect sample of the true edges, somehow our approximation to IsoRank is more robust to noise and therefore works better than producing the best isomorphism as measured by purely graph-theoretical measures. This requires further investigation.
We also approximated the IsoRankN clustering algorithm in the multispecies setting by replacing the original IsoRank similarities with the = Base and = Base approximations. Different from the two species setting, we found that the clusters obtained from Base were less functionally meaningful and less species diverse than those obtained from Line and exact IsoRank. Therefore, for the multispecies clustering problem, we recommend using the -approximation, Line, as it produces comparable results to the original IsoRankN implementation but has a significantly less computational cost.
Footnotes
ACKNOWLEDGMENTS
The authors thank the National Science Foundation for support under NSF grant CCF-1934553. The authors also thank the NSF CC* grant 2018149 for Tufts computing cluster resources.
AUTHORS’ CONTRIBUTIONS
K.D.: conceptualization, methodology, formal analysis, and software; A.B.: formal analysis; X.H.: methodology, formal analysis, and software; L.C.: conceptualization, methodology, formal analysis, and funding. All authors were involved in writing and editing the article.
CODE AND DATA AVAILABILITY
The implementation of approximate IsoRank is available at https://github.com/kap-devkota/approximate_ISORANK. Gene Ontology files are from the official GO website (http://geneontology.org/ OBO data version: releases/2022–12-04, GAF version: 2.2). The biological networks used in the experiments can be downloaded from the IntAct FTP link at https://www.ebi.ac.uk/intact/download/ftp
AUTHOR DISCLOSURE STATEMENT
The authors have no competing interests to declare that are relevant to the content of this article.
FUNDING INFORMATION
This work was supported by the National Science Foundation (NSF), grant CCF-1934553 and the NSF CC* grant 2018149 for Tufts computing cluster resources.
References
1.
AlbertR, BarabásiA-L. Statistical mechanics of complex networks. Rev Mod Phys, 2002; 74(1):47–97; doi: 10.1103/RevModPhys.74.47
2.
AltschulSF, GishW, MillerW, et al.Basic local alignment search tool. J Mol Biol, 1990; 215(3):403–410; doi: 10.1016/S0022-2836(05)80360-2
3.
AndersonNG. Co-Immunoprecipitation. In: Protein Targeting Protocols. (CleggR.A. eds.) Humana Press: Totowa, NJ; 1998. pp. 35–45. 10.1385/0-89603-487-9:35
4.
GO Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 2004; 32(suppl_1):D258–D261; doi: 10.1093/nar/gkh036
5.
DevkotaK, SchmidtH, WerenskiM, et al.Glider: Function prediction from GLIDE-based neighborhoods. Bioinformatics, 2022; 38(13):3395–3406; doi: 10.1093/bioinformatics/btac322
6.
El-KebirM, HeringaJ, KlauGW. Natalie 2.0: Sparse global network alignment as a special case of quadratic assignment. Algorithms, 2015; 8(4):1035–1051; doi: 10.3390/a8041035
7.
Emmert-StreibF, DehmerM, ShiY. Fifty years of graph matching, network alignment and network comparison. Information Sciences, 2016; 346:180–197; doi: 10.1016/j.ins.2016.01.074
8.
ErdősP, RényiA. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 1960; 5(1):17–60; doi: 10.1515/9781400841356.38
9.
HampT, KassnerR, SeemayerS, et al.Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics, 2013; 14(Suppl 3):S7; doi: 10.1186/1471-2105-14-S3-S7
10.
HashemifarS, XuJ. HubAlign: An accurate and efficient method for global alignment of protein–protein interaction networks. Bioinformatics, 2014; 30(17):i438–i444; doi: 10.1093/bioinformatics/btu450
11.
HermjakobH, Montecchi-PalazziL, LewingtonC, et al.InTact: An open source molecular interaction database. Nucleic Acids Res, 2004; 32(Database issue):D452–D455; doi: 10.1093/nar/gkh052
12.
KazemiE, GrossglauserM. On the structure and efficient computation of IsoRank node similarities. arXiv, 2016; doi: 10.48550/arXiv.1602.00668
13.
KhanAM, GleichDF, PothenA, et al.A multithreaded algorithm for network alignment via approximate matching. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE; 2012. pp. 1–11 10.1109/SC.2012.8
14.
KohalmiSE, ReaderLJV, SamachA, et al.Identification and characterization of protein interactions using the yeast 2-hybrid system, In: Plant Molecular Biology Manual. Springer: Netherlands, Dordrecht; 1998. pp 95–124. 10.1007/978-94-011-5242-6_6
15.
LiaoC-S, LuK, BaymM, et al.IsoRankN: Spectral methods for global alignment of multiple protein networks. Bioinformatics, 2009; 25(12):i253–i258; doi: 10.1093/bioinformatics/btp203
16.
MamanoN, HayesWB. SANA: Simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics, 2017; 33(14):2156–2164; doi: 10.1093/bioinformatics/btx090
17.
NeyshaburB, KhademA, HashemifarS, et al.NETAL: A new graph-based method for global alignment of protein–protein interaction networks. Bioinformatics, 2013; 29(13):1654–1662; doi: 10.1093/bioinformatics/btt202
18.
SchaferRD. An Introduction to Nonassociative Algebras. Courier Dover Publications; 2017.
19.
SchlickerA, DominguesFS, RahnenführerJ, et al.A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics, 2006; 7(1):302–316; doi: 10.1186/1471-2105-7-302
20.
SinghR, XuJ, BergerB. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci USA, 2008; 105(35):12763–12768; doi: 10.1073/pnas.080662710
21.
SmythM. A spectral theoretic proof of Perron-Frobenius. In: Mathematical Proceedings of the Royal Irish Academy. JSTOR, 2002. pp. 29–35.
22.
StrangG. Introduction to Linear Algebra. 3rd. edition.. Wellesley-Cambridge Press: Wellesley, MA; 2003.
23.
VijayanV, SaraphV, MilenkovićT. MAGNA++: Maximizing accuracy in global network alignment via both node and edge conservation. Bioinformatics, 2015; 31(14):2409–2411; doi: 10.1093/bioinformatics/btv161
24.
ZhangS, TongH. FINAL: Fast attributed network alignment. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. pp. 1345–1354, 10.1145/2939672.293976
25.
ZhangS, TongH, TangJ, et al.iNEAT: Incomplete network alignment. In: 2017 IEEE International Conference on Data Mining (ICDM). IEEE; 2017. pp. 1189–1194 10.1109/ICDM.2017.160