Approximate IsoRank for Scalable and Functionally Meaningful Cross-Species Alignments of Protein Interaction Networks

Abstract

The IsoRank algorithm of Singh, Xu, and Berger was a pioneering algorithmic advance that applied spectral methods to the problem of cross-species global alignment of biological networks. We develop a new IsoRank approximation that exploits the mathematical properties of IsoRank’s linear system to solve the problem in quadratic time with respect to the maximum size of the two protein–protein interaction (PPI) networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while remaining computationally inexpensive. In experiments on synthetic and real PPI networks with various proposed metrics to measure alignment quality, we find the results of our approximate IsoRank are nearly as accurate as the original IsoRank. In fact, for functional enrichment-based measures of global network alignment quality, our approximation performs better than the exact IsoRank, which is doubtless because it is more robust to the noise of missing or incorrect edges. It also performs competitively against two more recent global network alignment algorithms. We also present an analogous approximation to IsoRankN, which extends the network alignment to more than two species.

1. INTRODUCTION

As has been noted in the social network and pattern recognition application communities as well (Emmert-Streib et al., 2016; Khan et al., 2012), spectral methods have enjoyed some success in many variants of the approximate graph matching problem, including the Global Network Alignment problem studied by computational biologists for aligning protein–protein interaction (PPI) networks from different species. The first spectral method for the PPI global alignment problem proposed by the biological community was the IsoRank method of Singh et al. (2008). This method for global biological network alignment is still used today. With the advent of many more experiments, the size of the networks available to match has grown considerably, making running IsoRank unfeasible on these networks without access to substantial computational resources. In this article, we develop a new IsoRank approximation that exploits the eigenproperties of IsoRank’s linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while still remaining computationally inexpensive. Finally, we analogously show how to incorporate approximate IsoRank into the IsoRankN algorithm (Liao et al., 2009), where IsoRankN is a method that extends IsoRank to multiple network alignments.

The quality of our approximation is investigated in two different settings. First, in synthetic experiments, we generate random graphs using the Erdős and Rényi (1960) and Albert and Barabási (2002) models and ask IsoRank to recover the graph isomorphism between our graphs and a random node permutation. We parameterize the node similarity with a noise parameter $E$ , which scales from 0 to 1 (where 1 indicates a random node permutation). For three ranges of the noise level, we measure the quality of the approximation recovered by our IsoRank approximation: a low-error regime where IsoRank itself will recover the matching of the true isomorphism, an intermediate $E$ where exact IsoRank only recovers 80%−90% of the true matching, and a noisy $E$ where exact IsoRank recovers only 70%−75% of the true matching. In each case, we show how close approximate IsoRank comes, as well as demonstrating the considerable savings in CPU time. Next, we return to the biological domain and consider real-world networks from five species: mouse (Mus musculus), rat (Rattus norvegicus), fly (Drosophila melanogaster), baker’s yeast (Saccharomyces cerevisiae), and human (Homo sapiens). In this case, as in the original IsoRank, sequence similarity (as measured by BLAST [Altschul et al., 1990]) becomes the measure of node similarity, where more distant species will typically have less sequence similarity, but we are no longer in the setting where we are trying to recover a permutation of a single network: we do not even have the same number of nodes because of gene duplication and loss. As evolutionary distance increases, there will also be increasing network rewiring, resulting in inserted and deleted edges in each PPI network. Finally, the networks differ because we have only a noisy and incomplete sample of the true PPI networks. Because we of course do not have a ground truth alignment in this setting as we do for the synthetic data, we instead adopt the most commonly reported measures of alignment quality [see Hashemifar and Xu (2014) and Section 3.2 below], including graph-theoretical measures that report edge and subgraph consistency across the alignment, as well as functional coherence measures that report how often aligned proteins are known to be involved in similar biological functions across the two species. While these are the quality measures we adopt here, we note that alternative ways of measuring the quality of biological network alignment appear in El-Kebir et al. (2015), Mamano and Hayes (2017), Neyshabur et al. (2013), and Vijayan et al. (2015).

We additionally compare exact and approximate IsoRank to Hubalign (Hashemifar and Xu, 2014) and FINAL (Zhang and Tong, 2016), where these alternative alignment methods were chosen because (1) they are both in the set of recent well-performing competing network alignment algorithms, and (2) from this set, they are the ones we found that can also be successfully run on a desktop computer (with sufficient memory). While Approximate IsoRank is consistently slightly worse than exact IsoRank on the graph theoretical measures, when compared with Hubalign and FINAL, Approximate IsoRank is often better and, when not, usually a close second. Most interestingly, across the board, we find that Approximate IsoRank performs better than exact IsoRank, according to our functional enrichment measures, and also has better functional enrichment in the majority of experiments than the competing alignment methods. Perhaps this is because Approximate IsoRank is less sensitive to missing and incorrect edges than exact IsoRank.

Finally, we extend our approximation to the IsoRankN algorithm of Liao et al. (2009), which produces a network alignment across multiple species. We discuss the performance of exact and Approximate IsoRankN for cross-species functional orthology clustering.

2. ALGORITHM

2.1. IsoRank

Let us first recall the IsoRank algorithm (Singh et al., 2008) and some basic techniques from linear algebra (Strang, 2003). We consider first the version of IsoRank that only tries to match network topology but does not also incorporate a measure of node–node similarity. The key observation is that IsoRank looks at the tensor product of two graphs, which is essential for developing our approximation algorithm. The relation to the tensor product was not known by the developers of IsoRank but was also previously observed by Zhang et al. (Kazemi and Grossglauser, 2016; Zhang and Tong, 2016; Zhang et al., 2017).

Consider two simply connected, undirected, and possibly weighted graphs $G_{1} = (V_{1}, E_{1})$ and $G_{2} = (V_{2}, E_{2})$ . Their adjacency matrices are $A_{1}$ and $A_{2}$ , respectively, and weighted degree matrices are $D_{1} : = diag (d_{1})$ , $d_{1} : = A_{1} 1_{| V_{1} |}$ , and $D_{2} : = diag (d_{2})$ , $d_{2} : = A_{2} 1_{| V_{2} |}$ , respectively. Thus, the corresponding transition matrices are $P_{1} = A_{1} D_{1}^{- 1} \in ℝ^{| V_{1} | \times | V_{1} |} and P_{2} = A_{2} D_{2}^{- 1} \in ℝ^{| V_{2} | \times | V_{2} |},$ which are both column stochastic, i.e., $1_{| V_{1} |}^{T} P_{1} = 1_{| V_{1} |}^{T}$ and $1_{| V_{2} |}^{T} P_{2} = 1_{| V_{2} |}^{T}$ . $P_{1}$ and $P_{2}$ can be viewed as transition matrices defining the random walk in the PageRank algorithm that underlies the IsoRank algorithm.

Next we consider the tensor product of the two graphs, which is a graph with $| V_{1} | | V_{2} |$ nodes and its adjacency matrix is defined as $A = A_{1} ⨂ A_{2} \in ℝ^{| V_{1} | | V_{2} | \times | V_{1} | | V_{2} |},$ where $⨂$ denotes the Kronecker product (Schafer, 2017). Similarly, the transition matrix of the tensor product graph is $\begin{array}{l} P = P_{1} ⨂ P_{2} = (A_{1} D_{1}^{- 1}) ⨂ (A_{2} D_{2}^{- 1}) \\ = (A_{1} ⨂ A_{2}) (D_{1}^{- 1} ⨂ D_{2}^{- 1}) = (A_{1} ⨂ A_{2}) {(D_{1} ⨂ D_{2})}^{- 1} \\ = A D^{- 1} \in ℝ^{| V_{1} | | V_{2} | \times | V_{1} | | V_{2} |} . \end{array}$ where $D = D_{1} ⨂ D_{2} \in ℝ^{| V_{1} | | V_{2} | \times | V_{1} | | V_{2} |}$ is the degree matrix of the tensor product graph. This is because $\begin{array}{l} d : = A 1 = (A_{1} ⨂ A_{2}) 1_{| V_{1} | | V_{2} |} \\ = (A_{1} ⨂ A_{2}) (1_{| V_{1} |} ⨂ 1_{| V_{2} |}) = (A_{1} 1_{| V_{1} |} ⨂ (A_{2} 1_{| V_{2} |}) \\ = d_{1} ⨂ d_{2}, \end{array}$ and $\begin{array}{l} D : = diag (d) = diag (d_{1} ⨂ d_{2}) = diag (d_{1}) ⨂ diag (d_{2}) \\ = D_{1} ⨂ D_{2} . \end{array}$

As suggested in Singh et al. (2008), the basic version of IsoRank (i.e., without the node–node similarity matrix E) tries to find $R \in ℝ^{| V_{1} | | V_{2} |}$ , ${‖ ‖ R ‖ ‖}_{1} = 1$ , such that $R = P R .$ (1)

This is done by the power method, which is the following iterative procedure, for $k = 0, 1, 2, \dots$ ${\bar{R}}^{k + 1} = P R^{k}, R^{k + 1} = \frac{{\bar{R}}^{k + 1}}{{‖ {\bar{R}}^{k + 1} ‖}_{1}} .$

The initial value $R^{0}$ should be chosen based on data.

A modified version of IsoRank tries to find $R$ , ${∥ R ∥}_{1} = 1$ , such that, for $0 \leq α \leq 1$ $R = (1 - α) P R + α E,$ (2)where $E \in ℝ^{| V_{1} | | V_{2} |}$ , ${‖ ‖ E ‖ ‖}_{1} = 1$ which contains the domain-specific node–node similarity information to improve the graph alignment result. For example, in the original IsoRank article of Singh et al. (2008), besides the two PPI networks of the species, additional sequence similarity data, which are presented by $E \in ℝ^{| V_{1} | \times | V_{2} |}$ , are required in order to perform the graph matching. Here, $E_{i j}$ denotes how similar the amino acid sequence of the $i^{t h}$ protein of $G_{1}$ is to that of the $j^{t h}$ protein of $G_{2}$ . This sequence similarity information is obtained using the standard BLAST sequence alignment (Altschul et al., 1990). Since we use the tensor product notation in our article, we have transformed E to $vec (E) \in ℝ^{| V_{1} | | V_{2} |}$ , where $vec (\cdot)$ converts a matrix to a vector.

As suggested in Singh et al. (2008), (2) is again solved iteratively as follows, for $k = 0, 1, 2, \dots$ ${\bar{R}}^{k + 1} = (1 - α) P R^{k} + α E, R^{k + 1} = \frac{{\bar{R}}^{k + 1}}{{‖ {\bar{R}}^{k + 1} ‖}_{1}} .$ (3)

Other approximation methods can also be used to solve (2) (e.g., Kazemi and Grossglauser, 2016).

Remark 1. Note that, in both iterative procedures, the normalization step is not needed if we have $R^{0} \geq 0$ , $‖ ‖ R^{0} ‖ ‖_{1}$ , $E \geq 0$ , and $‖ ‖ E ‖ ‖_{1} = 1.$ This is because $\begin{array}{l} ‖ {\bar{R}}^{k + 1} ‖_{1} = 1_{| V_{1} | | V_{2} |}^{T} {\bar{R}}^{k + 1} = 1_{| V_{1} | | V_{2} |}^{T} ((1 - α) P R^{k} + α E) \\ = (1 - α) 1_{| V_{1} | | V_{2} |}^{T} P R^{k} + α 1_{| V_{1} | | V_{2} |}^{T} E \\ = (1 - α) 1_{| V_{1} | | V_{2} |}^{T} R^{k} + α 1_{| V_{1} | | V_{2} |}^{T} E \\ = (1 - α) ‖ R^{k} ‖_{1} + α ‖ E ‖_{1} = 1. \end{array}$

2.2. Approximate IsoRank

Let us first consider the basic version of IsoRank (1) without taking into account node similarity, and the key here is to notice that IsoRank is actually working on a tensor product graph of the original two graphs. Therefore, based on the Perron–Frobenius theorem (Smyth, 2002), the basic IsoRank (1) tries to find the right eigenvector of the transition matrix P that corresponds to the unique eigenvalue 1. Therefore, $R$ is nothing but the steady-state distribution (SSD) $π$ of the tensor product graph, which can be computed directly without using the power method at all since IsoRank works on a tensor product graph. Using the properties of the transition matrix P of a simply connected, undirected, and possibly weighted graph, the SSD can be obtained without any iterative procedure as follows; $π : = \frac{d}{d_{total}},$ (4)where $d = D 1_{| V_{1} | | V_{2} |}$ and $d_{total} = d^{T} 1_{| V_{1} | | V_{2} |}$ (which is the weighted total degree). This means, for the basic version IsoRank (1) (without the node-similarity measure), we have $R : = π$ directly, and there is no need of the power method or any iterative procedure, which involves matrix vector multiplications.

Now let us consider the general version of IsoRank (2), which also incorporates the extra sequence similarity information. Note that when $α = 0$ , (2) reduces to (1) and, therefore, $R = π$ . In contrast, when $α = 1$ , we have $R = E$ . Thus, for $0 \leq α \leq 1$ , we can use the following convex combination to get an approximation: $R^{0} = (1 - α) π + α E .$ (5)

Again, there is no iterative procedure and no matrix–vector multiplications at all. We call this approximation represented by Equation 5 either $R^{0}$ or Base $R^{≃}$ . The reason we call this approximation Base $R^{≃}$ is that it can be used either directly or as the initial guess of an iterative procedure for solving (2).

According to (2) and (5), we have $\begin{array}{l} R - R^{0} = (1 - α) P R + α E - (1 - α) π - α E \\ = (1 - α) (P R - π) = (1 - α) (P R - P π) \\ = (1 - α) P (R - π) . \end{array}$

Let us introduce $\tilde{R} = D^{- \frac{1}{2}} R$ , $\tilde{π} = D^{- \frac{1}{2}} π$ , $\tilde{E} = D^{- \frac{1}{2}} E$ , and $\tilde{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}$ .

Then we have $\tilde{R} - {\tilde{R}}^{0} = D^{- \frac{1}{2}} (R - R^{0}) = (1 - α) D^{- \frac{1}{2}} P (R - π) = (1 - α) \tilde{A} (\tilde{R} - \tilde{π}) .$

In contrast, multiplying $D^{- \frac{1}{2}}$ on the left of both sides of (2), we obtain $\tilde{R} = (1 - α) \tilde{A} \tilde{R} + α \tilde{E} .$

This implies $\tilde{R} = α {[I - (1 - α) \tilde{A}]}^{- 1} \tilde{E} .$ (6)

Since $\tilde{A}$ is symmetric and its largest eigenvalue is 1, its eigenvalues are $1 = λ_{1} > λ_{2} \geq λ_{3} \geq \dots \geq λ_{n}$ , $n = | V_{1} | | V_{2} |$ . The corresponding eigenvectors are denoted by $v_{i}$ , $i = 1, \dots, n$ , with $‖ ‖ v_{i} ‖ ‖ = 1$ and $v_{i}^{T} v_{j} = 0$ , $i \neq j$ . It is easy to check that $v_{1} = \sqrt{d_{total}} \tilde{π}$ . Since $\tilde{A}$ is symmetric, $v_{i}$ , $i = 1, \dots, n$ , form an orthonormal basis and, thus $\tilde{E} = \sum_{i = 1}^{n} μ_{i} v_{i}$ with $μ_{i} = v_{i}^{T} \tilde{E}$ . Note that $μ_{1} = v_{1}^{T} \tilde{E} = \sqrt{d_{total}} {[D^{- \frac{1}{2}} \frac{D 1_{n}}{d_{total}}]}^{T} D^{- \frac{1}{2}} E = \frac{1}{\sqrt{d_{total}}} .$

Now we are ready to present the following theorem, which measures the approximation error of $R^{0} =$ Base $R^{≃}$ .

Theorem 1.

Let $R^{0} = (1 - α) π + α E .$ For $0 < α \leq 1$ , we have $‖ R - R^{0} ‖_{D^{- 1}}^{2} \leq α^{2} (1 - \frac{1}{d_{total}}) {[\frac{(1 - α) λ_{2}}{1 - (1 - α) λ_{2}}]}^{2} .$ where ${‖ x ‖}_{D^{- 1}}^{2} : = \sqrt{x^{T} D^{- 1} x}$ .

Proof: Based on (6), for $0 < α \leq 1$ , $\begin{array}{l} \tilde{R} = α {[I - (1 - α) \tilde{A}]}^{- 1} [\sum_{i = 1}^{n} μ_{i} v_{i}] \\ = \sum_{i = 1}^{n} \frac{α μ_{i}}{1 - (1 - α) λ_{i}} v_{i} \\ = μ_{1} v_{1} + \sum_{i = 2}^{n} \frac{α μ_{i}}{1 - (1 - α) λ_{i}} v_{i} \\ = \tilde{π} + \sum_{i = 2}^{n} \frac{α μ_{i}}{1 - (1 - α) λ_{i}} v_{i} . \end{array}$

This means, $\begin{array}{l} \tilde{R} - {\tilde{R}}^{0} = (1 - α) \tilde{A} (\tilde{R} - \tilde{π}) \\ = (1 - α) \tilde{A} [\sum_{i = 2}^{n} \frac{α μ_{i}}{1 - (1 - α) λ_{i}} v_{i}] \\ = α \sum_{i = 2}^{n} \frac{(1 - α) λ_{i}}{1 - (1 - α) λ_{i}} μ_{i} v_{i} . \end{array}$

Therefore, $‖ R - R^{0} ‖_{D^{- 1}}^{2} = ‖ \tilde{R} - {\tilde{R}}^{0} ‖^{2} = α^{2} \sum_{i = 2}^{n} {[\frac{(1 - α) λ_{i}}{1 - (1 - α) λ_{i}}]}^{2} μ_{i}^{2} \leq α^{2} {[\frac{(1 - α) λ_{2}}{1 - (1 - α) λ_{2}}]}^{2} \sum_{i = 2}^{n} μ_{i}^{2} = α^{2} {[\frac{(1 - α) λ_{2}}{1 - (1 - α) λ_{2}}]}^{2} (1 - μ_{1}^{2}) .$

This completes the proof.

As we mentioned, if we are not satisfied with Base $R^{≃}$ , we can use it as an initial guess and perform power iterations. In practice, since it provides a good initial guess, we found that usually only one step of power iteration is needed, which provides another approximation $R^{1}$ as follows: $R^{1} = (1 - α) P R^{0} + α E .$ (7)

Because this is the result of performing only one linear step, we also refer to this approximation of Equation (7) as Line $R^{≃}$ . Note that, due to the fact that $P = P_{1} ⨂ P_{2}$ , the matrix–vector multiplication step $y = P x$ can be computed without explicitly forming P. We can use the following equivalent expression to compute $y$ : $y \leftarrow vec (P_{2} {vec}^{- 1} (x) P_{1}^{T}) .$ where ${vec}^{- 1} (\cdot)$ is the inverse operation of $vec (\cdot)$ .

2.3. Computational complexity

To determine the computational cost of using the iterative method (3) to solve IsoRank (2), we note that our approximations Base $R^{≃}$ and Line $R^{≃}$ mainly depend on the matrix–vector multiplication for computing $P x$ in the iteration (if the implementation uses the tensor product, then this step becomes matrix–matrix multiplication) and vector addition for computing $a x + b y$ , $a, b \in ℝ$ . Let us assume the complexity for matrix–vector multiplication (or matrix–matrix multiplication in the tensor product implementation) is $t_{1}$ and the complexity for vector addition is $t_{2}$ . Then the complexity of using the iterative method (3) to solve IsoRank (2) is $k (t_{1} + t_{2}),$ where k is the number of iterations used in the implementation. Using the standard convergence analysis of (3), at least $k \geq \frac{\log ε}{\log (1 - α) λ_{2}}$ steps are required to reach a given relative error tolerance $ε$ (measured in ${‖ ‖ \cdot ‖ ‖}_{D^{- 1}}$ norm). Here, $λ_{2}$ is the second largest eigenvalue of P, which happens to be the second largest eigenvalue of $\tilde{A}$ since P and $\tilde{A}$ are similar.

For our approximation $R^{0}$ = Base $R^{≃}$ , since we only use vector addition once, the complexity is $t_{2} .$ Finally, the approximation Line $R^{≃}$ does one step of iteration using $R^{0}$ as the initial guess. Thus, its computational cost is $t_{2} + (t_{1} + t_{2}) = t_{1} + 2 t_{2}$ .

As we can see, our approximation’s main computational advantage is that there is no need for iterations. Consider the worst case that $λ_{2} \approx 1$ , let us choose $α = 0.6$ as used in our experiments. The iterative method (3) needs about $k = 30$ iterations to achieve $ε = 10^{- 12}$ . This implies that our approximations are about 30 times faster than iterating to convergence here. Of course, the computational saving will depend on specific implementations and computers.

3. EXPERIMENTS

3.1. Synthetic networks

Our first set of experiments uses the Erdős–Rényi (ER) random graphs (Erdős and Rényi, 1960) and the Barabási–Albert (BA) random graphs (Albert and Barabási, 2002). For the ER graphs, to ensure with high probability that the resulting random graphs are connected, we set the probability that two nodes are connected to be $\frac{4 \log n}{n}$ . For the BA graphs, we use an initial seed graph consisting of five nodes and edge set ${(1, 2), (1, 5), (2, 4), (3, 4)}$ , forming a path. Each new node is then connected to exactly five existing nodes with a probability proportional to the number of edges the existing nodes already have.

To apply and test the IsoRank algorithm (2) and our two approximation algorithms, i.e., Base $R^{≃}$ and Line $R^{≃}$ , we first generate a simply connected random graph and then randomly permute the nodes to obtain another random graph. As suggested in Singh et al. (2008), we choose $α = 0.6$ in (2) for all the tests. In addition, we generate different $E$ by adding a different level of random noise to the edge weights of the correct random permutation. Starting with the correct random permutation, whose values are either 0 or 1, for a “good” $E$ , we add random noise between $[- 0.5, 0.5]$ . For a “bad” $E$ , we add random noise between $[- 0.7, 0.7]$ . When we add random noise between $[- 0.6, 0.6]$ , we refer to the resulting $E$ as “intermediate.” Once the random noise is added, we take the absolute value component-wise and then normalize to ensure that ${‖ ‖ E ‖ ‖}_{1} = 1$ and all the components are nonnegative. Intuitively, the “good” $E$ provides more information about the correct alignment, resulting in better accuracy than other choices of $E$ . Finally, since exact IsoRank(2) is solved by an iterative procedure, we stop the iteration when the Frobenius norm (denoted by ${‖ ‖}_{F}$ , and equal to the square root of the summed squares of absolute values of the elements of A) of the difference between two successive iterations is less than $10^{- 12}$ or the maximal number of iterations, which is set to be 100, is reached. In all our experiments, it took less than 20 iterations to converge.

We report the accuracy, CPU time, $‖ ‖ \cdot ‖ ‖$ error, and ${‖ ‖ \cdot ‖ ‖}_{D^{- 1}}$ error. The accuracy of the alignment is measured in terms of how many pairs of nodes between the original network and permuted network are predicted correctly (in percentage). Since we implement all the algorithms in Matlab, the CPU time is measured in seconds using the Matlab built-in functions tic and toc. Finally, we use the standard $‖ ‖ \cdot ‖ ‖$ norm to measure the error between the exact IsoRank result and its approximations. To verify the theory, we also look at the error using the weighted ${‖ ‖ \cdot ‖ ‖}_{D^{- 1}}$ norm. The numerical results are reported in Tables 1 and 2. Note that both exact IsoRank and its approximations recover the correct alignment in the “good” $E$ regime, but even exact IsoRank does not with more noise in the $E$ node-similarity measure. However, both Approximate IsoRank approximations are very close to exact IsoRank in both norms and have much faster running times.

Table 1.
Comparison of Exact and Approximate IsoRank on Erdős–Rényi (ER) Graphs

Good $E$ Intermediate $E$ Bad $E$

Acc. CPU ${‖ \cdot ‖}^{(err)}$ ${‖ \cdot ‖}_{D^{- 1}}^{(err)}$ Acc CPU ${‖ \cdot ‖}^{(err)}$ ${‖ \cdot ‖}_{D^{- 1}}^{(err)}$ Acc CPU ${‖ \cdot ‖}^{(err)}$ ${‖ \cdot ‖}_{D^{- 1}}^{(err)}$

$| V | = 2, 500$ , $| E | = 39, 053$ $| V | = 2, 500$ , $| E | = 38, 802$ $| V | = 2, 500$ , $| E | = 39, 424$

$R$ 100% 1.44 — — 83.9% 1.41 — — 71.6% 1.53 — —

$R^{0}$ 100% 0.19 4.92e-6 1.55e-7 83.3% 0.19 4.90e-6 1.56e-7 71.1% 0.20 5.05e-6 1.57e-7

$R^{1}$ 100% 0.31 4.54e-7 1.45e-8 83.8% 0.29 4.58e-7 1.46e-8 71.6% 0.32 4.66e-7 1.47e-8

$| V | = 5, 000$ , $| E | = 85, 630$ $| V | = 5, 000$ , $| E | = 84, 850$ $| V | = 5, 000$ , $| E | = 85, 248$

$R$ 100% 8.21 — — 83.6% 7.88 — — 71.2% 8.36 — —

$R^{0}$ 100% 0.83 2.25e-6 6.49e-8 83.0% 0.83 2.31e-6 6.70e-8 70.7% 0.84 2.29e-6 6.63e-8

$R^{1}$ 100% 1.45 1.99e-7 5.80e-9 83.6% 1.43 2.08e-7 6.08e-9 71.2% 1.45 2.05e-7 5.97e-9

$| V | = 10, 000$ , $| E | = 184, 460$ $| V | = 10, 000$ , $| E | = 183, 929$ $| V | = 10, 000$ , $| E | = 184, 265$

$R$ 100% 42.56 — — 83.1% 45.66 — — 71.7% 44.26 — —

$R^{0}$ 100% 2.98 1.04e-6 2.78e-8 82.4% 3.27 1.05e-6 2.82e-8 71.1% 3.12 1.03e-6 2.77e-8

$R^{1}$ 100% 6.33 8.81e-8 2.38e-9 83.1% 6.62 8.97e-8 2.43e-9 71.7% 6.47 8.78e-8 2.37e-9

	Good $E$	Intermediate $E$	Bad $E$
	$\| V \| = 2, 500$ , $\| E \| = 39, 053$	$\| V \| = 2, 500$ , $\| E \| = 38, 802$	$\| V \| = 2, 500$ , $\| E \| = 39, 424$
$R$	100%	1.44	—	—	83.9%	1.41	—	—	71.6%	1.53	—	—
$R^{0}$	100%	0.19	4.92e-6	1.55e-7	83.3%	0.19	4.90e-6	1.56e-7	71.1%	0.20	5.05e-6	1.57e-7
$R^{1}$	100%	0.31	4.54e-7	1.45e-8	83.8%	0.29	4.58e-7	1.46e-8	71.6%	0.32	4.66e-7	1.47e-8
	$\| V \| = 5, 000$ , $\| E \| = 85, 630$	$\| V \| = 5, 000$ , $\| E \| = 84, 850$	$\| V \| = 5, 000$ , $\| E \| = 85, 248$
$R$	100%	8.21	—	—	83.6%	7.88	—	—	71.2%	8.36	—	—
$R^{0}$	100%	0.83	2.25e-6	6.49e-8	83.0%	0.83	2.31e-6	6.70e-8	70.7%	0.84	2.29e-6	6.63e-8
$R^{1}$	100%	1.45	1.99e-7	5.80e-9	83.6%	1.43	2.08e-7	6.08e-9	71.2%	1.45	2.05e-7	5.97e-9
	$\| V \| = 10, 000$ , $\| E \| = 184, 460$	$\| V \| = 10, 000$ , $\| E \| = 183, 929$	$\| V \| = 10, 000$ , $\| E \| = 184, 265$
$R$	100%	42.56	—	—	83.1%	45.66	—	—	71.7%	44.26	—	—
$R^{0}$	100%	2.98	1.04e-6	2.78e-8	82.4%	3.27	1.05e-6	2.82e-8	71.1%	3.12	1.03e-6	2.77e-8
$R^{1}$	100%	6.33	8.81e-8	2.38e-9	83.1%	6.62	8.97e-8	2.43e-9	71.7%	6.47	8.78e-8	2.37e-9

R = Exact IsoRank; $R^{0}$ = Base $R^{≃}$ ; $R^{1}$ =Line $R^{≃}$ . Small, medium, and large ER graphs are generated with good, intermediate, and bad noise measures $E$ of pairwise node similarity.

Table 2.

Comparison of Exact and Approximate IsoRank on Barabási–Albert (BA) Graphs

	Good $E$				Intermediate $E$				Bad $E$
	Acc.	CPU	${‖ \cdot ‖}^{(err)}$	${‖ \cdot ‖}_{D^{- 1}}^{(err)}$	Acc	CPU	${‖ \cdot ‖}^{(err)}$	${‖ \cdot ‖}_{D^{- 1}}^{(err)}$	Acc	CPU	${‖ \cdot ‖}^{(err)}$	${‖ \cdot ‖}_{D^{- 1}}^{(err)}$
	$\| V \| = 2, 500$ , $\| E \| = 12, 473$				$\| V \| = 2, 500$ , $\| E \| = 12, 465$				$\| V \| = 2, 500$ , $\| E \| = 12, 475$
$R$	100%	3.37	—	—	87.3%	3.22	—	—	75.3%	3.27	—	—
$R^{0}$	100%	0.21	4.69e-5	2.66e-6	84.9%	0.19	4.41e-5	2.57e-6	73.2%	0.22	4.40e-5	2.63e-6
$R^{1}$	100%	0.38	1.19e-5	4.28e-7	87.2%	0.35	1.14e-5	4.04e-7	75.3%	0.40	1.07e-5	4.18e-7
	$\| V \| = 5, 000$ , $\| E \| = 24, 975$				$\| V \| = 5, 000$ , $\| E \| = 24, 976$				$\| V \| = 5, 000$ , $\| E \| = 24, 974$
$R$	100%	13.81	—	—	86.9%	14.03	—	—	72.9%	14.27	—	—
$R^{0}$	100%	0.84	2.41e-5	1.32e-6	83.9%	0.83	2.37e-5	1.33e-6	71.0%	0.94	2.27e-5	1.31e-6
$R^{1}$	100%	1.57	6.77e-6	2.10e-7	86.8%	1.57	6.39e-6	2.11e-7	73.0%	1.81	6.09e-6	2.08e-7
	$\| V \| = 10, 000$ , $\| E \| = 49, 974$				$\| V \| = 10, 000$ , $\| E \| = 49, 972$				$\| V \| = 10, 000$ , $\| E \| = 49, 973$
$R$	100%	38.85	—	—	86.6%	37.43	—	—	73.2%	38.62	—	—
$R^{0}$	100%	3.15	1.21e-5	6.60e-7	84.1%	3.08	1.22e-5	6.55e-7	71.1%	3.15	1.30e-5	6.63e-7
$R^{1}$	100%	5.18	3.62e-6	1.05e-7	86.5%	5.11	3.48e-6	1.04e-7	73.0%	5.17	3.72e-6	1.06e-7

R = Exact IsoRank; $R^{0}$ = Base $R^{≃}$ ; $R^{1}$ =Line $R^{≃}$ . Small, medium, and large BA graphs are generated with good, intermediate, and bad noise measures $E$ of pairwise node similarity.

3.2. Quality measures for global network alignments

On synthetic networks, we had the ground truth alignment and could measure performance relative to how successful each algorithm was in recovering this alignment. For real-world networks, there is no ground truth available, and other measurements of alignment quality must be chosen in order to quantify the quality of the alignment. Following Hamp et al. (2013), we used edge correctness (EC), largest common connected subgraph (LCCS), and average functional similarity (AFS) to measure the quality of the alignments. The first two are graph-theoretical measures that compare the topological properties of neighborhoods of nodes that are matched, while the third also measures the biological coherence of the alignments using the cellular functions defined by the gene ontology (GO) (GO Consortium, 2004). Details of these measures are given in the following paragraphs.

3.2.1. Edge correctness

Let $(V_{1}, E_{1})$ and $(V_{2}, E_{2})$ represent two networks, with $| V_{1} | \leq | V_{2} |$ and let $g : V_{1} \to V_{2}$ be a one-to-one mapping between the vertices of the two networks. For an edge $(u, v) \in E_{1}$ , we call $(g (u), g (v)) \in V_{2} \times V_{2}$ to be a translation of $(u, v)$ from $G_{1}$ to $G_{2}$ . Then, the metric EC, which measures the percentage of the edge-translation of edges in $E_{1}$ that is an actual edge in $E_{2}$ , is mathematically described as $E C = \frac{| (u, v) \in E_{1} : (g (u), g (v)) \in E_{2} |}{| E_{1} |}$ (8)

Since PPI connections are (weakly) preserved during evolution, higher EC values imply better protein pairings.

3.2.2. Largest common connected subgraph

Given two graphs, $G_{1} = (V_{1}, E_{1}), G_{2} = (V_{2}, E_{2})$ , with $| V_{1} | \leq | V_{2} |$ , and a one-to-one mapping $g : V_{1} \to V_{2}$ , let E denote the set of edges of $E_{1}$ that map to edges in $E_{2}$ using g. Then, $LCCS (g; G_{1}, G_{2})$ denotes the number of edges in the largest connected component of E. Like EC, LCCS also measures the number of edges that are preserved across the species by the mapping g.

3.2.3. Average functional similarity

The two previous metrics calculate how well the protein mapping preserves the structural features of the PPI networks. Since important biological functional pathways are also broadly preserved during evolution, we also expect a good mapping to strongly preserve protein function across species. Following Hashemifar and Xu (2014), we also measure the AFS across species, measuring the similarity of sets of Gene Ontology Consortium (2004) (GO) terms assigned to matched nodes. Recall that the GO is a hierarchical and species-agnostic directed acyclic graph representation of protein functional labels (i.e., GO terms) composed of three domains (molecular function [MF], biological process [BP], and cellular component [CC]), each with a separate root. The GO taxonomy is hierarchical, so the GO terms closer to the roots represent more general protein functions, which become increasingly specialized as we move farther away from the roots. All nodes are associated with their known GO terms as collected in the GO database (GO Consortium (2004)). For each protein mapping $(p, g (p))$ , we use Schlicker’s similarity (i.e., $s_{c} (p, g (p)), c \in$ {MF, BP, CC}) (Hashemifar and Xu, 2014; Schlicker et al., 2006) to find the similarity of GO terms of p with that of $g (p)$ . So, the $(A F S_{c})$ is computed as $A F S_{c} = \frac{1}{| g |} \sum_{u \in dom (g)} s_{c} (u, g (u))$ (9)where |g| represents the size of one-to-one protein mappings in g (if the alignment is global, $| g | = \max (| V_{1} |, | V_{2} |)$ ). Mappings with higher AFS based on the known functional labels are preferred. We depart from Hashemifar and Xu (2014) to match recommended best practice in the field (Devkota et al., 2022) and enforce some specificity of protein function by including only the GO labels whose shortest distance from a root is $\geq$ 5 in our computation of the AFS measure.

3.3. Biological networks

We extracted PPI networks from the IntAct database (Hermjakob et al., 2004); these networks are constructed using all physical interactions detected using coimmunoprecipitation (Anderson, 1998) and yeast 2-hybrid (Kohalmi et al., 1998) experiments. We generated the networks for five species: S = {Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus, Rattus norvegicus}. We obtain pairwise node similarity scores from the blastp sequence similarity bit scores (assigning a 0% similarity if blastp does not align the sequences) and, as above and as recommended by Singh et al. (2008), set the $α$ parameter that trades off between network and node similarity to be 0.6. The graph properties of these constructed PPI networks are provided in Table 3.

Table 3.
Graph Properties of IntAct Networks

Network #nodes #edges Avg degree Avg cc

human 32431 301529 18.595 0.069

fly 11247 42555 7.567 0.027

mouse 13967 36541 5.232 0.051

rat 10792 23315 4.320 0.075

yeast 6478 70638 21.808 0.285

Network	#nodes	#edges	Avg degree	Avg cc
human	32431	301529	18.595	0.069
fly	11247	42555	7.567	0.027
mouse	13967	36541	5.232	0.051
rat	10792	23315	4.320	0.075
yeast	6478	70638	21.808	0.285

Table includes information regarding the number of nodes, number of edges, average node degree, and average clustering coefficient for human, fly, mouse, rat, and yeast networks.

For the 10 distinct pairwise combinations of these five species, for each pair of corresponding PPI networks, we performed experiments using the true IsoRank similarity matrix R and the two approximations $R^{0} =$ Base $R^{≃}$ and $R^{1}$ = Line $R^{≃}$ . Finally, for each obtained similarity matrix $R, R^{0}$ and $R^{1}$ , we used the greedy matching algorithm to find the top 2000 protein mappings $M, M_{0}$ , and $M_{1}$ , respectively.

3.4. Graph theoretical results on biological networks

In order to see how the IsoRank approximations compare with the exact IsoRank mappings, we performed a similar analysis to that done in Section 3.1. For a given pair of species $(A, B), A \neq B$ , let $R (α; A, B)$ be the true IsoRank similarity matrix, and let $R^{0} (α; A, B)$ and $R^{1} (α; A, B)$ be the approximations Base $R^{≃}$ and Line $R^{≃}$ described in (5) and (7), respectively, for a given $α$ -value. After using the greedy matching algorithm on $R (α; A, B)$ , $R^{0} (α; A, B)$ , and $R^{1} (α; A, B)$ , let $M (α; A, B)$ , $M_{0} (α; A, B)$ , and $M_{1} (α; A, B)$ be the corresponding list of 2000 top one-to-one cross-species protein matches for the species pair $(A, B)$ in Exact IsoRank and Approximate IsoRank versions Base $R^{≃}$ and Line $R^{≃}$ . We generated results for each $α \in {0.2, 0.4, 0.6, 0.8}$ and $A, B \in S$ ; results were remarkably stable across different $α$ values; therefore, we present only the results with recommended $α = 0.6$ (Singh et al., 2008). We also computed the EC and LCCS scores for the matchings. Results appear in Table 4.

Table 4.
L2 Approximation Loss and Matching Similarity for R = Exact IsoRank, as Compared with Approximations $R^{0} =$ Base $R^{≃}$ and $R^{1} =$ Line $R^{≃}$

Orgs Norms Matching sim. Edge cor. LCCS

N1 N2 ${‖ R - R^{0} ‖}_{2}$ ${‖ R - R^{1} ‖}_{2}$ $Sim (R, R^{0})$ $Sim (R, R^{1})$ $R^{0}$ $R^{1}$ R $R^{0}$ $R^{1}$ R

F B 0.000947 0.000088 0.507 0.810 0.074 0.178 0.185 695 825 808

F M 0.001208 0.000121 0.810 0.941 0.028 0.073 0.072 15 23 23

F R 0.003482 0.000335 0.378 0.516 0.015 0.037 0.051 53 42 77

H B 0.000371 0.000032 0.976 0.991 0.213 0.216 0.216 1174 1207 1202

H F 0.000483 0.000032 0.973 0.994 0.063 0.065 0.064 21 132 134

H M 0.000571 0.000059 0.986 0.998 0.086 0.093 0.093 158 264 264

H R 0.001522 0.000158 0.903 0.982 0.018 0.040 0.041 76 408 388

M B 0.001115 0.000178 0.780 0.941 0.089 0.193 0.191 389 586 581

M R 0.005213 0.000599 0.779 0.940 0.041 0.129 0.128 57 373 363

R B 0.004086 0.000525 0.353 0.469 0.107 0.135 0.393 426 679 1135

Orgs	Norms	Matching sim.	Edge cor.	LCCS
F	B	0.000947	0.000088	0.507	0.810	0.074	0.178	0.185	695	825	808
F	M	0.001208	0.000121	0.810	0.941	0.028	0.073	0.072	15	23	23
F	R	0.003482	0.000335	0.378	0.516	0.015	0.037	0.051	53	42	77
H	B	0.000371	0.000032	0.976	0.991	0.213	0.216	0.216	1174	1207	1202
H	F	0.000483	0.000032	0.973	0.994	0.063	0.065	0.064	21	132	134
H	M	0.000571	0.000059	0.986	0.998	0.086	0.093	0.093	158	264	264
H	R	0.001522	0.000158	0.903	0.982	0.018	0.040	0.041	76	408	388
M	B	0.001115	0.000178	0.780	0.941	0.089	0.193	0.191	389	586	581
M	R	0.005213	0.000599	0.779	0.940	0.041	0.129	0.128	57	373	363
R	B	0.004086	0.000525	0.353	0.469	0.107	0.135	0.393	426	679	1135

Edge correctness (EC) and largest common connected subgraph (LCCS) similarities for each of R, $R^{0}$ and $R^{1}$ . F = fly network, B = Baker’s yeast network, H = human network, M = mouse network, R = rat network. $α = 0.6$ for all experiments.

Across all species pairs, we find that the approximation loss of approximate IsoRank as compared with exact IsoRank is always $< .0055$ , and as expected lower for $R^{1} =$ Line $R^{≃}$ than $R^{0} =$ Base $R^{≃}$ . Similarly, in all species pairs, the matching similarity of the top 2000 nodes to exact IsoRank is superior for Line $R^{≃}$ as compared with Base $R^{≃}$ . The disparity is most pronounced in the fly-yeast pair, where the difference is $> 0.30$ . However, for pairs where one of the organisms is human, there is relatively minor difference between the mapping produced by the two approximations, and the mappings themselves are closer to the true mapping.

Table 4 results also show that the best EC and LCCS values always come from either the true IsoRank R or its $R^{1} =$ Line $R^{≃}$ approximation. Out of the 10 cross-species experiments tested for $α = 0.6$ , R and $R^{1}$ both produced the best EC scores in 4 settings each (the remaining two being ties). For LCCS, Line $R^{≃}$ produced better LCCS scores more often (five times) than exact IsoRank (three times). This demonstrates that in a realistic biological setting, performing just a single iteration of our approximation scheme is enough to retain, and sometimes even enrich, the network-theoretical properties of the true IsoRank mappings in biological networks.

3.5. Functional similarity results on biological networks

We used the AFS measure, described above, to measure how functionally similar the protein pairs mapped by IsoRank and its approximations are. The AFS results are provided in Table 5. Since the AFS results were not sensitive to $α$ -values (fixing GO category, species pairs and type of IsoRank, and the expected s.d. obtained by setting $α = {0.2, 0.4, 0.6, 0.8}$ was $\approx 0.0015$ ), we report $α = 0.6$ in Table 5 and present the obtained AFS scores across all GO categories and species pairs.

Table 5.
Average Functional Similarity Values for Top 2000 Matching-Obtained Approximations ( $R^{0}$ = Base $R^{≃}$ and $R^{1}$ = Line $R^{≃}$ ) and True IsoRank ( $R$ ) Matrices for All Three Gene Ontology Hierarchies (MF = “Molecular Function,” BP = “Biological Process,” and CC = “Cellular Component”), with $α = 0.6$

MF BP CC

N1 N2 $R^{0}$ $R^{1}$ $R$ $R^{0}$ $R^{1}$ $R$ $R^{0}$ $R^{1}$ $R$

F B 0.513 0.515 0.517 0.261 0.266 0.268 0.539 0.547 0.550

F M 0.574 0.573 0.570 0.268 0.270 0.269 0.579 0.586 0.583

F R 0.495 0.429 0.391 0.190 0.170 0.154 0.429 0.399 0.369

H B 0.595 0.594 0.595 0.362 0.362 0.361 0.503 0.503 0.502

H F 0.630 0.627 0.628 0.335 0.332 0.332 0.585 0.582 0.581

H M 0.661 0.659 0.659 0.348 0.346 0.345 0.575 0.574 0.573

H R 0.622 0.621 0.621 0.308 0.308 0.308 0.464 0.464 0.464

M B 0.515 0.523 0.521 0.265 0.273 0.274 0.526 0.530 0.529

M R 0.613 0.600 0.600 0.297 0.293 0.292 0.536 0.530 0.530

R B 0.477 0.426 0.360 0.216 0.188 0.159 0.409 0.377 0.345

		MF	BP	CC
F	B	0.513	0.515	0.517	0.261	0.266	0.268	0.539	0.547	0.550
F	M	0.574	0.573	0.570	0.268	0.270	0.269	0.579	0.586	0.583
F	R	0.495	0.429	0.391	0.190	0.170	0.154	0.429	0.399	0.369
H	B	0.595	0.594	0.595	0.362	0.362	0.361	0.503	0.503	0.502
H	F	0.630	0.627	0.628	0.335	0.332	0.332	0.585	0.582	0.581
H	M	0.661	0.659	0.659	0.348	0.346	0.345	0.575	0.574	0.573
H	R	0.622	0.621	0.621	0.308	0.308	0.308	0.464	0.464	0.464
M	B	0.515	0.523	0.521	0.265	0.273	0.274	0.526	0.530	0.529
M	R	0.613	0.600	0.600	0.297	0.293	0.292	0.536	0.530	0.530
R	B	0.477	0.426	0.360	0.216	0.188	0.159	0.409	0.377	0.345

The columns “N1” and “N2” represent IntAct networks for the different pairs of organisms. Best performing results are in bold.

H, human; M, mouse; R, rat; F, fly; B, Baker’s yeast.

The results are in stark contrast to the results obtained from the graph-theoretical measures. Unlike EC and LCCS outputs, the mapping obtained from the Base $R^{≃}$ approximation (i.e., $R^{0}$ ) gave the highest AFS scores in the majority of cases across all GO categories (though in most cases by a slight margin). So, for the biological function measures, it seems that the Approximate IsoRank is superior to the exact IsoRank. Even in cases where it did not produce the highest-scoring one-to-one mapping, it closely followed the best scoring algorithm. This implies that if the downstream application of interest is using a better-annotated species to assign GO functional labels in a less well-annotated species, perhaps because it manages to somewhat denoise the network (see Section 5), the simple approximate IsoRank Base $R^{≃}$ may be superior to the original IsoRank for this task.

3.6. Competing methods

We further evaluated our Base $R^{≃}$ approximation by comparing its EC, LCCS, and AFS scores against the results from two recent network alignment methods: fast attributed network alignment (FINAL) (Zhang and Tong, 2016) and Hubalign (Hashemifar and Xu, 2014). These methods were chosen among the many competing global network alignment methods because they both have been shown to produce good alignments, and they are relatively fast. The results appear in Tables 6 and 7. FINAL takes in two networks (and optional edge/node attributes, where we use the node pairwise similarity matrix as in the competing methods) and finds an internetwork pairwise alignment matrix that solves a topology-sensitive optimization problem. It generates the one-to-one mappings by performing greedy alignment on this optimized matrix. For our experiments, we generated the top 2000 FINAL 1–1 node alignments.

Table 6.
EC and LCCS Results for Top 2000 Matching Obtained from Approximate IsoRank ( $R^{0}$ , $α = 0.6$ ) and Two Competing Methods (FINAL and Hubalign), for All Gene Ontology Hierarchies

N1 N2 $R^{0}$ FINAL Hubalign

(a) EC

F B 0.0740 0.0996 0.0301

F M 0.0284 0.0450 0.0166

F R 0.0150 0.0062 0.0033

H B 0.2137 0.0969 0.0408

H F 0.0638 0.0197 0.0061

H M 0.0862 0.0292 0.0194

H R 0.0189 0.0101 0.0037

M B 0.0896 0.1256 0.0443

M R 0.0413 0.0658 0.0352

R B 0.1077 0.0514 0.0156

(b) LCCS

F B 695 14 38

F M 15 6 5

F R 53 3 1

H B 1174 1155 1481

H F 21 27 35

H M 158 401 836

H R 76 29 26

M B 389 31 51

M R 57 57 55

R B 426 22 23

N1	N2	$R^{0}$	FINAL	Hubalign
(a) EC
F	B	0.0740	0.0996	0.0301
F	M	0.0284	0.0450	0.0166
F	R	0.0150	0.0062	0.0033
H	B	0.2137	0.0969	0.0408
H	F	0.0638	0.0197	0.0061
H	M	0.0862	0.0292	0.0194
H	R	0.0189	0.0101	0.0037
M	B	0.0896	0.1256	0.0443
M	R	0.0413	0.0658	0.0352
R	B	0.1077	0.0514	0.0156
(b) LCCS
F	B	695	14	38
F	M	15	6	5
F	R	53	3	1
H	B	1174	1155	1481
H	F	21	27	35
H	M	158	401	836
H	R	76	29	26
M	B	389	31	51
M	R	57	57	55
R	B	426	22	23

The columns “N1” and “N2” represent IntAct networks for organisms: Human (H), Mouse (M), Rat (R), Fly (F) and Baker’s yeast (B). Best performing results are in bold.

EC, edge correctness; LCCS, largest common connected subgraph.

Table 7.

Average Functional Similarity Results for Top 2000 Matching Obtained from Approximate IsoRank ( $R^{0}$ , $α = 0.6$ ) and Two Competing Methods (FINAL and Hubalign) for MF, BP, and CC Gene Ontology Hierarchies

		MF			BP			CC
N1	N2	$R^{0}$	FINAL	Hub.	$R^{0}$	FINAL	Hub.	$R^{0}$	FINAL	Hub.
F	B	0.513	0.537	0.521	0.261	0.287	0.277	0.539	0.545	0.538
F	M	0.574	0.593	0.536	0.268	0.279	0.246	0.579	0.598	0.557
F	R	0.495	0.482	0.478	0.190	0.185	0.186	0.429	0.419	0.420
H	B	0.595	0.557	0.521	0.362	0.329	0.295	0.503	0.486	0.464
H	F	0.631	0.606	0.513	0.334	0.318	0.248	0.585	0.562	0.487
H	M	0.661	0.647	0.614	0.348	0.344	0.308	0.575	0.581	0.546
H	R	0.622	0.588	0.579	0.308	0.282	0.271	0.464	0.436	0.437
M	B	0.515	0.552	0.520	0.265	0.288	0.269	0.526	0.542	0.523
M	R	0.613	0.614	0.614	0.297	0.298	0.301	0.537	0.536	0.539
R	B	0.477	0.472	0.463	0.216	0.209	0.207	0.409	0.402	0.400

The columns “N1” and “N2” represent IntAct networks for the different pairs of organisms: Human (H), Mouse (M), Rat (R), Fly (F) and Baker’s yeast (B). Best performing results are in bold.

MF, Molecular Function; BP; Biological Process; CC, Cellular Component.

Hubalign operates by running a minimum degree heuristic algorithm to estimate the topological and functional importance of proteins and comparing it with proteins from another network to form alignments. Similar to FINAL and Base $R^{≃}$ , we wanted to obtain the top 2000 one-to-one Hublign mappings, but their current C++ implementation can only return the complete global alignment as output. To solve this issue, we reimplemented the Hubalign code from scratch. Furthermore, we made an additional change to the alignment implementation, where we postponed the selection of relatively “weak” alignment between single degree nodes by greedily selecting the stronger node mappings first. We made this change primarily to make the Hubalign results computationally feasible, but we tested our alternative Hubalign with this postponement against the original Hubalign on the smallest two networks (aligning fly and yeast), and, as expected, since we are preferring stronger node mappings, all three reported metrics, EC, LCCS, and AFS, got substantially better with our change. The python source code of our Hubalign implementation is freely available in our Github repository.

4. EXTENDING APPROXIMATE ISORANK TO MORE THAN TWO SPECIES

An extension of IsoRank, called IsoRankN (or IsoRank-Nibble) (Liao et al., 2009), has been developed to perform network alignment across multiple species. However, unlike in the two-species setting, IsoRankN does not operate by generating similarity scores between every possible node configuration, as that would result in the formation of a higher-order tensor instead of a similarity matrix when the species count is > 2. IsoRankN instead uses the collection of PPIs to produce functionally enriched cross-species protein clusters, where the functional similarity between proteins within each cluster is informed by the graph-spectral properties of the networks and the sequence relationships between the proteins across species.

Given K PPI networks $G_{1}, G_{2}, \dots, G_{K}$ , the sequence similarity matrices $E_{i j}, i, j \in {1, \dots, K}$ between every network pair, the threshold parameter β, IsoRank weight $α$ , and the PageRank-Nibble constant $γ$ , the IsoRankN cross-species clustering is performed in the following way.

1.
Run the original IsoRank on every pair of networks to obtain scores $R_{i j}$ on all edges of the functional similarity graph.
2.
For every protein v, compute the star $S_{v} = {v_{j} \in N (v) | w (v, v_{j}) \geq β \max_{j} (w (v, v_{j}))}$ , where $N (v)$ is the neighborhood of v in the functional similarity graph.
3.
Pick an arbitrary remaining PPI network $G_{l}$ and order the proteins $v \in G_{l}$ by the sum of edge weights in the induced graph on $S_{v}$ . Following the ordering and excluding proteins already assigned to clusters, spectrally partition $S_{v}$ to obtain $S_{v}^{}$ .
4.
Merge every pair of clusters $S_{v_{1}}^{}$ and $S_{v_{2}}^{}$ in which $\forall v_{i} \in S_{v_{2}}^{} ∖ {v_{2}}, w (v_{1}, v_{i}) \geq β \max_{j} (w (v_{1}, v_{j}))$ and $\forall v_{j} \in S_{v_{1}}^{*} ∖ {v_{1}}, w (v_{2}, v_{j}) \geq β \max_{k} (w (v_{2}, v_{k}))$ .
5.
Repeat steps 3 and 4 until all proteins are assigned to a cluster.

4.1. Base $R^{≃}$ and Line $R^{≃}$ approximations for IsoRankN

Observing that the IsoRankN algorithm also suffers from the same computational bottleneck involving the construction of the IsoRank similarity matrices, we proceeded to reimplement the IsoRankN algorithm, where we replaced the exact IsoRank similarities with the $R^{0}$ and $R^{1}$ approximations. We set the IsoRankN parameters to their default values ( $α = 0.6, β = 0.01, γ = 0.25$ ) and computed the cross-species clusters on the IntAct PPI networks of the species: Mouse, Rat, Fly, and Yeast (we did not include the human PPI network in this experiment because its addition caused the overall memory requirement to go beyond 64GB of RAM). As in the original IsoRankN article, we used two entropy metrics on the protein clusters to measure their functional properties, which we call “(un)normalized averaged clusterwide entropy” here. We also calculated the same measure of cluster species diversity by counting the number of clusters with their protein members belonging to all (or some subset of) the species set. In detail, these measures are as follows.

4.1.1. Unnormalized averaged clusterwide entropy (UACE)

For a given cluster C, let $p_{i}$ be the proportion of proteins in C that have the GO label i. Then, per-cluster UACE (i.e., $UACE (C)$ ) is computed as $UACE (C) = Entropy (p_{1}, \dots, p_{d}) = - \sum_{i = 1}^{d} p_{i} \log p_{i}$ (10)where d is the number of all GO labels and the logarithm is base 2. Averaging this score across all clusters, we get the clusterwide entropy score $UACE = \frac{1}{N} \sum_{C} U A C E (C)$ , where N is the total number of clusters obtained.

4.1.2. Normalized averaged clusterwide entropy (NACE)

We get $NACE (C)$ from $UACE (C)$ by normalizing it with the log of cluster size (i.e., $NACE (C) = \frac{1}{\log | C |} U ACE (C)$ . Averaging this across all the clusters, we obtain the final clusterwide NACE.

Note that even though step 3 of the IsoRankN algorithm described above specifies arbitrary selection of the species network during each iteration, the actual IsoRankN implementation kept the order fixed. Since our choice of the PPI networks is different from the original implementation, for the sake of completeness, we reinvestigated the better approach among the two by using the UACE and NACE metrics. For this experiment, we chose the following ordering: (1) baker’s yeast, (2) fly, (3) mouse, and (4) rat. Computing the UACE and NACE results for GO = MF, we found that the randomized versus fixed scores were very close (NACE = 0.98 and UACE = 0.93 for both). Thus, we selected the randomized PPI selection setting as our default option during clustering.

Table 8 shows that, when we take all the clusters into account, the IsoRankN clustering obtained using the $R^{0}$ = Base $R^{≃}$ approximation produces smaller cluster entropy than the exact IsoRank and the $R^{1}$ = Line $R^{≃}$ approximation across all entropy metrics and GO hierarchies. However, when we measured the clusters’ species diversity by counting the number of clusters with proteins from more than one species (Table 9), the count was always the lowest for Base $R^{≃}$ . Additionally, the Base $R^{≃}$ approximation also produced far fewer clusters with size $> 2$ ( $676 / 17247$ ) compared with Line $R^{≃}$ ( $2770 / 15578$ ) and original IsoRank ( $2986 / 14939$ ). Thus, we recomputed the entropy evaluation by excluding all the size = 2 clusters from all 3 IsoRankN implementations. The entropy results (Table 10) showed that this filtering significantly deteriorated the functional coherence of the $R^{0}$ = Base $R^{≃}$ clusters.

Table 8.
NACE and UACE Results for the Exact IsoRankN Clustering and Its Two Approximations for All Three Gene Ontology (GO) Hierarchies (MF = “Molecular Function,” BP = “Biological Process,” and CC = “Cellular Component”)

IsoRank $R^{0}$ $R^{1}$ R

Metric type GO

NACE BP 1.395 1.454 1.457

CC 0.973 0.986 0.984

MF 0.970 0.986 0.981

UACE BP 2.714 3.037 3.068

CC 0.922 1.006 1.020

MF 0.854 0.939 0.954

	IsoRank	$R^{0}$	$R^{1}$	R
NACE	BP	1.395	1.454	1.457
CC	0.973	0.986	0.984
MF	0.970	0.986	0.981
UACE	BP	2.714	3.037	3.068
CC	0.922	1.006	1.020
MF	0.854	0.939	0.954

$R^{0}$ , Base $R^{≃}$ ; $R^{1}$ , Line $R^{≃}$ ; R, original IsoRank. Best performing results are in bold.

NACE, normalized averaged clusterwide entropy; UACE, unormalized averaged clusterwide entropy.

Table 9.

Table Showing the Number of Clusters (Obtained from Exact IsoRank (R) and Its Approximations Base $R^{≃}$ and Line $R^{≃}$ ( $R^{0}$ and $R^{1}$ , Respectively)) Having Proteins from Exactly k out of 4 Number of Species

#species in a cluster (k)	$R^{0}$	$R^{1}$	R
1	15046	12977	12317
2	1962	2134	2116
3	186	382	404
4	53	85	102
total	17247	15578	14939

Table 10.

NACE and UACE Results for the Exact IsoRankN Clustering and Its Two Approximations for All Three Gene Ontology (GO) Hierarchies (MF = “Molecular Function,” BP = “Biological Process,” and CC = “Cellular Component”)

	IsoRank	$R^{0}$	$R^{1}$	R
Metric type	GO	$R^{0}$	$R^{1}$	R
NACE	BP	2.589	2.038	2.040
	CC	1.113	1.069	1.065
	MF	0.948	1.021	1.027
UACE	BP	9.412	6.226	6.208
	CC	2.115	1.628	1.618
	MF	1.633	1.376	1.421

The scores are computed by selecting only those clusters with size > 2. Best performing results are in bold.

$R^{0}$ , Base $R^{≃}$ ; $R^{1}$ , Line $R^{≃}$ ; R, original IsoRank.

NACE, normalized averaged clusterwide entropy; UACE, unormalized averaged clusterwide entropy.

Thus, the results in Tables 9 and 10 show that the cluster entropy and species diversity are comparable between Line $R^{≃}$ and exact IsoRank. Both produce bigger and more diverse clusters, with higher functional similarity among proteins in the larger clusters than the Base $R^{≃}$ approximation.

5. DISCUSSION

5.1. Synthetic networks

The results on the synthetic networks serve to provide some guidance on which IsoRank approximation should be utilized as a function of the measured node similarities in the two networks. As shown in Tables 1 and 2, for both ER and BA random graphs, when “good” $E$ is used, the accuracy for all three algorithms is $100 %$ , i.e., all the algorithms can recover the network alignments perfectly, and the errors, measured in both norms, are very small. However, the original IsoRank R (2) is much slower than the two approximation algorithms. Approximate IsoRank $R^{0}$ = Base $R^{≃}$ (5) is about 7.6–14.3 times faster, and Approximate IsoRank $R^{1}$ = Base $R^{≃}$ (7) is about 4.7–6.7 times faster. Since they all recover the perfect alignment, when $E$ contains accurate information about the alignment, we suggest using approximate IsoRank $R^{0}$ = Base $R^{≃}$ (5). When “in-between” $E$ is used, although the error between the original IsoRank and approximations is still small, the accuracy for all three algorithms decreases. As we can see, Approximate IsoRank $R^{0}$ = Base $R^{≃}$ (5) is the fastest among the three, but its accuracy is slightly worse than the original IsoRank R (2). Approximate IsoRank $R^{1}$ = Base $R^{≃}$ (7) is slightly slower than Approximate IsoRank Base $R^{≃}$ (5) but achieves roughly the same accuracy as the original IsoRank R (2). But Approximate IsoRank Line $R^{≃}$ (7) is still much faster than original IsoRank (2) and, therefore, balancing the accuracy and CPU time, we would suggest using Approximate IsoRank Line $R^{≃}$ (7) in this case. Finally, when $E$ is “bad,” we observe the same results as the case $E$ is “in-between.” In particular, Approximate IsoRank Base $R^{≃}$ (5) is the fastest, but its accuracy is the worst (but not by much, still close to the accuracy of the original IsoRank [2]). Approximate IsoRank Line $R^{≃}$ (7) seems to achieve the best balance between accuracy and CPU time, and we would suggest using it in this case as well. However, if the CPU time is the priority, then we recommend Approximation IsoRank Base $R^{≃}$ (5) since it is the fastest. Finally, from the numerical experiments, we observe that both our approximation algorithms achieve optimal computational complexity $O (N^{2})$ , where $N = \max {| V_{1} |, | V_{2} |}$ , as expected since the synthetic networks considered here are sparse (namely $| E | / | V |$ in the range 15.4–18.5 in Table 1 and approximately 5.0 in Table 2) and $t_{1} = t_{2} = O (N^{2})$ in our implementation.

5.2. Biological networks

Interestingly, we found the Base $R^{≃}$ approximation to IsoRank to be more robust and produce better predictions of protein functional roles across the board than even an exact IsoRank. Since inferring the function of unknown proteins across species is one of the main applications of the Global Alignment of Biological Networks, perhaps given the fact that the networks are not actually isomorphic and we have an imperfect sample of the true edges, somehow our approximation to IsoRank is more robust to noise and therefore works better than producing the best isomorphism as measured by purely graph-theoretical measures. This requires further investigation.

We also approximated the IsoRankN clustering algorithm in the multispecies setting by replacing the original IsoRank similarities with the $R^{0}$ = Base $R^{≃}$ and $R^{1}$ = Base $R^{≃}$ approximations. Different from the two species setting, we found that the clusters obtained from Base $R^{≃}$ were less functionally meaningful and less species diverse than those obtained from Line $R^{≃}$ and exact IsoRank. Therefore, for the multispecies clustering problem, we recommend using the $R^{1}$ -approximation, Line $R^{≃}$ , as it produces comparable results to the original IsoRankN implementation but has a significantly less computational cost.

Footnotes

ACKNOWLEDGMENTS

The authors thank the National Science Foundation for support under NSF grant CCF-1934553. The authors also thank the NSF CC* grant 2018149 for Tufts computing cluster resources.

AUTHORS’ CONTRIBUTIONS

K.D.: conceptualization, methodology, formal analysis, and software; A.B.: formal analysis; X.H.: methodology, formal analysis, and software; L.C.: conceptualization, methodology, formal analysis, and funding. All authors were involved in writing and editing the article.

CODE AND DATA AVAILABILITY

The implementation of approximate IsoRank is available at https://github.com/kap-devkota/approximate_ISORANK. Gene Ontology files are from the official GO website (http://geneontology.org/ OBO data version: releases/2022–12-04, GAF version: 2.2). The biological networks used in the experiments can be downloaded from the IntAct FTP link at https://www.ebi.ac.uk/intact/download/ftp

AUTHOR DISCLOSURE STATEMENT

The authors have no competing interests to declare that are relevant to the content of this article.

FUNDING INFORMATION

This work was supported by the National Science Foundation (NSF), grant CCF-1934553 and the NSF CC* grant 2018149 for Tufts computing cluster resources.

References

Albert

, Barabási

A-L

. Statistical mechanics of complex networks. Rev Mod Phys, 2002; 74(1):47–97; doi: 10.1103/RevModPhys.74.47

Altschul

, Gish

, Miller

, et al. Basic local alignment search tool. J Mol Biol, 1990; 215(3):403–410; doi: 10.1016/S0022-2836(05)80360-2

Anderson

. Co-Immunoprecipitation. In: Protein Targeting Protocols. ( Clegg

R.A

. eds.) Humana Press: Totowa, NJ; 1998. pp. 35–45. 10.1385/0-89603-487-9:35

GO Consortium . The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 2004; 32(suppl_1):D258–D261; doi: 10.1093/nar/gkh036

Devkota

, Schmidt

, Werenski

, et al. Glider: Function prediction from GLIDE-based neighborhoods. Bioinformatics, 2022; 38(13):3395–3406; doi: 10.1093/bioinformatics/btac322

El-Kebir

, Heringa

, Klau

. Natalie 2.0: Sparse global network alignment as a special case of quadratic assignment. Algorithms, 2015; 8(4):1035–1051; doi: 10.3390/a8041035

Emmert-Streib

, Dehmer

, Shi

. Fifty years of graph matching, network alignment and network comparison. Information Sciences, 2016; 346:180–197; doi: 10.1016/j.ins.2016.01.074

Erdős

, Rényi

. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 1960; 5(1):17–60; doi: 10.1515/9781400841356.38

Hamp

, Kassner

, Seemayer

, et al. Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics, 2013; 14(Suppl 3):S7; doi: 10.1186/1471-2105-14-S3-S7

10.

Hashemifar

, Xu

. HubAlign: An accurate and efficient method for global alignment of protein–protein interaction networks. Bioinformatics, 2014; 30(17):i438–i444; doi: 10.1093/bioinformatics/btu450

11.

Hermjakob

, Montecchi-Palazzi

, Lewington

, et al. InTact: An open source molecular interaction database. Nucleic Acids Res, 2004; 32(Database issue):D452–D455; doi: 10.1093/nar/gkh052

12.

Kazemi

, Grossglauser

. On the structure and efficient computation of IsoRank node similarities. arXiv, 2016; doi: 10.48550/arXiv.1602.00668

13.

Khan

, Gleich

, Pothen

, et al. A multithreaded algorithm for network alignment via approximate matching. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE; 2012. pp. 1–11 10.1109/SC.2012.8

14.

Kohalmi

, Reader

LJV

, Samach

, et al. Identification and characterization of protein interactions using the yeast 2-hybrid system, In: Plant Molecular Biology Manual. Springer: Netherlands, Dordrecht; 1998. pp 95–124. 10.1007/978-94-011-5242-6_6

15.

Liao

C-S

, Lu

, Baym

, et al. IsoRankN: Spectral methods for global alignment of multiple protein networks. Bioinformatics, 2009; 25(12):i253–i258; doi: 10.1093/bioinformatics/btp203

16.

Mamano

, Hayes

. SANA: Simulated annealing far outperforms many other search algorithms for biological network alignment. Bioinformatics, 2017; 33(14):2156–2164; doi: 10.1093/bioinformatics/btx090

17.

Neyshabur

, Khadem

, Hashemifar

, et al. NETAL: A new graph-based method for global alignment of protein–protein interaction networks. Bioinformatics, 2013; 29(13):1654–1662; doi: 10.1093/bioinformatics/btt202

18.

Schafer

. An Introduction to Nonassociative Algebras. Courier Dover Publications; 2017.

19.

Schlicker

, Domingues

, Rahnenführer

, et al. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics, 2006; 7(1):302–316; doi: 10.1186/1471-2105-7-302

20.

Singh

, Xu

, Berger

. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci USA, 2008; 105(35):12763–12768; doi: 10.1073/pnas.080662710

21.

Smyth

. A spectral theoretic proof of Perron-Frobenius. In: Mathematical Proceedings of the Royal Irish Academy. JSTOR, 2002. pp. 29–35.

22.

Strang

. Introduction to Linear Algebra. 3rd. edition.. Wellesley-Cambridge Press: Wellesley, MA; 2003.

23.

Vijayan

, Saraph

, Milenković

. MAGNA++: Maximizing accuracy in global network alignment via both node and edge conservation. Bioinformatics, 2015; 31(14):2409–2411; doi: 10.1093/bioinformatics/btv161

24.

Zhang

, Tong

. FINAL: Fast attributed network alignment. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016. pp. 1345–1354, 10.1145/2939672.293976

25.