Abstract
Abstract
1. Introduction
Existing approaches for comparing networks across species differ substantially both in terms of mathematical formulations of the problem as well as applied algorithms. Early procedures searching for local subnetworks conserved across two species (Kelley et al., 2003; Sharan et al., 2005a; Koyuturk et al., 2006) have been complemented by multiple network aligners (Sharan et al., 2005b; Flannick et al., 2006). Global alignment has been formulated and solved by integer quadratic programming (Li et al., 2007). New graph-theoretic algorithms (Narayanan and Karp, 2007; Kalaev et al., 2009) and spectral approaches inspired by Google's page-rank algorithm have been proposed (Singh et al., 2007; Liao et al., 2009). Available methods offer a diverse range of scoring functions, including those motivated by evolutionary considerations (Koyuturk et al., 2006; Hirsh and Sharan, 2006; Berg and Lassig, 2006; Flannick et al., 2008), as well as structural and functional similarity (Kuchaiev and Pržulj, 2011; Ali and Deane, 2009).
Apart from developments in network alignment, multiple studies have focused on constructing and analyzing random graph models that aim to capture the basic processes by which protein networks evolve. Most of these models are based on the principle of duplication and divergence (Sole et al., 2002; Vazquez et al., 2003; Bebek et al., 2005; Berg et al., 2004; Ispolatov et al., 2005b; Ispolatov et al., 2005a; Evlampiev and Isambert, 2008) and have shown to match the observed data in terms of key topological features such as degree distribution and clustering coefficient (Bebek et al., 2006; Hormozdiari et al., 2007). When these models are fitted to experimental data, they provide important insights about the evolutionary forces shaping the network topology. Functional inferences, however, are challenging because nodes in these network models are typically anonymous and cannot be mapped to specific genes or proteins in the real networks. Additionally, these models typically describe the network of a single organism. Hence, although very interesting insights have been made using traditional random graph models, for example, recent inferences of putative ancestral network states (Navlakha and Kingsford, 2011), their ability to jointly model network evolution across species remains limited (Knight and Pinney, 2009).
To provide a model-based approach for comparing networks across species, we have previously developed conserved ancestral protein-protein interactions (CAPPI) (Dutkowski and Tiuryn, 2007), which applies phylogenetic analysis to reconstruct the history of protein sequence evolution based on which it further infers the evolution of protein interactions. CAPPI applies a model closely resembling the random graph models previously proposed, but it additionally adds a speciation event—a natural extension that allows us to consider networks from multiple species. The second crucial difference compared to standard random graph models is that the duplication and speciation events are linked directly to the specific proteins in the underlying biological networks, allowing for functional analysis of real networks. Originally, CAPPI was designed to perform local network alignment, that is, to identify conserved network regions under a user-specified model. It was further extended to handle missing links in networks and predict new protein interactions (Dutkowski and Tiuryn, 2009).
In the present work we take a new direction. First, we develop an expectation-maximization (Dempster et al., 1977) algorithm to learn the parameters of the evolutionary model automatically from experimental data, providing the first cross-species estimations of PPI evolutionary rates under a duplication-and-divergence model. This approach significantly extends CAPPI's network modeling capabilities, allowing for direct estimation of evolutionary parameters, parameter-free identification of conserved network regions, and reconstruction of ancestral and consensus network representations for multiple species. Next, we introduce two instances of the model: one that corresponds to evolution under selective pressure, and the other one that fits the neutral scenario. We train the two models on specific examples of known functional modules and randomly selected network regions. Finally, for every pair of protein families, we compute the likelihood of evolving under the conserved and under the random model. Using this method, we infer the most conserved network regions across seven eukaryotic species and identify cofunctioning protein families whose members participate in major eukaryotic complexes and pathways.
2. Methods
We first briefly summarize the network model (Dutkowski and Tiuryn, 2009) and then present the expectation maximization (EM) algorithm developed in the present study.
2.1. The model of PPI network evolution
Suppose we have a set of protein–protein interaction networks for several species. We first construct a phylogenetic tree for the input species and cluster all protein sequences present in these species to define clusters of paralogous proteins. Each cluster, also called a protein family, corresponds to a single protein in the ancestral species. Each family gives rise to a reconciled phylogenetic tree (Page and Charleston, 1997), which, in addition to speciation, uses duplication events and gene losses to explain the particular protein composition of the family. The reconciled trees are constructed subject to minimization of the number of duplication and loss events.
Assume real-valued parameters Θ = (ps, δs, pd, δd, pm, δm, p*) and an initial network of protein–protein interactions represented by an undirected graph G. The graph G can be thought of as an ancestral network of interactions, where nodes correspond to the proteins and edges connect two interacting partners. The parameter p* represents the (prior) probability of observing an edge in G. The network evolves subject to the following events:
• • •
The above speciation and duplication events can be naturally associated with corresponding events in the reconciled phylogenetic trees, thus representing the evolution of interactions spanning every pair of protein families (for details see Dutkowski and Tiuryn, 2007, 2009).
2.2. EM parameter estimation
Based on this model, we now develop an EM approach for deriving the probabilities of network rewiring events directly from the existing interaction data. We have two types of such events associated with speciation: retaining an edge and introducing an edge. We denote these two types by s and s+, respectively. In a similar fashion we define the events d, d+, m, m+. We introduce the concept of an event in order to easily refer in a uniform way to the corresponding parameters of the model. For this reason, we will also include the event of initiation and denote it by*. This event represents creation of the ancestral graph G by a sequence of edge introductions. Let E = {s, s+, d, d+, m, m+, *} denote the set of all types of events. With each type e ∈ E there is naturally associated a probability pe for this event to occur. Thus, for example, ps retains its meaning from the parameter set Θ, and
Our model consists of two levels: the observable level, which is represented by datasets of measured protein–protein interactions, and the hidden level that represents the true extant interactions together with all protein interactions that were present during the evolution from the ancestral species to the extant species. The speciation and duplication events are internal in the hidden level, while the measurement event serves as a bridge between the hidden and the observable levels.
Let us denote by
For a type
2.2.1. The
\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}
\begin{document}
$${ \cal Q}$$
\end{document}
function
We now derive the
The probability of the data given the parameters Θ can be expressed as the product of the probabilities of the initial interactions and the transition probabilities:
Applying usual transformations we obtain the following formula for
where the mean values are taken with respect to the probability distribution P(−|
2.2.2. The E-step
In the E-step we compute the mean values
2.2.3. The M-step
The M-step of the algorithm determines new parameter values that maximize the
For a given e, we find
Notice that
2.3. Discriminating between selective and neutral evolution
To learn the parameters associated with conserving and neutral network evolution, we use protein interactions among members of functional modules as training samples for the conserving model, and interactions in random network regions for the neutral model. We run the EM procedure separately to obtain these two sets of parameters. Afterward, the two estimated models can be used to discriminate between the conserving and neutral cases. For a given pair of protein families (f1, f2) we define the log-likelihood ratio (LLR score) as:
where
3. Results and Discussion
We applied the proposed framework and reference datasets to estimate two types of parameters of network evolution (Fig. 1). The first type was estimated based on known examples of functional modules, which are assumed to evolve under a conserving evolutionary scenario. The second type was estimated based on randomly selected parts of the network. We expected that the difference between these parameter sets would reflect the difference in interaction conservation among functional modules and remaining network regions evolving at a background rate.

Estimating the parameters of conserving and neutral evolution of protein interaction networks:
3.1. Data acquisition
We downloaded the TreeFam database (Li et al., 2006) as a reliable source of protein families and phylogenies. TreeFam contained over 18,000 gene family trees of which 1,314 (in TreeFam A) were manually curated and are therefore attributed greater confidence. From TreeFam A we selected 573 families that have a representative from each of the seven species: H. sapiens, M. musculus, R. norvegicus, D. melanogaster, C. elegans, S. cerevisiae, and A. thaliana. Next we downloaded all available protein–protein interactions for these species from the IntAct (Hermjakob et al., 2004), DIP (Salwinski et al., 2004), and MINT (Chatr-aryamontri et al., 2007) databases.
Reference datasets of protein pathways and complexes were downloaded from the KEGG and MIPS databases. Based on the reference datasets we prepared two separate sets of examples of conserving PPI evolution. First for the KEGG-based dataset we extracted pairs of different yeast proteins that were together members of the same pathway. For each such pair of proteins, we added the corresponding pair of protein families (to which the proteins belonged to) to our set of conserved examples, but only if there was at least one protein–protein interaction observed between members of these families (in the input PPI databases). We refer to such family pairs as conserved. We identified 553 conserved pairs altogether. Analogous steps were performed for the MIPS dataset identifying 347 family pairs.
Next we generated a random dataset, equal in size to the larger of the two conserved sets, drawing only pairs of protein families for which there was at least one interaction observed between their members. Thus both the conserved and the random set contained pairs of protein families for which some evidence for interaction existed. Pairs in the conserved sets fulfill an additional condition—each family in the pair has at least one protein member in the same KEGG pathway or MIPS complex. Note that in both cases a protein family is allowed to form a pair with itself.
3.2. Parameter estimation
Next we estimated the parameters of the evolutionary model (ps, δs, pd, δd, p*) based on the three training datasets (KEGG-based, MIPS-based, and random). Table 1 shows important differences between the three resulting models. Most notable is the conservation of interactions during speciation (ps), which is greater for the conserved KEGG and MIPS-based models than for the random model. This result supports the notion that interactions within functional modules are strongly conserved across species. Perhaps less expected is the weaker conservation of these interactions during duplication (pd). One plausible explanation is that it allows duplicate members of the functional modules to change function more easily. While the conservation of essential interactions is critical, once a protein duplicates one of the copies may be allowed to diverge and gain new functionality. This provides potential for higher interaction loss as well as for formation of new interactions by adapting old interfaces. It is also interesting to observe that parameter values estimated based on the MIPS data are more extreme than those based on KEGG. This suggests that interactions within complexes are more strongly conserved that those within pathways that generally define broader functional categories (often composed of several complexes).
Finally, we notice that in all three models the prior probability of interaction is close to one, suggesting that the vast majority of present interactions have their evolutionary predecessors in the ancestral species. Note that based on the present day interactomes, the overall prior probability of interaction between ancestral proteins is expected to be ∼0.001. Here however, we only consider a specific subspace of the interactome defined by pairs of protein families for which at least one interaction is observed experimentally.
3.3. Distinguishing conserved pathway and complex regions
The differences between the models (conserving and neutral) enable us to apply the likelihood ratio score to assess the relative interaction conservation between members of arbitrary protein families. First, we analyze the scores among pairs of families used for training the two conserving models. The distribution of LLR scores for pairs of families derived based on KEGG and MIPS data are plotted in Figure 2. For KEGG-based pairs, the conserving model used to calculate the LLR scores was based on the KEGG database. The MIPS-based pairs were scored using the MIPS-based conserving model. Additionally, each pair was scored in the neutral model. As expected, in both cases most KEGG and MIPS-based pairs scored higher under the conserving model (LLR>0). However, we also identified a considerable fraction of pairs that were more likely under the neutral scenario (255 KEGG- and 140 MIPS-based pairs). The histograms (see Fig. 2 A, B) suggest a bimodal distribution where some of the KEGG- and MIPS-based family pairs are significantly more conserved than others. Based on this observation, we selected a stringent LLR threshold value, at which other highly conserved pairs can be identified. Figure 2 provides an overview of the top-scoring KEGG-based associations with LLR>0.2 grouped according to the KEGG pathways that support them. Among the most conserved pathway regions, we identified those involved in proteolysis, cell cycle, metabolism, and DNA replication.

Histograms provide the distribution of log-likelihood ratio (LLR) scores among pairs of protein families containing proteins belonging to common KEGG pathways
3.4. Maps of highly conserved associations
Next we computed the LLR score for all remaining pairs of protein families using the MIPS-based conserving model and the neutral model. Further, we constructed a network of protein families in which the edges are weighted by the LLR score, representing the level of conservation of interactions between the members of adjacent families. Based on the considerations outlined above, we selected a threshold value of 0.4 and retained only the edges in the network with weights above this threshold. Family pairs with no edges above the selected threshold were discarded. Overall 251 protein families and 358 edges were retained.
The resulting network is visualized in Figure 3 (a high-resolution version is also available as Supplementary Fig. 1). The network has a giant component composed of 192 nodes and 316 edges. Additionally there are 29 considerably smaller components. Approximately half of the edges in the network correspond to protein family pairs, which were previously known to be associated within functional modules. In total, 105 of these correspond to pairs present in the MIPS-based dataset used for training. An additional 38 pairs (marked by yellow edges) are present only in the KEGG-based dataset—these were not present in the training set. The remaining 173 edges represent predicted novel family associations.

TreeFam family network identified at LLR threshold of 0.4 and clustered using the affinity propagation algorithm. Identified clusters are numbered and color coded. Thick edges correspond to known associations either based on MIPS (blue), KEGG (yellow), or both (green).
We further raised the LLR threshold value to identify multiple coherent network modules. Table 2 presents the results of gene ontology (GO) enrichment analysis (Bauer et al., 2008) of the 26 connected components of the family network identified as LLR = 0.7 (components are presented in Supplementary Fig. 2). We find that 24 of the 26 modules are significantly enriched for a biological process term (p-value <0.01 after correcting for multiple testing). The first two modules correspond to the proteasome complex involved in proteolysis (see above). Other modules are responsible for highly conserved biological processes such as translation, cell cycle regulation, protein folding, and phosphorylation. Most of the modules have enriched annotations to the same term in multiple species. Our analysis also uncovers possible missing annotations (denoted by NA in Table 2) and suggests candidate GO terms for the corresponding proteins. For example, based on the evidence from four other species, Arabidopsis proteins in component number 20 might be considered for annotation to the cell division term, while rat proteins belonging to the 7-th module are suspected to take part in DNA replication.
The p-values are reported for five of the seven species. At most one significant term is considered for each module. The p-values smaller than 0.01 are presented in bold. NA indicates missing annotation (see text in Section 3.4).
To take advantage of the information coming from weaker connections between families, we clustered the original network using the affinity propagation algorithm (Frey and Dueck, 2007), producing a map of 83 family–family interaction modules (Fig. 3). Repeating the GO enrichment analysis for the identified clusters, we found that 76 of them (92%) have statistically significant biological process annotations in at least one species (p-value <0.01 after correcting for multiple testing) and 52 (63%) have significant annotations to the same term in at least two of the seven species. Supplementary Table 1 lists the annotations of 39 clusters, which have significant annotations to the same term in at least two of the seven species at a more stringent p-value <0.001.
3.5. Assessing robustness
Our initial analysis was focused on exploring the parameter space and identifying unknown conserved family associations. To this end, we used all known MIPS-based family pairs to learn the conserving model and identify novel associations that evolve as restrictively as the known functional components. An important outstanding question is if our method is able to identify MIPS associations held out from training and if it is robust to perturbations in the input data. To investigate this we applied the cross-validation procedure by iteratively learning the evolutionary parameters on 4 of 5 (approximately equal in size) subsets of the data and classifying the pairs in the subset that was held out from training. The held out portion of the data was subsequently used to validate the method's predictions. Note that each of the five subsets contained ∼1/5 of the 347 MIPS-based pairs and ∼1/5 of the 347 pairs selected at random.
Notably, we found that our method, having only five parameters, does not overfit to training data and produces a generalizable discriminant. As a result, we recovered 95 MIPS pairs (84%) above the 0.4 LLR threshold–each pair was identified based on the training data that did not include this pair. At the selected threshold, only 18 pairs (16%) from the random set were identified. Of the 95 true positive pairs, 91 were among the 105 MIPS pairs identified by the original method, which was based on all available data.
4. Conclusions
We have presented a maximum likelihood approach for learning the probabilities of network rewiring events under a simple duplication-and-divergence model. Our method provides insights into the rates of acquisition and loss of interactions under various evolutionary scenarios. It also allows differentiation between the neutral and conserving modes of evolution, thus pointing to regions of enhanced conservation that typically correspond to key functional modules. We anticipate that model-based approaches such as the one proposed here will be essential for understanding the processes shaping biological networks and inferring and transferring functional information across species.
Footnotes
Acknowledgments
The authors would like to thank Trey Ideker for helpful comments on the manuscript. Part of this work was included in the doctoral thesis of the first author (JD). This work was partly supported by the Polish Ministry of Science and Education grant no. N N301 065236.
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
