Finding Alignments of Conserved Graphlets in Protein Interaction Networks

Abstract

As the amount of data describing biological interactions increases, it becomes possible to analyze the complex interactions of genes and proteins across multiple networks at the genome scale. While the most popular techniques to study conservation of patterns in biological networks are through the use of network alignment techniques or the identification of network motifs, we show that it is possible to exhaustively enumerate all graphlet alignments, which consist of at least two vertex-disjoint subgraphs that share a common topology and contain homologous proteins at the same position in the topology. We compare the performance of our algorithm to network alignment algorithms and show that our algorithm is able to cover significantly more proteins in the given networks while maintaining comparable or higher sensitivity and specificity with respect to functional enrichment.

1. Introduction

To investigate the function of genes, one of the most popular strategies is through studying conservation of patterns and modules in biological networks, which can reveal signaling pathways, protein complexes, and functional modules. One important strategy is through the use of network alignment techniques, which include heuristic algorithms (Sharan et al., 2005; Koyutürk et al., 2006; Kalaev et al., 2009), progressive alignment algorithms (Flannick et al., 2006), and parameter learning algorithms (Flannick et al., 2009). Although these algorithms are able to identify biologically conserved regions (Kelley et al., 2003; Berg and Lässig, 2004), they do not enforce strict topological constraints, and it is difficult to use them to study the relationships between topology and function. Recently, the availability of global network alignment algorithms allows these conservation studies to be performed globally across multiple networks (Singh et al., 2008; Flannick et al., 2009; Liao et al., 2009; Kuchaiev and Pržulj, 2011).

An alternative strategy to analyze biological networks is through the identification of network motifs (Milo et al., 2002; Wuchty et al., 2003), which are over-represented patterns in a network. While most previous approaches for finding network motifs focus on estimating the number of motifs that have a certain topology in biological networks, either through counting (Shen-Orr et al., 2002; Parida, 2007; Alon et al., 2008), sampling (Kashtan et al., 2004; Jiang et al., 2006; Pržulj et al., 2006; Wernicke, 2006), or a combination of these techniques (Grochow and Kellis, 2007), functional linkages of proteins within a motif are ignored.

We investigate the problem of identifying conserved patterns in protein interaction networks by obtaining graphlet alignments, which consist of at least two vertex-disjoint subgraphs that share a common topology and contain homologous proteins at the same position in the topology. Since each topology is represented by a small graphlet, we employ exhaustive enumeration techniques to identify all alignments, which is different from most network alignment algorithms that employ heuristics. By placing a constraint on homology between aligned proteins, our strategy is different from counting the number of network motifs that have a certain topology.

We apply this strategy to protein interaction networks both within and across species and show that our algorithm is able to cover significantly more proteins in the given networks than previous approaches while maintaining comparable or higher sensitivity and specificity with respect to functional enrichment.

2. Methods

We first assume that a single interaction network is given, and our goal is to identify all graphlet alignments within the network (Fig. 1).

FIG. 1.

Illustration of an interaction network G; a topology H; a graphlet alignment that contains m = 3 instances H₁, H₂, and H₃; and its correspondence with an induced subgraph of G³. Each oval represents one set S_k of homologous proteins in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal S}$$\end{document} .

Definition 1

Given an undirected graph G and an integer m, G^m is an undirected graph in which each vertex \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{1} , \ldots , v_{m} )$$\end{document} corresponds to an m-tuple of vertices in G, and each edge connects \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{1} \ , \ldots , v_{m} )$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v^{ \prime}_{1} , \ldots , v^{ \prime}_{m} )$$\end{document} if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{i} , v^{ \prime}_{i} )$$\end{document} are edges in G for all 1 ≤ i ≤ m.

Definition 2

Given an undirected graph G = (V, E) that represents an interaction network, a connected undirected graph \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$H = ( V_{0} = ( v_{01} , \ldots , v_{0 \mid V_{0} \mid } ) , \ E_{0} )$$\end{document} that represents the target topology, a collection \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal S} = \{ S_{1} , \ldots , S_{ \mid \cal S \mid } \}$$\end{document} of sets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$S_{k} \subset V$$\end{document} of homologous proteins, and an integer m, a graphlet alignment consists of m vertex-disjoint instances \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$H_{1} , \ldots , H_{m}$$\end{document} in which each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$H_{i} = ( V_{i} = ( v_{i1} , \ldots , v_{i \mid V_{0} \mid } ) , E_{i} )$$\end{document} is a subgraph of G that is isomorphic to H, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$v_{0j} , v_{1j} , \ldots , v_{mj}$$\end{document} are mapped together by isomorphism for 1 ≤ j ≤ |V₀|, with the restriction that the induced subgraph of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ ( v_{11} , \ldots , v_{m1} ) , \ldots , ( v_{1 \mid V_{0} \mid } , \ldots , v_{m \mid V_{0} \mid } ) \}$$\end{document} on G^m is isomorphic to H, and for each j, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$v_{1j} , \ldots , v_{mj}$$\end{document} are homologous proteins, that is, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\{ v_{1j} , \ldots , v_{mj} \} \ \subseteq \ S_{k}$$\end{document} for some k.

Similar to the motif discovery approach in Grochow and Kellis (2007), we assume that a fixed topology is given, and our algorithm can be applied over different topologies to identify all graphlet alignments. Since the given topology does not contain actual proteins, this strategy is different from other approaches that identify conserved patterns through a query path or a query network (Kelley et al., 2003; Shlomi et al., 2006; Tian et al., 2007; Dost et al., 2008). Since no restrictions are made on the relationships between different sets S_k of homologous proteins, they can overlap in complicated ways. This strategy is different from the one in Koyutürk et al. (2004), in which the same label is used to specify homologous proteins in different graphs. In addition to subgraphs in G, the isomorphism mappings are applied to induced subgraphs in G^m to make sure that H is the densest topology that all instances share (note that the subgraphs in G do not need to be induced). To avoid repetitive structures, we require that each vertex in G can only appear at most once within an alignment. Note that this does not prevent different homologous proteins from appearing within an instance.

We exhaustively find all graphlet alignments through a branch-and-bound algorithm by generating m-tuples of vertices in G that satisfy the homology constraints and recursively adding them to a growing alignment with m instances. This technique updates the m instances at the same time and is different from the progressive alignment technique used in network alignment algorithms (Flannick et al., 2006). To avoid the generation of redundant alignments, we impose symmetry-breaking conditions.

Grochow and Kellis (2007) considered the problem of enumerating all subgraphs of an undirected graph G that are isomorphic to a given topology H and derived symmetry-breaking conditions to ensure that there is a unique map from H to each instance of H in G. Given a distinct labeling of vertices in G, they imposed conditions of the form \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$l_{H} ( v ) < \min ( l_{H} ( u_{1} ) , \ldots , l_{H} ( u_{k} ) )$$\end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$v , u_{1} , \ldots , u_{k}$$\end{document} are vertices of H, and l_H(v) is the induced label of vertex v in H given a map from H to G.

To impose symmetry breaking conditions on graphlet alignments, observe that there is a one-to-one correspondence between a graphlet alignment with m instances and an induced subgraph in G^m that contains m-tuples of homologous proteins as vertices with no repeated vertices in G. For each j, v_ij of H_i in a graphlet alignment corresponds to the ith component of the vertex \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{1j} , \ldots , v_{mj} )$$\end{document} in G^m that represents a set of m aligned proteins. Note that v_ij of H_i is mapped from the jth vertex of H, and a map from H to G^m specifies an alignment with m instances (Fig. 1). Note, for example, that (u, v) and (v, u) are two distinct vertices in G².

The problem reduces to enumerating all induced subgraphs of G^m that are isomorphic to H and contains m-tuples of homologous proteins as vertices with no repeated vertices in G, thus the symmetry-breaking conditions in Grochow and Kellis (2007) can be used to break symmetry of topology in G^m. We have an additional type of symmetry because of the permutation of instances.

Definition 3

Two graphlet alignments with m instances are redundant if they have exactly the same sequence of m-tuples of homologous proteins up to symmetry of topology in G^m and symmetry due to permutation of instances (Fig. 2).

FIG. 2.
Illustration of redundant alignments: The alignments in (a) and (b) and the alignments in (c) and (d) are symmetric in topology, while the alignments in (a) and (c) and the alignments in (b) and (d) have their instances permuted. The symmetry-breaking condition on H is shown below H. The corresponding symmetry-breaking conditions on (a) are shown below (a), in which the first condition is obtained by setting l_H(v₀₃) = l(u₇, u₅) = min(l(u₇),l(u₅)) and l_H(v₀₄) = l(u₈, u₆) = min(l(u₈), l(u₆)) to break symmetry of topology, and the second condition is obtained from the first two-tuple (u₁, u₂) to break symmetry due to permutation of instances.

To break both types of symmetry, we first make sure that all alignments that differ only by a permutation of instances are treated in exactly the same way with respect to topology symmetry breaking by assigning the same label to each m-tuple \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{1} , \ldots , v_{m} )$$\end{document} in G^m and all its permutations \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{p ( 1 ) } , \ldots , v_{p ( m ) } )$$\end{document} , where p is an arbitrary permutation on m vertices. One way to do this is to assign the label \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$l ( v_{1} , \ldots , v_{m} ) = \min ( l ( v_{1} ) , \ldots , l ( v_{m} ) )$$\end{document} to each m-tuple \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{1} , \ldots , v_{m} )$$\end{document} in G^m, where l(v) is the distinct label of vertex v in G (Fig. 2). Note that this does not mean these m-tuples are treated in the same way. They are distinct in G^m and can appear in different alignments (Fig. 3). Also note that since no repeated vertices in G are allowed in an alignment, topology symmetry breaking is performed with respect to m-tuples that have distinct labels. Redundant alignments that differ only by a permutation of instances are removed by imposing the condition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$l ( v_{11} ) < \cdots < l ( v_{m1} )$$\end{document} on the first m-tuple of the instances \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$H_{1} , \ldots , H_{m}$$\end{document} (Fig. 2).

FIG. 3.
Illustration of distinct alignments that have the same set of labels {1, 3, 5, 6} in G² given the label assignments l(u_i) = i and l(u_i, u_j) = min(l(u_i), l(u_j)). The alignments in (a), (b), and (c) do not have the same tuples of homologous proteins, while the alignments in (a) and (d) are not symmetric in topology.

Figure 4 shows our algorithm GraphletAlign that enumerates all non-redundant graphlet alignments by applying the procedure ExtendAlignment to grow an alignment recursively (Fig. 5). Unlike other network alignment approaches that combine more than one network into a single graph during preprocessing (Kelley et al., 2003; Sharan et al., 2005; Koyutürk et al., 2006), there is no need to generate the entire graph G^m explicitly, which is not feasible when m is large because of extensive space requirements. During the generation of m-tuples \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$( v_{1} , \ldots , v_{m} )$$\end{document} , it is necessary to consider only vertex combinations within each S_k in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal S}$$\end{document} that are of size at least m. We order vertices in G and H in such a way to reduce unsuccessful branches that will need to be pruned later.

FIG. 4.
Algorithm GraphletAlign for finding graphlet alignments with m vertex-disjoint instances \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$H_{1} , \ldots , H_{m}$$\end{document} when given an interaction network G, a topology H, and a collection \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal S}$$\end{document} of sets of homologous proteins.

FIG. 5.
Algorithm ExtendAlignment for extending alignments that contain the first j vertices of each of the m vertex-disjoint instances \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$H_{1} , \ldots , H_{m}$$\end{document} to contain one more vertex when given an interaction network G, a topology H, and a collection \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal S}$$\end{document} of sets of homologous proteins.

Since for each m-tuple in G^m, the algorithm ExtendAlignment is called recursively on all its adjacent m-tuples at most |V₀|−1 times, the worst case time complexity is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$O ( \mid V \mid ^{m} \Delta^{ \mid V_{0} \mid - 1} )$$\end{document} , where Δ is the largest vertex degree of G^m. This can be rewritten as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$O ( \mid V \mid ^{m} \delta^{m ( \mid V_{0} \mid - 1 ) } )$$\end{document} , where δ is the largest vertex degree of G. In reality, the number of needed vertices in G^m is bounded by the number of distinct m-tuples of vertices in G that can be obtained from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal S}$$\end{document} , and the requirement of finding induced graphs helps to prune a lot of search branches. As long as most vertices in G^m are not close to the maximum degree, the actual computational time should be much lower than the worst-case estimate.

To allow indels between neighboring proteins within the instances, given a parameter d that specifies the maximum length of gaps, we construct a new graph G′ = (V′, E′) from G = (V, E) by setting V′ = V and connecting vertex u to v in G′ if the shortest path distance between u and v is at most d + 1 in G and apply the algorithm to G′ instead of G. To allow the identification of graphlet alignments in multiple networks, we combine the given networks into a single graph G (no new edges are added) and assign consecutive labels to vertices in G within each of the networks. We further require that each alignment contains at least one instance from each network.

3. Results

We applied the GraphletAlign algorithm on non-isomorphic topologies H (Fig. 6) obtained from using the algorithm in McKay (1998) to protein interaction networks from human and mouse in the IntAct database (Hermjakob et al., 2004), to protein interaction networks from fly, worm, and yeast in the DIP database (Xenarios et al., 2000), and to protein interaction networks from E. coli, H. pylori, S. typhimurium, and V. cholerae in the SNDB database (Srinivasan et al., 2006). To obtain the collection \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \cal S}$$\end{document} of sets of homologous proteins, for each protein we identify top r BLAST hits (Altschul et al., 1990) with e-value below 10⁻⁷ so that the protein itself is also a reciprocal BLAST hit to form one set of homologous proteins, where r is a parameter.

FIG. 6.
Illustration of all non-isomorphic topologies H with three to five vertices. Each topology is assigned a name of the form x-y or x-y-z, where x is the number of vertices in the topology, y is the number of edges, and z distinguishes between topologies that have the same values of x and y.

In order to allow an indel in a graphlet alignment between two sets of m aligned proteins, one from each instance, we impose an additional condition that there is at least one direct connection within one of the instances in the original network. Note that indels are allowed within each edge of the topology according to the way that G′ is constructed to replace G. Table 1 shows the number of graphlet alignments over a few combinations of species when one indel is allowed. For single species alignments, we use topology size |V₀| = 3 and number of instances m = 2. For two-and three-species alignments, we also use topology size |V₀| = 4. For two-species alignment on fly and yeast, we further use topology size |V₀| = 5. For multiple species alignments, we set m as the number of species. We set the maximum number of top reciprocal BLAST hits r allowed for each protein to the largest possible value (in multiples of five) so that the number of alignments is between 10⁷ and 10⁸. It takes between half of a day and two days to obtain all alignments on a single processor over all topologies in each case.

Table 1.
Number of Graphlet Alignments and Computational Time on Single Species, Two Species, Three Species, and Four Species Alignments When One Indel Is Allowed

Vertex Edge |V₀| m r Alignment Avg_size Max_size Time

Human 6294 13455 3 2 5 3.0 × 10⁷ 45.0 177 883.2

Yeast 7446 22734 3 2 10 3.6 × 10⁷ 59.0 117 662.2

Human–mouse 7455 14545 3–4 2 — 2.2 × 10⁷ 21.4 109 1246.8

Fly–yeast 12374 40111 3–5 2 15 4.8 × 10⁷ 35.1 100 2585.7

Fly–worm–yeast 14254 42361 3–4 3 40 3.5 × 10⁷ 17.9 100 1011.9

E. col–H. pyl–S. typ–V. cho 10339 64890 3 4 35 7.6 × 10⁷ 9.3 41 1635.1

Within each entry, vertex is the total number of vertices in the given interaction networks, edge is the total number of edges, |V₀| is the range of size of topology, m is the number of instances, r is the maximum number of top reciprocal BLAST hits allowed for each protein (“—” indicating no constraint), alignment is the total number of alignments over all topologies, avg_size is the average number of distinct proteins within a module instance after postprocessing (only aligned proteins are counted while indels are ignored), max_size is the maximum number of distinct proteins within a module instance, and time is the computational time to obtain all alignments in minutes.

We compare the performance of our algorithm on multiple species alignments to NetworkBLAST-M (Kalaev et al., 2009), which identifies conserved functionally enriched modules based on a representation of multiple networks that are of linear size. On two-species alignments, we further compare to NetworkBLAST (Sharan et al., 2005), which combines more than one given network into a single graph and uses a seed-extension approach to find high-scoring subgraphs that represent alignments, and MaWISh (Koyutürk et al., 2006), which considers an evolutionary model that includes interaction matches, interaction mismatches, and protein duplications to define high-scoring pairwise alignments. On two-species alignment of fly and yeast, we also compare to DOMAIN (Guo and Hartemink, 2009), which incorporates information from domain interactions to produce pairwise alignments based on alignment of edges rather than nodes. On three-species alignment, we also compare to CAPPI (Dutkowski and Tiuryn, 2007), which identifies functional modules by reconstructing an ancestral network based on a network evolution model. On four-species alignment, we also compare to Græmlin (Flannick et al., 2006), which employs a progressive alignment approach that allows a large number of networks to be aligned together. Since our algorithm is the only one among these algorithms that can identify alignments within a single species, no comparisons are performed in this case.

We perform postprocessing to obtain larger modules from the graphlet alignments. In order to avoid exhaustive comparisons between all pairs of alignments when the number of alignments is large, we consider the alignments within each topology in search traversal order and merge all neighboring alignments that have exactly the same m-tuples in G^m at all search levels except for the last level. We merge each instance of an alignment with the corresponding instance of another alignment separately to obtain a module with larger instances.

We perform further postprocessing to reduce the number of modules and overlap between modules while increasing the module size. In order not to lose information from our exhaustive set of graphlet alignments, we merge some of these modules instead of removing them. We define the score of a graphlet alignment to be the average minus log e-value from BLAST of all aligned protein pairs, and the score of a module to be the best graphlet alignment score within the module. For each aligned protein, we construct a list of all the modules that contain the protein. We sort the module lists in decreasing order of the number of modules in each list and consider each module list in order. Within each module list, we sort the modules in decreasing order of the number of aligned proteins in the module, then in decreasing order of the number of aligned edges within the ordering, and finally in decreasing order of the module score. We iteratively consider the highest-ranked module in the list. We compare it to the next lower-ranked module and merge them if the merged module does not contain more than 100 aligned proteins within an instance and the overlap of aligned proteins between each pair of corresponding instances is at least 50% with respect to the average number of aligned proteins in the instance pair. We remove the lower-ranked module after merging and replace the highest ranked module with the merged module. We continue to compare the highest-ranked module with the next lower-ranked module until there are no more lower-ranked modules to compare, at which time we remove the module lists of all the aligned proteins that are contained in the highest-ranked module. We continue with the next module list until there are no more module lists left. Table 1 shows that the average size of the modules remains small after merging (although the maximum module size can be large).

To investigate functional relationships among aligned proteins within a module, we consider each species separately and use gene ontology (GO) annotations (Ashburner et al., 2000) to determine whether its aligned proteins tend to have related function. We evaluate its functional enrichment by applying the GO Term Finder (Boyle et al., 2004) to the aligned proteins and identifying significant GO terms with Bonferroni corrected p-value below 0.05 within the biological process ontology. We define the specificity to be the percentage of modules that have significant GO terms within each species. We map each significant GO term to all ancestral GO terms with a shortest-path distance of two from the root of the biological process ontology, which represents a subset of high-level GO terms that represent functional categories. We define the sensitivity to be the percentage of these ancestral GO terms that are mapped from at least one significant GO term.

Figure 7 shows that our algorithm was able to cover significantly more proteins. Except for a few cases, our algorithm usually had higher sensitivity and specificity than the other algorithms. Similar to NetworkBLAST and MaWISh, our algorithm returned a larger number of modules than the other algorithms. This is due to the condition that two modules have to satisfy the overlap threshold within each species in order to be merged, so that modules that occupy very different regions within some species will remain separate. Both the protein coverage and sensitivity decrease as the number of species increases. In general, the number of modules is highly correlated to the number of alignments. It is necessary to allow indels since otherwise protein coverage is much lower. While it is not feasible to use all BLAST hits, protein coverage can be improved by increasing the maximum number of top reciprocal BLAST hits allowed for each protein. When the number of instances is small, it is necessary to consider larger topologies to maintain high sensitivity and specificity.

FIG. 7.
Performance comparisons of GraphletAlign, NetworkBLAST-M, NetworkBLAST, MaWISh, DOMAIN, CAPPI, and Græmlin on single-species, two-species, three-species, and four-species alignments. For GraphletAlign, one indel is allowed and parameter settings are in Table 1. For the other algorithms, the same networks and the same BLAST e-value threshold are used. (a) Number of modules. (b) Protein coverage is the total number of distinct proteins that are covered by these modules within each species (only aligned proteins are counted while indels are ignored). (c) Sensitivity is the percentage of functional categories as defined by all ancestral gene ontology (GO) terms with a shortest path distance of two from the root of the biological process ontology that are mapped from at least one significant GO term within each species. (d) Specificity is the percentage of modules that have significant GO terms within each species while excluding the ones that do not have any GO term annotations.

To investigate whether completely different conserved regions can be obtained from different algorithms within each species, we consider each algorithm and retain only the modules in which all proteins within at least one species are not covered by another algorithm. Note that these modules can still overlap with proteins covered by the other algorithm within some other species. Figure 8 shows that our algorithm was able to identify some number of such modules with respect to each of the other algorithms, and the specificity of these modules remains high, while the other algorithms generally identified fewer such modules with respect to our algorithm.

FIG. 8.
Performance comparisons between GraphletAlign and each of the algorithms NetworkBLAST-M, NetworkBLAST, MaWISh, DOMAIN, CAPPI, and Græmlin on multiple species alignments when retaining only the modules in which all proteins within at least one species are not covered by another algorithm. The notation X\Y denotes the performance of algorithm X with respect to another algorithm Y. Each graph shows the same statistics as in Figure 7 except that they are only on the retained modules. The notation in (d) is the same as the other graphs.

Figure 9 shows four graphlet alignments found by our algorithm in the yeast network that link together three mitogen-activated protein (MAP) kinase cascades in the pheromone response pathway, the filamentation/invasion pathway, and the cell integrity pathway, in which the correspondences between the MAPKK, MAPK, and MAP kinases are all in the correct positions (Gustin et al., 1998). Figure 10 shows a module found by our algorithm but not by NetworkBLAST-M that contains cold shock proteins nusA, infB, pnp, and rpsO (Bae et al., 2000), with a strong relationship of the extra protein fusA in H. pylori to cold shock response (Delgado et al., 2008).

FIG. 9.
Mitogen-activated protein (MAP) kinase cascades found in the graphlet alignments of yeast with topology 3-3. Solid lines denote direct interactions, while dashed lines denote indirect interactions. Proteins within the same row are aligned together.

FIG. 10.
A module found by GraphletAlign but not by NetworkBLAST-M, which contains cold shock proteins. Solid lines denote direct interactions, while dashed lines denote indirect interactions.

4. Discussion

We have developed an algorithm for identifying graphlet alignments in protein interaction networks that complements existing algorithms on network alignments and network motifs. We show that it is possible to exhaustively enumerate all non-redundant alignments when the topology is small. Our strategy is successful in achieving a more complete coverage of the conserved functionally enriched modules within the networks while maintaining comparable or higher sensitivity and specificity. Among the algorithms that we have tested, our algorithm is the only one that can identify alignments within a single species.

Although our algorithm is slower than previous algorithms, its running time is approximately linear in the number of graphlet alignments that are generated, while the number of graphlet alignments is highly correlated to the maximum number of top reciprocal BLAST hits r allowed for each protein (Fig. 11). Figures 12 and 13 further show that both sensitivity and specificity increase as r increases, but they level off after r becomes large enough. As the maximum size of topology |V₀| increases, sensitivity stays relatively constant while specificity gradually increases.

FIG. 11.
Computational statistics of GraphletAlign: (a) Running time as a function of the number of graphlet alignments that are generated. (b) Number of graphlet alignments as a function of the maximum number of top reciprocal BLAST hits r allowed for each protein.

FIG. 12.
Sensitivity of GraphletAlign against (a) the maximum number of top reciprocal BLAST hits r allowed for each protein and (b) the maximum size of topology |V₀| that is used.

FIG. 13.
Specificity of GraphletAlign against (a) the maximum number of top reciprocal BLAST hits r allowed for each protein and (b) the maximum size of topology |V₀| that is used.

Our algorithm can be modified to handle other types of biological networks and directed graphs. Other than using BLAST, one can use COG (Tatusov et al., 1997) or Inparanoid (Remm et al., 2001) to define homologous proteins. Other methods such as phylogenetic profiling (Pellegrini et al., 1999) or functional linkages (Li et al., 2005) can be used to define proteins that have related function. Other than using GO terms, one can use resources such as KEGG Orthology (Kanehisa et al., 2004) to define functional categories.

Availability: A software program implementing the algorithm (GraphletAlign) is available online (Hsieh and Sze, 2014).

	Vertex	Edge	\|V₀\|	m	r	Alignment	Avg_size	Max_size	Time
Human	6294	13455	3	2	5	3.0 × 10⁷	45.0	177	883.2
Yeast	7446	22734	3	2	10	3.6 × 10⁷	59.0	117	662.2
Human–mouse	7455	14545	3–4	2	—	2.2 × 10⁷	21.4	109	1246.8
Fly–yeast	12374	40111	3–5	2	15	4.8 × 10⁷	35.1	100	2585.7
Fly–worm–yeast	14254	42361	3–4	3	40	3.5 × 10⁷	17.9	100	1011.9
E. col–H. pyl–S. typ–V. cho	10339	64890	3	4	35	7.6 × 10⁷	9.3	41	1635.1

Footnotes

Acknowledgments

This work was supported by the National Science Foundation (MCB-0951120).

Author Disclosure Statement

No competing financial interests exist.

References

Alon

, Dao

, Hajirasouliha

, et al. 2008. Biomolecular network motif counting and discovery by color coding. Bioinformatics, 24, I241–I249.

Altschul

S.F.

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

Ashburner

, Ball

C.A.

, Blake

J.A.

, et al. 2000. Gene ontology: tool for the unification of biology. Nat. Genet., 25, 25–29.

Bae

, Xia

, Inouye

, and Severinov

2000. Escherichia coli CspA-family RNA chaperones are transcription antiterminators. Proc. Natl. Acad. Sci. USA, 97, 7784–7789.

Berg

, and Lässig

2004. Local graph alignment and motif search in biological networks. Proc. Natl. Acad. Sci. USA, 101, 14689–14694.

Boyle

E.I.

, Weng

, Gollub

, et al. 2004. GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20, 3710–3715.

Delgado

, Zaman

, Muthaiyan

, et al. 2008. The fusidic acid stimulon of Staphylococcus aureus. J. Antimicrob. Chemother., 62, 1207–1214.

Dost

, Shlomi

, Gupta

, et al. 2008. QNet: a tool for querying protein interaction networks. J. Comput. Biol., 15, 913–925.

Dutkowski

, and Tiuryn

2007. Identification of functional modules from conserved ancestral protein-protein interactions. Bioinformatics, 23, I149–I158.

10.

Flannick

, Novak

, Srinivasan

B.S.

, et al. 2006. Græmlin: general and robust alignment of multiple large interaction networks. Genome Res., 16, 1169–1181.

11.

Flannick

, Novak

, Do

C.B.

, et al. 2009. Automatic parameter learning for multiple local network alignment. J. Comput. Biol., 16, 1001–1022.

12.

Grochow

J.A.

, and Kellis

2007. Network motif discovery using subgraph enumeration and symmetry-breaking. Lect. Notes Bioinformatics, 4453, 92–106.

13.

Guo

, and Hartemink

A.J.

2009. Domain-oriented edge-based alignment of protein interaction networks. Bioinformatics, 25, I240–I246.

14.

Gustin

M.C.

, Albertyn

, Alexander

, and Davenport

1998. MAP kinase pathways in the yeast Saccharomyces cerevisiae. Microbiol. Mol. Biol. Rev., 62, 1264–1300.

15.

Hermjakob

, Montecchi-Palazzi

, Lewington

, et al. 2004. IntAct: an open source molecular interaction database. Nucleic Acids Res., 32, D452–D455.

16.

Hsieh

M.-F.

, and Sze

S.-H.

2014. A software program implementing the algorithm GraphletAlign: http://faculty.cse.tamu.edu/shsze/graphletalign

17.

Jiang

, Tu

, Chen

, and Sun

2006. Network motif identification in stochastic networks. Proc. Natl. Acad. Sci. USA, 103, 9404–9409.

18.

Kalaev

, Bafna

, and Sharan

2009. Fast and accurate alignment of multiple protein networks. J. Comput. Biol., 16, 989–999.

19.

Kanehisa

, Goto

, Kawashima

, et al. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280.

20.

Kashtan

, Itzkovitz

, Milo

, and Alon

2004. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics, 20, 1746–1758.

21.

Kelley

B.P.

, Sharan

, Karp

R.M.

, et al. 2003. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. USA, 100, 11394–11399.

22.

Koyutürk

, Grama

, and Szpankowski

2004. An efficient algorithm for detecting frequent subgraphs in biological networks. Bioinformatics, 20, SI200–SI207.

23.

Koyutürk

, Kim

, Topkara

, et al. 2006. Pairwise alignment of protein interaction networks. J. Comput. Biol., 13, 182–199.

24.

Kuchaiev

, and Pržulj

2011. Integrative network alignment reveals large regions of global network similarity in yeast and human. Bioinformatics, 27, 1390–1396.

25.

, Pellegrini

, and Eisenberg

2005. Detection of parallel functional modules by comparative analysis of genome sequences. Nat. Biotechnol., 23, 253–260.

26.

Liao

C.-S.

, Lu

, Baym

, et al. 2009. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics, 25, I253–I258.

27.

McKay

B.D.

1998. Isomorph-free exhaustive generation. J. Algorithms, 26, 306–324.

28.

Milo

, Shen-Orr

, Itzkovitz

, et al. 2002. Network motifs: simple building blocks of complex networks. Science, 298, 824–827.

29.

Parida

2007. Discovering topological motifs using a compact notation. J. Comput. Biol., 14, 300–323.

30.

Pellegrini

, Marcotte

E.M.

, Thompson

M.J.

, et al. 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA, 96, 4285–4288.

31.

Pržulj

, Corneil

D.G.

, and Jurisica

2006. Efficient estimation of graphlet frequency distributions in protein-protein interaction networks. Bioinformatics, 22, 974–980.

32.

Remm

, Storm

C.E.V.

, and Sonnhammer

E.L.L.

2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol., 314, 1041–1052.

33.

Sharan

, Suthram

, Kelley

R.M.

, et al. 2005. Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. USA, 102, 1974–1979.

34.

Shen-Orr

S.S.

, Milo

, Mangan

, and Alon

2002. Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet., 31, 64–68.

35.

Shlomi

, Segal

, Ruppin

, and Sharan

2006. QPath: a method for querying pathways in a protein-protein interaction network. BMC Bioinformatics, 7, 199.

36.

Singh

, Xu

, and Berger

2008. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. USA, 105, 12763–12768.

37.

Srinivasan

B.S.

, Novak

A.F.

, Flannick

J.A.

, et al. 2006. Integrated protein interaction networks for 11 microbes. Lect. Notes Bioinformatics, 3909, 1–14.

38.

Tatusov

R.L.

, Koonin

E.V.

, and Lipman

D.J.

1997. A genomic perspective on protein families. Science, 278, 631–637.

39.

Tian

, McEachin

R.C.

, Santos

, et al. 2007. SAGA: a subgraph matching tool for biological graphs. Bioinformatics, 23, 232–239.

40.

Wernicke

2006. Efficient detection of network motifs. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 3, 347–359.

41.

Wuchty

, Oltvai

Z.N.

, and Barabási

A.-L.

2003. Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat. Genet., 35, 176–179.

42.

Xenarios

, Rice

D.W.

, Salwinski

, et al. 2000. DIP: the Database of Interacting Proteins. Nucleic Acids Res., 28, 289–291.