Abstract
Abstract
1. Introduction
Apart from the general agreement that protein complexes form dense subgraphs in an interaction network (Spirin and Mirny, 2003), which leads to the strategy of first generating small dense subgraphs and either extending or merging these subgraphs to construct protein complexes (Bader and Hogue, 2003; Li et al., 2007), detailed understanding of the organization of protein complexes remains inadequate. To improve the modeling of complexes, recent approaches separate the tasks of predicting a core complex and its attachment proteins (Leung et al., 2009; Wu et al., 2009).
We observe that most complexes either reside in very dense regions, in which most algorithms are able to identify, or they reside in regions with low neighborhood density, in which most algorithms are less successful to identify. We investigate the following algorithm to consider these two types of complexes separately. Given a protein interaction network, we first identify all the maximal cliques. For each maximal clique, we count the number of other maximal cliques that overlap significantly with it and use it to define neighborhood density. We subdivide these cliques into two sets, with one containing cliques with low neighborhood density and the other containing cliques with high neighborhood density.
Since the maximal cliques with low neighborhood density are likely to correspond to the core region of a complex, we extend each clique to include more proteins as long as the density remains high. This allows each prediction to become a dense subgraph that is not necessarily a clique. Since the maximal cliques with high neighborhood density have overlap with many other maximal cliques, using these cliques directly, as predicted complexes, will lead to significant overestimate of the number of complexes. We extract the most shared sets of proteins from these cliques. We obtain the set of predicted complexes by collecting the above two types of predictions.
We compare the performance of our algorithm to other complex prediction algorithms on a few protein interaction networks and show that our algorithm is able to identify complexes more accurately with respect to complex agreement, complex accuracy, and protein pair agreement measures.
2. Methods
2.1. Neighborhood density
Given a protein interaction network represented by a graph G=(V, E), in which each vertex represents a protein and each edge represents interaction between two proteins, we first obtain the set
Given two sets C1 and C2, we define their similarity to be S(C1, C2)=|C1 ∩ C2|2/(|C1||C2|) (Bader and Hogue, 2003; Altaf-Ul-Amin et al., 2006; Li et al., 2007; Chua et al., 2008; Moschopoulos et al., 2009; Wu et al., 2009; Jung et al., 2010; Shi et al., 2011). For each maximal clique C, we count the number of other maximal cliques C′ with S(C, C′)≥t, where t is a given threshold. We define a clique C to have low neighborhood density if this number is below a given threshold c, and it has high neighborhood density otherwise. The worst case time complexity of this step is
2.2. Cliques with low neighborhood density
Given an induced subgraph G′=(V′, E′) of a graph G=(V, E), we define its density to be D(V′)=2|E′|/(|V′|(|V′|−1)) (Bader and Hogue, 2003; Altaf-Ul-Amin et al., 2006; Li et al., 2007; Moschopoulos et al., 2009; Wu et al., 2009; Shi et al., 2011). For each maximal clique with low neighborhood density, we iteratively identify the best vertex to add so that the density of the enlarged subgraph remains high and the length of the shortest path between two vertices in the enlarged subgraph remains small. We repeat the procedure until no more changes can be made, and use the resulting dense subgraphs as predicted complexes (Fig. 1). Since these subgraphs reside in regions with low neighborhood density, they are likely to function as an independent unit.

Algorithm NDComplexL is used to obtain predicted complexes from maximal cliques with low neighborhood density.
For each potential vertex to add, it takes O(|V|) time to compute the density of the enlarged subgraph. By precomputing all vertex pairs that have shortest paths of length at most two, the worst case time complexity of the procedure is
2.3. Cliques with high neighborhood density
For each maximal clique with high neighborhood density, if its most similar maximal clique overlaps significantly with it, we replace it by the intersection of the two cliques. We repeat the procedure until no more changes can be made, and use the remaining cliques as predicted complexes (Fig. 2). This procedure reduces the number of highly overlapping predictions significantly, since many cliques become identical after the intersections.

Algorithm NDComplexH is used to obtain predicted complexes from maximal cliques with high neighborhood density.
By labeling each clique with a positive integer and picking the one with the lowest label when resolving ties in similarity, we can guarantee that at least two cliques become identical after each iteration, and the procedure takes at most
3. Results
3.1. Performance evaluation
We use a combined set of true complexes that contain at least two proteins to which we compare the predicted complexes, including 214 curated complexes from the Munich Information Center for Protein Sequences (MIPS) database (Mewes et al., 2004), 101 curated complexes from Aloy et al. (2004), and 363 complexes extracted from the Saccharomyces Genome Database (SGD; Cherry et al., 1998), according to GO slim complex annotations (Friedel et al., 2009). To reduce evaluation bias, we have removed complexes that contain more than 100 proteins from the SGD complexes. We removed the duplicates from the combined set, resulting in a total of 574 true complexes. Most of these complexes are small, with the average number of proteins within a complex being 9.4 (Table 1).
Complex denotes the number of complexes that have number of proteins specified by Protein.
We used the yeast protein interaction network from the MIPS database (Mewes et al., 2004), the yeast protein interaction network from the Database of Interacting Proteins (DIP database) (Xenarios et al., 2000), and the yeast protein interaction network from the Biological General Repository for Interaction Datasets (BioGRID database) (Stark et al., 2006) to obtain separate sets of predicted complexes from each network. For the BioGRID network, only physical interactions are included. These networks have different densities, with the BioGRID network being the densest (Table 2).
Protein, interaction, avg_deg, max_deg, and density denote the number of proteins, the number of interactions, the average vertex degree, the maximum vertex degree, and the density of the network, respectively.
3.2. Performance comparisons
We compared the performance of our algorithm NDComplex to the Markov cluster algorithm (MCL) (Enright et al., 2002), which subdivides a given graph into clusters by Markov clustering, to the Molecular complex detection algorithm (MCODE) (Bader and Hogue, 2003), which uses a seed-extension algorithm to identify complexes in dense regions, to the Dense-neighborhood extraction using connectivity and confidence features algorithm (DECAFF) (Li et al., 2007), which is based on the identification and merging of dense subgraphs, and to the Core-attachment based method (COACH) (Wu et al., 2009), which is based on the prediction of a core complex and its attachment proteins. We also compared the performance of our algorithm to the Maximal clique algorithm (CLIQUE), which uses the set of maximal cliques with at least three vertices as predicted complexes.
For MCL, we set inflation to 3.5. For MCODE, we set depth to 100, haircut to true, fluff to false, and fluff density threshold to 0.2. For DECAFF, we implement steps 1 to 4 of Algorithm 1 in Li et al. (2007), including the local clique detection step, the hub removal step, and the merging step, and set the density threshold to 0.8 and the neighborhood affinity threshold to 0.5. Since the other algorithms do not use functional information, we skip the filtering steps 5 to 9 in DECAFF that use the MIPS functional catalog (Mewes et al., 2004). For COACH, we set the neighborhood affinity threshold to 0.1. For NDComplex, we set the similarity threshold t and the occurrence threshold c during the computation of neighborhood density to 0.3 and 3 respectively. We set the density threshold d to 0.7 during the computation of predicted complexes in regions with low neighborhood density, and the similarity threshold s to 0.2 during the computation of predicted complexes in regions with high neighborhood density. These parameters are determined by testing a few combinations and choosing the one that gives the best overall performance on the test sets. We collect the predicted complexes from both types of regions for performance evaluation, with each distinct prediction counted once.
3.3. Complex agreement measure
Given a similarity threshold u, we evaluate the agreement between a set of true complexes and a set of predicted complexes by defining the precision (PC) to be the ratio of the number of predicted complexes P that have a true complex C with similarity S(C,P)≥u to the total number of predicted complexes, and the recall (RC) to be the ratio of the number of true complexes C that have a predicted complex P with similarity S(C, P)≥u to the total number of true complexes (Chua et al., 2008; Moschopoulos et al., 2009; Wu et al., 2009; Jung et al., 2010; Shi et al., 2011). We compute the F-measure=2×(PC×RC)/(PC+RC).
Figure 3 shows that NDComplex has the best overall performance when the similarity threshold u is low, while MCODE and COACH perform better when the similarity threshold u is high, which corresponds to a small number of almost perfect predictions. DECAFF has the next best performance, followed by CLIQUE and MCL.

Performance of complex prediction algorithms on each protein interaction network with respect to the complex agreement measure over different similarity thresholds u between a true complex and a predicted complex. For DECAFF, only steps 1 to 4 of Algorithm 1 in Li et al. (2007) are included, while the filtering steps 5 to 9 that are based on the use of functional information are skipped.
3.4. Complex accuracy measure
In addition to using the complex agreement measure, we use the complex accuracy measure in Brohée and van Helden (2006), Friedel et al. (2009), and Tu et al. (2011) that does not rely on the use of a similarity threshold. Given a set of true complexes
Figure 4 shows that NDComplex has the best accuracy in almost all cases, followed by CLIQUE, DECAFF, and COACH. The performance differences between NDComplex and MCL, MCODE, DECAFF, or COACH are the largest on the BioGRID network. DECAFF puts emphasis on sensitivity, while MCODE puts emphasis on PPV.

Performance of complex prediction algorithms on each protein interaction network with respect to the complex accuracy measure.
3.5. Protein pair agreement measure
We evaluate the protein pair agreement by defining a true positive (TP) to be a protein pair that is within the same true complex and within the same predicted complex, a false positive (FP) to be a protein pair that is within the same predicted complex but not in the same true complex, and a false negative (FN) to be a protein pair that is within the same true complex but not in the same predicted complex. We compute the precision PC=|TP|/(|TP|+|FP|), the recall RC=|TP|/(|TP|+|FN|), and the F-measure=2×(PC×RC)/(PC+RC).
When the true complexes and the predicted complexes both consist of disjoint sets of proteins, this measure evaluates the resemblance of the two sets. Since a protein can appear in more than one complex, this measure evaluates how accurately an algorithm can predict whether a given pair of proteins are within the same complex or not, although this is only an approximation since the set of true complexes may not be complete.
Figure 5 shows that COACH and NDComplex have the best performance with respect to protein pairs. The performance differences between these algorithms and MCL, MCODE, or DECAFF are especially large on the BioGRID network, with CLIQUE performing better than MCL, MCODE, and DECAFF. MCL and DECAFF have low performance on the BioGRID network due to a large number of false positive protein pairs. MCODE puts emphasis on precision, while DECAFF puts emphasis on recall.

Performance of complex prediction algorithms on each protein interaction network with respect to the protein pair agreement measure.
4. Discussion
We have developed an algorithm to identify complexes from protein interaction networks based on using different strategies for regions with different neighborhood densities. We have shown that this approach is very effective and it achieves the best performance with respect to complex agreement, complex accuracy, and protein pair agreement measures. Among the three networks that we have tested, we found that each algorithm performs the best on the BioGRID network, which has the highest density, with weaker performance on the DIP network, followed by the MIPS network, which has the lowest density. On the BioGRID network, our algorithm NDComplex performs the best with respect to the complex agreement measure when the similarity threshold u is low, and with respect to the complex accuracy measure, while COACH and NDComplex have the best and comparable performance with respect to the protein pair agreement measure.
Ideally, the correspondences between the true complexes and the predicted complexes should be one-to-one. To evaluate the degree of success of each algorithm in reaching this goal, we use the separation measure in Brohée and van Helden (2006) and Tu et al. (2011). Given a true complex C and a predicted complex P, define their separation Sep
Table 3 shows that while MCODE has the highest separation, it produces a small number of predictions that cover very few proteins. MCL achieves the next highest separation, but it produces a disjoint partition of proteins, which is biologically inaccurate since it does not allow a protein to appear in multiple predicted complexes. COACH achieves a better separation than NDComplex, but it covers less proteins. CLIQUE and DECAFF cover the most interactions, but the number of predictions is very high and they have the worst separation. NDComplex achieves better separation than CLIQUE and DECAFF.
Complex denotes the number of predicted complexes; protein and interaction denote the number of proteins and interactions covered by these complexes, respectively, in the network; and separation denotes the separation between the set of true complexes and the set of predicted complexes.
Table 4 shows the statistics of maximal cliques and predicted complexes during different stages of our algorithm NDComplex. Within each network, the number of maximal cliques with high neighborhood density is much larger than the number of maximal cliques with low neighborhood density, which means that there are significant overlaps among the maximal cliques with high neighborhood density. For regions with low neighborhood density, our strategy can add a large number of proteins to a prediction, which is especially evident on the BioGRID network. While the density of these predictions remains high, the number of predictions decreases since some of them become identical after the addition of proteins. The situation is very different for regions with high neighborhood density, when the number of predictions decreases drastically after the intersections of cliques, and each of these predictions contains a very small number of proteins, which means that there are only a small number of highly shared parts within these cliques. The predictions from these two types of regions are distinct (compare to Table 3), with almost all predictions coming from regions with low neighborhood density. Since this is sufficient to ensure that the predictions have high quality, a complex can be modeled mostly as an independent unit with dense inside connections and sparse outside connections, while the small number of predictions from very dense regions are still needed to reduce false negatives.
Clique_l and complex_l denote cliques with low neighborhood density and predicted complexes from these cliques, respectively; and clique_h and complex_h denote cliques with high neighborhood density and predicted complexes from these cliques, respectively. In each case, number denotes the total number of cliques or predicted complexes, and avg_pro and max_pro denote the average and maximum number of proteins within a clique or a predicted complex, respectively.
Although the worst case time complexity of our algorithm NDComplex is high, the actual running time is not high (Table 5). For sparse networks such as MIPS and DIP, it takes less than an hour to complete all the steps, and the time to obtain the maximal cliques dominates. For denser networks such as BioGRID, it takes about a day, and the time to obtain predicted complexes from cliques with high neighborhood density dominates.
Clique denotes the time to obtain all the maximal cliques; neighbor denotes the time to compute the neighborhood density of these cliques; and complex_l and complex_h denote the time to obtain predicted complexes from cliques with low neighborhood density and high neighborhood density, respectively.
To investigate the effect of different parameters on our algorithm NDComplex, we picked a few sets of parameters that are close to the one we choose and examine the performance differences. Figure 6 shows that NDComplex is not very sensitive to parameters.

Performance of our algorithm NDComplex on the protein interaction network from the BioGRID database with respect to the complex accuracy measure and the protein pair agreement measure over different parameter settings (t,c,d), where t and c are the similarity threshold and the occurrence threshold, respectively, during the computation of neighborhood density, and d is the density threshold during the computation of predicted complexes in regions with low neighborhood density. The similarity threshold during the computation of predicted complexes in regions with high neighborhood density is fixed to s = 0.2. The last set of parameters (0.3,3,0.7) is the one we chose.
To illustrate the differences in predictions that can be obtained from different algorithms, we examine the predicted complex from each algorithm on the BioGRID network that has the highest similarity to the true complex that contains the MAP kinase cascade of the pheromone response pathway and the filamentation/invasion pathway (Gustin et al., 1998) from the MIPS database. Figure 7 shows that CLIQUE finds the complex that contains the MAP kinase cascade of the filamentation/invasion pathway, while the protein Fus3 in the true complex is not included. NDComplex expands the complex to include two extra proteins Bem1 and Hek2, in addition to all the proteins in the true complex. COACH and DECAFF return the largest complexes that contain a few extra proteins, while MCL and MCODE do not return a complex that has overlap to the true complex.

Predicted complex from each algorithm on the protein interaction network from the BioGRID database that has the highest similarity to the true complex that contains the MAP kinase cascade of the pheromone response pathway and the filamentation/invasion pathway from the MIPS database. No predicted complexes from MCL or MCODE have overlap to the true complex and are not shown.
One drawback of our algorithm is that it takes exponential time to identify all the maximal cliques, thus it is not likely that it will scale up to handle dense networks. One future direction is to investigate whether it is possible to develop polynomial time heuristics without a large decrease in prediction performance.
Footnotes
Acknowledgment
This work was supported by grants from the National Science Foundation (CCF-0830455, MCB-0951120).
Disclosure Statement
The authors declare that no competing financial interests exist.
